Logo name
Personal tools

Data unit

From Piki

  • Currently0.00/5
Jump to: navigation, search

The data unit is the principal Synapse data container. Data units are created in preprocessing and used throughout the application.

Contents

Anatomy of the data unit

Input format

Main article: Input format

The input format is responsible for translating external data to internal form. It can in principle support any format conversion. Examples of input formats in Synapse include the SQL format or the image array format.

The data unit uses the input format to fill its input buffer. This buffer contains the read data in its original form and is used as the foundation for the non-destructive operation of the filter stack.

Filter stack

Main article: Filter

Filters are data processing elements that can change data and describe in structural terms the changes that they made. A filter can be anything from an element that calculates a moving average to one that reorders the features in a data unit or one that makes a query based selection.

Filters have meta-data that describe their operation allowing for a dynamic filter stack where the order of the filters can be changed at any time without the need to compute the entire filter chain. The filter stack takes its input from the input buffer and fills the output buffer. It is therefor completely non-destructive.


Data sets

Data units have three different data channels, usually called just channels or data sets. They are the following:

  • Training - Data in the training set is used for training (adapting) the system. Essentially when a control system reads data from this set it allows training to be turned on.
  • Validation - Data in the validation set is used to benchmark the performance of a system. When a control system sends validation data through the system it makes sure that training is turned off. This guarantees that the system is validated with data it has not seen before. This is important in order to benchmark the system on previously unseen data.
  • Test - The test set is deprecated and not used. No current Synapse components use the test set. The test channel was removed because the validation channel can perform the same functions.

Usually input formats fill both training and validation sets on data load. Usually a percentage (default:15%) of the end of the data is set aside for validation.


Validation vs Test set

Once you’ve seen the results of the test set and use the information to modify the system, it stops being a test set and becomes a validation set. A test set is a final check of the generalization capabilities of the system and if the check fails you must use new data for the next test. Otherwise you are adapting the system to the test set and thus it loses its meaning as an independent and unbiased performance measure.


  • Training set: Direct adaptation
  • Validation set: Indirect (meta) changes to the system (either by manual modification or automatic optimization). It gives you an idea of how well the system performs on data it has not been explicitly trained on.
  • Test set: No changes to the system allowed. Final performance metric only.


Conclusion: Never use the test data in the development loop of the system, it is only a final performance metric. In Synapse the best way of using a test set is to have a second data unit containing the test data. Once you are pleased with your system, you change the training/validation data unit to the test data unit (with learning turned off) in the data sources and run the postprocessing functions to analyze the model. Should the results be inadequate, you discard the test data and get a new batch of independent samples for the next round of testing.

Data types and data sets

The validation data set should have enough data for it to be a representative benchmark of the system.


Static data

When dealing with static data (i.e. where the order of the samples are irrelevant) one must make sure that the validation set data has the same distributions as the training set.

If for instance a variable x has the range of values {1-100} and the data is ordered according to that variable, it is no good if 1-84 are placed in the training set and 85-100 in the validation.

In order to prevent this, randomizing the order of a data set can be a good idea. This can be done using a mixer filter.

Time series data

When dealing with time series data (i.e. where the order of samples is important) one must keep in mind that the validation set will constitute the end of that time series. If one would for instance use a crop filter to remove the end of the training channel, the validation channel would become useless.

Time series data sample order can't under any circumstances be randomized. No samples should be removed from the time series unless it at the end of the data (in the validation set) or at the start of the data (in the training set).

When data mining a time series it can be advisable to use a mixer filter to move all data to the training channel. This will allow for easier visualization and will make sure that you don't accidentally apply filters to just one channel. When you are done with your data mining you can use the mixer filter again to move data into the validation channel.

Using data units

Data units are primarily used in preprocessing where they are created and formed and in design mode where they are used in data source blocks. In preprocessing the data units can be created viewed and modified using the Data unit manager. Data units can be visualized using visualizers in preprocessing.


See also

  • Preprocessing mode - Article covering the principles of the preprocessing mode.
  • Design mode - Article covering the principles of the design mode.
  • Data source - Documentation for the data source block.
This page was last modified 16:47, 20 January 2013.  This page has been accessed 5,481 times.  Disclaimers