Preprocessing mode
From Piki
Contents |
Layout
Quick nomenclature
- Data unit: A unified way to store lumps of data. It also features an Input format (such as file reading capabilities or an SQL link) and can have any number of filters.
- Filter: An operator that operates on data units.
- Feature: A variable the data (dimension or axis if you will). Also often referred to as column or attribute.
Operation
Loading data
In order to do anything useful in Synapse, data to drive the system must be imported. Data is stored in data units and loaded into Synapse from theData unit manager(1). By the use of the “Add” button a data import wizard is started. First an Input format is chosen. After the specific settings for this format has been set, the data is loaded into a data unit, which is placed in the list of the data unit mangager (1).
Studying data
The Statistics pane (5) always shows basic statistics for the chosen data. Select a data unit in the data unit manager(1) and see the statistics in the Statistics pane (5). The statistics is repeated column wise for each feature of the data. The radio buttons at the bottom switches between the Training and Validation channel (more on this later).
For a more detailed view of data drag the data unit from the data unit manager (1) to an empty visualizer tab (4).
(There right most tab is always an empty tab.) When a data unit is dropped on an empty visualizer tab (4) a window with a list of available visualizers are shown. By clicking one of the visualizer icons that visualizer will be chosen and appear on the visualizer tab (4).
Many visualizers enable selection of data points, or regions of data. The statistics for these selections will be shown in alternate color in the Statistics pane (5).
Hierarchical data mining
By dragging one visualizer to an empty visualizer tab (4), the selection of the first visualizer will be made available to a second visualizer of choice. (Again the window with a list of available visualizers is shown.) This enables a progressive method of finding and viewing interesting parts of data. Some visualizers can use this ability not only for viewing, but to permanently modify the underlying data unit by for example applying removal of certain data points as a filter (see changing data).
Changing data
Data almost always has to be manipulated in some way before use. Fore this task there are filters. A list of available filters is found in the filter bar (2). A filter is applied to the current data unit by dragging the filter of the filter bar (2) and dropping it on to the filter stack (3).
The [[Synapse:Filter#filter stack|]] (3) always shows the filters of the current data unit, the one selected in the data unit manager(1). When a filter is selected in the filter stack (3) the filter settings are shown in the area just underneath the filter stack (3). By default the “Auto apply” checkbox is not checked and the “Apply” button must be pressed for new filters and filter settings to take effect. As the data unit is recalculated the filters are applied in the order they appear in the filter stack (3), with the top most first. If a data unit is subject to visualization the visualization will be updated to reflect the new content of the data unit.
If a filter is no longer desired, it can ether be temporarily deactivated by means of the checkbox next to the filter in the filter stack (3), or permanently removed by pressing the [Delete] key. Again the “Apply” button must be used if “Auto apply” is not checked.
More on data units, training and validation
data units have two Data sets. One for training and one for validation. This is due to the statistical nature of adaptive systems and the need of testing such a system on data that have not been used in the process of creating it.
This is why a portion of the data available is stored in a separate channel for validation purposes. This data is never used to train the system. The two channels, or data sets, are identical in form, they consist of the same Features in the same order. Usually they are not of the same size however. (That is, they do not necessarily have the same number of rows.)
By default many input formats store the last part of the loaded dataset in the validation channel. This is fine fore time series and for static data if data is not spatially correlated. In static cases however, one may want to re-shuffle the data. This can be achieved for instance by using the Mixer filter.
The data unit manager (1) shows the number of samples (rows) there is in each set.
Embedding data in solutions
Sometimes for the sake of portability you may wish to embed the data into the solution. This allows you to use the solution offline, without access to the original data source. To do so, select a data unit, and check the "Embed Data in Solution" checkbox. To restore reading from the original data source, uncheck the box.






