Data
From Piki
Data is information in a structured form. Although the structure can take any form the table based approach is most commonly used in data mining applications.
Contents |
Table based data structure
The table (also called matrix) based approach to data storage is a two dimensional storage system commonly used in databases and in spreadsheets. The basic principle is a map where samples and variables are represented as rows and columns:
Features
Features, also commonly known as variables are the columns in the table scheme. They describe different properties of the system or phenomenon that the data describes. The features together form a feature space, also known as Vapnik-Chervonenkis space. If you have the features X,Y and Z they form a 3-dimensional feature space. Features are collectively for that reason also often referred to as dimensions. A twenty dimensional feature space means simply that the data has twenty features. When the dimensionality of the data is discussed the topic is the number of features in the data.
Samples
Samples, examples or instances are the rows in the table scheme. They are the individual data points. If X, Y and Z are the features in the data then [1, 2, 3] is an example of a sample. A sample is hence an instance that follows the form defined by the features.
Data size
Although a data table can be symmetrical for any statistical algorithm to work you need to have many more samples than features. You should at least have a statistically useful amount of data for each feature. A typical scenario would be that you have 5-50 features and 5,000 samples. If there are too few samples there is no way of finding patterns in the data. There are exceptions to this rule and it is when the different features are highly correlated so the amount of independent information is small per feature. An example of this would be image data where you have each pixel as a feature.
Having a large number of features will significantly slow down most data processing algorithms (much more than having a large number of samples). This is because most optimization and learning algorithms search the feature space in one way or another. Adding more dimensions means that a larger space has to be searched. Having more relevant features in the data also means that you have more information about the phenomenon you are trying to model. The key here is "relevant" which isn't always easy to know. Ultimately you want to have as few features as possible that provide a maximum of independent information.
Time series
A time series is a sequence of samples that is spaced at uniform time intervals. This is also referred to as dynamic data as opposed to static data which has no temporal dependencies. The important difference between static and dynamic data is that the order of the samples is important in the latter but not in the former. Time series have to be treated with caution because removing or reordering samples will contaminate the data, making it potentially unusable. Because of this only a limited number of preprocessing operations can be made on time series data - only such operations that keep the sequence of samples intact.
Ordinal vs nominal
There are generally speaking two types of data values - nominal and ordinal. Ordinal data has an implicit order while nominal has not. Suppose that you have data with two features, one that that is “Age” and the other “Fruit”. Although “Age” may be expressed in integers, it is still an ordered sequence. Even if your data contains discrete integer values such as 14 and 15, the value in-between 14.5 has a meaning. On the other hand if your “Fruit” variable contains “Banana”, “Apple”, “Kiwi” and you do a mapping where Apple = 1, Banana = 2, Kiwi =3, the in-between values have no meaning.
Expansion
In many situations you may want to include non-numerical ordinal and nominal data into an otherwise numerical data set. There are good and bad ways of doing it.
For ordinal data - data that has some implicit order is OK to just enumerate – as long as the enumeration is consistent with the order:
Good:
| Age Original | Age Expanded |
|---|---|
| one | 1 |
| two | 2 |
| three | 3 |
| four | 4 |
Bad:
| Age | One | Two | Three | Four | |
|---|---|---|---|---|---|
| one | 1 | 0 | 0 | 0 | |
| two | --> | 0 | 1 | 0 | 0 |
| three | 0 | 0 | 1 | 0 | |
| four | 0 | 0 | 0 | 1 |
For nominal data – data where there is no implicit order:
Good:
| Fruit | Banana | Apple | Kiwi | Melon | |
|---|---|---|---|---|---|
| Banana | 1 | 0 | 0 | 0 | |
| Apple | --> | 0 | 1 | 0 | 0 |
| Kiwi | 0 | 0 | 1 | 0 | |
| Melon | 0 | 0 | 0 | 1 |
Bad:
| Fruit Original | Fruit Expanded |
|---|---|
| Banana | 1 |
| Apple | 2 |
| Kiwi | 3 |
| Melon | 4 |
This type of preprocessing is needed for most types of adaptive systems but can be good in general as it encodes if the information is nominal or ordinal. It makes it much easier to apply sensible statistics to the data.

