Self-organizing map

From Piki

  • Currently5.00/5
Jump to: navigation, search

The self-organizing map (SOM) also known as a Kohonen map is a type of neural network based on competitive learning. It can be used for clustering and for visualization of high-dimensional data.

Contents

Competitive learning

Main article: Competitive learning

Competitive learning is category of algorithms based around the concept of a special data point called a "unit" that gravitates towards the other points in the data.

Basic principles

A SOM is based on competitve learning, but has a significant modification. The units are no longer unbound agents free to roam feature space as they wish. Instead they are all interconnected in a grid. When a unit tries to run away in a direction, it will be pulled back by the strings that are attached to neighboring units in the grid.

One of the problems with plain competitive learning is that once the units have converged to their final positions, all you can get is the coordinates for cluster centers – you still have no real idea of how feature space looks like and what the relationships between the different clusters are. By tethering the units together in a grid, we can force them to keep formation. One unit will have some form of relationship in feature space to the units next to it in the grid.

The following animated example demonstrates the principle (black dots= data, green dots = units):

Animated SOM operation
Animated SOM operation


You can see how while the map gets deformed as the individual units run off in the search for data points it still keeps a global structure. A unit’s neighbors remain its neighbors after adaptation. In the animated example the units are connected in a rectangular grid but other forms are possible. Later, when we come to visualization examples we will actually use a hexagonal grid. But before we explore the usefulness of the SOM, we’ll go through a bit more theory.

It is important to understand that while the map itself is two dimensional (i.e. a grid), the units in the map have the same number of dimensions as the feature space (the number of features). Here is an example of a SOM being stretched across a three dimensional feature space (features x,y,z):

Animated 3D SOM
Animated 3D SOM


The map is still a plane, but each point in it has a coordinate with three dimensions. As was mentioned earlier, these maps become useful when we have more features. Although it isn’t possible to directly visualize it, the principle remains the same when the number of dimensions increases.

The usefulness of the SOM stems from the grid connections between the units. It makes sure that through the grid, the topological properties of the data in feature space are preserved. We get a map of how the features (variables) are connected in the data. The map is then used for visualization.


Visualization with the SOM

Suppose we have a 3×3 SOM (i.e. 9 units in a square grid) that has been adapted and twisted to fit a 2D data set. It could look something like the left part of this picture:

Example 3x3 SOM
Example 3x3 SOM

All the units in the grid on the left side of the pictures have been assigned numbers. On the right side there is a 3×3 grid with each cell corresponding to one unit in the SOM. Reading from the graph on the left, we will fill out the boxes on the right in a meaningful way. We’ll call the table on the right side a "Maplet"

The feature-space that the SOM above occupies has two dimensions: X and Y. Suppose we created one maplet for X and one for Y and filled each cell with a color that corresponds to the value of the unit associated with the cell (for the corresponding feature):

Maplet coloring
Maplet coloring


Each cell corresponds to a unit. In the "X" maplet the cell colors correspond to the X value for each unit and the "Y" maplet the same goes for the Y value. If we go back to the left side plot in the previous figure and look for instance at units 1 and 9:

  • Unit 1: Low X value, High Y value => X Maplet, Cell 1 Blue, Y Maplet, Cell 1 Red
  • Unit 9: High X value, Low Y value => X Maplet, Cell 9 Red, Y Maplet, Cell 9 Blue

While the example above is trivial and was used just to illustrate the principle, this method of visualization is very powerful. The SOM creates a map in the word’s real sense. It is a map of the data and of the feature space. Due to the connections between the units, the map is topologically intact meaning that you will get sorted continuous values on the map.

As an example of a real SOM visualization in action, this is a visualization of the "Cows" dataset covered in Synapse tutorial 4.

Cows visualized
Cows visualized

You can ignore the two first maplets for now. This SOM has a hexagonal grid – it’s better for visualization as it allows smoother transitions between the unit cells. The black circles inside the unit cells indicate how many data points are close to a particular unit. Larger circle means more data. This is a 15×15 grid in a 10-dimensional feature space.

If we for instance look at the Churches feature and focus on the red area we can make the following conclusion by looking at the same unit cells for other features:

Where there are many churches per capita there are few schools per capita, many cows, the population is large and the median age is fairly low…

U-Matrix

There is a visualization technique called the U-matrix or unified distance matrix that visualizes the distance between adjacent units in the SOM.

Instead of the color representing the value of the unit for a particular variable, imagine if we instead showed the average distance of that unit to other units. You would end up with something like this:

Unified distance matrix
Unified distance matrix

The dark areas are where the distance between the nodes is large – it’s a gap. The light areas are where the nodes are close to each other. So we can read the U-matrix and relatively easy see that there are three white areas separated by a dark region. You can think of it as bits of land separated by a ditch. In the case above there are hence three clusters covered by the SOM. The U-matrix is an important clue when you try to determine how many natural clusters the SOM is covering.

See also