The abalone is a type of shellfish fairly common around the world. It is of interest to biologists and ecologists due to its variation and world-wide spread. For more information on abalones, you can look at this article.
The age of the abalone is determined by cutting the shell through the cone, staining it and counting the number of rings through a microscope. This is a boring and time-consuming task which most biologists would like to avoid. So let’s see if we can help them by making an adaptive system that learns to predict abalone age from other measurements which are easier to obtain.
The data file is located in Synapse\Sample Data\Abalone and called “abalone.txt”. The following features (variables) are available to us in the data file:
- ID – sample ID number
- Age – the age of the abalone (our desired output value) (mm)
- Sex – male, female or infant
- Length – longest shell measurement (mm)
- Diameter – perpendicular to length (mm)
- Height – including the meat in the shell (mm)
- Whole weight – whole abalone (g)
- Shuck weight – weight of meat (g)
- Viscera weight – gut weight (g)
- Shell weight – after being dried (g)
One very important thing to decide is the requirement specifications for our solution to the problem. What precision do we want from the data? What theoretical limits do we have? Are individual samples very important or do we plan to look at groups of samples?
In the case of abalone, there are a few things to keep in mind. First of all, the exact age of an individual abalone is fairly uninteresting for biologist. They are more interested in measuring the age of groups of abalones. This means that per-sample precision is less important than overall statistical coherence in the age estimate. The age predicted should be correct on average – i.e. we should avoid any errors that systematically affect the results. One example of such an error would be a systematic over-estimation of the age.
Second, our model can only be as good as the data is. The abalone data is based on manual measurements which are always associated with wide error margins. The age estimation in the data for instance is based on counting rings under a microscope. In the abalone case, the error on that data is with a 95% confidence +- 3 rings (years). Unless there is a systematic error in the measurements, this is our theoretical limit of precision. In addition we have other manual measurements such as length, height, weight etc – all that are likely associated with some degrees of measurement errors. So what goals should we set? Well, to set a fairly ambitious goal, let’s say that we want our predictions to be +-5 years with 95% confidence and an average error of no more than 2 years.
The first step in building an adaptive system is to get a better understanding of the data. When we become familiar with the data, we can start to format it and filter it in a way that will help the adaptive system to make a better job.
Start up Synapse and make sure you are in Preprocessing. If the start page is activated, close it. If you have already Synapse opened, create a new solution ("File->New Solution"). You will see something similar to this:
We'll start by importing the abalone data file into Synapse. In the Data Unit Manager click on the "Add" button.
Select "CSV File" and click "Next>":
Click on the "Browse"¦" button and navigate to the Synapse\Sample Data\abalone.txt file. Click on "Next>":
To import the file, click "Finish".
In the Data Unit Manger, you should now have a Data Unit named "CSV". The size column tells you how many training and validation samples there are and how many features the data unit has. In our case we have 10 features and there are 3551 training samples and 626 validation samples.
Now, from the Data Unit Manager, drag-drop the data unit to the first empty visualizer.
In the window that pops up, select "Correlation".
In the Correlation visualizer set the "Feature" drop-down to "Age".
The correlation visualizer shows the linear correlation between features. The higher the bar goes, the more the features are correlated. Blue bars indicate a positive correlation while red bars indicate a negative correlation.
This can be seen as a triage stage to see what features affect the desired output. We can here see that for instance the "ID" feature doesn't contribute much. This isn't very surprising as it is just the sample ID. The "Sex" feature doesn't seem to attribute much either. So let's get rid of them as they are likely only to take up valuable computing power in the adaptive system, without contributing to anything.
Make sure that our data unit is selected and then drag-drop the Extract filter from the "Filter Bar" to the "Filter Stack".
Below the filter stack you have the settings of the currently selected filter. The extract filter that we have placed on the stack can be used to remove features or to split them to a second data unit. What we want to do is to remove the "ID" and "Sex" features.
The first step is to change the "Mode" setting of the extract filter to "Remove". Next under the "Selected" category, change "ID" and "Sex" to "True". This will mark these features for removal.
Click on the "Apply" button that is located below the settings list. As you can see the two features are now gone from the correlation visualizer.
Let's now take a closer look at the data. Close the correlation visualizer by clicking on the "x" in the right corner. Drag the data unit again to the now empty visualizer. This time select "Histogram".
The Histogram visualizer is the most useful tool for understanding data in the individual features. It shows how the data is distributed "“ value on the horizontal axis and frequency on the vertical axis.
When we first load the histogram the "Age" feature is selected. As we can see most shells are 9 years old "“ 579 of them to be precise.
The "Bins" settings adjusts the resolution of the histogram. You can try changing the default 255 resolution to for instance 29 (as the oldest shell is 29 years old and the age is expressed in whole years.
The curve you are seeing is bell-shaped. This is called a normal distribution (or standard/Gaussian). The normal distribution is a very important probability distribution that arises in many situations. Generally speaking when you have a large number of independent samples from a naturally occurring phenomenon, the data will follow a normal distribution. For more about normal distributions you can look at this wikipedia article.
Now, change the selected feature to "Height". Change the number of bins back to 255. What you see now is a normal distribution on the left side with lots of empty space on the right side. In the far right corner there is one single sample. That is a nice example of an outlier. An outlier is a data point that lies outside the normal data range. In this case, the average height is about 0.14 m and we can see that our normal distribution bell curve goes to around 0.25 m. Is it really likely that our outlier is a real shell that's over 1.2 m high? No, not likely at all. In all probability the scientist that measured the abalones made an error in his notes.
Why is this sample important? Because adaptive systems learn from data. Give them incorrect data and they will learn incorrectly. An adaptive system is only as good as the data you feed it.
So, how do we get rid of our outlier? Fortunately the fact that the data is normally distributed makes it easy. As we know how the distribution looks, we know what is outside it "“ and we can simply remove it. That is what the outlier filter can do for us.
From the Filter Bar, drag-drop the "Outlier Remover" filter onto the Filter Stack, after (below) the Extract filter that we've already placed there.
We should now select which features we should apply this filter on. In our case here, all our variables are natural measurements, which almost certainly guarantees that they have something similar to a normal distribution. So we can with good conscience select all features for outlier removal.
One way of doing it is to in it's settings set all the features under the "Selected" category to "True" by double-clicking on them individually. There is however a faster way. Under "Operator", select the "Invert" setting. You should see on the edge, a small button with "..." written on it:
Click on it to invert the values under "Selected":
Click on "Apply". As you can see the picture changed dramatically. The outlier was removed and now the rest of the data that fits well enough in the normal distribution covers the whole plot.
One important thing to remember that using the Outlier Remover filter in the fashion we did now only works if the data follows something that resembles a normal distribution.
We are going to do one final thing in pre-processing in this tutorial and that is to extract our output feature (Age) to a data unit of its own. Although this isn't strictly necessary at this point, it will simplify things in design mode.
Add another extract filter to the Filter Stack. Make sure that Mode is set to "Split". Under the "Selected" category, change "Age" from "False" to "True".
You can see now in the Data Unit Manager that we've got a second data unit. It contains our age feature which has been extracted from our original data unit.
Now let's go to design mode and build our system.
Select Design in the mode bar.
We're going to start by placing a snippet. Snippets are prefab topologies, or bits of topologies. Basically they are usable collections of connected components. It saves you the trouble of having to build a system from scratch in many cases.
Make sure that you are in move/select mode.
Right click on the work area (the white region in the middle of the screen). You'll get a pop-up menu.
Select "Insert Snippet"¦"Static" --> "MLP One Layer"
If you need to center the snippet on the screen, press the mouse wheel button to pan, or press "d" and click-drag on the work area to move the view.
What you see now on the screen is a basic type of neural network. It contains two weight layers, two function layers and one delta terminator. It's called a multilayer perceptron (MLP) and is a basic type of artificial neural network.
The weight layers represent the long-term memory of the system. It's a set of weights that the signal that travels through the network get multiplied with. These weights get adjusted by the learning algorithm to produce a desired output signal, given an input signal.
The function layers can be seen as non-linear thresholds for the signal. They give the adaptive system its non-linear computing capabilities.
Finally the delta terminator is an error criterion. It takes two signals and compares them according to some metric. One input to it is set to the actual output of the system and one is set to the desired output. The delta terminator compares them and sends its results back through the system where that information is used to update the weights. It's called a "terminator" because it terminates the forward signal flow through the system.
What this system now lacks is data, so let's connect the data units we constructed in preprocessing. We have two data units "CSV" which contains our system inputs and X_CSV" which contains our desired system output.
From the Solution Explorer, drag-drop the "CSV" data unit (under "Resources") in front of the first weight layer and then drag-drop the "X_CSV" data unit in front of the blue port of the delta terminator. This will create two data source components, one for each data unit.
Now let's connect them to the system. The "CSV" data unit, as it is the input should be connected to the first component in the system "“ i.e. the first weight layer. Our desired output should be connected to the delta terminator. To do this, switch to link mode.
Click on the data source in front of the weight layer and drag a link to the weight layer. When the link turns green, you can release the mouse button. You've just connected two components.
Now connect the second data source to the delta terminator's blue port. When you drag the link over the delta terminator, you'll see two boxes covering it, one green and one blue. These are ports. Each port can take an independent signal. Connect the link to the blue port. The green port is already connected to the last function layer in the system. You can see this by looking at the color of the tip of the link between the function layer and the delta terminator. It's green, indicating that it is connected to the green port of the delta terminator. The link from the data source that we just connected has a blue tip, indicating that it is connected to the blue port.
You should have something like this on your screen now:
You can click on the work area and press Ctrl+F to frame the system in the window. That's it for design, at least for this tutorial. Now we move on to training the system.
Click on "Training" in the mode bar.
Again, in this mode you can press Ctrl+F to fit the system in the window.
In training mode the system is adapted to the data. As you might have noticed some of the components have a different look. This is Synapse's context sensitive GUI. Depending on mode and context, the component's can show a more appropriate GUI.
For instance if you take a look at the weight layers, you will see that the lines are now colored. The color opacity is proportional to the weight values. Weights that are zero are transparent while strong positive weights are blue and strong negative weights are red. This can give us an idea of how many of the weights are actually used.
Before we run the system, we'll set some parameters in the control system GUI. The control system handles all information flow in the system and decides when and how adaptation is to take place.
In the control system GUI on the right side, enter 400 in the "Max Epochs" box and press enter. This is the number of iterations the system will go through. One epoch means showing the whole training data to the system once. So 400 epochs means that the training data will go through the system 400 times.
In the "Batch Length" box write 200 and press enter. Batch length decides how many samples are sent through the system at once. Batch length has two major consequences.
The first is relates to the adaptation. When a batch length > 1 is used, multiple samples are used at once to decide how the system will be updated. Instead of changing the system each sample, you update it with groups of samples. This has a stabilizing effect on the training, as the system is changed less frequently. So if there are high degrees of noise in the data, using larger batches can improve training stability. On the other hand using large batches can have a too stabilizing effect "“ so that the adaptation has a too low resolution and that relevant individual sample variations get lost.
The second consequence of batch length is performance. Larger batch sizes require less CPU time, but more memory. Smaller batch sizes require considerably less memory but the system takes considerably longer to train. There is no simple way of assessing the optimal batch size, so it is something that can require trial-and-error. In some cases you might even want to change batch size during training.
Now, with those two parameters set, it's time to press play.
The system will now start adapting.
You can in the control system GUI see how the current epoch progresses. Observe the two graphs in the delta terminator. One shows the error on training data and one on validation data. Validation data is data that has been put aside for testing and not been included in the training "“ it is used to see how well the system deals with unseen data.
When the epoch progress bar reaches 400, we're done with the training. Now we have a fully trained adaptive system. As a final step, let's see how well the system actually performs.
Click on "Post Processing" in the mode bar.
Click on "Error Analyzer" in the bar on the left side of the window. The Error Analyzer is a tool that allows us to put the performance of the system in a statistical context.
Click on the "Refresh" button. From the "Error Metric" drop-down, if it isn't already selected, select "Linear":
This plot shows us the distribution of the errors on the validation data. Recall the requirement specifications we made when we first took a look at the problem.
The first requirement was that we wanted the average error to be no more than 2 years.
What we can see from this first graph is that our average error is around -0.2 years - well within our requirements.
Change the Error Metric from "Linear" to "Manhattan" in the Error Metric dropdown.. The linear metric that we looked at showed us the plain error as desired output "“ actual output. The Manhattan metric on the other hand shows abs(desired output "“ actual output). It is more reasonable to look at averages that way since in the linear case negative and positive errors take each other out. In the linear case if we have an error at -2 and an error at 2 the average error will be (2-2)/2 = 0. In the Manhattan case it will be (2+2)/2 = 2, which is more useful. It's called the absolute error.
Anyway, if we look at the absolute error, we can see that we're within our requirement specification here as well. The average absolute error is around 1.3 which is well bellow our maximum limit.
We had a second requirement specification for our solution. We required it to be within +-5 years with 95% confidence. That means that we want at least 95% of the cases to have errors that are less than 5 years.
To see how well we did in that department, click on the "Confidence" tab. As we can see, the values are with 95% confidence about +-3.4 years within the desired values.
So we did nearly as good as our theoretical limit. Hence we have a system that is quite acceptable for our purposes.
You can zoom in the confidence plot by dragging a box around the area you wish to take a closer look at. The purple line is the output of the system; the black line is the desired output.
The two other lines that enclose the shaded area are the confidence limits. With the confidence level set to 95% it means that 95% of the desired outputs should be within the gray area.
You can use the sort buttons to sort the data according to sample number (S) according to desired value (T), according to error (E) or according to output (O):
With this we conclude this first tutorial. You have learned now how to import data, visualize it and how to use filters. In addition you have learned how to use snippets and how to get the data into design. You've seen how a system is trained and how you analyze the trained system. In the next tutorial we'll take a look at classification. We will use more advanced data mining functions, spend more time in design and finally we'll see how a system can be deployed for external use.