Tutorial 4
From Piki
In this tutorial we will look at the finer points of building a model while keeping an eye on the limits imposed by the data. We will also take a look at some more Postprocessing mode post-processing components, including the probe that allows us to test our model in real-time.
It is assumed that you have completed the previous tutorials before starting this one.
Contents |
The problem
Greece and the islands of the Aegean Sea have given birth to many myths and legends of war and agriculture. And those once-proud stones, of ruined and shattered temples bear witness to the civilization that flourished and then died here and to the demigods and heroes who inspired those legends on this land and these islands. But, though the stage is the same, ours is a legend of our own times, and its heroes are not demigods, but the people form the National Statistical Service of Greece.
In the last decade of the 20th century, so the story goes, people from NSSG collected agricultural data in the rural parts of the Greek mainland and islands. Their goal is said to be collecting background data for a new taxation model. Truth be told, we may never know as the full description has been lost in the vast regions of the public domain. The written records of the data however remain, and we shall use them to model the presence on these lands of the bovine creatures named "The Cows of Navarone".
The data contains a number of variables measured at different villages in Greece. Each sample represents one village. One of these variables is the number of cows in the village, and this is the one we will try to model, given the other variables.
The data
The data file is located on the Peltarion data server: http://data.peltarion.com/cows/cows.txt. We will later import that data directly from Synapse, so there is no need to download it. These are the features (variables) in the data file:
- Population - number of people living in the village
- Distance - distance to nearest town (km)
- AgricA - agricultural area (ha)
- Asphalt - asphalt roads (km)
- Churches - number of churches / 10,000 people (weighted, adjusted)
- Schools - number of schools / 10,000 people (weighted, adjusted)
- Tractors - number of tractors / 1,000 people (weighted, adjusted)
- MAge - median age (years)
- Water - water quality (0-100%)
- Cows - number of cows in village (target variable)
The requirement sepecifications
Contrary to popular belief, we at Peltarion have relatively little knowledge of cattle. Without domain knowledge, it is difficult to define requirements and boundaries for the model. As there is no earlier model to compare to, we'll have to rely on the statistics of the data.
Our target feature is "Cows". If we look at the descriptive statistics of the data, we can see it has a standard deviation of about 123 cows. The mean number of cows is 158 and the maximum and minimum number of cows is 796 and 3 respectively. To perform worse than the standard deviation would indicate a very bad model. So, let's say our aim is at less than half of it - say -50 cows with 95% confidence, to pick a round number. Had we actually had domain knowledge about bovine logistics then we could have made a far more meaningful requirement. The one we made is pretty much arbitrary and only within very rough statistical boundaries.
There is a point to it however, which will be a recurring theme in this tutorial, and that is that adaptive systems may give you the power to blindly create models, but that they may not mean much if you aren't qualified to evaluate them.
Data mining
Let's get the data into Synapse. It's located on the Peltarion data server, but apart from that it's a plain text file.
Start a new solution and add a new Data Unit, using the CSV file format. Instead of adding a file, in the "File name" text box, write the URL: http://data.peltarion.com/cows/cows.txt :
|
You will see that the "Browse.." button has changed to a "Load" button. Click on it, and complete the wizard. That's how you load a data file over network. Of course, for a more solid data storage, you should use a database and use the SQL format to get the data to Synapse.
Before we move on to data mining, make sure that the "CSV" data unit is selected id the Data Unit Manager. Above the "Add" button, you will find a check box that says "Embed Data in Solution". Check it:
|
This means that the data will be saved with the solution, and will be read from there rather than from the format. That way you don't have to download the file over the internet each time you load the solution. You can always restore the connection by unchecking the box. The cows.txt file is of manageable size so it is safe to do - you should however avoid embedding larger amounts of data as it will result in huge solution files that will slow down the system.
Take now five to ten minutes and explore the data. As we have covered parts of this in earlier tutorials we won't go through it again here.
If you are done, let's move on. Visualize the data unit in a scatter plot, displaying "AgricA" on the X axis and "Cows" on the Y axis:
|
As we can see, there are three clear outliers on the right side of the plot (Agrica>30,000). They should be removed so that they don't sabotage our data.
You'll be thinking outlier filter directly, but you would be wrong. Sure an outlier filter would take care of those immediately, but switch to the Validation set to see why it won't do here:
|
Do you see the dot in the upper right corner? An outlier filter applied on the training set would have not removed it and hence our test results would have been unreasonably bad. And indiscriminately applying an outlier remover to the validation set is a big no-no as you have no direct control over what is removed. Unless you are willing to impose more strict restrictions on the model than the data does, you should not touch the validation set.
It is however exactly what we want: We are going to carve out reasonable limits on the entire model by trimming the data.
Place a Mixer Filter on the Filter Stack:
|
Press the "Apply" button. The mixer filter repartitions the data to a training and validation set. As we have left it's "Percent" setting to the default of "0", all data will be put into one channel and training and validation will be the same. This way we can operate on our entire data set, rather than on per set basis.
Our scatter plot now looks like this:
|
As you can see we now have all four points. From the toolbar on the left, pick the "Select" tool and select those four points. From the "Options" menu select "Make Filter-->Remove Selected".
Our points are gone and we have a new filter on the stack:
|
Problem solved? Not quite.
This is why you never should do something like this unless absolutely necessary: FilterFrames or local filters are index based. In short what they do is remove points at specified indices. Had we for instance loaded a new file with different data in it, or had we simply reordered the data prior to the FilterFrame, the indices would have no longer matched and we would have removed the wrong points!
FilterFrames are for this reason also ignored during deployment. They are a fast way of removing individual points, but they are in no way a good solution unless you intend to always stick to the same data.
A much better alternative is the Select filter which removes points based on a range of one or more variables, rather than absolute points.
So, remove the FrameFilter from the stack, by selecting it and pressing the "Del" button on your keyboard. Press "Apply" to see the removed points reappear. While you are at it, you can remove the Mixer filter as well - we don't need it now.
Add a Select filter to the stack. In the "Select" setting under "Operation" write "Agrica < 20000" set "Validation" to "True":
|
Press the "Apply" button. Now we are back with the points removed from both the Validation and the Training set, but in a far better and deployable way. What we have to remember however is that we have imposed a limit on our model. Our model is per definition now no longer valid for villages where AgricA > 20,000 ha.
The final thing we wish to do in preprocessing is to make sure that the order of the data is randomized. The 15% for validation that we defined in the format wizard are simply taken from the end of the data (so that it doesn't destroy time-series). In static problems, such as this one, it doesn't hurt to make sure that data is truly in random order.
Add a second Mixer filter to the stack. And set it's mode to "Shuffle". Set the percent to "15":
|
Press "Apply".
Ok, let's go to Design mode
Design
Add a "Static->MLP Two Layer" snippet to the work area:
|
Now, we could leave it as is, but let's improve upon it a bit. Instead of just using a standard network, let's build a custom system. So what do we wish to add?
Well, we could try to increase its robustness. If we add a Hebbian layer with the appropriate rule, we could get it to perform some Principal Component Analysis (PCA) and with that help it to control outliers. Adding a function layer after it, we can get that non-linear.
From the component bar, drag a "Hebbian Layer" (found under "Unsupervised") on the work area and one Function Layer. Hook them up to the existing system like this:
|
Select the Hebbian Layer and in the property browser set the "Forward Rule" to "GHA", which is the Generalized Hebbian Algorithm rule that produces PCA. Set "Step" to 0.01. This will make sure that it doesn't go off too fast:
|
Let's change the default learning rule for the other branch as well. The Quickprop update rule is a gradient based rule that uses second order information to improve it's convergence. It is in theory superior to the standard Step rule, but it can also be unreliable and diverge. Let's try it anyway. Select the middle branch:
|
In the Setting Browser, under "Learning Backpass", set "Back Rule" to "Quick Prop":
|
And we're done. Suppose now that we are so proud of our new creation that we want to save it for posterity or perhaps use it on a regular basis. Select all blocks by pressing CTRL+A on the work area. Right-click on any block to get a context menu:
|
Select "Create NetSnippet..". You will get a new window, in which you can write a name, author and description:
|
Click on Save, and you will be asked for a location. The default user snippet directory is under My Documents/Peltarion Synapse/Snippets.
Create a new directory called "Cattle" (under My Documents/Peltarion Synapse/Snippet) and save the file in it (CattleNet.synsnip).
Back in Design, right click on the work area, and select "Insert Snippet.." Notice now that htere is a "Cattle" catgeory with a "CattleNet" snippet.
|
That's how you add your own snippets.
Now, add the data as you have learned in previous tutorials - our desired output feature is "Cows" and the rest are input features. You should by now have no problems with creating the data sources on your own.
Once you have configured them, connect them like this:
|
Training
Go to training mode and train for roughly 400 epochs. Your MSE should be of the order of magnitude of 1e-3. Remember, pause button halts the training, stop button halts the training and resets the system and all weights. Depending on your CPU training may take a few minutes as the GHA is a computationally expensive update rule.
|
Compared to the standard Step rule, Quickprop is more aggressive, which results in faster training but also in greater inconsistencies in results and sometimes divergent behavior. Due to this after training the model trained for this tutorial and your model have most likely converged to two different things. Don't be alarmed if the numbers and graphs don't match yours exactly.
Postprocessing
Now, go to Postprocessing mode postprocessing, start the error analyzer and refresh it. Switch the "Error Metric" to "Manhattan":
|
This is the distribution of our absolute errors (on the training set). You can see that most absolute errors are well confined below 50, but that we have some greater errors, the worst one being more than 300 cows wrong. Wouldn't it be great if we could know what samples cause those errors?
Well, we can. From the toolbar on the left side, select the "Select" tool. With it, you can select bars in the histogram, the same way as in pre-processing. Let's take a look at the really bad errors, say those with over 100 cows wrong.
With the select tool, drag a box that contains the bars that are on positions > 100:
|
You will see that the "Send to Preprocessing" button has been enabled:
| |
Click on it and a box will pop up, prompting you to choose which data unit you want to match this selection to.
|
As we only have one, just click on "OK". Synapse now sends us off to pre-processing, allowing us to choose a visualizer to show the selected samples with:
|
Choose the Grid View and you'll see something like this:
|
These are the samples that caused those big errors. Notice something right away? The populations are all > 2900. If you go back to post-processing and expand your selection, you'll see that the trend holds. You can in preprocessing see in the statistics pane how these samples differ from the samples that were within acceptable error range. That way you can understand the limitations of your model.
In our case, the model isn't very good at predicting the number of cows that falls within a cluster of villages that are large, have large agricultural areas, but relatively few cows. It becomes clear from the data that to describe those adequately, we need more variables - which we don't have. But at least we know now what causes the errors.
So, how are we doing generally speaking? In the error analyzer Confidence view, we can see that we managed to achieve our requirements. We wanted +-50 and got +-45 in validation and +-36 in training with 95% confidence. As we start with random weights when we start the training, your results will vary. It should however be betwen 30-50 with 95% on the validation set:
|
Now suppose you are in your car, driving in Greece and your friend, sitting next to you says:
"Look at that village, it looks like it has a population of about 1000 people, is 30 km from the nearest town, has an agricultural area of 6000 ha and 10 km of asphalted roads. Judging from what I can see, it has 8 churches, 33 schools, 18 tractors per 10,000 capita, a median age of 31 years and 70% water quality. How may cows do you reckon are in the village?"
Well, one way of answering your friend is to deploy the system you just made, and input data to the deployed component. There is however a quicker way. It's called the Probe:
|
Start the probe from the post-processing bar and press "Refresh":
|
(If you get different values, don't worry - that particular input is nonsensical and isn't representative of the data it was trained on. As systems are initialized with random values they converge to different solutions. Problem space not covered by the data can result in great variance in output between models. More on this below.)
As you might have guessed by now the Probe allows you to enter your inputs and get an output live. You can either write the values in the text boxes or use the sliders. The output will be shown on the right side. The plot in the lower right corner tracks the changes in the output.
In the "Source:" dropdown, you choose which source you wish to override to enter your values and in the "Probe At" dropdown you choose at which block you wish to collect the output.
So, let's enter your friend's values:
|
You can now tell your friend that in that village, there are 195 +-45 cows with 95% confidence. Again, your exact values will vary depending on the random order of data and initial weights as well as to what Quickprop did with it.
Press "Clear" on the plot.
So, what would happen if the number of tractors per capita increased in that village? To find out, just drag the Tractor slider to the right and watch the plot:
|
Finally, a word of caution. It is quite possible to with the probe make combinations that could never happen in the real world and for which the model will give out nonsense. If you have no domain knowledge at all, this can be difficult to spot. If you are not qualified to evaluate the input variables and to interpret the output variables, then chances of making a mistake are big.
This is why an economist armed with a neural network will always outperform a programmer with a neural network when it comes to stock market prediction. There is simply no substitute for domain knowledge, as you need to interpret the results. Adaptive systems may build the model for you, but if you input wrong data or aren't qualified to interpret the outputs, then they are of little worth.
To demonstrate one example of an impossible combination, drag all the sliders down to the left, except for Schools which you should drag to the right:
|
Negative cows??? Fortunately we are all domain experts at counting objects, so we can spot that a negative number of cows is a nonsense count. Why the input combination above is nonsense (which it clearly is, given the nonsense output) is a bit more difficult to answer. You'd have to actually know something about the domain to answer that. The results for impossible combinations vary greatly from model to model as that part of problem space has not been adapted to the training data and will fit whatever particular approach to solution that the system constructs.
Finally in this tutorial, we'll take a look at one thing we've been lacking so far: qualitative understanding. This is where sensitivity analysis comes in:
|
Press "Refresh" and you'll see a bar plot (yours will probably look somewhat different):
|
What sensitivity analysis does is to go through the input features, one by one and dither them a defined percentage while holding the rest fixed. It runs the system for each feature and records how much the output changed due to the dithering of the input feature. This information is then assembled and you can see how sensitive the system is to changes in the features.
In plain English this means that you can see how much the output changes if you change an input feature a certain percentage. This is a measure of how important a feature is at a certain level.
For instance in the bar plot above, we can see that a 10% increase (dither 0.1) of the population variable gives a far stronger output change than a 10% of increase of the asphalt variable.
One thing that is important to understand is that sensitivity is per model as long as input variables are correlated (which they nearly always are in real-world problems) . A neural net can choose which information it wishes to extract from what feature. How that choice is made depends on many factors, the initial random weights in the network being a major one. This directly impacts the sensitivity and two models trained on the same data are unlikely to have equal sensitivity.
It is tempting to say that sensitivity equals importance. In a way, it does, but you have to take into account range.
It would not be correct to say that the Population variable is more important than the Water variable, because these systems are non-linear. Had they been simple, then you would have just written an equation and never bothered with adaptive systems. The real world however is complex and non-linear so if we for instance change the dither to 0.5 (50%) and press refresh we may get something like this (again your model will most likely be different):
|
However, if you are careful with the interpretation, sensitivity analysis can give a lot of information about the model. The non-linearity is a great thing, if you know how to use it. Consider the distance variable in this model- it has a far smaller relative impact on the number of cows when the dither is small than when it is large. It tells us that if a village is 10 or 15 km away doesn't really matter, but if it is 10 or 100 does and so on. In the case of population, it's the opposite 500 people or 600 people makes more relative difference than 2000 or 3000 people.
Use it right, embrace the non-linear nature of it, and the sensitivity analysis will be your best friend when it comes to understanding the model.
In this fourth tutorial we have learned how to apply filters through visualizers, and why we should avoid that and use the Select filter instead. We have created a hebbain-mlp hybrid and made a snippet out of it. In post-processing we saw how to trace errors back to pre-processing and how this could give us a better understanding of the limits of the model. Finally we checked out the probe and the sensitivity analysis plugins.

































