Tutorial 2
From Piki
In this tutorial we will use Synapse to solve a classification problem.
Before following this tutorial, it is strongly recommended that you complete the previous tutorial if you have not done that already.
Contents |
The problem
In the late 90's the Dutch noticed that they had a problem: a large percentage of their police officers were terrible at their jobs. Nearly 20% of all police officers were receiving failing grades five years after their police academy graduation. Could it have something to do with the lowering of the grade requirements? Or was it possibly because they removed the requirement that you could only become a police officer if you had no criminal record? Well, there were plenty of good questions and the Dutch ministry of justice wisely decided to start collecting data to see if they could see some patterns. The idea was pretty simple - collect a bunch of data on the police cadets when they leave the academy and then grade their performance each year. After five years, after an important pass/fail evaluation of the officers you can start looking for patterns.
Unfortunately the ministry of justice was apparently only familiar with relatively simple statistical methods and didn't have Synapse. They did find some interesting patterns, but as we shall see, they are not very difficult to find. With the methods available to them at the time they did not manage to build an adequate model, but still we are thankful to them for collecting the data and making it available to us. We on the other hand, armed with Synapse shall make a killer model.
Now, to state the problem: Given data collected at the time of graduation of the cadets, can we accurately predict if they will be failing at their jobs five years later?
The data
The data file is located in Synapse\Sample Data\Police and called "police.txt". The following features (variables) are available to us in the data file:
- Age - the age at which the cadet started studying to become a police officer
- AvG - average grade at the time of graduation (scale 1-10, people with average grades < 4.0 were not accepted into the police force
- Chdn - number of children at the time of graduation
- ExEd- Extra university-level or equivalent education (years)
- CR - criminal record (0=No, 1=Yes)
- Sex - sex of the cadet (0 = Male, 1 = Female)
- SecE - Other experience in the security sector (0 = No, 1 = Yes)
- AvgE - average yearly evaluation score (The average of five years. The evaluation is performed by a committee of 10 senior police officers. Scale 1-5)
- FinalE - final evaluation. Fail if average yearly evaluation score (Avg) < 2.0 otherwise pass. ("Pass"/"Fail").
Requirement specification
We are looking at a classification problem here, so the requirements are best looked at in terms of the rate of false positives and false negatives. We want to decide if a person will fail as a police officer. So a false positive is if we incorrectly say that a person will fail. A false negative on the other hand is if we incorrectly predict that a person won't fail. From a technical point of view, it's quite irrelevant. However depending on the application of your solution one may be much more important than the other.
In our case the original purpose of the Dutch experiment was to find a model that could help them decide stricter rules for granting a commission to a cadet or that could be used to single out individual officers that could use extra guidance and help to improve their job performance. In the former case, false positives are far worse than false negatives - accepting a potentially problematic officer is less serious than not accepting an officer who would do a good job. In the latter case it isn't as serious, but the principle holds.
Ok, then let's set some goals. Let's say that we want no more than 10% false negatives and no more than 5% false positives. So to be pleased with the model we want to be able to predict with 95% confidence that a cadet will pass and with 90% of confidence that a cadet will fail. For comparison the best the Dutch model managed to do was about 75% for both pass and fail predictions - not good enough for a practical usage.
In addition, as deployment will be covered in this tutorial, a requirement for this problem will be to write an application that uses our model. We want the user to be able to input the data for a cadet (age, average grade etc), press a button and get an answer if that person will be failing as a police officer five years in the future.
Ok then, let's get started.
Data mining
As always the first step in making an adaptive system is to get a better understanding of data through data mining. In the previous tutorial you learned how to import data into Data Units, how to apply filters to that data and how to use visualizers. In this tutorial we'll go deeper into data mining and see how we can draw qualitative conclusions about the problem before even thinking about building a model.
Start up Synapse and make sure you are in Preprocessing mode. From the input manager add a new Data Unit, using the CSV format. The file you want to load is located in Synapse\Sample Data\Police and called "police.txt". One thing worth noting here is how the data looks in raw farm - specifically the last feature (FinalE). As you can see in the first display of the CSV format that feature is in alphanumeric format - the data is written in the explicit form as "Pass" and "Fail".
|
As Synapse deals with numeric data only, these strings have to be converted and it is done through a method called "Feature Expansion". On the next display you can see what it does:
|
As you see the "FinalE" feature that contained the strings is gone. Instead we have two new features called "C_Fail", "C_Pass". The conversion looks simply like this:
"Pass" => C_Pass=1 and C_Fail=0. In short when "Pass" value occurs, the "C_Pass" feature is set to 1 and the "C_Fail" feature is set to 0. The same principle applies for "Fail" which results in a "C_Pass" value of 0 and a "C_Fail" value of 1.
| FinalE | C_Fail | C_Pass |
|---|---|---|
| Pass | 0 | 1 |
| Fail | 1 | 0 |
Note: In newer version of Synapse the new feature names will be "FinalE_Pass" and "FinalE_Fail" as the feature name it is based on is called "FinalE".
Drag your newly created data unit to the top empty visualizer tab and select the correlation visualizer. The correlation plot is a good starting point as it gives you an overview of the available variables and an idea of how correlated they are.
To start with something recognizable and obvious, set "Feature" to "Chdn" (number of children).
|
Not surprisingly we see a strong correlation with "Age" - people that have many children are generally older. The second largest positive correlation is "ExEd" (extra years of education). And this is where the Romans would have said cum hoc ergo propter hoc.
"With this, therefore because of this" - the most common logical fallacy in data mining. Correlation does not at all necessarily imply causation. In this case a typical mistake would be to read from the data that extra education leads to more children. In reality the two variables are just connected through age. Older people tend to have more children than younger people and older people tend to have more extra education than younger people.
As we shall see later, it is very important to keep this in mind and not jump to conclusions when you see a correlation.
With that in mind, select "AvgE" from the "Feature" dropdown in the correlation visualizer. "AvgE" is the average of five years' worth of performance evaluations. This is the primary variable we will be looking at while data mining. The Pass/Fail variables are directly derived from this variable. While we won't be including it in the model, this variable is suitable to look at because it is far more detailed than the simple binary Pass/Fail. AvgE contains a continuous range of values between 1-5.
To see this for yourself, drag-drop the data unit to the other empty visualizer below the correlation visualizer. Select a histogram. From the "Feature" drop down select "AvgE".
|
You can see here that the distribution of variables follows approximately a normal distribution with values ranging from abou 1-5. Now select the "C_Fail" feature. As you can see there are only two values here 0 and 1. So "C_Fail" has a far lower resolution, which makes "AvgE" more useful. But why look at a variable we are not going to use in the model? -Simply because it is more educational from a data mining point of view.
Now, back to our correlation plot. We can see a strong correlation between AvgE and AvgG (average grade at time of graduation). Let's take a closer look at that.
Before we proceed let us just define how we refer to the position of the visualizers as it otherwise will get a bit complicated when we start making more advanced operations.
We'll call the upper visualizer area A and the lower B. We'll enumerate the visualizers in each area 1,2,3. So our correlation visualizer is A1 and our histogram is B1.
|
Ok, close down the histogram B1. Drag-drop the Data Unit to B1 and select "Scatter". From the correlation plot (A1) we can see that it makes sense to look at AvgG vs AvgE first. Select "AvgG" in the "X" drop down and "AvgE" in the "Y" drop down.
|
As we can see there is a rather clear relation between those variables. Observe though that it is not quite linear. Let's see how it relates to the Pass/Fail ratio. We'll do that by adding another plot to the same visualizer - an AvG vs Fail plot. In the "U" drop down select "AvG" and in the "V" drop down select "C_Fail". Go under the "Options" menu of the visualizer and select "Use UV as X2Y". You should see something like this:
|
The red points in the plot are the "C_Fail" feature. It has a value of 1 for Fail and 0 otherwise. As we can see nobody with grade average around 5.5 and below did well five years later and everybody above around 6.5 passed the evaluation. This is how far the Dutch ministry of justice got. While it may not seem like much to you, it was very important to them. You see, in the Dutch educational system 5.5 is usually the limit for a failing grade. This requirement was lowered in the early 90's to 4.0 for Police cadets due to a shortage of police officers. This shows that their original cut-off limit was a fairly good idea.
Now there are two areas of this curve we wish to study more closely. First of all we have the region between 5.5-6.5 where there is a mix of people failing and passing. This is highly relevant for our problem.
Select the selection tool in the scatter visualizer in B1:
|
By dragging a box, select the points between 0 and 5.
|
Take a look at the Statistics pane in the lower left corner of Synapse. The Statistics pane shows a quick overview of max/min, mean and standard deviation values of the whole data unit and of selected points.
|
If we scroll through the variables we can see the following: The age of the selected group is somewhat lower (23.7 vs 25.8) than the average, the average grade is significantly lower, the number of children is more twice as low (0.22 vs 0.52), they have on four times less extra education than the average, they have (surprisingly perhaps) less criminal records than the average and the percentage of women in the group is twice as low etc
So we can conclude that the typical member of that group is a guy in his early 20's, has very bad grades, no extra education, no criminal record and no previous employment in the security sector.
Play around a bit and select other points and look at their statistics. For adding selections use the SHIFT key while dragging the selection box and for subtracting selections use the ALT key. When you are ready to move on, select the points between 5.5 and 6.5 - our area of interest.
|
Close the correlation visualizer (A1). Drag the Scatter (B1) tab to the empty plot in A1.
|
Select "Histogram". These two visualizers are now linked. The histogram is a child visualizer to the scatter plot. The histogram will only show the points you have selected in the scatter plot. You can in turn link the histogram visualizer to another new visualizer and so on This is an incredibly powerful tool for deep data exploration.
|
In the histogram visualizer select the "C_Fail" feature. As expected both Fail and Pass frequencies are pretty high.
Let's see if we can see why some of them failed while others did not. In the A1 histogram, using the select tool, select the bar located at 0:
|
Select the empty visualizer B2. Drag the Histogram A1 tab to B2 and select a Histogram visualizer. Now these two plots are linked and you are looking at the age distribution of the people within the 5.5-6.5 range that passed the evaluation. In the A1 histogram select the other column. Now you are looking at the people within the 5.5-6.5 group that did not pass the test.
Look at the other features in the B2 histogram. Can you see any notable patterns? Can you see for instance that the people who passed the evaluation are more likely to have more extra education and more likely (than proportional) to be women?
Try connecting more plots and explore the data. A suggested area to explore is the three diverging lines at the high-grade end in the AvgG vs AvgE scatterplot.
|
After you are done with the data exploration, we can move on. As mentioned earlier the very informative AvgE feature won't be included in our model. We can't use it as an input as it is decided five years after the graduation of the cadets. As we are only interested in getting a Pass/Fail prediction, we don't want it as an output.
Remove that feature by using an Extract filter as we did in the previous tutorial.
Unlike in the last tutorial, we won't be using the extract filter to split off the output features to a second data unit. Instead we'll see how we can do that through the Data Source component.
Switch to design mode.
Design
From the component bar, drag the "Data Source" to the work area.
|
Select the Data Source by clicking on it. In the Settings browser select "Add Data" and select "CSV". This is one of the alternatives to drag-dropping from the Solution Explorer.
|
Select the Zoom tool from the toolbar:
|
Click on the Data Source component. The component is now covering your visible work area. You can also see now that the Data Source has a GUI interface of its own. All components support having a fully interactive graphical user interface of their own.
|
Select the Component tool from the tool bar.
|
This tool allows us to interact with the component's GUI. Click on the "Features" tab. In the list select the "C_Fail" and "C_Pass" features. You can do it either by using the CTRL key for an additive selection or you can simply press the left mouse button over "C_Fail" and drag it downwards.
Check the "Remove" checkbox and press "Apply". These features have now been removed from usage in this Data Source. Note that they have not been removed from the Data Unit in question. This will be our input source.
|
Select the "Select Tool" from the toolbar.
|
Press the BACKSPACE button to return to the previous zoom level. Click on the component to select it. Copy it to the clipboard by pressing CTRL C. Move your mouse a bit below it and press CTRL V to paste the copy. Click on the new Data Source component to select it.
|
In the property browser, scroll down to the "Selected" category. Here you can see that all the features are set to "True", except the two we removed with the Data Source GUI.
|
As this second source is supposed to become our output source we would have the exactly opposite situation. Scroll up again to the top of the settings list. Under "Feature Op" click on the "Invert" setting and press the ".." button.
|
Do the same to "Apply". Scroll down again to the "Selected" category. As you can see we achieved what we wanted by inverting the feature selection and by applying our change.
|
This way the features used by the Data Source can be controlled through its settings list.
Now, by right clicking on the work area next to the input data source, insert an "MLP One Layer" snippet (You'll find it under "Static"). It is one of the more simple topologies, but it should provide enough power for our problem. Feel free to rearrange the components by using the "Select" tool. Dragging a component moves it.
|
This may be a good time to draw your attention to the Error List located along the bottom of Synapse.
|
Here Synapse provides you feedback and telling you what you are doing wrong. These messages are categorized by severity ranging from "Low" to "Severe". By clicking on the filter buttons, you can select which messages you wish to see. In our case here Synapse is telling us that there are unconnected components in the system and that these will be ignored. It's because the snippet we placed isn't connected to any sources.
Using the Link tool in the toolbar connect our input source to the first weight layer.
Ok, that got rid of that warning, but we got a new one: "Topology not valid" and "Loose Receiver". This is the control system's way of saying that it is not pleased with the situation. In this case it objects to the Delta Terminator not having connections on both its ports. So let's make it happy by connecting our output source to the Delta Terminator's blue port.
You should have something like this now:
|
One final thing to do before we move on to training. Select the last function layer ("Function Layer 4", the one that is connected to the Delta Terminator). In the Settings Browser, change the "Name" setting from "Function Layer 4" to "Output".
Training
Now go to Training mode. If you don't see anything or your components are in strange positions, click once on the work area and press CTRL+F to frame all the components. Design and Training have independent zoom modes to allow more flexible viewing of the system. So when you zoom in Design, it doesn't zoom in Training.
Select the Delta Terminator by clicking on it. In its settings list, now shown in the property browser, select "Confusion_Matrix" from the "Visualization" drop-down list (found under the "Display" category). I will explain what this is in a minute.
Before that however, go to the Control System GUI and write 20 under "Max Epochs", and press enter. Write 100 under "Batch Length" and press enter.
Press the "Play" button on the training toolbar. Now take a closer look at the Delta Terminator. You should see something like this (numbers may differ):
|
This is called a confusion matrix and is used to see how well a system does classification. In a perfect classification, you'll have 100% on the diagonal going from top left to bottom right.
In the image above we can see from the first row that "C_Pass" is classified as "C_Pass" 100% of the time and misclassified as "C_Fail" 22.4% of the times. From the second row we can see that "C_Fail" is misclassified 24.14% as "C_Pass" and correctly classified as "C_Fail" 75.86% of the time. The confusion matrix shows validation data - i.e. data that hasn't been included in the training.
Let's do a proper training run now. Set the "Max Epochs" to 2000 and press "Play". Watch the system adapt and you should have something like this in the end of those 2000 epochs:
|
Remember our requirements? We wanted no more than 10% false negatives and no more than 5% false positives. What we actually got is quite a bit better: Only 3.45% false negatives and as little as 0.83% false positives. And I'm sure you would agree that's a bit better than the 25% error levels the original study had.
We can with 96.55% confidence say that a graduating cadet will be failing at his job in five years and we can with 99.17% confidence say that a graduating cadet will be performing an adequate job five years in the future. How's that for predictability?
Deployment
While our model may be great it is little more than an academic exercise as long as it is locked up inside Synapse. If we want people to be able to use the model, we have to get it out in some usable form. This is where the "Deployment" post-processing component comes in. It allows us to deploy the systems we make in Synapse to regular .NET components. Not only that, but the components are reduced to a minimal functional subset optimized for performance.
Note: For this stage to be meaningful you need Microsoft Visual Studio 2005 or a similar .NET 2 compatible development environment. Earlier versions will not work as they are not compatible with the 2.0 version of the framework. If you don't have Visual Studio, you can still read this section to get an understanding of how Synapse deployment works, and why it is one of the more powerful features of Synapse. A free version of Visual Studio 2005 called "Express" can be downloaded from Microsoft.
Go to Post Processing mode, and select "Deployment".
In the "System Name" box write "GoodCopBadCop".
In the "Deployment Path" you can see (and change if you wish) where the system will be deployed.
|
There are a few interesting things to observe here. Under "Data Unit Selection" you can see and select the Data Unit(s) that will be deployed. The UDF checkbox indicates if a special deployment input format is to be used. If you remember our input format in this case was the CSV file format. While it is very suitable to load data from a text file when you are to train a system, it may not be as suitable in a deployed system. Instead we are using the default Matrix input format. It allows us to directly input data (in matrix form) to the Data Unit. That way we don't need a text file.
Under "Shortcuts" you can select which components will be tagged as outputs and visible in the component root. For instance you can see that our "Outputs" function layer has been detected as a default output and selected. Once deployed, it will be available as a property, GoodCopBadCop.Output_port0.
Press the "Deploy" button. Once it is done deploying, you'll see a pop-up box. Now you can save the solution and exit Synapse.
|
Start Visual Studio 2005. Open the Synapse\Sample Data\Police\PoliceEval\PoliceEval.sln solution. To save time, we've built the GUI of our application.
Start the PoliceEval project (F5). This is the basic GUI:
|
As it is, it has no relevant function. When the "Evaluate" button is pressed the parameters set in the various text boxes and drop down lists are parsed and sent to a FailEvaluation function. If the function returns true, "Fail" is highlighted, otherwise "Pass" is highlighted. Right now, it looks like this:
private bool FailEvaluation(int age, double avgg, int exed, int chdn, ... { return (new Random().NextDouble() < 0.5 ? true : false); }
As you can see, it does nothing but return a random value. We want to replace this with our system that can actually respond with something meaningful.
Our first task is to import our deployed system into Visual Studio. In the solution explorer, under the PoliceEval project, right-click on "References" and select "Add Reference...". Select the "Browse" tab and navigate to the GoodCopBadCop directory to which we deployed the system. Select the GoodCopBadCop.dll and press OK:
|
Our system has now been imported. Let's start editing the code. Right click on the "Police.cs" file and select "View Code".
The first thing to do is to add two "using" directives. The first one is to select the deployed system and the second one is to enable us to use the Matrix class. Add the two following using directives:
using Peltarion.Deployed; using Peltarion.Maths;
Now, let's create an instance of our system. Add this to the code:
public partial class Police : Form { GoodCopBadCop sys = new GoodCopBadCop(); //<---Add public Police() { InitializeComponent(); InitializeFields(); }
Ok, so now we have an instance called sys. Let's use it. In the bool FailEvaluation(...) method, remove the existing return command:
private bool FailEvaluation(int age,…) { return (new Random().NextDouble() < 0.5 ? true : false); //<--Remove }
Recall our data unit, called "CSV". It had 10 features ( Age, AvgG, Chdn, ExEd, CR, Sex, SecE, AvgE, C_Fail and C_Pass). So this is the data that our deployed Data Unit is expecting as well.
We'll use the Set_CSV function which allows us to enter one sample. The more general method is of assigning a matrix to sys.Input_CSV, but this one is more pratical, as we can get help from Visual Studio's IntelliSense to fill it out.
The FailEvaluation(..) provides us with the values we need to insert into function. Assigning the data is as simple as this:
private bool FailEvaluation(int age,…) { sys.Set_CSV(age, avgg, chdn, exed, cr, sex, sece, 0, 0, 0); }
Visual Studio will show you which fields to fill out by showing you a tooltip:
|
What happened to the three last values? Why are they zero? Note which three values are zero: AvgE, C_Fail and C_Pass. If you recall AvgE gets removed by an extract filter, so it doesn't matter what we put in there. And C_Fail and C_Pass are desired outputs, and are not used as inputs to the system. Remember how we removed them from the input data source? Hence we can leave them as zero. As they are not used by any input data sources their values are irrelevant. The Data Unit however still must get its full range of features for it to work.
Now that we've assigned data, we can run the system. We only want to run one pass, to get an output for the sample we've put in, so we'll just step the system one epoch:
private bool FailEvaluation(int age,…) { sys.Set_CSV(age, avgg, chdn, exed, cr, sex, sece, 0, 0, 0); sys.StepEpoch(); // Step one epoch }
Now we only have to retrieve the answer and use it to let the FailEvaluation(..) function return a value. If you remember, our output component was called "Output", so the data on its first (and in this case only) port is called "Output_Port0". The output had two features, "C_Pass" and "C_Fail". Our return function is quite simple, if the value of fail is larger than the value of pass, return true:
private bool FailEvaluation(int age,…) { sys.Set_CSV(age, avgg, chdn, exed, cr, sex, sece, 0, 0, 0); sys.StepEpoch(); // Step one epoch Matrix o = sys.Output_Port0; // Get the output return o[0] < o[1]; // Return true if pass < fail }
That's it, we're done. You can now build and run the program. The system we trained is now fully operational within the PoliceEval app.
|
|





































