We don’t need no optimization

When we are working with data, and fitting models to it and building algorithms, it is very easy to lose sight of what we are really doing. Which is that we are trying to explore the geometry of the data. In fact, the geometry of the data points in space is really what is fundamental – everything else is secondary. The assumptions involved in all our statistical or machine learning models (the same assumptions that we often fail to verify – yes, its YOU I’m looking at!) are really assumptions on this point geometry in space.

If these assumptions are valid – that is to say, if our data are indeed configured in space, as we imagined them to be – then the cut-and-dried formulas and algorithms that we are familiar with – be it logistic regression or nearest neighbors – will go through.

An illustration

To illustrate, let us consider the Swedish banknote dataset. This is a small dataset containing 6 measurements each on 200 Swedish banknotes, 100 of which are genuine, and 100 of which are fake. We have a very simple task before us – to predict which is which.

Our technique of choice here is linear discriminant analysis (LDA). This is a very simple way to build a classifier, closely related to logistic regression, that tries to build optimal separating planes between the classes. We get as many planes as there classes – so in this case, with only 2 classes, we shall obtain 1 separating plane (line, really).

Below, we load the dataset into R, and fit an LDA model to it. And indeed, the separating vector is found rather easily.

Data from
##http://search.r-project.org/library/gclus/html/bank.html
##LDA Technique explained at
##http://stats.stackexchange.com/questions/95247/logistic-regression-vs-lda-as-two-class-classifiers
require(MASS)
bank <- read.csv('/dos/Msc/datasets/banknote.csv')
head(bank)
bank.lda <- lda(Y ~ Length + Left + Right + Top + Bottom + Diagonal, data = bank)
bank.lda$scaling
## LD1
## Length -0.005011113
## Left -0.832432523
## Right 0.848993093
## Top 1.178884468
## Bottom 1.117335597
## Diagonal -1.556520967
plot(bank.lda)

We can draw up a pretty plot from this as well.

Histogram
The 200 points are projected onto a line, which is optimal for class separation. Since our data are now 1-dimensional, histograms can be used for visualization. We can see how the range easily splits for the two classes, showing the model is a good fit.

Is this really all there is to this data?

We did get a very neat and succinct solution to our problem of separating the fake notes from the genuine ones. But have we let our data speak, freely, openly, willingly? Or have we strapped it to a chair, and extracted a confession out of it?

All we were interested in, was trying to separate out the classes. But is this also what the data wants to talk about? Maybe the data wants to tell us more, maybe it has other stories – have we tried to explore them?  

Whenever our data modelling algorithms optimize – what they are really doing is extracting a confession. To let the data speak we must keep an open mind about it – explore the data’s geometry itself – and see what comes up. One never knows what the data wants to say, but if we never try to listen, we shall never know!

A more free flow exploration

GGobi is one piece of software that allows us to do this kind of un-directed exploration. This here, is a movie I made earlier today, where one can see the data geometry itself being explored.

What we have for the first 10 seconds is a static plot. At that point, I switch the scatterplot to a tour – a kind of dynamic graphic, that moves fluidly in space to show the data points. Although the computer screen is naturally a two dimensional object, the motion of points is intended to give an impression of high dimensional visualization – which is indeed what we need, since our points really lie in 6-D space.

[Unfortunately, most of that fluidity of motion that I can see on my screen, is quite lost in the video – I’m still investigating why that happens, but I suspect its because the video encoder does not see sufficient difference between successive frames, and tends to “compress” them all together. One gets an impression quite like stop motion animation from the video.]

We can see several point configurations in the video. It is all really the same 6-D point cloud – photographed, as if, from different angles.

At about the 00:20 mark, we see for the first time a sort of split in the cloud. The shape is similar to a butterfly, each of whose wings are separated by a narrow gap in the middle. This view becomes even more acute at the 00:26 mark, when the two wings split almost totally. (Again, on my screen, these are fluid movements, but this is lost in the video.)

I wanted to see, if this split really is the class separation we are looking for – namely between genuine and fake notes. To see if this is true, I brushed all the fake notes yellow. This is possible since this is training data – we have the labels available. And indeed, the left wing of the butterfly turns out to correspond to fake notes entirely!

And we came so far without any optimization at all!! In fact, the classes in the data were visible to us even without the brushing. This is the power of visual exploration.

From 00:38 onward, I started exploring again, letting the point cloud drift, and move where it wanted to.

At the 00:43 mark, the cloud “flips” – we are now seeing the butterfly from below, the right and left wing are swapped, but the structure is essentially the same. Such “flips” are often indicative of the strength of the class structure in the point geometry. It shows that the class separation is indeed a fundamental feature of the point geometry of this set of data, and not something that we have forced from it. The more point configurations that we observe with the split between classes, the surer we can be that we have indeed, chanced upon something real, rather than a mirage (often brought about by bad model design).

At 00:47, something even more interesting happens. The clear split between the genuine notes (purple points) and fake notes (yellow notes) is lost, but a small set of points has split off entirely from the entire cloud. The structure is rather persistent, and can be seen upto 00:52.

Curious, I pause the cloud movement, and activate the brush at 00:57, to color these points. Although I do it fluidly on my screen, this is not captured in the video, and we see the small bunch of point colored orange, all at once, from 1:04 to 1:10.

From 1:10 onward, I resume the movement of the point cloud. We can see the structure persist right up till 1:17, when it starts to dissolve. This often happens – it shows that this is a  “weaker” structure in the data than the genuine / fake split. But I’m curious, and I decided to meddle a bit – I open up the Projection Pursuit window (you can see parts of it from 1:17-1:20), and try to “encourage” the structure of the orange points.

Is it too late? At 1:21, we see the structure lost already, and another butterfly taking shape. But hey, our little encouragement works, and the orange points split off entirely at 1:26. Yay!

And what we see now is something quite different from before – we see three distinct clusters – genuine notes, and two clusters of fake notes. Are there two sets of fraudsters in our data? Visualization suggests so! The structure moves a bit at 1:30, but persists. This seems to be an important secondary structure in our data. Not in a million years would we have guessed at this from our cut and dried models and algorithms.

Visualize your data!

 [The Author is a trained statistician, with independent work in the areas of projection pursuit and high dimensional classification and machine learning. You can follow him on Twitter @mdayal1789]

One Comment Add yours

  1. Nice post. I was checking constantly this blog and I’m impressed! Very helpful info specifically the last part 🙂 I care for such info much. I was seeking this particular info for a very long time. Thank you and good luck.

Leave a Reply

Your email address will not be published. Required fields are marked *