3 years ago, on a usual evening, when I was doing nothing as usual, an unusual thought struck my usual mind. It was an idea of mapping any kind of world data into an m x n matrix. A naive thought, I know; but it was only 17 seconds since the flash of this idea. Allow me.
It goes this way - there were attributes(columns) and those attributes had values for various entities(rows). It looked possible. But that wasn’t the unusual thing. The crux was that, I felt patterns could be logically decoded from data just the way our eyes can search, see and understand. And the start of it with the m x n model felt easy and empowering.
My idea completely revolved around finding patterns, patterns and more patterns; anywhere and everywhere. I went further and accomplished that a computer program should be able to do that for me. It will not only do basic statistics (like I guessed what many softwares would be doing) but also think about what to measure based on the data, somehow. This was the key.
So, I was on to building an insight spitter machine. I named a folder on my desktop - “Insight spitter”. Give it data that fits in m x n and it will spit information; on your face. I was attracted to her. I wanted to know her more.
Late night talks
Not so surprisingly, I started thinking about what kind of data points will I come across. I figured out that we will have to tackle two types of data points mainly:
Quantitative data points - for eg: temperature, exponential moving average of a stock, goals scored, number of girls you have dated. Basically, the ones where the terms average, mean, median, etc. would make sense.
Qualitative data points - these mainly represent categories in which an observation might fall into like. Like who voted for whom, the credit rating of stocks, etc. They don’t necessarily map to numbers which add or multiply and still make sense.
An end to end program that will tell you all kinds of things about your data given the types.
What kind of things?
The trivial - maximum, minimum, averages, some regression fits and some more statistical metrics etc.
Not so trivial - Correlation(not really the statistical correlation, but some logical thing that I called correlation) between all pairs of rows and between all pairs of columns. mC2 * nC2 correlations to find which two columns match (and even rows,) showing similarity between two entities and significance between two columns. Depending on the correlation measures of a column w.r.t. other columns I would apply some logical weights to predict a particular cell.
Time fields - So if your data had timestamps, there would be a series of helper functions in my code that would automatically add some more pseudo-columns (qualitative ones) like the DAY of the week, month of the year, day number, etc. and correlate based on those too.
Outliers - Find things that don’t fit the patterns, exclude them somehow and then recalculate
Difference columns - Basically find the change in data points down the column and correlate between them.
Some more logical basic ideas
I had started liking her.
There should be a pattern as to why a particular stock would go up in a particular season, or when a stock goes up some other particular stock goes up or down. I even tried this code on some 1500 stocks on a one year data after learning about different correlation measures and by removing the outliers wherever necessary. Ofcourse, it failed miserably. But that’s not the point.
Before I went ahead with all of my plans of becoming a billionaire, I happened to talk to two excellent computer science friends, both working for a big quant firm. Both in their own ways advised me to look into what’s called machine learning. Everything, I said(read invented) was already done and done very well. These were the very very sophisticated researched fields(much much more than what I had imagined, I can say this now) of data science. Call me ignorant. It’s like you invented the gravitational law while you were thrown off the cliff and you were told about the existence of Newton’s law of gravitation while you were counting your extra bones in a nearby hospital.
A new mature relationship
I have been doing machine learning (on and off) for more than a year now. The difference between then and now, if you ask, is not much. The logic has remained the same. Pretty much. I have just added some real weapons in my arsenal. I know how it works. And the thought that you can just apply a machine learning algorithm one night and make millions from trading stocks is exactly how it doesn’t work. It’s not rocket science and it’s not black magic. Undercover it’s all mathematics - linear algebra, differential calculus, probability and statistics, etc. and ofcourse programming. It’s a lot of hard work more than anything else.
Through this series of blog posts, I will try to write about data science, about tools, about algorithms, about various new exotic things I might have learned, about some of my attempts with real code on the plate. This is not a place to learn machine learning. It’s a log of a guy having adventures with machine learning.
Our world is not deterministic(to the best of my knowledge) or it doesn’t appear to be with the tools we have currently.
There is chaos.
Machine learning will not predict the future. Ever. But, approximations are good, too.