Data Science - Thought Process

How to go about modeling data for analysis.

One of the problem that I faced when I took part in my first Data Science competition was how to go about this whole process, I am sure many others faced something similar. The theoretical knowledge which all of us try and accumulate before taking part in any such competition does not really help much if you don't know how to properly apply that knowledge in practice. Anyone with experience in building data analysis pipeline will agree that much of the decision process is huerestic in nature i.e. the decisions taken while building the model depends on the data at hand. There is no user manual, one could follow to get the optimal results. This is exactly where experience comes into play. There is no subsitute for it and can only be gained by constantly trying out.

This post is about a thought process technique that I find really helpful in trying to tackle a lot the decisions that needs to be made. The process is really simple, it is all about thinking how you a human would go about performing the task you want the machine to do.

Let me put a problem: suppose you have to design a machine learning system for a library such that it can predict whether a given person who is borrowing a book will return it or not. Before jumping into what initial analysis methods will you use or what kind of machine learning model to use, lets first take a step back. Imagine that instead of building a machine learning application, you were given the job to do this manually. How will you do this as a person. How will you decide whether or not that lending a book to a person is safe. Some of the things that I probably will look at before lending a person a book are:

  • Does the person come to the library often: If yes, chances are that he/she will return the book as they are regular patrons of the library
  • Is the person married: In general people with families will be less inclined to stealing a book

Most the people will likely think of a lot more than these two while deciding whether or not to lend the book. However, that's not the point. The important thing to note here is that we did just now is feature engineering. Instead of directly jumping into the problem and trying out multiple methods and probably just caging our thought process to techniques and methods. We thought about the problem before hand and came up with a few probably important features.

It is important to keep in mind that our brains are still the gold standard that most machine learning systems want to replicate. Hence it is in our own benefit that we use it to the fullest instead of just mindlessly applying methods and formulas hoping to get a good result.