Simple is good

When I first started learning data science and machine learning, I had a few goals in my mind:

Learn all these fancy machine learning techniques
Win big bucks in Kaggle competitions by using the above learned techniques

Naturally, with such goals in mind I sought out the fastest way to accomplish all of this. The hacker inside me knew that there was a shortcut to it all and reading up on basic techniques was not worth as they were never used in a real world scenario, or so I thought. After reading up a lot of blogs on the cutting edge models used by Kaggle competition winner, I decided to enter a competition and try out my new learned skill :D. After creating a lot of parametric and non-parametric models and ensembling them (yup first attempt at my first competition I was ensembling models), I decided to upload my solution. Before uploading I saw that scores of quite a few people were less than even the sample submission provided by the competition master and I thought to myself how bad can you suck at this that the sample is better than your solution. So, laughing at these people I uploaded my solution and after seeing the results my laughter changed into bewilderment, I was at the bottom of the leaderboard, lower than any of the people I was just laughing at.

My brain went into complete denial, this can’t be true I applied at methods used by competition winners, how can I be at the bottom? There must have been a mistake, I must have uploaded the wrong file. So, after re-uploading the files and trying to change the parameters in the models used, it slowly dawned on me that my solution sucked. So I started looking at the discussion forum and kernels that other people had created to try an understand what went wrong. I found a kernel in which a guy discussed his approach and his approach was fairly simple, he used only one of the techniques used by me and yet his solution was a hundred times better than mine. So what was the difference, he applied a small transformation to his data to handle highly skewed variables and that was it. This one simple thing made his solution perform far better than mine.

It was then that I realized that fancy shiny methods are nice but they really cannot work if the dataset does not suit them. And the best to understand a dataset as to which methods can be applied is to try out the simple things first

tldr: Tried fancy stuff without understanding the basics.

Now let’s see how one of the easiest method Term frequency can reveal interesting information about a dataset.

For this demonstration I will be analyzing Plot Keywords that people assigned to movies on IMDB The data set can be found here. Following are the libraries that are used.

pandas: For basic data analysis
matplotlib: For visualization
wordcloud: For generating word cloud

Lets get started. First lets read the data into a dataframe and remove entries where the number of votes is less than 100 to reduce noise as entries with very less number of votes are always not so reliable.

df = pd.read_csv('data/movie_metadata.csv'')
df = df[df[''num_voted_users''] > 100][[''plot_keywords', 'gross', 'budget','duration','imdb_score', 'movie_title'']]

As the various plot key words for a movie are stored as a string delimited by '|', we need to split the plot_keywords column using '|' and then create a new entry in the table for each keyword

tags = df['plot_keywords'].copy().str.split('|').apply(pd.Series, 1).stack()
tags.index = tags.index.droplevel(-1)
tags.index
tags.name = 'tags'
df = df.join(tags)

Here we create a new series containing the keywords and then join it to original dataframe on the basis of the index. As we are performing textual analysis we need to weed out certain common words from our corpus. Words which don't really convey any information but appear quite frequently to mess up our term-frequency analysis, such as 'the', 'in', etc. Lets create a list of such words from the point of view of our analysis.

sWords = {'and', 'in', 'of', 'the', 'on','to', 'title','reference','female','male','by'}

Next we are going to use the word cloud library which does all the heavy lifting for us by calculating the count of each tag and also creating the corresponding word cloud image.

tagsString = " ".join(df.tags.dropna().tolist())
wordcloud = WordCloud(stopwords=sWords,
background_color='white',
max_words = 30,
width=2500,
height=2000
).generate(tagsString)
plt.imshow(wordcloud)
plt.rcParams["figure.figsize"] = [8, 6]
plt.axis('off')
plt.show()

Thats all we needed to do to generate a word cloud of the plot keywords. Now lets take a look at it.

Interestingly or obviously 'nudity' is very common in movies. Other frequent words have also to do with the day to day lives of normal people.

Using a simple techique like word count or term frequency we were able to create a graphical representation of what movie are about in our world. Finally, this was a small part of analysis that I actually did. I went on further to compare the keywords associated with top 250 IMDB movies and the top 250 highest grossing movies to compare them. You can try it out on your own or read it up here