Music Genre Classification Using Convolutional Neural Network

This post provides a simple and brief overview of what Covolutional Neural Networks are and how to use them to predict the genre of any songs. I will go into a lot of details so buckle up cause this is gonna be a long post

Convolutional Neural Network

Convolutional Neural Networks (CNNs) are a special type of Neural Network which have special convolutional layers in addition of the usual fully connected layers. These convolutional layers are notthing but filters which slide over the data to produce output. The history of CNNs can be traced to Image processing and recognition. Basically, earlier for any sort of image recognition what people used to do was to generate hand crafted features based on their intuition and understanding of the problem domain. Take for example the following task of recognizing the digit in an image

Here, a simple approach would be to maybe look for edges in the images and go from there. To capture these edge what is normally applied is a filter that would slide through the image, perform a dot product with the image values and record if it observes any edge in it or not. Something like the following would happen if we want to slide an edge filter over the image of 7.

A Convolutional Neural Network does exactly this. It has a bunch of such filters which it slides over images to generate what are called activation map which is nothing but the dot product of the filter and a small part of the image (Note: the depth of the filter is same as that of the image so if the image has 3 channels Red Green and Blue, then the depth of the filter will be automatically set to 3). The awesome thing about a convolutional network though is that you don't have to worry about designing the filters themselves. The training portion of the network takes care of that. And this is the exact thing that made CNNs so popular as before the major challenge in any image recognition task was to come with the hand crafted features and corresponding filters for them.

So in essence CNNs are neural networks with an additional set of learnable filters which are used to identify features in the dataset. The typical architecture of any convolutional neural network looks something like this:

So I missed a few key terms such as Non linear Activation, Pooling and dropout. Lets now understand briefly what these terms are:

Non Linear Activation

Non Linear Activation functions are what gives neural networks their advantage over other machine learning methods. These activations are inspired from the process that happen in our own neuron. However, they are really just a very crude and rudimentary approximation of the actual process. Common Activation functions are: sigmoid, tanh and relu. Though sigmoid in the most known function in all of them but sadly it is also the least used one. The most commonly used activation function is relu, which is simply Max(0, x) where x is the input to the activation function. The reason relu works better than other activation functions is that it does not saturate, where as both sigmoid and tanh saturate as they reach their min or max value which causes the gradients to become very very small causing the network to not learn.

Pooling

Pooling is a way to reduce the size of the input to any of the layers in the network. Basically as done with the filters before a matrix is slided over the output of the Non Linear Activation layer. This matrix however is not learnable and has to be specified while describing the network. There are many type of pooling operations such as min, max, average, etc. . The most common type of pooling is max pooling, in it a window of size specified during network construction. This window is slid over the input to the layer and the max value is found to replace the window with that max value.

Dropout

Dropout is a hyperparameter that specifies how many of the neurons should be deactivated at each training step. A simple hand wavy explaination of why would anyone want to deactivate any neurons during training is that it sort of creates an ensembles of networks, someting similar to what happens in a Random Forest or any other Boosted tree methods. Basically, by deactivating some random neurons you ensure that the models does not learn only in a particular direction but rather tries and generalize as much as possible. This is also the reason why Dropout is used as a regularization method to improve model generalization. Another reason why people use dropout is that by randomly deactivating certain neurons, you can force gradients to flow through the network making it learn better.

Uptil now we only covered what the typical convolutional neural network looks like, we have not discussed how a model learns all the things it has to learn. This process is the same as any regular neural network and is known as Back Propogation. To describe what Back Propagation is and how is everything calculated will take a post of its own. But to just describe it briefly it is a way by which based on the output of the last layer and its comparison with the actual class labels, the weights of the layers are updated one by one. It goes one layer at a time starting from the last, then the second last and so on till the first layer. This is why the method is called Back Propogation as the updates the propogated from the last to the first layer.

If you are interested in understanding the maths behind all the topics I covered above I would highly recommend that you go through Stanford's cs231n taught by Andrej Karpathy. It is a good course but if you don't have time and just want to understand the concepts discussed here you can go through the following as well here , here and here. If you want to see how the output of various layers in a convolutional layer looks like visit here , it is a real fun example that helps in understanding a CNN

Music Genre Prediction

Now that we have briefly discussed what a CNN is and how it functions. Lets now see how it can be applied to predict the genre of a song. But first lets go through how we are going to represent a song and how this CNN will be different to the ones we discussed till now. Finally before I start I am using GTZAN dataset for training my neural network

Feature Engineering - Mel Spectogram

The normal sampling rate of any audio sample is generaly 44100 Hz i.e. 44,100 samples are taken per second. If we were to give raw audio clip like to we do for images and even if the clips are of size 100ms (this is what we will do a bit later), the input layer would need 4410 neurons. This is not particularly a lot but if you want to train it on your local machine, this might cause an issue as the for each neuron in the network we have to store some data for back propogation and training. So if there are a lot of neurons in the network, more than for which the data could be stored in the RAM, the training time will be very very high.

So in order to represent the audio file we will use something that is known as a Mel Spectogram. It is basically a spectogram with values mapped down to Mel scale i.e. log scale. Mel Spectogram represents the frequency distribution over time. The reason why I am using this and not anything else is that I found Mel Spectogram to be used in a lot of audio processing tasks. Lets now see how these Mel Spectogram look for two very different genres Blues and Jazz

Network Architecture — Spectogram for Jazz