Music Genre Classification Using Convolutional Neural Network
Using Convolutional Neural Network to predict a song's genre
This post provides a simple and brief overview of what Covolutional Neural Networks are and how to use them to predict the
genre of any songs. I will go into a lot of details so buckle up cause this is gonna be a long post
Convolutional Neural Network
Convolutional Neural Networks (CNNs) are a special type of Neural Network which have special convolutional layers in addition
of the usual fully connected layers. These convolutional layers are notthing but filters which slide over the data to produce
output. The history of CNNs can be traced to Image processing and recognition. Basically,
earlier for any sort of image recognition what people used to do was to generate hand crafted features based on their
intuition and understanding of the problem domain. Take for example the following task of recognizing the digit in an image
Here, a simple approach would be to maybe look for edges in the images and go from there. To capture these edge what is
normally applied is a filter that would slide through the image, perform a dot product with the image values and
record if it observes any edge in it or not. Something like the following would happen if we want to slide an edge
filter over the image of 7.
A Convolutional Neural Network does exactly this. It has a bunch of such filters which it slides over images to generate
what are called activation map which is nothing but the dot product of the filter and a small part of the image (Note:
the depth of the filter is same as that of the image so if the image has 3 channels Red Green and Blue, then the depth of
the filter will be automatically set to 3). The
awesome thing about a convolutional network though is that you don't have to worry about designing the filters themselves.
The training portion of the network takes care of that. And this is the exact thing that made CNNs so popular as before the
major challenge in any image recognition task was to come with the hand crafted features and corresponding filters for them.
So in essence CNNs are neural networks with an additional set of learnable filters which are used to identify features in the
dataset. The typical architecture of any convolutional neural network looks something like this:
So I missed a few key terms such as Non linear Activation, Pooling and dropout. Lets now understand briefly what these terms are:
Non Linear Activation
Non Linear Activation functions are what gives neural networks their advantage over other machine learning methods. These
activations are inspired from the process that happen in our own neuron. However, they are really just a very crude and
rudimentary approximation of the actual process. Common Activation functions are: sigmoid, tanh and relu. Though sigmoid in
the most known function in all of them but sadly it is also the least used one. The most commonly used activation function
is relu, which is simply Max(0, x) where x is the input to the activation function. The reason
relu works better than other activation functions is that it does not saturate, where as both sigmoid and tanh saturate as they
reach their min or max value which causes the gradients to become very very small causing the network to not learn.
Pooling
Pooling is a way to reduce the size of the input to any of the layers in the network. Basically as done with the filters before
a matrix is slided over the output of the Non Linear Activation layer. This matrix however is not learnable and has to be
specified while describing the network. There are many type of pooling operations such as min, max, average, etc. . The most
common type of pooling is max pooling, in it a window of size specified during network construction. This window is slid
over the input to the layer and the max value is found to replace the window with that max value.
Dropout
Dropout is a hyperparameter that specifies how many of the neurons should be deactivated at each training step. A simple hand
wavy explaination of why would anyone want to deactivate any neurons during training is that it sort of creates an
ensembles of networks, someting similar to what happens in a Random Forest or any other Boosted tree methods. Basically,
by deactivating some random neurons you ensure that the models does not learn only in a particular direction but rather tries
and generalize as much as possible. This is also the reason why Dropout is used as a regularization method to improve model
generalization. Another reason why people use dropout is that by randomly deactivating certain neurons, you can force
gradients to flow through the network making it learn better.
Uptil now we only covered what the typical convolutional neural network looks like, we have not discussed how a model learns all
the things it has to learn. This process is the same as any regular neural network and is known as Back Propogation. To describe
what Back Propagation is and how is everything calculated will take a post of its own. But to just describe it briefly it is a way
by which based on the output of the last layer and its comparison with the actual class labels, the weights of the layers are updated
one by one. It goes one layer at a time starting from the last, then the second last and so on till the first layer. This is why the
method is called Back Propogation as the updates the propogated from the last to the first layer.
If you are interested in understanding the maths behind all the topics I covered above I would highly recommend that you go
through Stanford's cs231n taught by Andrej Karpathy. It is a good course but if you don't have time and just want to understand
the concepts discussed here you can go through the following as well here
, here and here.
If you want to see how the output of various layers in a convolutional layer looks like visit here
, it is a real fun example that helps in understanding a CNN
Music Genre Prediction
Now that we have briefly discussed what a CNN is and how it functions. Lets now see how it can be applied to predict the genre of a
song. But first lets go through how we are going to represent a song and how this CNN will be different to the ones we discussed till
now. Finally before I start I am using GTZAN dataset for training my
neural network
Feature Engineering - Mel Spectogram
The normal sampling rate of any audio sample is generaly 44100 Hz i.e. 44,100 samples are taken per second. If we were to give
raw audio clip like to we do for images and even if the clips are of size 100ms (this is what we will do a bit later),
the input layer would need 4410 neurons. This is not particularly a lot but if you want to train it on your local machine, this
might cause an issue as the for each neuron in the network we have to store some data for back propogation and training. So if there
are a lot of neurons in the network, more than for which the data could be stored in the RAM, the training time will be very very high.
So in order to represent the audio file we will use something that is known as a Mel Spectogram. It is basically a spectogram with
values mapped down to Mel scale i.e. log scale. Mel Spectogram represents the frequency distribution over time. The reason why I am
using this and not anything else is that I found Mel Spectogram to be used in a lot of audio processing tasks. Lets now see how these
Mel Spectogram look for two very different genres Blues and Jazz