Landing a shuttle on Moon with Deep Reinforcement Learning

Introduction

Using Reinforcement learning techniques to train bot that can play 2D video games was something I wanted to try from a long time. Fortunately as a part of the Artifical Intelligence course that I took this semester, I got to implement just that. The task was to train a Deep Reinforcement Leanrning agent that can learn how to properly land a shuttle on the surface of moon. The game evnvironemnt was Open AI's Lunar Landar . The particular technique that I used was Q learning, which is same as what Deep Mind used in its landmark 2013 paper on using Deep Reinforcement Learning to play Atari breakout . This post in no way covers all the concepts requried for implementing such a system from scratch, rather I will try to provide a brief overview of the concepts related. Finally, in no way I am claiming that following whatever I am about to say will lead to results similar to what I have, as neural network are pretty hard to train correctly and require lot of tweaking to work correctly due to their non-deterministic nature. So what I will try to do is point out the problems I faced while training my network and the steps I took to solve them.

The aim is to land the lander between the flags as softly as possible

Before we begin, a quick introduction to basic of Q learning and how we can use it with a neural network. The optimal Q-value for a state action pair according to the Bellman equation is given by the following equation $$ Q(s, a) = \sum_{s'}p(s'|s, a)(R(s, a, s') + \gamma\max_a Q(s', a)) $$ here Q(s, a) is the Q-value of a state action pair, p(s'|s, a) is the probability of reaching state s' by performing action a from state s (by encorporating probability of a transition we are taking into account the uncertainity associated with taking an action), R(s, a, s') is the reward for the action and $ \gamma $ is the discount factor which represents by how much should future rewards should affect actions in current state. A high $ \gamma $ would lead to the bot favoring higher rewards in the future where as lower values will make it look for more immediate rewards

In our current scenario as we are playing a game in we can assume that whatever action we take gets executed with 100% certainity, thus the above equation becomes: $$ Q(s, a) = R(s, a) + \gamma\max_a Q(s', a) $$ One question that came to my mind while training the neural network was that why to do it this way i.e. why use a neural network which is very very hard to train when you can simply use more traditional methods like value iteration and policy iteration which are deterministic and easy to implement. The reason for this is that though Q value update using Bellman is correct in theory (Bellman equation basically says start optimal and be optimal at every step of the way) its implementation using value or policy iteration uses discrete state spaces, which is something we rarely come accross in real world problems. We could always try and discretize the space but this would usually lead us to store a lot of state information, many of which would be about nearby states which can have similar Q values. Neural network solves this exact problem, we can use it to produce the Q(s, a) values for a continous state space.

Now that we know what the Q-value equation looks like and why we should use neural network to solve it, lets look at how exactly we need to do this. Anyone familiar with neural network or with any other optimization strategy in general will know that we need a loss value to minimize in order for things to work. A good loss value in this case can be the mean squared difference between the Q-value returned by the Bellman equation and the one returned by the neural network. This makes sense as we would like to train our model such that the Q-values produced are exactly same as what we would have achieved using methods like value iteration if we would have discretized the state space. So our loss function should looks like the following. $$ L = mean((Bellman - Q(s, a))^2) $$ Subsituiting the equation for Bellman value we solved above, we get $$ L = mean((R(s, a) + \gamma*\max_a Q(s', a) - Q(s, a))^2) $$

Network Architecture and Loss Calculation

Now that we have all the maths figured out, its time to get into the technical details and implement it using Tensorflow. First, lets define the our model.

def build_model(observation_input, trainable=True):
        hidden = tf.layers.dense(observation_input, 64, activation = tf.nn.relu, trainable = trainable, name = 'dense')
        hidden_2 = tf.layers.dense(hidden, 64, activation = tf.nn.relu, trainable = trainable, name = 'dense_1')
        hidden_3 = tf.layers.dense(hidden_2, 64, activation = tf.nn.relu, trainable = trainable, name = 'dense_2')
        action_values = tf.squeeze(tf.layers.dense(hidden_3, env.action_space.n, trainable = trainable, name = "qValueLayer"))
        return action_values

The build_model function takes 2 input variables "observation_input" which is a tensorflow place holder for the input data and "trainable" is a boolean variable which determines whether or not a network needs to be trained. This is used to create a target network, more on this later. The output dimension of the network is set to "env.action_space.n" (env is the object reference to the OpenAI's lunar landar gym) which gives the number of actions possible, which in this case is 4. So what we are trying to do with this network model is to get the Q-values associated with all the actions for a given state, as the input to this network will be states spaces and the output will be the Q-value associated with each of the actions. This is exactly how we were supposed to use a neural network i.e. use it to get Q-values for the state action pair.

Some subtle points about this network, the weights of the layers are initialized using the xavier or glorot initialization, it is actually the default initialization scheme in Tensorflow. All, the biases are all intialized to 0. The reason I am pointing these out is cause whether or not a network learns anything or someting depends on how the weights and biases were initialized.

Now that we have a network set up its time to define the loss. As specified earlier the loss should sort of the difference between the ideal output and the actual output received. Lets first define what the ideal output should have been. If the state is the terminal i.e. the game is over the ideal output is the reward achieved, i.e. $$ Q*(s, a) = R(s, a)$$ If the state is not terminal then the ideal Q-value is given by: $$ Q*(s, a) = R(s, a) + \gamma*\max_a Q(s', a) $$ A thing to note here is that the ideal output is a 1 D vector, hence the actual output has also got to be a one dimensional vector as well. Remember that the network we defined above gives a 4 D vector as output for each input observation passed. This is something that confused me for quite some time as I couldn't understand how to use the network's out. After a few frustrating hours and going back and forth in defining my loss I understood that what I actually need for the network's output is Q(s, a) that is the value returned by the network for a particular action. Basically for each state we get a 4 D output, from this 4 D vector we need to choose the value corresponding to the action that was actually taken in that state. Doing this for all input observations we get a 1 D, which we can then use for calculating the loss.

With the network architecture and the loss function figured out, the only major part that remains to be setup is the eplison greedy strategy for exploration vs exploitation. But before that I feel now is a right time to talk about some techical/engineering things to make the network learn properly.

Replay Memory: Use a replay memory which is basically a list to store the observations i.e. (R, S, A, S'). Here R is the reward returned, S is the current state, A the action taken from that state and S' the state where the bot reached after taking te action. At each network update step sample observations of the required batch size from this list. The reason for doing this is that neural networks needs samples which are not correlated and subsequent observations in a game will always be correlated to each other. So to solve this problem of correlation, we can store all the observations seen till now and then randomly sample them.
Target Network: The way refinforcement learning usually works is that the enviroment provides a feedback about the actions a bot/agent takes and then based on this feedback bot/agent adjusts its decision system. This feedback creates a problem for training a neural network as this feedback prevents the network from generalizing well. To solve, we use what is called a Target network, which we won't be training (this is what the trainable parameter was for in the build_model function) . What we will do for this network is that we will copy the weights from the network we are training to this one at certain intervals (it is a hyperparameter). We will be using this network for calculating the ideal output vector.
Huber Loss: The loss that I defined earlier was Mean Squared error though this is the loss function the team at Deep Mind used to train its network but I found it very unstable. Instead I used Huber Loss to train my model as it is more robust to outliers and hence more stable.

Epsilon Greedy Strategy

The final piece of the puzzle to make the agent learn is to decide on the Epsilon Greedy Strategy. It deals with what actions an agent takes given an observation. The way it is defined that the agent performs random action with Epsilon probability and uses the learned knowledge the rest of the time. Pseudo code for Epsilon Greedy Strategy look like the following

def takeAction(Agent, Observation, epsilon):
       return Action.Random() if random() < epsilon else Agent.getAction(Observation)

What we ideally want to do is explore a lot of states in the beginining by taking random actions and then as the agent learn make the actions less and less random and more based on the information that the agent has learned its environment. The reason for doing this is that in the beginining the agent knows nothing about the environment it is in and if we let it perform actions based on what it knows it won't properly explore all the states and may miss out on a higher reward. We also want the randomness to die down as the agent learns so that it can solidify its understanding of the enviroment by acting on what it knows about it.

There can be multiple different strategies that can work, linear decay, exponential decay and so on. What worked for me was to let the epsilon value decay linearly for the first 50000 observations and then decay it exponentially from there on. Following is how my agent worked after 4000 episodes of training, one episode is one complete round of the game.

Lunar Lander Game — My Agent after 4000 episodes of training

Lastly here are some general purpose tips for training a neural network.

Big Batch Size: Ensure that the Batch size is big enough as small batch size leads to unstability in the loss.
Use TensorBoard to Visualize network: This is very important as this helps in debugging the model
Use Small learning rate: Always try and use a small learning rate in the order of 10^-3. Further using something like Nestrov momentum always help
Add Regulaization Term: A possible problem with the network could be that the weights of your layer are blowing up during training in such cases adding a regulaization term for the layer weights to the loss value helps
Check network initialization: A possible problem why the network is nit learning anything is cause the weights and biases were not initialized properly
Normalize the input values: This is important while training neural networks as the loss is dependent on the distance. So bigger values will have a higher impact on the loss.

If after following the above steps you still cannot figure out what is wrong with your network. I would suggest reading this post on debugging neural networks. My implementation of the Lunar Lander agent can be found here.

Thats all for now. If you have any questions please leave a comment!

Cheers!!