Use Deep Q Learning to land a shuttle
Using Reinforcement learning techniques to train bot that can play 2D video games was something I wanted to try from a long time. Fortunately as a part of the Artifical Intelligence course that I took this semester, I got to implement just that. The task was to train a Deep Reinforcement Leanrning agent that can learn how to properly land a shuttle on the surface of moon. The game evnvironemnt was Open AI's Lunar Landar . The particular technique that I used was Q learning, which is same as what Deep Mind used in its landmark 2013 paper on using Deep Reinforcement Learning to play Atari breakout . This post in no way covers all the concepts requried for implementing such a system from scratch, rather I will try to provide a brief overview of the concepts related. Finally, in no way I am claiming that following whatever I am about to say will lead to results similar to what I have, as neural network are pretty hard to train correctly and require lot of tweaking to work correctly due to their non-deterministic nature. So what I will try to do is point out the problems I faced while training my network and the steps I took to solve them.
Before we begin, a quick introduction to basic of Q learning and how we can use it with a neural network. The optimal Q-value for a state action pair according to the Bellman equation is given by the following equation $$ Q*(s, a) = \sum_{s'}p(s'|s, a)(R(s, a, s') + \gamma*\max_a Q(s', a)) $$ here Q(s, a) is the Q-value of a state action pair, p(s'|s, a) is the probability of reaching state s' by performing action a from state s (by encorporating probability of a transition we are taking into account the uncertainity associated with taking an action), R(s, a, s') is the reward for the action and \( \gamma \) is the discount factor which represents by how much should future rewards should affect actions in current state. A high \( \gamma \) would lead to the bot favoring higher rewards in the future where as lower values will make it look for more immediate rewards
In our current scenario as we are playing a game in we can assume that whatever action we take gets executed with 100% certainity, thus the above equation becomes: $$ Q*(s, a) = R(s, a) + \gamma*\max_a Q(s', a) $$ One question that came to my mind while training the neural network was that why to do it this way i.e. why use a neural network which is very very hard to train when you can simply use more traditional methods like value iteration and policy iteration which are deterministic and easy to implement. The reason for this is that though Q value update using Bellman is correct in theory (Bellman equation basically says start optimal and be optimal at every step of the way) its implementation using value or policy iteration uses discrete state spaces, which is something we rarely come accross in real world problems. We could always try and discretize the space but this would usually lead us to store a lot of state information, many of which would be about nearby states which can have similar Q values. Neural network solves this exact problem, we can use it to produce the Q(s, a) values for a continous state space.
Now that we know what the Q-value equation looks like and why we should use neural network to solve it, lets look at how exactly we need to do this. Anyone familiar with neural network or with any other optimization strategy in general will know that we need a loss value to minimize in order for things to work. A good loss value in this case can be the mean squared difference between the Q-value returned by the Bellman equation and the one returned by the neural network. This makes sense as we would like to train our model such that the Q-values produced are exactly same as what we would have achieved using methods like value iteration if we would have discretized the state space. So our loss function should looks like the following. $$ L = mean((Bellman - Q(s, a))^2) $$ Subsituiting the equation for Bellman value we solved above, we get $$ L = mean((R(s, a) + \gamma*\max_a Q(s', a) - Q(s, a))^2) $$
Now that we have all the maths figured out, its time to get into the technical details and implement it using Tensorflow. First, lets define the our model.
def build_model(observation_input, trainable=True): hidden = tf.layers.dense(observation_input, 64, activation = tf.nn.relu, trainable = trainable, name = 'dense') hidden_2 = tf.layers.dense(hidden, 64, activation = tf.nn.relu, trainable = trainable, name = 'dense_1') hidden_3 = tf.layers.dense(hidden_2, 64, activation = tf.nn.relu, trainable = trainable, name = 'dense_2') action_values = tf.squeeze(tf.layers.dense(hidden_3, env.action_space.n, trainable = trainable, name = "qValueLayer")) return action_values
The build_model function takes 2 input variables "observation_input" which is a tensorflow place holder for the input data and "trainable" is a boolean variable which determines whether or not a network needs to be trained. This is used to create a target network, more on this later. The output dimension of the network is set to "env.action_space.n" (env is the object reference to the OpenAI's lunar landar gym) which gives the number of actions possible, which in this case is 4. So what we are trying to do with this network model is to get the Q-values associated with all the actions for a given state, as the input to this network will be states spaces and the output will be the Q-value associated with each of the actions. This is exactly how we were supposed to use a neural network i.e. use it to get Q-values for the state action pair.
Some subtle points about this network, the weights of the layers are initialized using the xavier or glorot initialization, it is actually the default initialization scheme in Tensorflow. All, the biases are all intialized to 0. The reason I am pointing these out is cause whether or not a network learns anything or someting depends on how the weights and biases were initialized.
Now that we have a network set up its time to define the loss. As specified earlier the loss should sort of the difference between the ideal output and the actual output received. Lets first define what the ideal output should have been. If the state is the terminal i.e. the game is over the ideal output is the reward achieved, i.e. $$ Q*(s, a) = R(s, a)$$ If the state is not terminal then the ideal Q-value is given by: $$ Q*(s, a) = R(s, a) + \gamma*\max_a Q(s', a) $$ A thing to note here is that the ideal output is a 1 D vector, hence the actual output has also got to be a one dimensional vector as well. Remember that the network we defined above gives a 4 D vector as output for each input observation passed. This is something that confused me for quite some time as I couldn't understand how to use the network's out. After a few frustrating hours and going back and forth in defining my loss I understood that what I actually need for the network's output is Q(s, a) that is the value returned by the network for a particular action. Basically for each state we get a 4 D output, from this 4 D vector we need to choose the value corresponding to the action that was actually taken in that state. Doing this for all input observations we get a 1 D, which we can then use for calculating the loss.
With the network architecture and the loss function figured out, the only major part that remains to be setup is the eplison greedy strategy for exploration vs exploitation. But before that I feel now is a right time to talk about some techical/engineering things to make the network learn properly.
The final piece of the puzzle to make the agent learn is to decide on the Epsilon Greedy Strategy. It deals with what actions an agent takes given an observation. The way it is defined that the agent performs random action with Epsilon probability and uses the learned knowledge the rest of the time. Pseudo code for Epsilon Greedy Strategy look like the following
def takeAction(Agent, Observation, epsilon): return Action.Random() if random() < epsilon else Agent.getAction(Observation)
What we ideally want to do is explore a lot of states in the beginining by taking random actions and then as the agent learn make the actions less and less random and more based on the information that the agent has learned its environment. The reason for doing this is that in the beginining the agent knows nothing about the environment it is in and if we let it perform actions based on what it knows it won't properly explore all the states and may miss out on a higher reward. We also want the randomness to die down as the agent learns so that it can solidify its understanding of the enviroment by acting on what it knows about it.
There can be multiple different strategies that can work, linear decay, exponential decay and so on. What worked for me was to let the epsilon value decay linearly for the first 50000 observations and then decay it exponentially from there on. Following is how my agent worked after 4000 episodes of training, one episode is one complete round of the game.
Lastly here are some general purpose tips for training a neural network.
Thats all for now. If you have any questions please leave a comment!
Cheers!!