Space Invaders challenge: a Reinforcement Learning competition
Overview
Current advancements in Reinforcement Learning has made it possible to train an AI agent to master many games and play at a human level or in some cases beat a human player. This was made possible mainly by creating novel set of techniques or methods that incorporates Deep Learning into traditional Reinforcement Learning which gave rise to new class of algorithms called Deep Reinforcement Learning algorithms. As part of this challenge I have explored few such Deep Reinforcement Learning algorithms from DQN to the current state-of-art A3C and this report will cover the details of these algorithms along with their experimental results. As the title suggests the game that was used to validate the performances of these algorithms is Atari's Space Invaders and all the algorithms being explored uses vision sensory data like image frames to train. The environment for Space Invaders is provided by OpenAI Gym's Atari emulator.
Background
Markov Decision Process (MDPs)\textbf{Markov Decision Process (MDPs)} : Reinforcement Learning's main objective is to solve MDPs. MDPs from an AI agent view point can be thought of the agent interacting with an environment by taking an action depending on its current state and moving to a new state. Depending on the new state that the agent lands a reward is awarded to the agent. The goal of the agent is to take those actions that maximize the long term rewards. The MDPs follows the Markov property as only the current state matters while taking an action and not the states that preceded the current state. In the Space Invaders Atari environment the "state" is made up of image frames, "actions" include "no-op (do nothing), move left, move right, fire, left fire, right fire" and the rewards depends on whether the agent survived or shot down aliens.
State pre-processing\textbf{State pre-processing} : As explained above the image frames are the states of our environment and the Atari environment provides image frames in RGB format with dimension 210x160x3. These RGB format image frames are down-sampled and converted to grayscale format with the final dimension being 88x80. These pre-processed images are used in training and evaluation of all the algorithms.
Bellman Equation or state-value function\textbf{Bellman Equation or state-value function} : The agent is trying to maximize its long term rewards but in some cases the iterations can last for many time steps or infinity which gives rise to the concept of discounted rewards. Discounted rewards are rewards that are obtained in the future time steps but their value is discounted out. The Bellman equation captures the value VV of being in a state sts_t based on the immediate reward of entering that state and all the discounted rewards that can be obtained from that state, as shown below -
V(st)=maxat(R(st,at)+γ∑st+1P(st+1∣st,at)V(st+1))V(s_t) = max_{a_t}(R(s_t, a_t) + \gamma \sum_{s_{t+1}} P(s_{t+1} | s_t, a_t) V(s_{t+1}))
The optimal value VV of the current state sts_t is obtained by taking an action ata_t that maximizes the immediate reward RR for taking that action in current state plus the sum over all the values VV of the new states st+1s_{t+1} weighted by the probability PP of ending up in the new state by the taking an action in the current state multiplied by a discount factor γ\gamma.
Q-function or action-value function\textbf{Q-function or action-value function} : The Bellman equation described above requires modeling the probability distribution for every new state from the current state and action. The Q-function eliminates the need for this model based approach by redefining the Bellman equation as below
Q(st,at)=R(st,at)+γmaxat+1Q(st+1,at+1)Q(s_{t}, a_{t}) = R(s_{t}, a_{t}) + \gamma max_{a_{t+1}} Q(s_{t+1}, a_{t+1})
The Q value QQ of the current state sts_t for taking action ata_t is the immediate reward RR for taking that action in current state plus the maximum possible Q value which is obtained by choosing the optimal action out of all possible actions at+1a_{t+1} in the new state st+1s_{t+1} multiplied by the discount factor γ\gamma. This doesn't entirely capture the original Bellman equation which gave us the optimal value VV of being in a state sts_t, we can still obtain this from the Action-value function or Q-function as below -
V(st)=maxatQ(st,at)V(s_t) = max_{a_t}Q(s_t, a_t)
This is exactly what is done in practice while using the action-value function or Q-function. When in a state sts_t we pick an action ata_t that maximizes the Q-values which is nothing but the optimal value of that state V(st)V(s_t). Thus algorithms that use the Q-function are usually referred to as value-iteration algorithms.
Policy\textbf{Policy} : A policy π\pi indicates which action ata_t to take in state sts_t. An optimal policy π∗\pi^{*} is the one that can maximize long term rewards. It can be represented as below -
π(at∣st)=argmaxatQ(st,at)\pi(a_t|s_t) = argmax_{a_t}Q(s_t, a_t)
Value-iteration methods\textbf{Value-iteration methods} : The goal of these methods is to maximize the long term rewards by directly maximizing the values of being in a particular state sts_t.
Policy-iteration methods\textbf{Policy-iteration methods} : The goal of these methods it to maximize the long term rewards by finding an optimal policy π∗\pi^{*} from a set of policies (π1,π2,....,πn)(\pi_1, \pi_2, ...., \pi_n).
Advantage function\textbf{Advantage function} : The advantage function measures the relative importance of an action. It can be obtained from the state-value function and action-value function as below -
Aπ(s,a)=Qπ(s,a)−Vπ(s)A^{\pi}(s, a) = Q^{\pi}(s, a) - V^{\pi}(s)
The state-value function V(s)V(s) measures how good it is to be in state ss. The action-value function Q(s,a)Q(s,a) measures the value of choosing a particular action aa in state ss. The advantage function A(s,a)A(s,a) which is subtracting the value of a state from the action-value function thereby measuring the importance of each actions. With all three of them following the policy π\pi.
Now, let's say that all the functions have converged and we have found an optimal policy π∗\pi^{*} which gives the optimal action that can maximize the long term reward. And suppose we choose only those actions aa that is returned by the policy π∗\pi^{*} in state ss for both the action-value function Q(s,a)Q(s,a) and advantage function A(s,a)A(s,a) then,
Aπ∗(s,a)=0A^{\pi^{*}}(s,a) = 0, this is because since all the functions have converged and we are now choosing the action from π∗\pi^{*} there is no need for us to do Vπ∗(s)=maxaQπ∗(s,a)V^{\pi^{*}}(s) = max_aQ^{\pi^{*}}(s,a) because Vπ∗(s)=Qπ∗(s,a)V^{\pi^{*}}(s) = Q^{\pi^{*}}(s,a).
Q-learning\textbf{Q-learning} : Q-learning is a value-iteration method that builds a table of Q values QQ for all states (st,st+1,....)(s_t, s_{t+1}, ....) and all possible actions (at,at+1,......)(a_t, a_{t+1}, ......) that can be taken from those states. The Q value updates are done as below -
Q(st,at)=R(st,at),Q(s_t, a_t) = R(s_t, a_t), for a terminal state sts_t and R(st,at)+γmaxat+1Q(st+1,at+1)R(s_{t}, a_{t}) + \gamma max_{a_{t+1}} Q(s_{t+1}, a_{t+1}) for a non-terminal state sts_t
With the table of Q values we can now find the optimal policy as below, which is to just pick the action ata_t that gives the highest Q value QQ for that state sts_t.
πt∗=argmaxaQ(st,at)\pi^{*}_t = argmax_a Q(s_t, a_t)
ϵ\bm\epsilon-greedy policy\textbf{greedy policy} : Q-learning uses ϵ\epsilon-greedy policy for addressing the exploitation vs exploration problem in Reinforcement Learning. The exploitation vs exploration problem is where the agent keeps taking greedy actions that will always maximize rewards however in-order to better strategize the agent also needs to explore other state-action spaces by taking some random actions every now and then. In the ϵ\epsilon-greedy policy for an ϵ\epsilon value between 0 and 1 we take a random action with a probability p=ϵp=\epsilon and choose an action based on Q value with a probability p=1−ϵp = 1-\epsilon.
Deep Q Networks (DQN)
Overview\textbf{Overview}
The Q-learning algorithm will converge to the optimal Q-values Q∗Q^{*} as the number of timesteps reaches ∞\infty, i.e, QiQ_i -> Q∗Q^{*} as ii -> ∞\infty. However, in practice this is not possible as the action-value function or Q-function which is used to compute the Q-values QiQ_i are estimated for each state separately. Thus, at best, new methods can come up with function approximators to estimate the action-value or Q-function. These function approximators can either be linear or non-linear. The Deep Q Networks or DQNs first proposed in [1][1] uses a non-linear function approximator using deep neural networks to approximate the action-value function. Thus the optimal action-value function can now be approximated as below -
Q(st,at;θ)≈Q∗(st,at)Q(s_t, a_t ; \theta) \approx Q^{*}(s_t, a_t), where θ\theta are the weights associated with the Deep neural networks referred as Deep Q-networks.
The Deep Q-network architecture described in the original paper [1][1] is as follows - first Conv2D layer with 16 filters, kernel size of 8 with stride of 4 and ReLU activation function followed by second Conv2D layer with 32 filters, kernel size of 4 with stride of 2 and ReLU activation followed by a fully connected layer with 256 units and the output layer outputs logits of size equal to the number of actions. Thus for Space Invaders we will be getting a single output for each of the action. The DQN algorithm uses experience replay [1][2][1][2] in-order to eliminate correlations in the state-space observations without which the agent will keep training over similar states which can be detrimental. The experience replay consists of tuples et=(st,at,rt,dt,st+1)e_t = (s_t, a_t, r_t, d_t, s_{t+1}), which is a record of the current state, action, reward, done flags and new state. replay buffer max size\textit{replay buffer max size} of such tuples are maintained in the replay buffer which will be used while training the DQN. Done flag is a boolean returned by Gym's Atari environment with its value set to True\textit{True} if the episode has ended which means sts_t is the terminal state else its set to False\textit{False}.
The Deep Q-network is trained by obtaining the target estimate for Q-values and by minimizing the loss function Li(θ)L_i(\theta) as shown below at every training iteration ii and taking the gradients -
Q(st+1,at+1;θ)=maxat+1Q(st+1,at+1;θ)Q(s_{t+1}, a_{t+1}; \theta) = max_{a_{t+1}}Q(s_{t+1}, a_{t+1}; \theta)
Li(θ)=(yi−Q(st,at;θ))2L_i(\theta) = (y_i - Q(s_t, a_t; \theta))^2, where yi=ry_i = r for terminal state and yi=(r+γQ(st+1,at+1;θ))y_i = (r + \gamma Q(s_{t+1}, a_{t+1}; \theta)) for non-terminal state.
Here ii is being used with respect to the Deep Q-network and tt is being used with respect to the Atari environment. The DQN algorithm is detailed in [1][1], on the high level it can be summarized as - depending on the ϵ\epsilon-greedy policy if we chose to take an action based on Q-value estimation then the action ata_t for state sts_t is obtained by passing sts_t through the DQN with parameters θ\theta and from the output logits, which are the estimates of Q-values for all the actions, we pick the action ata_t with highest value or if ϵ\epsilon-greedy suggests to take a random action then we will just choose ata_t randomly from the action space. Irrespective of how ata_t was chosen it is passed on to the environment. We will then get the corresponding new state st+1s_{t+1} and reward rr for taking action ata_t from the environment. We then make the tuple et=(st,at,rt,st+1)e_t = (s_t, a_t, r_t, s_{t+1}) and add it to the experience replay buffer. Before training the experience replay buffer is filled up with replay buffer min size\textit{replay buffer min size} tuple entries by performing random actions ata_t. Now, for every update frequency\textit{update frequency} the DQN network is trained by randomly sampling mini batch\textit{mini batch} tuple entries from the experience replay buffer. All the mini batch\textit{mini batch} st+1s_{t+1} entries are passed through the DQN to obtain the Q-values for all the actions, we choose those Q-values with max values maxQ(st+1,at+1)max Q(s_{t+1}, a_{t+1}) and multiply it with the discount factor γ\gamma and (1−dt)(1 - d_t) which is (0(0 - for terminal state,1, 1 - for non-terminal state)), we finally add the rewards rtr_t to obtain the target Q-value estimations yiy_i of the loss. The predicted Q-value estimates Q(st,at;θ)Q(s_t, a_t; \theta) of the loss are obtained by passing the mini batch\textit{mini batch} sts_{t} entries through the DQN to obtain the Q-values for all the actions, we choose only those Q-values that correspond to the action ata_t from the tuple entry. After this the gradients are computed and the network parameters are optimized.
As one can notice from above that unlike supervised learning, here the target is not fixed but instead an estimate which keeps moving in each training iteration which can lead to instability in training as its either too hard to converge due to the moving target or it can converge to sub-optimal value due to correlations between the target Q-values and predicted Q-values when the rewards are small or delayed. This issue is addressed in [2][2] with the use target DQN networks which will be used to obtain the target Q-value estimation yiy_i of the loss. Target DQN networks are a lazy copy of the Main DQN networks that are being trained to predict the Q-values. Lazy copy means that the target network parameters will be updated with that of the main network parameters at every target network update frequency\textit{target network update frequency}, so the target network is not being trained it just copies the parameter values of the main network every now and then. This strategy will help in stabilizing the training of DQNs. With this the target estimate for Q-values and the yiy_i term of the loss function Li(θ)L_i(\theta) is updated as below -
Q(st+1,at+1;θ−)=maxat+1Q(st+1,at+1;θ−)Q(s_{t+1}, a_{t+1}; \theta^{-}) = max_{a_{t+1}}Q(s_{t+1}, a_{t+1}; \theta^{-})
Li(θ)=(yi−Q(st,at;θ))2L_i(\theta) = (y_i - Q(s_t, a_t; \theta))^2, where yi=ry_i = r for terminal state and yi=(r+γQ(st+1,at+1;θ−))y_i = (r + \gamma Q(s_{t+1}, a_{t+1}; \theta^{-})) for non-terminal state. Here, θ{\theta} are the parameters of the main DQN network which is being trained and θ−{\theta^{-}} are the parameters of the target DQN network.
One other trick that was used in [2][2] to stabilize the training further is to clip the value of the loss Li(θ)L_i(\theta) to be between -1 and 1. If the temporal difference between target and predicted estimates is xx -
x=r+γmaxat+1Q(st+1,at+1;θ−)−Q(st,at;θ)x = r + \gamma max_{a_{t+1}}Q(s_{t+1}, a_{t+1}; \theta^{-}) - Q(s_t, a_t; \theta)
A drawback with the squared error loss function Li(θ)L_i(\theta) used above is higher the error difference xx between the target estimates and predicted estimates more amplified the loss would be which in turn will optimize the network parameters by a huge margin. This may be advantageous in a supervised learning environment but this will impact the stability of DQN training as both the predictions and target estimates are moving. The authors explore the use of absolute value loss function ∣x∣|x| which treats both higher and lower value of xx in a similar way by providing similar gradients. However, we would still prefer higher error differences to optimize the network parameters a bit more but not at the scale done in squared error loss. Thus a middle-ground to both of these requirements is to use the Huber loss function which as shown below uses squared error loss if x≤1x \leq 1 and uses absolute value loss otherwise.
huber(x)=12x2huber(x) = \dfrac{1}{2} x^2, for ∣x∣≤1|x| \leq 1; and huber(x)=∣x∣−12huber(x) = |x| - \dfrac{1}{2}, otherwise
Thus the above loss function is updated as - Li(θ)=(yi−Q(st,at;θ))2L_i(\theta) = (y_i - Q(s_t, a_t; \theta))^2, if ∣yi−Q(st,at;θ)∣≤1|y_i - Q(s_t, a_t; \theta)| \leq 1
Li(θ)=∣yi−Q(st,at;θ)∣L_i(\theta) = |y_i - Q(s_t, a_t; \theta)|, if ∣yi−Q(st,at;θ)∣>1|y_i - Q(s_t, a_t; \theta)| > 1, where yiy_i is same as above.
As explained above, DQN also uses the ϵ\epsilon-greedy policy to address the exploration vs exploitation problem. The ϵ\epsilon value is started with a value of 11 till the replay buffer reaches 50000 entries and is annealed to a minimum of 0.10.1 as the training progresses till the 1 millionth frame and stays at 0.10.1 from there on.
The same techniques described above are used in all the following algorithms except the A3C, with only subsequent incremental changes in each one of them. Thus those following section will not be as verbose as this. The DQN papers [1][2][1] [2] also describe further parameter settings related to the Atari environment which we will go over in the experiment section below.
Experiments and Results\textbf{Experiments and Results}
The first change with the implementation of DQN from what was described in the overview section is the DQN-network architecture. Instead of using the architecture in [1][1], I tried out the convolutional layers and the following fully connected layer from [5][5] instead. The network - first Conv2D layer with 32 filters, kernel size of 8 with stride of 4 followed by second Conv2D layer with 64 filters, kernel size of 4 with stride of 2 followed by third Conv2D with 64 filters, kernel size of 3 with stride 1 followed by fourth Conv2D with 1024 filters, kernel size of 5 with stride 1 followed by fully connected layer with 1024 units and the output layer outputs logits of size equal to the number of actions. The second change with the implementation was to use the Adams optimizer instead of RMSprop used in [1][2][1][2]. This convolutional layer architecture is the same for all the following DQN based algorithms and they used Adams optimizer as well.
The model was trained with the below hyper-parameter setting for a total of max training frames\textit{max training frames}.
target network update frequency\textit{target network update frequency} = 10000
γ\gamma = 0.99
replay buffer min size\textit{replay buffer min size} = 50000
replay buffer max size\textit{replay buffer max size} = 1000000
max training frames\textit{max training frames} = 1000000
no-op steps\textit{no-op steps} = 30
action repeat\textit{action repeat} = 3
update frequency\textit{update frequency} = 4
learning rate\textit{learning rate} = 0.00025
mini batch\textit{mini batch} = 32
These hyper-parameter settings are same for all the following DQN-based except for max training frames\textit{max training frames} and learning rate\textit{learning rate}
The parameters related to Atari environment interaction are the no-op steps\textit{no-op steps} which corresponds to the max number of "do nothing" action (no-op) to be performed by the agent at the start of an episode and action repeat\textit{action repeat} which corresponds to repeating an action on those many frames. It was noted in [1][1] that action repeat\textit{action repeat} had to be clipped to a max of 3 for Space Invaders as anything more could lead to the miss of frames consisting of the bullet. It was not very clear from [2][2] on whether no-op steps\textit{no-op steps} and action repeat\textit{action repeat} had to be done during both training and evaluation or only during evaluation. So I did both only during evaluation and for the number of "do nothing" a value between 0 and no-op steps\textit{no-op steps} was randomly sampled each time.
The DQNs output Q-Values for each action - there are 6 actions [’noop’,’fire’,’right’,’left’,’rightfire’,’leftfire’][\textit{'noop'}, \textit{'fire'}, \textit{'right'}, \textit{'left'}, \textit{'rightfire'}, \textit{'leftfire'}] in SpaceInvaders-v0 which is used to train and evaluate all the algorithms. Thus there is a Q-value associated with each on of the above actions.
Two experiments were carried out for DQN algorithm -
Experiment-1 (wandering-sponge-106) : To play SpaceInvaders-v0 without the no-op steps\textit{no-op steps} and action repeat\textit{action repeat} suggestions given in [2][2].
Experiment-2 (worthy-tree-127) : To play SpaceInvader-v0 with no-op steps\textit{no-op steps} and action repeat\textit{action repeat}.
Both the experiments were evaluated for 100 episodes. Below visualization captures the episodic reward\textit{episodic reward} plots and cumulative avg reward\textit{cumulative avg reward} plots for both the experiments. The Avg Q Values\textit{Avg Q Values}, which plots the average Q-values in y-axis for 100 episodes for each of the 6 actions in x-axis, plot is only for the 2nd experiment. The episode and cumulative rewards look more or less similar in both cases with there being one episode with a score over 1000 in both cases and cumulative reward being 279 and 273 for Experiment 1 and 2 respectively. One peculiar thing that can be observed from the gameplay videos is that as soon as the episode begins or when the agent re-spawns it races to the right of the window firing on its way and continuing to fire after coming to a standstill. This corresponds to the actions ’fire’\textit{'fire'} and ’rightfire’\textit{'rightfire'}, thus the agent seems to believe that performing these two actions for most number of times would maximize its long term rewards. This behavior is also backed-up by the Avg Q Values\textit{Avg Q Values} plot where the Q-Value is higher for action no 5 which is ’rightfire’\textit{'rightfire'} and action no 1 which is ’fire’\textit{'fire'}. It might be possible that the long standing issue with Q-learning which is over-estimation of Q-values could have caused this. The issue of over-estimation gets amplified with more number of actions, which in our case is 6. One reason why these two actions might have higher over-estimation than others is due to the fact that during training we adjust the Q-values of only the selected action and not for all the actions, thus if the more an action gets selected the more its Q-value will get optimized by going through network parameters which can induce noise, if any, at the initial stages of training which in turn will amplify the over-estimations. And from the looks of it, it seems like during training these two actions would have got picked more often. Its also worth noticing that both these actions are high reward yielding so there also has to be a play of exploitation over exploration here. There is a slight possibility of the agent leaning towards exploitation over exploration, we have trained our agent for only 1 million frames and the ϵ\epsilon value of the ϵ\epsilon-greedy policy is annealed from 1 to 0.1 as training progresses. Thus by the 500000th frame the value of ϵ\epsilon = 0.58, so we have already starting to take 42% of the decisions based on the estimated Q-values and the experience replay buffer would also have only 500000 frames to sample from which is still only halfway through. The combination of over estimated Q-values for high rewards yielding action with the minimal training it is highly possible that the agent started exploiting only these 2 actions and thus got stuck at some local optima from which it was never able to come out due to minimal exploration. Over estimation of Q-values, which has been prevalent in many Q-learning applications, is addressed in the following algorithm. For the issue of minimal exploration it seems like there needs to be a high ϵ\epsilon value at-least until the experience replay buffer fills up which means the training has to be done for definitely more than 1000000 frames. Finally, it looks like both no-op steps\textit{no-op steps} and action repeat\textit{action repeat} did not have much impact on the agent as no-op steps\textit{no-op steps} only matters at the beginning of the episode to force the agent to perform nothing but after the first few frames this parameter plays no role. And since the agent was mainly repeating only 2 actions the action repeat\textit{action repeat} parameter did not play any role as even without it the agent would still be repeating one of these two actions.
Double Deep Q Networks (DDQN)
Overview\textbf{Overview}
The incremental change brought in by Double DQN is not directly to DQN but to the underlying action-value function or Q-function which causes an issue because of its maxmax operator. Due to the maxmax operator the Q-function is prone to over-estimations of Q-values when there is small noise induced due to the environment or other factors, in case of DQN the noise can be from both the function approximator i.e the neural network parameters, activation functions and the Atari environment. As described in [4][4] the number of actions that can be taken in the environment will scale up these over-estimations, say if the action-value function has a uniformly distributed random error ee in the interval (−e,e)(-e, e) then the target over-estimations can reach up-to em−1m+1{e^{\dfrac{m-1}{m+1}}} where mm is the number of actions. The authors in [4][4] propose to use Double Q-learning which is described in [3][3] as a solution to reduce the over-estimation errors encountered in DQN. On the high level Double Q-learning as the name suggests uses two Q-function QA(s,a)Q^A(s,a) and QB(s,a)Q^B(s,a) to estimate the Q-values for the actions and ultimately choose the optimal action instead of using a single Q-function as in Q-learning. This approach basically decouples the selection and evaluation of an action using target estimates from two Q-functions instead of doing the action selection and evaluation based on target estimates from the same Q-function.
Inclusion of Double Q-learning into DQN is fairly simple as DQN already has two separate deep neural networks, the main DQN network which is being trained and the target DQN network which is lazily synchronized with main DQN network parameters every update frequency\textit{update frequency}. This combination gives raise to the Double DQN algorithm. As in Double Q-learning, in Double DQN the maxmax operation in target estimate gets decoupled into action selection using the main DQN network and action evaluation using the target DQN network.
So instead of doing the below in DQN network -
Q(st+1,at+1;θ−)=maxat+1Q(st+1,at+1;θ−)Q(s_{t+1}, a_{t+1}; \theta^{-}) = max_{a_{t+1}}Q(s_{t+1}, a_{t+1}; \theta^{-}) and yi=(r+γQ(st+1,at+1;θ−))y_i = (r + \gamma Q(s_{t+1}, a_{t+1}; \theta^{-})), for non-terminal state.
we will no do it as per below -
QDouble(st+1,at+1;θ−)=Q(st+1,argmaxat+1Q(st+1,at+1;θ);θ−)Q^{Double}(s_{t+1}, a_{t+1};\theta^{-}) = Q(s_{t+1}, argmax_{a_{t+1}}Q(s_{t+1}, a_{t+1}; \theta); \theta^{-}) and yi=r+γQDouble(st+1,at+1;θ−)y_i = r + \gamma Q^{Double}(s_{t+1}, a_{t+1};\theta^{-}), for non-terminal state. Here, θ{\theta} are the parameters of the main DQN network which is being trained and θ−{\theta^{-}} are the parameters of the target DQN network.
As seen above Double DQN removes the maxmax operator from the target estimate, it instead selects the optimal action from the main DQN network and gets the target estimate for this optimal action from the target DQN network thereby evaluating it. With this change in the original DQN there should be reduced over-estimation of target Q-values.
Experiments and Results\textbf{Experiments and Results}
All the hyper-parameter and experiment settings are same as mentioned/described in the DQN section above. Here we will just go over the results and compare it with the DQN algorithm's results, Experiment-1 (radiant-lake-108) and Experiment-2 (pleasant-smoke-135). The episode rewards look similar for both the experiments with a high of 700s in both cases and the cumulative rewards are very close at 212 and 214 for experiment-1 and 2 respectively. A surprising change that can be noticed here is that the agent is no longer exploiting the actions ’rightfire’\textit{'rightfire'} and fire\textit{fire}, in fact from the Avg Q Values\textit{Avg Q Values} plot we can notice that action no 0 ’noop’\textit{'noop'}, action no 2 ’right’\textit{'right'} and action no 5 ’leftfire’\textit{'leftfire'} have higher Q-values this time. This action distribution can also be observed on both the gameplay videos - during the initial stage of the episode when the barriers protecting the agent exists it prefers to stay underneath by performing ’noop’\textit{'noop'}, when most of the aliens are clustered to the right flank the agent moves quickly to right and when the aliens are clustered in the middle towards the end of the episode the agent prefers the action ’leftfire’\textit{'leftfire'}.
Looks like with Double DQN the over-estimation of Q-values are reduced to an extent and thus the agent had the possibility to evaluate the actions for their actual value returns. As evident from the plots and gameplay videos, the agent now has learnt to use a wait-and-proceed strategy over the reward exploitation strategy in the previous DQN case. Thus Double DQN does seem to help the agent to learn better. However, the average rewards for Double DQN are lesser than that of DQN, this could again be attributed to the fact of minimal training. From the gameplay video it is clear that the agent is playing the game better now than in the previous case and it looks like it was in the process of learning when to use the action noop\textit{noop} more effectively, which is when its underneath the barrier. The reason for lower rewards is because now the agent is not trying to exploit the high reward yielding actions which could have led it to get stuck in a local optima, instead it is exploring the action space much better due to the reduction of over-estimation of Q-values and provided more training samples Double DQN should perform better than DQN.
Dueling Double Deep Q Networks (DDDQN)
Overview\textbf{Overview}
The incremental change proposed in [5][5] for Double DQN is the introduction of a new type of neural network called the Dueling network.
As shown in the diagram above to the left from [5][5] the major change in the network is after the last convolutional layer, where instead of connecting to a single fully connected layer, the output from the convolutional layer is split into two separate fully connected layers which according to the authors form the value stream which represents the state values V(s)V(s) and the advantage stream which represents the state-dependent advantage of actions A(s,a)A(s,a). As vividly shown in the diagram above to the right again from [5][5] - on the top pair the value stream is focusing on the horizon as that is where new cars are expected to show up whereas the advantage stream doesn't pay any attention as it doesn't really matter what action is chosen here since there are no cars. On the bottom pair the value stream is still focusing on the horizon as the cars right in front would have been accounted for is previous states, this time however the advantage stream pays attention since it really matters which action is chosen now. This enables the Dueling networks to learn which states are valuable or not without having to go through the impact of each action in every state which is not possible in case of a direct single action stream Q(s,a)Q(s, a). The fully connected value stream layer which outputs a single logit and the fully connected advantage stream layer which outputs a logit for each action are merged together to obtain the estimates for the action-value function Q(s,a)Q(s,a). And as described previously the estimates of Q(s,a)Q(s,a) are also for each action.
From the background section we know that Q(s,a)=V(s)+A(s,a)Q(s,a) = V(s) + A(s,a). And since we are now dealing with neural networks we have to take into account the network parameters thus the equation can be re-written as below -
Q(s,a;θ,α,β)=V(s;θ,β)+A(s,a;θ,α)Q(s,a;\theta,\alpha,\beta) = V(s;\theta,\beta) + A(s,a;\theta,\alpha), where θ\theta is the parameters of the convolutional layers, α\alpha is the parameters of the fully connected advantage stream layer and β\beta is the parameters of the fully connected value stream layer. However in practical applications where there are possibilities of noise being induced into network estimates we cannot just directly add V(s)V(s) and A(s,a)A(s, a) to obtain Q(s,a)Q(s, a) due to the issue of identifiability, wherein given Q(s,a)Q(s, a) it is not possible to recover V(s)V(s) and A(s,a)A(s,a) uniquely. To address this issue the authors came up with the approach of forcing the advantage function estimator to have zero advantage for the chosen action and the chosen action is the one with maximum advantage as shown below -
Q(s,a;θ,α,β)=V(s;θ,β)+(A(s,a;θ,α)−maxa′A(s,a′;θ,α))Q(s,a;\theta,\alpha,\beta) = V(s;\theta,\beta) + (A(s,a;\theta,\alpha) - max_{a^{'}}A(s,a^{'};\theta,\alpha)), however the introduction of maxmax operator can cause some instability in training and thus the mean is subtracted instead -
Q(s,a;θ,α,β)=V(s;θ,β)+(A(s,a;θ,α)−1∣m∣∑a′A(s,a′;θ,α))Q(s,a;\theta,\alpha,\beta) = V(s;\theta,\beta) + (A(s,a;\theta,\alpha) - \dfrac{1}{|m|}\sum_{a^{'}}A(s,a^{'};\theta,\alpha)), where mm is the number of actions. This provides stability during training however with the drawback of the V(s)V(s) and A(s,a)A(s,a) target estimates being slightly off. This is the approach that is incorporated in the final fully connected layer of the Dueling network which aggregates the value stream V(s)V(s) and advantage stream A(s,a)A(s,a) into the target estimates for Q-values Q(s,a)Q(s,a).
Experiments and Results\textbf{Experiments and Results}
The experiment settings are same as described in the DQN section and hyper-parameters are also the same except for max training frames\textit{max training frames} = 3000000, Experiment-1 (clear-night-96) and Experiment-2 (generous-paper-134). The episode rewards and average rewards for both the experiments has some variance with Experiment-1 having high episode rewards of 800 whereas Experiment-2 having 500 and the average reward for Experiment-1 is 296 and for Experiment-2 is 266. Surprisingly the gameplay videos look pretty much the same as that of the DQN case and Avg Q Values\textit{Avg Q Values} backs that up indicating that the value for action no 5 ’rightfire’\textit{'rightfire'} seems to be the highest again however this time action no 2 ’right’\textit{'right'} has the next highest value instead of action ’fire’\textit{'fire'}. It has to be noted that in this algorithm we are using Double DQN but with Dueling network architecture. From the Double DQN section it is clear that over-estimation of Q-values were addressed as the agent started to learn much better than that of the DQN case. With Dueling network architecture we seem to have gone back to the original DQN issue of exploiting a subset of actions throughout an episode, in this case it happens to be ’rightfire’\textit{'rightfire'} and ’right’\textit{'right'}.
Although it seems appealing to point at the network complexity as the culprit behind the poor performance, in terms of actual network parameters there is nothing much changed. This is because the original single fully connected layer with 1024 units is split into two fully connected layers of 512 units each thereby not modifying the parameter count with this layer, the only change arises from the connection on the value stream side where 512 units are connected to a single value unit instead of six action units which actually reduces the number of parameters by 2565. Thus the problem seems more related towards the estimation of the state-value function and advantage function. In policy gradients research it has been observed that introduction of an advantage function can lead to the introduction of higher bias in the model which has lead to invention of new methods of estimating advantages as in [8][8] which aims at lowering the amount of bias that gets introduced. This is not to imply that the advantage estimation suggested in Dueling architecture [5][5] is not robust, it seems like this advantage and state-value estimations work optimally when the agent encounters more frames i.e train longer. In our case training over 3 million frames seems to cause the model to exploit those few actions which seemed to work better in the span of training.
Dueling Double Deep Q Networks (DDDQN) with Prioritized Experience Replay (PER)
Overview\textbf{Overview}
This algorithm is same as the previous one except for the introduction of a new feature called Prioritized Experience Replay (PER) which was proposed in [6][6]. PER is generic technique which can included in any algorithms that use experience replay buffer, in our case its all the three preceding algorithms. However, here I tried it out only with Dueling Double DQN algorithm. As the name suggests, in PER we prioritize the tuples of the replay buffer and consider those tuples from which it is possible to learn more rather than randomly sampling as done in the vanilla experience replay approach described in DQN section. The prioritization is measured using the temporal-difference error ∣δ∣|\delta|, so the higher this error more priority is given to it. However this way of prioritization can lead to issues like loss of diversity and large gradient steps which is addressed with stochastic prioritization and importance sampling respectively. The temporal-difference (TD) error associated with each experience replay tuple ii in case of Double DQN can be computed as below -
δi=r+γQ(st+1,argmaxat+1Q(st+1,at+1;θ);θ−)−Q(st,at;θ)\delta_i = r + \gamma Q(s_{t+1}, argmax_{a_{t+1}}Q(s_{t+1}, a_{t+1}; \theta); \theta^{-}) - Q(s_t, a_t; \theta), here θ\theta are main DQN network parameters and θ−\theta^{-} are target DQN network parameters.
If we use ∣δi∣|\delta_i| directly and draw tuples from experience replay depending on its value then this will be rise to a greedy prioritization technique which leads to only high error samples being retrieved for training each time. This can be detrimental to the learning of the agent as if it takes more time to reduce the error then the agent will end up seeing same samples during each gradient update which brings back the correlation issue (loss of diversity) which experience replay was suppose to overcome in the first place. Thus, instead a probability is associated with each experience replay tuple ii as below -
P(i)=piα∑kpkαP(i) = \dfrac{p_{i}^{\alpha}}{\sum_k{p_{k}^{\alpha}}}, where pi=∣δi∣+ϵp_i = |\delta_i| + \epsilon; ϵ\epsilon is a small positive constant which ensures pi>0p_i > 0.
Now with the above association of probability with each experience tuple its ensured that each and every tuple even the ones with ∣δi∣|\delta_i| gets a chance of being picked. But this still has the greedy prioritization issue of drawing experience tuples with higher probability which is overcome with the stochastic prioritization techniques. In this technique we divide the probability distribution of the experience tuples into mini batch\textit{mini batch} segments and then sample one experience tuple uniformly from each segment. The segmenting ensures that all parts of the probability distributions are accounted for and within the segments the tuples with higher probability gets drawn more often than the ones with lower probability thereby achieving a balanced as well as prioritized replay of experience. The introduction of α\alpha allows us to control how much prioritization is needed, if α=0\alpha=0 then all the tuples will end up having same probability of being drawn thus there is no priority and if α=1\alpha=1 then it becomes a full-on prioritized mode. In [6][6] it is observed that a α=0.6\alpha=0.6 seems to work well for Atari environment.
The next issue that needs to be addressed is the large gradient steps. Since in PER we are making the model learn on high-error transitions more often, this will lead to higher gradient steps being taken more often which can be disruptive for the DQN network due to the moving nature of its target estimates. Thus the gradient steps are soft clipped (rather than restricting between min and max, we multiply with some weight proportional to priority) using importance sampling weights which are shown as below -
wi=(1N.1P(i))βw_i = (\dfrac{1}{N} . \dfrac{1}{P(i)})^{\beta}, where NN is the current number of entries in the experience replay buffer.
These importance sampling weights are normalized before being used. And the gradient clipping was not done directly by multiplying the weights to the gradients but instead by scaling the loss by the weights.
wi=wimax(w)w_i = \dfrac{w_i}{max(w)}
Now, the above details the concept of PER but how can it be implemented efficiently as the replay buffer max size\textit{replay buffer max size} can reach upto 1000000 and we cannot use a O(N)\mathcal{O}(N) method. In [6][6] it is suggested to use sum tree data structure which is explained in [11][11] which makes it very efficient to implement PER. On the high level sum tree is a binary tree data structure where each leaf node holds the priority values and the parent node holds sum of its children nodes. Thus the root node holds the sum of all the priority values ∑kpk\sum_k{p_{k}}. For retrieval of a node we traverse from the root node moving to left or right child depending on the priority value till we reach a leaf node and return the leaf node that we landed in, thus retrieval take O(log(N))\mathcal{O}(\log(N)). For updating the value of node we start with the leaf node whose priority value has to be updated, we then compute the difference or change and propagate this change up towards each parent until we reach the root node, thus value updation also takes O(log(N))\mathcal{O}(\log(N)). So the leaf nodes can hold a tuple (pi,idx)(p_i, idx) where pip_i is the priority value of the experience tuple with an index of idxidx in the replay buffer.
The high level implementation details of PER are - we maintain the sum tree node data structure as a separate class which has the (pi,idx)(p_i, idx) members and other functions for retrieval and updation. While adding or updating each experience tuple et=(st,at,rt,dt,st+1)e_t = (s_t, a_t, r_t, d_t, s_{t+1}) into the experience replay buffer we calculate the priority value pi=∣δi∣+ϵp_i = |\delta_i| + \epsilon which is passed on to the sum tree node associated with the idxidx of ete_t and that node will carryout the update procedure described above. The samples are retrieved by first dividing the probability distribution into equal ranges or segments, this is done by dividing the sum tree's root node by mini batch\textit{mini batch} i.e segment range len=∑kpkmini batch\textit{segment range len} =\dfrac{{\sum_k{p_{k}}}}{\textit{mini batch}}. Then from 0 to ∑kpk{\sum_k{p_{k}}} at every segment range len\textit{segment range len} we uniformly sample a priority value and a sum tree node is retrieved based on this value. From the retrieved node we can access the experience tuple using the idxidx member. During this retrieval process the importance sampling weights wiw_i are also calculated and sent along with the experience tuple for training. During training the importance sampling weights wiw_i are passed as a scaling parameter to the loss function. After training the new TD error ∣δi∣|\delta_i| is computed for the sampled mini batch\textit{mini batch} experience tuples ete_t and the priority value of only those sampled tuples are updated.
Experiments and Results\textbf{Experiments and Results}
The experiment settings are same as described in the Dueling Double DQN section and hyper-parameters are also the same except for learning rate\textit{learning rate} = 0.000625 and additional new ones include α\alpha = 0.6, β\beta = 0.4. It should be noted that β\beta is annealed from its initial value to 1.
The episodic rewards and average rewards for both experiments look similar. And from the gameplay videos it looks like the agent is trying out all possible actions which is also reflected in Avg Q Values\textit{Avg Q Values} plot where the Q-values have smaller variations between each actions except for ’noop’\textit{'noop'}. Thus, similar to how Double DQN overcame DQN's exploitation issue - the Dueling Double DQN with PER has overcome the Dueling Double DQN without PER's exploitation issue. Since in PER the agent is trained on those experience tuples with higher temporal-difference error, it seems like that is helping the agent to better navigate the action space. The reason for sub-optimal average rewards is due to the fact that we do stochastic prioritization by sampling from an equally segmented probability distribution - this will cause the tuples with lower priority in the higher segments to be drawn much lesser time than the tuples with higher priority in the lower segments which can cause an issue since the first set of tuples have higher temporal-difference error than the second set of tuples. However, this wont be an issue if the agent goes through much more higher frames than just 3000000 frames because that will give sufficient time for the lower priority frames in the higher segments to have been picked up a sufficient amount of time.
Asynchronous Advantage Actor-Critic (A3C) Method
Overview\textbf{Overview}
A3C introduced in [7][7] belongs to the actor-critic category of algorithms where the actor is trying to learn and the critic is assisting the actor in learning. The policy gradient algorithm is one of the original actor-critic algorithms where we try to directly learn the optimal policy (actor) using a baseline state-value function (critic). In the previous DQN methods we were learning the action-value function or Q-function estimations and derive a policy from that. This is one major difference between the previous four algorithms and the A3C algorithm. The policy gradient algorithm can be summarized as - compute the policy gradient ∇θlogπ(st,at;θ)At′\nabla_{\theta}\log\pi(s_t, a_t; \theta)A^{'}_t where At′=Rt−b(st)A^{'}_t=R_t-b(s_t) is the advantage estimate and Rt=∑t′=tT−1γrt′R_t = \sum_{t^{'}=t}^{T-1}{\gamma r_{t^{'}}} is the total rewards. Here, RtR_t and At′A^{'}_t are computed by iterating over timesteps t-max\textit{t-max}. The Advantage, Actor and Critic part of A3C can be seen in the policy gradient equation where logπ(st,at;θ)\log\pi(s_t, a_t; \theta) is the Actor, At′A^{'}_t is the Advantage and b(st)b(s_t) is the Critic. On a high level we can just assume that Rt≈Q(st,at)R_t \approx Q(s_t, a_t) and b(st)≈V(st)b(s_t) \approx V(s_t) which will then give rise to the Advantage estimate which we knew from the previous algorithms.
The Asynchronous part of A3C refers to spawning multiple agents which execute in parallel on different CPU cores with their own environment and models. We no longer need the experience replay buffer as the issue of "training data correlation" is addressed with each agent interacting with its own environment computing the loss with respect to its model parameters, however the gradient updates are not performed to its own model, rather the gradients of a shared global model is updated. This global model is shared among all the agents and thus the experience gained by each agent also gets shared across others which speeds up the process of learning. Before starting a training iteration the agents will synchronize their local model parameter with that of the global model's, thereby gaining access to the experiences of all the other agents. The gradient updates to the shared global model is done in a Hogwild fashion, which means that multiple agents can simultaneously update the global model's parameter and it should still be fine in the long run. The elimination of experience replay buffer greatly helps in reducing the memory requirements as a million experience tuples need ~9 GB of RAM space.
The network that was suggested to be used in [7][7] is as follows - first convolutional layer with 16 filters, kernel size of 8 and stride 4, followed by a second convolutional layer with 32 filters, kernel size of 4 and stride 2, followed by a fully connected layer with 256 units with all the layers using ReLU activation. The 256 units output of the fully connected layer is now shared across two fully connected layers, just like in Dueling DQN, with the first one representing the actor or policy stream which emits a softmax output for each action and the second one represents the value stream emitting a single logit as output.
In A3C we don't train over mini-batch\textit{mini-batch} samples, instead the training is done over a smaller trajectory with a timestep of t-max\textit{t-max} or till a terminal state is reached whichever comes first. Before each training iteration is started the agent's local model parameter is synchronized with the shared global model. On a high level the training iteration and loss computation can be summarized as - till the trajectory ends, pass the current environment state to the model to obtain the policy and value - (πt,b(st))(\pi_t, b(s_t)) estimates, get logπt\log\pi_t from πt\pi_t, calculate the entropy - entropyt=∑π∗−logπentropy_{t} = \sum{\pi * -\log\pi}, get an action ata_t depending on the probability distribution of πt\pi_t, step through the environment with the selected ata_t and receive the (new-state,rt,terminal)(\textit{new-state},r_t,\textit{terminal}) tuple from the environment. Once the trajectory ends check if its a \textit{terminal} state or not, if yes, then the target reward RR is initialized with 00 else its initialized with the b(st)b(s_t) value for the last state. The b(st)b(s_t), logπt\log\pi_t, entropytentropy_t and rtr_t are stored in a list for each timestep of the trajectory. Now the target reward is computed as R=∑rt+γRR = \sum{r_t + \gamma R}, the advantage is computed as At′=(R−b(st))A^{'}_t = (R - b(s_t)), the value loss is computed as Li(Vi;θ)=∑At′22L_i(V_i; \theta) = \sum{\dfrac{A^{'2}_t}{2}}, compute δt=rt+γb(st+1)−b(st)\delta_t = r_t + \gamma b(s_{t+1}) - b(s_t) as described in [8][8], the generalized advantage estimation described in [8][8] is computed as gAt′=γλδtgA^{'}_t = \gamma \lambda \delta_t - where λ\lambda is a new constant and finally the policy loss Li(πi;θ)=−(logπtgAt′+βentropyt)L_i(\pi_i;\theta) = -(\log\pi_t gA^{'}_t + \beta entropy_t) - where β\beta is a new constant and the reason for −ve-ve loss is we are trying to optimize the target reward by performing a gradient ascent in policy gradient methods. The final loss is just a linear combination of Li(πi;θ)L_i(\pi_i; \theta) and Li(Vi;θ)L_i(V_i; \theta) which in our case was Li(θ)=Li(Vi;θ)2+Li(πi;θ)L_i(\theta) = \dfrac{L_i(V_i;\theta)}{2} + L_i(\pi_i;\theta). The gradients are then computed with respect to Li(θ)L_i(\theta) which corresponds to the local model parameters, however the optimization step or gradient update is now done on the global model's parameters. This completes a trajectory based training of an agent in the A3C algorithm. One important point to note here is the use of entropytentropy_t - in the previous DQN approaches the exploration vs exploitation problem was addressed using ϵ\epsilon-greedy policy, however in A3C entropytentropy_t is used to address this issue. Higher entropytentropy_t implies more exploration as the probability distribution is spread evenly across all the actions and lower entropytentropy_t implies exploitation as the probability distribution is concentrated more on few actions.
Experiments and Results\textbf{Experiments and Results}
For A3C just a single experiment of playing the SpaceInvaders-v0 without the no-op steps\textit{no-op steps} and action repeat\textit{action repeat} was carried out. The hyper-parameter settings are as below -
γ=0.99\gamma = 0.99
λ=0.94\lambda = 0.94
β=1\beta = 1
t-max=5\textit{t-max} = 5
T-MAX=12800000\textit{T-MAX} = 12800000
learning rate\textit{learning rate} = 0.0001
Num-Of-Agents=8\textit{Num-Of-Agents} = 8
The A3C algorithm was trained for a total of T-MAX\textit{T-MAX} frames with Num-Of-Agents\textit{Num-Of-Agents} running on parallel threads. In the code rather than sharing a QQ for counting T-MAX\textit{T-MAX}, each agent was limited to only run for only 1600000 frames. The results of the experiments are plotted below, there was single high episode value of 640 and the average rewards over 100 episodes were 75. The A3C algorithm is supposed to be trained for at-least 100 million frames by 8-16 agents in parallel, however I was able to train it only 12.8 million frames thus the dismal scores. However, the gameplay video for 600 score episode shows that the agent seem to explore the action space pretty well and given more the training the A3C should perform better than the previous DQN approaches.
Conclusion
As part of this competition I tried out five different Deep Reinforcement Learning algorithms - DQN, Double DQN, Dueling Double DQN, Dueling Double DQN with PER and Asynchronous Advantage Actor-Critic (A3C). Although Dueling Double DQN and DQN performed the best out of the five in my experiments with a average reward of 295 and 275 respectively, they certainly did not achieve that through reasonable exploration of the action space. Instead the agents in both cases settled to exploiting the high reward yielding actions. The observations with Double DQN and Prioritized Experience Replay were promising even though their average rewards does not reflect that, both the methods reversed the exploitation problem faced in their respective preceding algorithms. The current state-of-the art method is A3C but I was not able to competitive scores due to limited training of the agent.
References
- Playing Atari with Deep Reinforcement Learning - Mnih, Volodymyr et al
- Human level control through deep reinforcement learning - Mnih, Volodymyr et al
- Double Q-learning - Hasselt, Hado
- Deep Reinforcement Learning with Double Q-learning - Hasselt, Hado et al
- Dueling Network Architectures for Deep Reinforcement Learning - Wang, Ziyu et al
- Prioritized Experience Replay - Schaul, Tom et al
- Asynchronous Methods for Deep Reinforcement Learning - Mnih, Volodymyr et al
- High-Dimensional Continuous Control Using Generalized Advantage Estimation - Schulman, John et al
- https://towardsdatascience.com/ - many Reinforcement Learning posts related to DQNs
- https://medium.com/ - many Reinforcement Learning posts related to DQNs
- https://danieltakeshi.github.io/ - posts related to A3C, GAE, frame skipping and policy gradients
- https://www.fcodelabs.com/2019/03/18/Sum-Tree-Introduction/