Lunar Lander - Open AI
sample_rollout
Introduction
In this report I showcase and analyze an implementation for solving OpenAI’s LunarLander-v2 Gym Environment using a Deep Q-Network (DQN) learning strategy. Section I introduces the learning problem and experiment goals. Section II defines the algorithm implemented and the associated hyperparameters used for tuning. Section III analyzes the results of my most successful parameter combination. Section IV describes and analyzes the effects of changing various hyperparameters. Section V summarizes the experiment and discusses areas of further research.
I Problem Definition
OpenAI maintains gym, a Python library for experimenting with reinforcement learning techniques. Gym contains a variety of environments, each with their own characteristics and challenges. For this project, I used the LunarLander-v2 environment. The objective of LunarLander is to safely land a spaceship between two flags. The LunarLanader environment is:
- Fully Observable: All necessary state information is known observed at every frame.
- Single Agent: There is no competition or cooperation.
- Deterministic: There is no stochasticity in the effects of actions or the rewards obtained.
- Episodic: The reward is dependent only on the current state and action.
- Discrete Action Space: There are only 4 discrete actions: thrust, left, right, nothing.
- Static: There is no penalty or state change during action deliberation.
- Finite Horizon: The episode terminates after a successful land, crash, or 1000 steps.
- Mixed Observation Space: Each observation contains 8 values:
- (Continuous): X distance from target site
- (Continuous): Y distance from target site
- (Continuous): X velocity
- (Continuous): Y velocity
- (Continuous): Angle of ship
- (Continuous): Angular velocity of ship
- (Binary): Left leg is grounded
- (Binary): Right leg is grounded
Finally, after each step a reward is granted. The total reward of an episode is the sum of the rewards for all steps within that episode. The reward for moving from the top of the screen to landing pad with zero speed is awarded between 100 and 140 points. If the lander moves away from the landing pad it loses the same reward as moving the same distance towards the pad. The episode receives additional -100 or +100 points for crashing or landing, respectively. Grounding a leg is worth 10 points and thrusting the main engine receives -0.3 points. An episode score of 200 or more is considered a solution.
Formally, the environment EE can be defined by a functional mapping Es(a)→(r,s′,d)E_s(a)\rightarrow(r,s',d). The environment at state ss, EsE_s, takes an action aa and returns a reward rr, next state ss, and a flag dd to indicate episode termination. For convenience, I will refer to this set as an experience tuple: <s,a,r,s’,d><s,a,r,s’,d>.
Consider a policy π(s)→a\pi (s) \rightarrow a which returns an action for a given state. The cumulative value of an episode can be defined by:
where ss denotes the starting state and Es(a)(r,s′,d)E_s(a)_{(r,s',d)} denotes the respective component of the returned experience tuple. For this project, my objective is to create an agent capable of learning a policy which earns greater than 200 points, on average, across 100 games.
II Experiment Design: DQN Algorithm & Model Architecture
I approached this problem in a similar fashion as Mnih et al. [1] - using Deep Q-Network (DQN) learning. The first step is to reframe the problem as an MDP, solvable by Bellman’s equations: Vπ(s)=maxa(R(s,a)+γ∑−s′∈ST(s,a,s′)Vπ(s′))V^\pi(s)=\max_{a}(R(s,a)+\gamma\sum-{s' \in S}T(s,a,s')V^\pi(s')). Since this is a deterministic environment, T(s,a,s′)T(s,a,s') is 0 for all s’s’ and 11 for only a single s’s’. We can therefore simplify the equation to: Vπ(s)=maxa(R(s,a)+γ(Vπ(Es(a)))V^\pi(s)=\max_{a}(R(s,a)+\gamma(V^\pi(E_s(a))). For simplicity, I will write Es(a)E_s(a) as s’s’. We can now define the Q-form of the equations:
Next, since the state space is infinite (continuous), I generalize QQ by leveraging a neural net, parameterized by θ\theta. This turns calls of Q(s,a)Q(s,a) into Q(s,a;θ)Q(s,a;\theta). Now, I can define the error as: error=Q(s,a;θ)−(r+γmaxa′Q(s′,a′;γ−))error=Q(s,a;\theta) - (r + \gamma \max_{a'}Q(s',a';\gamma^-)). This says that the error is the difference between my current Q value (predicted) and the sum of the reward plus my next state’s best Q value (target). It is now trivial to calculate losses on a batch of experience tuples. For example, RSME loss: L(θ)=E(s,a,r,s′)[(r+γmaxa′Q(s′,a′;θ−)−Q(s,a;θ))2]L(\theta)=E_{(s,a,r,s')}[(r + \gamma \max_{a'}Q(s',a';\theta^-)-Q(s,a;\theta))^2]. Note that θ−\theta^- refers to a second pair of target weights. The target weights are updated periodically with θ\theta at a frequency discussed later. Now that I have established a parameterized, differentiable Q function of which I can calculate the loss, all that is left is to define a training procedure that allows for a balance of exploration vs exploitation. I leveraged a very similar procedure as described by Mnih et al. [1].
TODO
Next, we define the model architecture which estimates Q(s,a;θ)Q(s,a;\theta). I implemented my model using the Keras high-level modeling API and parameterize the model shape as well. The model can be summarized by:
TODO
Note that the actual model implementation is more complicated. In fact, I accept an additional action input which is a one-hot encoding of the action. This is multiplied by the final dense layer before back propagation. This allows me to use the default Keras optimizer and loss function by ensuring that the only the only neuron in the final dense layer which contributes to error (therefore only to receive backpropagation values) is the neuron corresponding to the action in the experience tuple during training. At inference time, I send a vector of all ones to the action input which allows the model to output 4 estimations, one for each action’s Q(s,a)Q(s,a) value. I then simply take the argmax of that vector to get the predicted best action or take the max to get the predicted best value.
TODO
III Results
&