Skip to main content

Lunar Lander - Open AI

In this report, we analyze how a Deep Q-Network (DQN) can effectively solve the Lunar Lander Gym Environment Open AI RL problem. Moreover, the original modeling and study was done in Spring of 2019. This particular report is an adaptation of such work with a particular focus on instrumenting the experimentation harness with WandB's experiment tracking and hyperparameter tuning libraries.
Created on October 15|Last edited on October 15

sample_rollout
This run didn't log media for key "sample_rollout", step 0, index 0. Docs →
This run didn't log media for key "sample_rollout", step 150, index 0. Docs →
This run didn't log media for key "sample_rollout", step 300, index 0. Docs →
This run didn't log media for key "sample_rollout", step 300, index 0. Docs →
This run didn't log media for key "sample_rollout", step 300, index 0. Docs →
This run didn't log media for key "sample_rollout", step 0, index 0. Docs →
This run didn't log media for key "sample_rollout", step 150, index 0. Docs →
This run didn't log media for key "sample_rollout", step 300, index 0. Docs →
This run didn't log media for key "sample_rollout", step 400, index 0. Docs →
This run didn't log media for key "sample_rollout", step 400, index 0. Docs →
This run didn't log media for key "sample_rollout", step 0, index 0. Docs →
This run didn't log media for key "sample_rollout", step 150, index 0. Docs →
This run didn't log media for key "sample_rollout", step 300, index 0. Docs →
This run didn't log media for key "sample_rollout", step 400, index 0. Docs →
This run didn't log media for key "sample_rollout", step 400, index 0. Docs →
This run didn't log media for key "sample_rollout", step 0, index 0. Docs →
This run didn't log media for key "sample_rollout", step 150, index 0. Docs →
This run didn't log media for key "sample_rollout", step 300, index 0. Docs →
This run didn't log media for key "sample_rollout", step 300, index 0. Docs →
This run didn't log media for key "sample_rollout", step 300, index 0. Docs →
This run didn't log media for key "sample_rollout", step 0, index 0. Docs →
This run didn't log media for key "sample_rollout", step 150, index 0. Docs →
This run didn't log media for key "sample_rollout", step 300, index 0. Docs →
This run didn't log media for key "sample_rollout", step 300, index 0. Docs →
This run didn't log media for key "sample_rollout", step 300, index 0. Docs →
Sweep: Big Sweep 1
197
Sweep: Big Sweep 2
0


Introduction

In this report I showcase and analyze an implementation for solving OpenAI’s LunarLander-v2 Gym Environment using a Deep Q-Network (DQN) learning strategy. Section I introduces the learning problem and experiment goals. Section II defines the algorithm implemented and the associated hyperparameters used for tuning. Section III analyzes the results of my most successful parameter combination. Section IV describes and analyzes the effects of changing various hyperparameters. Section V summarizes the experiment and discusses areas of further research.

I Problem Definition

OpenAI maintains gym, a Python library for experimenting with reinforcement learning techniques. Gym contains a variety of environments, each with their own characteristics and challenges. For this project, I used the LunarLander-v2 environment. The objective of LunarLander is to safely land a spaceship between two flags. The LunarLanader environment is:

  • Fully Observable: All necessary state information is known observed at every frame.
  • Single Agent: There is no competition or cooperation.
  • Deterministic: There is no stochasticity in the effects of actions or the rewards obtained.
  • Episodic: The reward is dependent only on the current state and action.
  • Discrete Action Space: There are only 4 discrete actions: thrust, left, right, nothing.
  • Static: There is no penalty or state change during action deliberation.
  • Finite Horizon: The episode terminates after a successful land, crash, or 1000 steps.
  • Mixed Observation Space: Each observation contains 8 values:
    • (Continuous): X distance from target site
    • (Continuous): Y distance from target site
    • (Continuous): X velocity
    • (Continuous): Y velocity
    • (Continuous): Angle of ship
    • (Continuous): Angular velocity of ship
    • (Binary): Left leg is grounded
    • (Binary): Right leg is grounded

Finally, after each step a reward is granted. The total reward of an episode is the sum of the rewards for all steps within that episode. The reward for moving from the top of the screen to landing pad with zero speed is awarded between 100 and 140 points. If the lander moves away from the landing pad it loses the same reward as moving the same distance towards the pad. The episode receives additional -100 or +100 points for crashing or landing, respectively. Grounding a leg is worth 10 points and thrusting the main engine receives -0.3 points. An episode score of 200 or more is considered a solution.

Formally, the environment EE can be defined by a functional mapping Es(a)→(r,s′,d)E_s(a)\rightarrow(r,s',d). The environment at state ss, EsE_s, takes an action aa and returns a reward rr, next state ss, and a flag dd to indicate episode termination. For convenience, I will refer to this set as an experience tuple: <s,a,r,s’,d><s,a,r,s’,d>.

Consider a policy π(s)→a\pi (s) \rightarrow a which returns an action for a given state. The cumulative value of an episode can be defined by:

Vπ(s)=Es(π(s))r+{0,if Es(π(s))dVπ(Es(π(s))s′),otherwiseV^\pi(s)=E_s(\pi(s))_r+\begin{cases}0, & \text{if } E_s(\pi(s))_d \\ V^\pi(E_s(\pi(s))_{s'}), & \text{otherwise}\end{cases}

where ss denotes the starting state and Es(a)(r,s′,d)E_s(a)_{(r,s',d)} denotes the respective component of the returned experience tuple. For this project, my objective is to create an agent capable of learning a policy which earns greater than 200 points, on average, across 100 games.

II Experiment Design: DQN Algorithm & Model Architecture

I approached this problem in a similar fashion as Mnih et al. [1] - using Deep Q-Network (DQN) learning. The first step is to reframe the problem as an MDP, solvable by Bellman’s equations: Vπ(s)=max⁡a(R(s,a)+γ∑−s′∈ST(s,a,s′)Vπ(s′))V^\pi(s)=\max_{a}(R(s,a)+\gamma\sum-{s' \in S}T(s,a,s')V^\pi(s')). Since this is a deterministic environment, T(s,a,s′)T(s,a,s') is 0 for all s’s’ and 11 for only a single s’s’. We can therefore simplify the equation to: Vπ(s)=max⁡a(R(s,a)+γ(Vπ(Es(a)))V^\pi(s)=\max_{a}(R(s,a)+\gamma(V^\pi(E_s(a))). For simplicity, I will write Es(a)E_s(a) as s’s’. We can now define the Q-form of the equations:

Vπ(s)=max⁡aQπ(s,a)Qπ(s,a)=R(s,a)+γVπ(s′)π(s)=argmaxaQπ(s,a)V^\pi(s)=\max_{a}Q^\pi(s,a) \\ Q^\pi(s,a)=R(s,a)+\gamma V^\pi(s') \\ \pi(s)=argmax_{a} Q^\pi(s,a)

Next, since the state space is infinite (continuous), I generalize QQ by leveraging a neural net, parameterized by θ\theta. This turns calls of Q(s,a)Q(s,a) into Q(s,a;θ)Q(s,a;\theta). Now, I can define the error as: error=Q(s,a;θ)−(r+γmax⁡a′Q(s′,a′;γ−))error=Q(s,a;\theta) - (r + \gamma \max_{a'}Q(s',a';\gamma^-)). This says that the error is the difference between my current Q value (predicted) and the sum of the reward plus my next state’s best Q value (target). It is now trivial to calculate losses on a batch of experience tuples. For example, RSME loss: L(θ)=E(s,a,r,s′)[(r+γmax⁡a′Q(s′,a′;θ−)−Q(s,a;θ))2]L(\theta)=E_{(s,a,r,s')}[(r + \gamma \max_{a'}Q(s',a';\theta^-)-Q(s,a;\theta))^2]. Note that θ−\theta^- refers to a second pair of target weights. The target weights are updated periodically with θ\theta at a frequency discussed later. Now that I have established a parameterized, differentiable Q function of which I can calculate the loss, all that is left is to define a training procedure that allows for a balance of exploration vs exploitation. I leveraged a very similar procedure as described by Mnih et al. [1].

TODO

Next, we define the model architecture which estimates Q(s,a;θ)Q(s,a;\theta). I implemented my model using the Keras high-level modeling API and parameterize the model shape as well. The model can be summarized by:

TODO

Note that the actual model implementation is more complicated. In fact, I accept an additional action input which is a one-hot encoding of the action. This is multiplied by the final dense layer before back propagation. This allows me to use the default Keras optimizer and loss function by ensuring that the only the only neuron in the final dense layer which contributes to error (therefore only to receive backpropagation values) is the neuron corresponding to the action in the experience tuple during training. At inference time, I send a vector of all ones to the action input which allows the model to output 4 estimations, one for each action’s Q(s,a)Q(s,a) value. I then simply take the argmax of that vector to get the predicted best action or take the max to get the predicted best value.

TODO

III Results

&




Sweep: Big Sweep 1
197
Sweep: Big Sweep 2
0




Run set
282