Skip to main content

Proximal Policy Optimization (PPO) on OpenAI Gym

Tried PPO on OpenAI Gym - LunarLanderContinuous-v2 and MountainCarContinous-v0
Created on May 21|Last edited on June 3

LunarLanderContinuous-V2





Within 5k episodes, 2/4 attempts solved LunarLanderContinous-v2 (achieved 200 moving averaged rewards).

Some findings are:

  • It seems better to stop clearing the replay buffer at every episode
  • It is slow to learn! (perhaps due to some parameters settings...)
  • Learning rate is important - as we all know :-) (these results haven't fully optimized it yet)

Main parameters are:

  • γ\gamma: 0.99
  • Replay buffer size: 30k
  • LR for policy net: 2e-5
  • LR for value net: 2e-4
  • Batch size: 128
  • Number of epochs: 32
  • Number of hidden layers*: 3
  • Number of hidden units*: 64
  • Activation function:
    • relu for policy net (actor)
    • tanh for value net (critic)

* for both value and policy net

01k2k3k4k5kStep-300-200-1000100200
Run set
4


MountainCarContinuous-v0




Run set
2