Proximal Policy Optimization (PPO) on OpenAI Gym
Tried PPO on OpenAI Gym - LunarLanderContinuous-v2 and MountainCarContinous-v0
Created on May 21|Last edited on June 3
Comment
LunarLanderContinuous-V2
Within 5k episodes, 2/4 attempts solved LunarLanderContinous-v2 (achieved 200 moving averaged rewards).
Some findings are:
- It seems better to stop clearing the replay buffer at every episode
- It is slow to learn! (perhaps due to some parameters settings...)
- Learning rate is important - as we all know :-) (these results haven't fully optimized it yet)
Main parameters are:
- γ\gamma: 0.99
- Replay buffer size: 30k
- LR for policy net: 2e-5
- LR for value net: 2e-4
- Batch size: 128
- Number of epochs: 32
- Number of hidden layers*: 3
- Number of hidden units*: 64
- Activation function:
- relu for policy net (actor)
- tanh for value net (critic)
* for both value and policy net
Run set
4
MountainCarContinuous-v0
Run set
2
Add a comment