Skip to main content

[WIP] APO on Gym Mujoco

APO performance on 3 seeds 1M steps
Created on June 22|Last edited on June 27
I found out that APO is sensitive to gae-lambda, so for each env I first check [0.8, 0.9, 0.95, 0.99] values for 500k steps and pick best for 1M here. All other hparams are set to default, as in code (TLDR: they are identical to PPO). For PPO runs available only for v2 envs, so I will show them on the same seeds.
P.S. I expect that APO will be better in Swimmer, HalfCheetah, Ant, but worse in Hopper, Walker as they have unsafe states, which average reward doesn't handle by default.

Swimmer-v3

gae-lambda: 0.99

videos
This run didn't log videos for key "videos", step 1092, index 0. Docs →
This run didn't log videos for key "videos", step 1092, index 0. Docs →
This run didn't log videos for key "videos", step 1092, index 0. Docs →
This run didn't log videos for key "videos", step 1092, index 0. Docs →
This run didn't log videos for key "videos", step 1092, index 0. Docs →
This run didn't log videos for key "videos", step 1092, index 0. Docs →
This run didn't log videos for key "videos", step 1092, index 0. Docs →
This run didn't log videos for key "videos", step 1092, index 0. Docs →
This run didn't log videos for key "videos", step 1092, index 0. Docs →
This run didn't log videos for key "videos", step 1092, index 0. Docs →
This run didn't log videos for key "videos", step 1092, index 0. Docs →
This run didn't log videos for key "videos", step 1092, index 0. Docs →
This run didn't log videos for key "videos", step 1092, index 0. Docs →
200k400k600k800kStep0100200300Episodic Return
CleanRL's avg_ppo_continuous_action.py
CleanRL's ppo_continuous_action.py
CleanRL's avg_ppo_continuous_action.py
3
CleanRL's ppo_continuous_action.py
10


HalfCheetah-v3

gae-lambda: 0.9

APO
3
PPO
10


Ant-v3

gae-lambda 0.8

APO 0.8
3
PPO
10


Walker2d-v3

TODO

APO
3
PPO
3


Hopper-v3

TODO

APO 0.99
3
PPO
3


Humanoid-v3

TODO