[WIP] APO on Gym Mujoco

APO performance on 3 seeds 1M steps
Created on June 22|Last edited on June 27
Comment
I found out that APO is sensitive to gae-lambda, so for each env I first check [0.8, 0.9, 0.95, 0.99] values for 500k steps and pick best for 1M here. All other hparams are set to default, as in code (TLDR: they are identical to PPO). For PPO runs available only for v2 envs, so I will show them on the same seeds.
P.S. I expect that APO will be better in Swimmer, HalfCheetah, Ant, but worse in Hopper, Walker as they have unsafe states, which average reward doesn't handle by default.
Swimmer-v3gae-lambda: 0.99
﻿
videos
This run didn't log media for key "videos", step 1092, index 0. Docs →
This run didn't log media for key "videos", step 1092, index 0. Docs →
This run didn't log media for key "videos", step 1092, index 0. Docs →
This run didn't log media for key "videos", step 1092, index 0. Docs →
This run didn't log media for key "videos", step 1092, index 0. Docs →
This run didn't log media for key "videos", step 1092, index 0. Docs →
This run didn't log media for key "videos", step 1092, index 0. Docs →
This run didn't log media for key "videos", step 1092, index 0. Docs →
This run didn't log media for key "videos", step 1092, index 0. Docs →
This run didn't log media for key "videos", step 1092, index 0. Docs →
This run didn't log media for key "videos", step 1092, index 0. Docs →
This run didn't log media for key "videos", step 1092, index 0. Docs →
This run didn't log media for key "videos", step 1092, index 0. Docs →
Step
Swimmer
Swimmer
200k400k600k800kStep0100200300Episodic Return
CleanRL's avg_ppo_continuous_action.py
CleanRL's ppo_continuous_action.py
CleanRL's avg_ppo_continuous_action.py3
CleanRL's ppo_continuous_action.py10
﻿
HalfCheetah-v3gae-lambda: 0.9
﻿
APO3
PPO10
﻿
Ant-v3gae-lambda 0.8
﻿
APO 0.83
PPO10
﻿
Walker2d-v3TODO
﻿
APO3
PPO3
﻿
Hopper-v3TODO
﻿
APO 0.993
PPO3
﻿
Humanoid-v3TODO
﻿
Add a comment