Contributors Twitters: https://twitter.com/vwxyzjn , https://twitter.com/RousslanDossa, https://twitter.com/yooceii

Open RL Benchmark by CleanRL (https://github.com/vwxyzjn/cleanrl) provides benchmark of popular Deep Reinforcement Learning algorithms in 34+ games with a new level of transparency, openness, and reproducibility.

Section 3

CleanRL is a library that provides high-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features. All of our implementation is benchmarked to ensure quality. We log all of our experiments using Weights and Biases so that you can check the following information:

Additionally, we packaged our library with docker, which allows us to leverage AWS Batch to run thousands of experiments concurrently. This is a poor man's google scale. Tutorial coming up for 0.4.0 release.

Atari Results

gym_id apex_dqn_atari_visual c51_atari_visual dqn_atari_visual ppo_atari_visual
BeamRiderNoFrameskip-v4 2936.93 ± 362.18 13380.67 ± 0.00 7139.11 ± 479.11 2053.08 ± 83.37
QbertNoFrameskip-v4 3565.00 ± 690.00 16286.11 ± 0.00 11586.11 ± 0.00 17919.44 ± 383.33
SpaceInvadersNoFrameskip-v4 1019.17 ± 356.94 1099.72 ± 14.72 935.40 ± 93.17 1089.44 ± 67.22
PongNoFrameskip-v4 19.06 ± 0.83 18.00 ± 0.00 19.78 ± 0.22 20.72 ± 0.28
BreakoutNoFrameskip-v4 364.97 ± 58.36 386.10 ± 21.77 353.39 ± 30.61 380.67 ± 35.29

Mujoco Results

gym_id ddpg_continuous_action td3_continuous_action ppo_continuous_action
Reacher-v2 -6.25 ± 0.54 -6.65 ± 0.04 -7.86 ± 1.47
Pusher-v2 -44.84 ± 5.54 -59.69 ± 3.84 -44.10 ± 6.49
Thrower-v2 -137.18 ± 47.98 -80.75 ± 12.92 -58.76 ± 1.42
Striker-v2 -193.43 ± 27.22 -269.63 ± 22.14 -112.03 ± 9.43
InvertedPendulum-v2 1000.00 ± 0.00 443.33 ± 249.78 968.33 ± 31.67
HalfCheetah-v2 10386.46 ± 265.09 9265.25 ± 1290.73 1717.42 ± 20.25
Hopper-v2 1128.75 ± 9.61 3095.89 ± 590.92 2276.30 ± 418.94
Swimmer-v2 114.93 ± 29.09 103.89 ± 30.72 111.74 ± 7.06
Walker2d-v2 1946.23 ± 223.65 3059.69 ± 1014.05 3142.06 ± 1041.17
Ant-v2 243.25 ± 129.70 5586.91 ± 476.27 2785.98 ± 1265.03
Humanoid-v2 877.90 ± 3.46 6342.99 ± 247.26 786.83 ± 95.66

Pybullet Results

gym_id ddpg_continuous_action td3_continuous_action ppo_continuous_action
MinitaurBulletEnv-v0 -0.17 ± 0.02 7.73 ± 5.13 23.20 ± 2.23
MinitaurBulletDuckEnv-v0 -0.31 ± 0.03 0.88 ± 0.34 11.09 ± 1.50
InvertedPendulumBulletEnv-v0 742.22 ± 47.33 1000.00 ± 0.00 1000.00 ± 0.00
InvertedDoublePendulumBulletEnv-v0 5847.31 ± 843.53 5085.57 ± 4272.17 6970.72 ± 2386.46
Walker2DBulletEnv-v0 567.61 ± 15.01 2177.57 ± 65.49 1377.68 ± 51.96
HalfCheetahBulletEnv-v0 2847.63 ± 212.31 2537.34 ± 347.20 2347.64 ± 51.56
AntBulletEnv-v0 2094.62 ± 952.21 3253.93 ± 106.96 1775.50 ± 50.19
HopperBulletEnv-v0 1262.70 ± 424.95 2271.89 ± 24.26 2311.20 ± 45.28
HumanoidBulletEnv-v0 -54.45 ± 13.99 937.37 ± 161.05 204.47 ± 1.00
BipedalWalker-v3 66.01 ± 127.82 78.91 ± 232.51 272.08 ± 10.29
LunarLanderContinuous-v2 162.96 ± 65.60 281.88 ± 0.91 215.27 ± 10.17
Pendulum-v0 -238.65 ± 14.13 -345.29 ± 47.40 -1255.62 ± 28.37
MountainCarContinuous-v0 -1.01 ± 0.01 -1.12 ± 0.12 93.89 ± 0.06

Other Results

gym_id ppo dqn
CartPole-v1 500.00 ± 0.00 182.93 ± 47.82
Acrobot-v1 -80.10 ± 6.77 -81.50 ± 4.72
MountainCar-v0 -200.00 ± 0.00 -142.56 ± 15.89
LunarLander-v2 46.18 ± 53.04 144.52 ± 1.75

All training curves

Benchmarked Learning Curves Atari
Metrics, logs, and recorded videos are at cleanrl.benchmark/reports/Atari
 
Benchmarked Learning Curves Mujoco
Metrics, logs, and recorded videos are at cleanrl.benchmark/reports/Mujoco

Benchmarked Learning Curves Pybullet
Metrics, logs, and recorded videos are at cleanrl.benchmark/reports/PyBullet-and-Other-Continuous-Action-Tasks
Benchmarked Learning Curves Classic Control
Metrics, logs, and recorded videos are at cleanrl.benchmark/reports/Classic-Control
Benchmarked Learning Curves Experimental Domains
Metrics, logs, and recorded videos are at                                                                                                               cleanrl.benchmark/reports/Others
This is a rather challenging continuous action tasks that usually require 100M+ timesteps to solve.
This is a self-play environment from https://github.com/hardmaru/slimevolleygym, so its episode reward should not steadily increase. Check out the video for the agent's actual performance (i.e. go check out cleanrl.benchmark/reports/Others )
This is a MicroRTS environment to build as many combat units as possible, see https://github.com/vwxyzjn/gym-microrts. These runs are created by https://github.com/vwxyzjn/gym-microrts/blob/master/experiments/ppo.py, which additionally implements invalid action masking and handling of multi-discrete action space for PPO.
This is an experimental run of MontezumaRevengeNoFrameskip-v4 with PPO with RND (Random Network Distillation) by @yooceii, see vwxyzjn/cleanrl#25 and runs/j00qhu7d. We plan to officially include this run soon.

Experimental Domains

Section 4