RL Homework 3
nv2099@nyu.edu
Created on October 19|Last edited on October 23
Comment
Table of Contents
P.S. The link to the web version of this report is https://wandb.ai/nikhilweee/rl-hw-3/reports/RL-Homework-3--VmlldzoxMTMyMjI0
Pendulum
The following plot shows the performance of REINFORCE (green) and PPO (purple) on the Pendulum environment. The solid line plots the average episode reward across three runs with seeds 13, 31 and 42. The min/max values are shown as a range behind the solid line. The average reward for PPO towards the end of the runs across these three seeds reaches -329.8
Run set
6
Bipedal Walker
The following plot shows the performance of REINFORCE (green) and PPO (purple) on the Bipedal Walker environment. The solid line plots the average episode reward across three runs with seeds 0, 7 and 42. The min/max values are shown as a range behind the solid line. The average reward for PPO towards the end of the runs across these three seeds reaches 138.6
Run set
6
Lunar Lander
The following plot shows the performance of REINFORCE (green) and PPO (purple) on the Lunar Lander environment. The solid line plots the average episode reward across three runs with seeds 7, 13 and 31. The min/max values are shown as a range behind the solid line. The average reward for PPO towards the end of the runs across these three seeds is around 104.8
Run set
6
Comments about REINFORCE
The performance of REINFORCE does not increase over time. This might be because of the fact that it's not getting the correct gradients since we're optimizing the same policy that we're sampling from. PPO, on the other hand, has tricks like using the clip objective, using advantage instead of rewards, and importance sampling to make sure that there's little deviation between the two.
Add a comment