RL Homework 3

nv2099@nyu.edu
Created on October 19|Last edited on October 23
Comment
﻿
Table of ContentsPendulumBipedal WalkerLunar LanderComments about REINFORCE
﻿
﻿
P.S. The link to the web version of this report is https://wandb.ai/nikhilweee/rl-hw-3/reports/RL-Homework-3--VmlldzoxMTMyMjI0﻿
PendulumThe following plot shows the performance of REINFORCE (green) and PPO (purple) on the Pendulum environment. The solid line plots the average episode reward across three runs with seeds 13, 31 and 42. The min/max values are shown as a range behind the solid line. The average reward for PPO towards the end of the runs across these three seeds reaches -329.8
﻿
Run set6
﻿
Bipedal WalkerThe following plot shows the performance of REINFORCE (green) and PPO (purple) on the Bipedal Walker environment. The solid line plots the average episode reward across three runs with seeds 0, 7 and 42. The min/max values are shown as a range behind the solid line. The average reward for PPO towards the end of the runs across these three seeds reaches 138.6
﻿
Run set6
﻿
﻿
Lunar LanderThe following plot shows the performance of REINFORCE (green) and PPO (purple) on the Lunar Lander environment. The solid line plots the average episode reward across three runs with seeds 7, 13 and 31. The min/max values are shown as a range behind the solid line. The average reward for PPO towards the end of the runs across these three seeds is around 104.8
﻿
Run set6
﻿
Comments about REINFORCEThe performance of REINFORCE does not increase over time. This might be because of the fact that it's not getting the correct gradients since we're optimizing the same policy that we're sampling from. PPO, on the other hand, has tricks like using the clip objective, using advantage instead of rewards, and importance sampling to make sure that there's little deviation between the two.
﻿
﻿
Add a comment