Skip to main content

RL Homework 3

nv2099@nyu.edu
Created on October 19|Last edited on October 23

Table of Contents



P.S. The link to the web version of this report is https://wandb.ai/nikhilweee/rl-hw-3/reports/RL-Homework-3--VmlldzoxMTMyMjI0

Pendulum

The following plot shows the performance of REINFORCE (green) and PPO (purple) on the Pendulum environment. The solid line plots the average episode reward across three runs with seeds 13, 31 and 42. The min/max values are shown as a range behind the solid line. The average reward for PPO towards the end of the runs across these three seeds reaches -329.8

Run set
6


Bipedal Walker

The following plot shows the performance of REINFORCE (green) and PPO (purple) on the Bipedal Walker environment. The solid line plots the average episode reward across three runs with seeds 0, 7 and 42. The min/max values are shown as a range behind the solid line. The average reward for PPO towards the end of the runs across these three seeds reaches 138.6

Run set
6



Lunar Lander

The following plot shows the performance of REINFORCE (green) and PPO (purple) on the Lunar Lander environment. The solid line plots the average episode reward across three runs with seeds 7, 13 and 31. The min/max values are shown as a range behind the solid line. The average reward for PPO towards the end of the runs across these three seeds is around 104.8

Run set
6


Comments about REINFORCE

The performance of REINFORCE does not increase over time. This might be because of the fact that it's not getting the correct gradients since we're optimizing the same policy that we're sampling from. PPO, on the other hand, has tricks like using the clip objective, using advantage instead of rewards, and importance sampling to make sure that there's little deviation between the two.