Cartpole - Reinforcement Learning in Gym
Taking an RL model in Github using the algorithm reinforce to solve Cartpole problem, and see whether it can be further optimized to achieve the goal with fewer episodes
Created on May 1|Last edited on June 1
Comment
May 2022
Baseline Model
The OpenAI gym is an API built to make environment simulation and interaction for reinforcement learning simple. It also contains a number of built in environments (e.g. Atari games, classic control problems, etc).
One such classic control problem is Cart Pole, in which a cart carrying an inverted pendulum needs to be controlled such that the pendulum stays upright. The reward mechanics are described in the gym page for this environment.

Here, the reinforce algorithm is being used to solve Cartpole problems with RL. Here is the baseline model.
The goal of this Cartpole task is to train this model so that the running reward gets to 475 (pole hasn't fall over after 475 steps). Here are some key parameters and their values in the baseline models. Those are the key parameters we can potentially tune later. And the goal of tuning the model is to achieve the targeted running reward with fewer episodes than the baseline (faster training).
Try Out Different Parameters
1. Adjust learning rate
Based on the running_reward baseline model metric, it looks like the model is relatively unstable as it learns. It initially went up quickly and dropped with more training steps and eventually went up again. One intuition is that the learning rate is too high. So here we firstly tried out a few learning rates (both lower & higher). 
Unfortunately, neither lower nor higher learning rates improved the model performance.
2. Manually adjust gamma, dropout, hidden layer size, optimizer
Other hyperparameters we can potentially change are gamma, dropout rate, optimizer type, and size of the model hidden layer. Here we tried a few options along each and didn't see any significant improvements by changing any hyperparameters. 
3. Sweep on dropouts, gamma, and activation function
It does look like dropouts and gamma impacts the model performance more than others. So there is potential to try a few more options there. In addition, we also wanna try a few other activation functions.
We did a sweep on dropout, gamma, and activation function. It looks like:
(1) a higher gamma works a little better (not materially different);
(2) dropout rate is not super related to performance;
(3) activation function silu and tanh works a little better than relu (not materially different either).
4. Add model layers & change activation function
(1) more layers to the model doesn't seem to help either;
(2) silu activation does work a little better than relu, but again, it doesn't look like there is any significant improvements.
Conclusion
- The baseline model in Github seems to already have optimized parameters. It'd be difficult to tune the model further to achieve meaningful better results.
- The reinforce algorithm seems to be unstable - it is not the state of the art algorithm for RL problems like Cartpole. Next step would be to try later algorithms such as actor critic - SOTA evolves fast.
- Running reward standalone might not be the best metrics to look at when deciding the best-performing parameters. A combination of criteria such as running reward, last reward, simulation of what the model is doing in video format would provide more intuition of the strenghts & weaknesses of a model.
Add a comment