Comparison of Four Reinforcement Learning Methods on CartPole-v1
Introduction
This study compares four different RL approaches on the CartPole-v1 task. The methods are:
- REINFORCE (Policy-Based): A straightforward policy-gradient method.
- DQN (Value-Based): A deep Q-network that estimates action values.
- A2C (Hybrid): An actor-critic method that combines policy and value learning.
- Decision Transformer (Candidate): An offline method trained on expert demonstrations and then evaluated live.
My goal was to see how well each algorithm learns to balance the pole, measured by the total reward per episode (which equals the episode length for CartPole).
Methodology
I used the CartPole-v1 environment where the agent earns 1 point for every timestep the pole remains balanced. The two main metrics are:
- Average Return: Total reward per episode.
- Average Episode Length: Number of timesteps per episode (they’re the same for CartPole).
For training:
- Online methods (REINFORCE, DQN, and A2C) learn directly from interactions with the environment.
- The Decision Transformer collects expert demonstrations first, trains offline on that data, and is later evaluated in the live environment.
All results were logged with WandB and plotted locally to help compare the methods.
Results
Below is a summary of the evaluation results:
Algorithm | Average Return | Average Episode Length |
---|---|---|
REINFORCE (Policy-Based) | 500 | 500 |
DQN (Value-Based) | 9.40 | 9 |
A2C (Hybrid) | 9.63 | 9–10 |
Decision Transformer (Candidate) | 10.40 | ~10 |
Note: In CartPole-v1, the reward per timestep equals the episode length.
Analysis
- REINFORCE learned an optimal policy quickly, balancing the pole for the entire episode.
- Both DQN and A2C struggled; they ended up with very low scores and very short episodes.
- The Decision Transformer, although trained offline on expert data, achieved evaluation results similar to DQN and A2C. This suggests that, under the current settings, its offline training didn’t translate into a competitive policy for live play.
Conclusion
My experiments show that REINFORCE can solve CartPole-v1 perfectly, while DQN, A2C, and the offline Decision Transformer were not able to learn effective policies with the current configurations. Further tuning of the hyperparameters for DQN, A2C, and the Decision Transformer might lead to better performance.