Skip to main content

Comparison of Four Reinforcement Learning Methods on CartPole-v1

This project compares four RL algorithms on CartPole-v1 by evaluating their average return and episode length. The study covers online methods (REINFORCE, DQN, A2C) and an offline method (Decision Transformer) trained on expert data.
Created on April 2|Last edited on April 3

Introduction

This study compares four different RL approaches on the CartPole-v1 task. The methods are:

  • REINFORCE (Policy-Based): A straightforward policy-gradient method.
  • DQN (Value-Based): A deep Q-network that estimates action values.
  • A2C (Hybrid): An actor-critic method that combines policy and value learning.
  • Decision Transformer (Candidate): An offline method trained on expert demonstrations and then evaluated live.

My goal was to see how well each algorithm learns to balance the pole, measured by the total reward per episode (which equals the episode length for CartPole).

Methodology

I used the CartPole-v1 environment where the agent earns 1 point for every timestep the pole remains balanced. The two main metrics are:

  • Average Return: Total reward per episode.
  • Average Episode Length: Number of timesteps per episode (they’re the same for CartPole).

For training:

  • Online methods (REINFORCE, DQN, and A2C) learn directly from interactions with the environment.
  • The Decision Transformer collects expert demonstrations first, trains offline on that data, and is later evaluated in the live environment.

All results were logged with WandB and plotted locally to help compare the methods.

Results

Below is a summary of the evaluation results:

AlgorithmAverage ReturnAverage Episode Length
REINFORCE (Policy-Based)500500
DQN (Value-Based)9.409
A2C (Hybrid)9.639–10
Decision Transformer (Candidate)10.40~10

Note: In CartPole-v1, the reward per timestep equals the episode length.

Analysis

  • REINFORCE learned an optimal policy quickly, balancing the pole for the entire episode.
  • Both DQN and A2C struggled; they ended up with very low scores and very short episodes.
  • The Decision Transformer, although trained offline on expert data, achieved evaluation results similar to DQN and A2C. This suggests that, under the current settings, its offline training didn’t translate into a competitive policy for live play.

Conclusion

My experiments show that REINFORCE can solve CartPole-v1 perfectly, while DQN, A2C, and the offline Decision Transformer were not able to learn effective policies with the current configurations. Further tuning of the hyperparameters for DQN, A2C, and the Decision Transformer might lead to better performance.





Run set
5