Can We Beat Mario Bros with Gymnasium and CleanRL on a Laptop?

If you know Betteridge's Law of Headlines, you're getting what you expected
Created on June 6|Last edited on July 18
Comment
﻿
A 1983 Video Game vs. a 2023 Laptop
﻿
Kickstarting one of the biggest video game franchises ever upon its release in 1983, Mario Bros has a special place in the gaming industry. The franchise spans multiple decades, is frequently the flagship release on new Nintendo systems, and has enjoyed crossover appeal in everything from animated series to comic books to middling John Leguizamo movies.
Still, this is a forty year old game. In 1983:
Apple made its first Lisa computer, the first commercial computer with a GUI. 
Microsoft Windows was announced in 1983 (not even released)
Word 1.0 was released
Quicken was released
Suffice it to say: we've made a few advancements since then. In machine learning, we've seen massive advancements pretty much everywhere, but for the purposes of this report, we want to talk specifically about reinforcement learning. Recent examples of the power of RL in gaming include:
﻿AlphaGo beating legendary Go player Fan Hui﻿
﻿OpenAI Five defeating Dota 2 world champions﻿
﻿AlphaStar reaching Grandmaster level in Starcraft II﻿
Impressive. But can a 40 year-old game hold up to a 2023 AI that I can run from my laptop? That's what we're here to find out. 
Table of ContentsA 1983 Video Game vs. a 2023 LaptopTable of ContentsDeep Q-Learning (DQN)Q-LearningDeep Q-LearningExperience Replay Target NetworkExpectationCleanRL DQN + OpenAI's Gymnasium vs. Mario BrosOut-of-the-Box ImplementationApproach Two: ExperimentationObservations:Adjusted Parameter Approach: Bonus: Learning from a "Fun" Mistake Conclusion
﻿
Deep Q-Learning (DQN)Here, we'll start by looking at DQN (originally from DeepMind's "Playing Atari with Deep Reinforcement Learning"). A few concepts to keep in mind: 
NOTE: If you're completely new to DQN, this report is a nice, simple introduction﻿. If you find yourself confused by any of the below, we recommend giving that a read and coming back. 
💡
Q-LearningQ-Learning is based on the idea that each state and actions will result in a Q-Value, or in this case the output of such state plus action. You can think of the Q function that coverts the current state and actions like so:
Q(Current State + Actions) = Q-Value/Reward
Deep Q-LearningDQN builds upon the concept of Q-Learning by having a neural network learn the Q function itself. This means that a DQN Algorithm will take in the inputs of the current state and actions and learn what the corresponding Q-values. DQN maximizes the expected value of the Q-Value (or Reward).
The Q-formula from "Playing Atari with Deep Reinforcement Learning"
Experience Replay Rather than only train the current state, DQN implements a Replay Buffer which stores the agent's experiences within a table and will account for the last such amount of states and actions when choosing the next move. This increases the amount of data that is considered per action and uses the previous actions of the agent more efficiently.
For example, instead of taking the current state and action, DQN will consider a sample of 64 from the last 10,000 frames in order to choose its next action.
Target NetworkDQN actually makes use of two neural networks: a primary network for choosing the actions and a target network for generating the Q-values. 
In a traditional Q-Learning environment, only the current state and action affect the Q-value. However, since we are learning from many previous actions of our own, every action can effect subsequent Q-values instead of it being a simple function. In order to nullify this, we use two neural networks:
Primary Network: Updated at every step and used to choose actions.
Target Network: Updated at every n steps and  used to generate the Q-value of a given state (plus experience)
In case you're curious, here's the psuedocode from DeepMind's Playing Atari with Deep Reinforcement Learning:
﻿
ExpectationFirst, let's look at how well DQN does with other Atari games. This image comes straight from the original paper:
﻿
What jumps out first is that DQN performs best on games such as Boxing, Breakout, and Video Pinball. Notably, these have few controls and straightforward objectives. In the case of Breakout, there are only two real controls (right and left) and DQN does excellently with learning how to properly use both.
On the other hand, we have some poor DQN performance on more complex games such as Montezuma's Revenge and Frostbite. Both of these games require navigating platforms similar to that of Mario Bros. and both games perform significantly worse than human-level. Knowing this, I don't have much expectation going into this experiment but it is certainly a baseline that we can look at.﻿﻿
CleanRL DQN + OpenAI's Gymnasium vs. Mario BrosI will be using Gymnasium's implementation of MarioBros as the Atari environment and starting with CleanRL's DQN implementation for Atari environments.
I will be comparing two different runs while looking through implementations:
Out-of-the-box implementation from CleanRL
An approach with adjusted parameters
We'll start with the former: 
Out-of-the-Box Implementation﻿CleanRL has an implementation of DQN that I will be using as the basis for this demonstration. There are some default parameters and generated charts that come with the CleanRL implementation.
Straight from the CleanRL Docs, here is the explanation for the two charts I will be  showing repeatedly.
﻿
Our CLI Command:
python3 cleanrl/dqn_atari.py --env-id ALE/MarioBros-v5 
	--track --capture-video --buffer-size 50000
Our panels: 
﻿
﻿
Our results: We ran this off-the-shelf model for a million steps and we got to level two. Maybe a different approach will work better.
Approach Two: ExperimentationThere a few things I tried when it came to tinkering with hyperparameters. 
End Epsilon: 0.01 -> 0.001 
In the original paper by DeepMind, the end epsilon was at 0.001 which is significantly smaller than the CleanRl's implementation of 0.01. 
Batch Size: 32 -> 64
I also doubled the batch size from 32 to 64 in order to look into a larger sample of actions for Mario to learn from. Looking back, perhaps keeping it at 32 would have been better since many of Mario's moves are inherently reactionary. 
Total Timestep: 10_000_000 -> 1_000_000
I believed that increasing the batch size would allow me to decrease the timesteps to save in total training time. I was sorely mistaken.
Train Frequency: 4 -> 10
I was venturing doubling the Train Frequency from 4 to 8 at first in order to match the Batch Size's increase, but I wanted to see if a non-power of 2 value would affect anything. 
Observations:The losses/td_loss peaked at a much lower value (~0.11 versus ~88.00 for the full implementation). This is likely due to increased batch sizes, lower end epsilon, and higher train frequency.
I should have trained this run for more steps, but given my hardware limitations I had decided on a lower timestep to save time. Increasing the timestep to 10-20 million for future runs will be ideal
Mario was still able to score points at the end of the experiment! The same cannot be said for my next experiment.
Adjusted Parameter Approach: Our CLI Command:
python3 cleanrl/dqn_atari.py --env-id ALE/MarioBros-v5 
	--end-e 0.001 --batch-size 64 
	--capture-video --track --total-timesteps 1000000 
	--train-frequency 10
Our panels:
﻿
﻿
Our results: We didn't make it out of the first level. 
Bonus: Learning from a "Fun" Mistake For this run, I had accidentally executed run with -0.8 Gamma, which makes no sense in the algorithm but provides an interesting insight of what DQN does in this situation: nothing but run right. 
A hilarious outcome, sure, but a quick lesson that the wrong hyperparameter is always around the corner.
Our CLI Command:
python3 cleanrl/dqn_atari.py --env-id ALE/MarioBros-v5 --total-timesteps 10000000 
	--track --capture-video --buffer-size 10000 
	--gamma -0.8 --target-network-frequency 1000 
	--batch-size 256 --start-e 1 --end-e 0.1 --exploration-fraction 0.6 
	--learning-starts 10000 --train-frequency 4
Our panels: 
﻿
﻿
Our results: Turns out you gotta run left too. 
ConclusionIn conclusion, the original 2015 DQN is a far cry from solving Mario Bros. The Out-of-the-Box Implementation ran for 1 million steps and still could only get to level two. My altered implementation did even less. 
DQN is certainly not state of the art in 2023. There are other algorithms in RL that do better learning, at least on the benchmark game, Atari 2600's Pong. As of 2023, here is the current Leaderboard courtesy of Paperswithcode:
﻿
One clear weakness in this report is the lack of experimentation done. For that, Sweeps on Launch will be the next step in tuning DQN for the best possible hyperparameters.
﻿
﻿
Add a comment
Tags: Articles, Experiment, Panels, Gaming, Plots, Intermediate
Iterate on AI agents and models faster. Try Weights & Biases today.