Replit Hackathon Winner: Implementing Q-Learning from Scratch
How I implemented Q-Learning from scratch in one week and created a game using it during the Replit x Weights and Biases Hackathon
Created on February 11|Last edited on March 8
Comment
Note: This post won the award for "Best W&B Report" during our recent hackathon with the Replit. You can view the associated Repl here.
💡
Day 0
After seeing some announcements about Replit's Machine Learning Hackathon in collaboration with Weights & Biases, I decided on a whim to sign up.
Little did I know what was in store for me.
Day 1
Brainstorming
After being accepted to the hackathon, I started thinking up some ideas:
- AI learns to walk: Pretty self-explanatory.
- A traditional ML project: For example, using GPT or some other model to do something interesting.
- God's world: A simulation world where an AI plays the role of God.
- A puzzle game: Some sort of game where an agent controlled by an AI would try to solve puzzles.
Inspired by many people such as Code Bullet–and because I'm familiar with game development–I decided to teach my AI to play a unique game.
First though, I needed to do some research:
Research
Visual Design
I settled on isometric pixel art for the visual aspect, because I had been dabbling in isometric rendering just a few days ago.

An initial isometric render.
Neural Network
Until this point, I've never done much with machine learning in general. But I decided to dive right into the rabbit hole.
After an hour of research, I decided on reinforcement learning, as I felt that was most suited to my purposes.
There are a few different types of reinforcement learning, such as Q-Learning, Deep Q-Learning, Monte Carlo, and plenty others. I chose SARSA, which is an on-policy variant of Q-Learning. (I still refer to it as Q-Learning because to my mind, it's pretty much the same thing.)
Here are some reinforcement learning algorithms and how they compare:
Algorithm | Description | Policy | Action Space | State Space | Operator |
---|---|---|---|---|---|
Monte Carlo | Every visit to Monte Carlo | Either | Discrete | Discrete | Sample-means |
Q-learning | State–action–reward–state | Off-policy | Discrete | Discrete | Q-value |
SARSA | State–action–reward–state–action | On-policy | Discrete | Discrete | Q-value |
DQN | Deep Q Network | Off-policy | Discrete | Continuous | Q-value |
Limitations
Repls are limited to 1GB of storage. This meant that some major machine learning libraries, such as Tensorflow, Keras, and more wouldn't fit on my repl. That's when I came up with an absolutely insane idea:
Rather than relying upon some external library, I would code everything from scratch.
This, in theory, would allow me to have control over every aspect of my project, and I'd be able to make a lightweight implementation of Q-Learning.
In doing so, I got a far greater grasp of how Q-Learning and reinforcement learning in general can work its wonders.
Day 2
I started working on my pixel art and isometric rendering next. It took far longer than expected, as my sparse artistic skills were rendered null by the fact that I needed to draw everything isometrically.

Our robot protagonist. I think it's pretty cute 😅
Day 3
I decided to optimize my code as much as possible in order to give a smooth viewing experience. During this period, I struggled a bit against Replit's VNC viewer, but was eventually able to work everything out.
Day 4
I continued to design the puzzles for my game, as well as creating animations and making sure that everything worked well.
The game itself is pretty simple: that cute little robot above needs to navigate to colored squares and the level's complete when it reaches the blue one. We start with a pretty simple line, move onto grids, and winding paths, and eventually into something a bit more complicated. In some levels that green square starts out yellow and our robot friend needs to step there first, turn it green, then navigate to the blue one.

The final level of my puzzle game...
Day 5
With the game finished, I finally got to work on my Q-learning implementation. At first, I browsed the web, looking for similar examples. However, I quickly ran an issue: most reinforcement learning tutorials relied on OpenAI's Gym module.
Since I had created my own environment, I needed to implement rewards, states, and more from scratch. After hours of searching, and at my wit's end, I decided to pop open Replit's Ghostwriter Chat. After asking a few questions, it confirmed that a custom Q-Table implementation would be possible. I was soon able to come up with a custom implementation of a Q-Table that didn't rely on a NumPy array, but rather a traditional Python dictionary:
# A simplified snippet of my Q-learning implementationclass QTable:def __init__(self, n_actions=4, alpha=0.2, gamma=0.9, epsilon=0.1):self.alpha = alphaself.gamma = gammaself.epsilon = epsilonself.n_actions = n_actionsself.q_table = {}def epsilon_greedy(self, state):r = np.random.uniform(0,1)if r < self.epsilon:return random.randint(0,self.n_actions-1)else:return self.q_table[state].index(max(self.q_table[state]))def update_q(self, state, action, reward, next_state, next_action):self.q_table[state][action] += self.alpha * (reward + self.gamma * (self.q_table[next_state][next_action] - self.q_table[state][action]))def eval_greedy(self, state):return self.q_table[state].index(max(self.q_table[state]))
Day 6
I decided to finally integrate Weights & Biases into my project and start doing some final training.
Surprisingly, Weights & Biases was super easy to set up. I was able to get started training in less than 10 minutes and was able to visualize my training data without fuss.
From my graph of the Q-Table size, I realized that my epsilon function was slowly being more and more inefficient over time. To try to amend that, I tried many different workarounds, such as decaying the epsilon, varying between 1 and 0 over a certain period of time, and a constant.
In the end, after tweaking the learning rate and discount factor of my algorithm helped the agent solve the final puzzle.
Day 7
I decided to polish my game even further and do some final bug testing. And here we are: I'm drafting up this report right now.
Below are some graphs of my training! I'm still quite new to using wandb, so they probably don't look the best 😅 but they get the point across. I've included some pieces about what I learned on the side:
Hackathon Summary
This hackathon has really been an eye-opener for me and I've learned so many useful things over the course of a single week. I really appreciate the work of Replit and Weights & Biases, and hope to compete in future hackathons 😄
Add a comment
I saw your replit. How did you train it i deleted the model.pkl file so i need to retrain it? python3 game.py or qlearn.py
Reply
Enjoyed reading this, thanks for writing!
Reply
Loved day to day journal of your progress. Keep at it. :)
Reply
Nice!
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.