Replit Hackathon Winner: Implementing Q-Learning from Scratch

How I implemented Q-Learning from scratch in one week and created a game using it during the Replit x Weights and Biases Hackathon

Icemaster Eric

Created on February 11|Last edited on March 8

Comment

﻿
Note: This post won the award for "Best W&B Report" during our recent hackathon with the Replit. You can view the associated Repl here. ﻿﻿
💡
Day 0After seeing some announcements about Replit's Machine Learning Hackathon in collaboration with Weights & Biases, I decided on a whim to sign up.
Little did I know what was in store for me.
Day 1
BrainstormingAfter being accepted to the hackathon, I started thinking up some ideas:
AI learns to walk: Pretty self-explanatory.
A traditional ML project: For example, using GPT or some other model to do something interesting.
God's world: A simulation world where an AI plays the role of God.
A puzzle game: Some sort of game where an agent controlled by an AI would try to solve  puzzles.
Inspired by many people such as Code Bullet–and because I'm familiar with game development–I decided to teach my AI to play a unique game.
First though, I needed to do some research:
Research
Visual DesignI settled on isometric pixel art for the visual aspect, because I had been dabbling in isometric rendering just a few days ago.
An initial isometric render.
Neural NetworkUntil this point, I've never done much with machine learning in general. But I decided to dive right into the rabbit hole.
After an hour of research, I decided on reinforcement learning, as I felt that was most suited to my purposes.
There are a few different types of reinforcement learning, such as Q-Learning, Deep Q-Learning, Monte Carlo, and plenty others. I chose SARSA, which is an on-policy variant of Q-Learning. (I still refer to it as Q-Learning because to my mind, it's pretty much the same thing.)
Here are some reinforcement learning algorithms and how they compare:













































AlgorithmDescriptionPolicyAction SpaceState SpaceOperator
Monte CarloEvery visit to Monte CarloEitherDiscreteDiscreteSample-means
Q-learningState–action–reward–stateOff-policyDiscreteDiscreteQ-value
SARSAState–action–reward–state–actionOn-policyDiscreteDiscreteQ-value
DQNDeep Q NetworkOff-policyDiscreteContinuousQ-value
﻿
﻿
LimitationsRepls are limited to 1GB of storage. This meant that some major machine learning libraries, such as Tensorflow, Keras, and more wouldn't fit on my repl. That's when I came up with an absolutely insane idea:
Rather than relying upon some external library, I would code everything from scratch.
This, in theory, would allow me to have control over every aspect of my project, and I'd be able to make a lightweight implementation of Q-Learning.
In doing so, I got a far greater grasp of how Q-Learning and reinforcement learning in general can work its wonders.
Day 2I started working on my pixel art and isometric rendering next. It took far longer than expected, as my sparse artistic skills were rendered null by the fact that I needed to draw everything isometrically.
Our robot protagonist. I think it's pretty cute 😅
Day 3I decided to optimize my code as much as possible in order to give a smooth viewing experience. During this period, I struggled a bit against Replit's VNC viewer, but was eventually able to work everything out.
Day 4I continued to design the puzzles for my game, as well as creating animations and making sure that everything worked well.
The game itself is pretty simple: that cute little robot above needs to navigate to colored squares and the level's complete when it reaches the blue one. We start with a pretty simple line, move onto grids, and winding paths, and eventually into something a bit more complicated. In some levels that green square starts out yellow and our robot friend needs to step there first, turn it green, then navigate to the blue one. 
The final level of my puzzle game...
Day 5With the game finished, I finally got to work on my Q-learning implementation. At first, I browsed the web, looking for similar examples. However, I quickly ran an issue: most reinforcement learning tutorials relied on OpenAI's Gym module.
Since I had created my own environment, I needed to implement rewards, states, and more from scratch. After hours of searching, and at my wit's end, I decided to pop open Replit's Ghostwriter Chat. After asking a few questions, it confirmed that a custom Q-Table implementation would be possible. I was soon able to come up with a custom implementation of a Q-Table that didn't rely on a NumPy array, but rather a traditional Python dictionary:
# A simplified snippet of my Q-learning implementation
class QTable:
  def __init__(self, n_actions=4, alpha=0.2, gamma=0.9, epsilon=0.1):
    self.alpha = alpha
    self.gamma = gamma
    self.epsilon = epsilon
    self.n_actions = n_actions
    self.q_table = {}
  
  def epsilon_greedy(self, state):
    r = np.random.uniform(0,1)
﻿
    if r < self.epsilon:
      return random.randint(0,self.n_actions-1)
﻿
    else:
      return self.q_table[state].index(max(self.q_table[state]))
  
  def update_q(self, state, action, reward, next_state, next_action):
    self.q_table[state][action] += self.alpha * (reward + self.gamma * (self.q_table[next_state][next_action] - self.q_table[state][action]))
  
  def eval_greedy(self, state):
    return self.q_table[state].index(max(self.q_table[state]))
Day 6I decided to finally integrate Weights & Biases into my project and start doing some final training.
Surprisingly, Weights & Biases was super easy to set up. I was able to get started training in less than 10 minutes and was able to visualize my training data without fuss.
From my graph of the Q-Table size, I realized that my epsilon function was slowly being more and more inefficient over time. To try to amend that, I tried many different workarounds, such as decaying the epsilon, varying between 1 and 0 over a certain period of time, and a constant.
﻿
﻿
In the end, after tweaking the learning rate and discount factor of my algorithm helped the agent solve the final puzzle.
Day 7I decided to polish my game even further and do some final bug testing. And here we are: I'm drafting up this report right now. 
Below are some graphs of my training! I'm still quite new to using wandb, so they probably don't look the best 😅 but they get the point across. I've included some pieces about what I learned on the side:
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
Hackathon SummaryThis hackathon has really been an eye-opener for me and I've learned so many useful things over the course of a single week. I really appreciate the work of Replit and Weights & Biases, and hope to compete in future hackathons 😄
﻿

Algorithm	Description	Policy	Action Space	State Space	Operator
Monte Carlo	Every visit to Monte Carlo	Either	Discrete	Discrete	Sample-means
Q-learning	State–action–reward–state	Off-policy	Discrete	Discrete	Q-value
SARSA	State–action–reward–state–action	On-policy	Discrete	Discrete	Q-value
DQN	Deep Q Network	Off-policy	Discrete	Continuous	Q-value