Solving Wordle with Reinforcement Learning
In this article, we explore how to use Deep Reinforcement Learning (RL) to teach a bot to play Wordle, the word-guessing game now owned by The New York Times.
Created on February 12|Last edited on July 29
Comment
TL;DR
You read the title, so chances are you know what you're getting into here. But if not: I used reinforcement learning to teach an agent to play Wordle.
The current best model I trained played a total of around 20 million games before it was decently good and now has ~99.5% win rate on the "Official" Wordle game, averaging ~3.9 guesses per game.
There are definitely more efficient ways to build a Wordle-bot (see e.g., 3blue1brown's very nice entropy minimization approach), but this was a fun excuse for me to learn about some DeepRL methods.
If you don’t want to read all the technical nonsense below(ed note: you really should though), the approach that worked for me was an A2C (Advantage Actor Critic), a policy gradient Deep RL method. I used a staged training approach where I progressively trained the network to solve harder and harder problems (through increasing vocabulary size) and warm-started each time it started a more difficult problem. I also designed a neural net where, instead of learning the full space of ~13k possible discrete actions, the model only needed to learn 130 outputs.
You can test out the Deep RL agent yourself here. (This is hosted on Heroku's free tier so please be patient if it needs up to a minute to spin up).
Here's what we'll be covering in this article:
TL;DRWordleOther Automated SolutionsReinforcement LearningDeep RLState and Action representationWordle Environment and TrainingOther tweaks and hacksResults
Wordle
Just in case you haven’t played Wordle, it’s a game where you have to guess a 5 letter word within 6 tries. After each guess, the game will tell you which letters are in the correct spot, which are in the wrong spot, and which don’t appear in the word at all.
Here's an example with a particularly contentious target word:

An illustration of how Wordle works, for the target word KNOLL.
One of the big challenges is that each of your guesses must be a valid word; i.e. you can’t just enter characters that you want to try out. It’s quite a challenging game the first few times you play!
Anecdotally though, most people are able to figure out the words within 3-5 guesses. If the number of guesses you take is your score, I suspect a very good human to average in the low-to-mid 3's.
Other Automated Solutions
There are some very good solvers out there, my favorite right now being 3blue1brown's entropy minimization solution.
I initially tried to do something similar by following this minimax approach which roughly attempts to guess the word that will eliminate the most number of remaining words, no matter the outcome. In other words: it tries to minimize the worst case. When there's only 1 word remaining, it guesses that word and wins. The author tested its worst-case performance as 5 guesses, which is extremely good for such a simple heuristic!
Reinforcement Learning
After scratching my Golang itch and reimplementing the minimax solver, I figured the next thing I should try to do is incorporate some sort of information on how frequently certain words occur.
See, at every stage of the game, you have ~13k possible words to guess, and you want to choose the one that maximizes some expected value, leading to an explore/exploit tradeoff, which is exactly what reinforcement learning is trying to do.
In the Reinforcement Learning setting, you are an agent with some notion of state and are able to take some action. The environment will update the state based on the action you take. The goal of the agent is to learn a policy that, given a current state, chooses the action that maximizes the agent’s total expected rewards.
For small problems, you can use value or policy iteration techniques to get to the optimal policy. Unfortunately, I wasn’t able to come up with a tractable problem. One way to represent the state is, for each letter, to track whether it’s been attempted, and if it has, which spaces it’s still possible for (i.e., yes, maybe, no for each of the 5 spots). You could store this in a binary vector of size 26 + 3*5*26, width 416.
Most of the state space is unreachable. One bound on the size of the state space is that there are 13k actions you could take at each turn. After each of these steps, the game can respond with one of 3^5 combinations of Noes, Maybes, and Yeses. So you could think of this as (13k * 3^5)^6 possible sequences, approx. 10^38.

Wordle State Debugging, target word was “WHEEL”, it won on the 6th try.
So ignoring for now the fact that that’s an astronomically large number, one approach to RL is to ask, given my current state, what’s the action that maximizes my expected reward? You could store this data in a 2D array, a “Q-table”, indexed by the current state, and the set of available discrete actions, with the values as the expected reward for taking an action in a given state. Once the table is filled in, you just choose the action with the largest expected value given your current state, play the word, and then see what state you end up in next. This is the value iteration approach.
Deep RL
Enter model-based approaches, where you use some parameterized function to approximate the Q-table, say a linear combination of the state vector. Deep Q learning is just using a neural network as this function approximator.
More recently, however, Policy Gradient methods seem to have had more success than Deep Q learning. I tried both Deep Q learning and Advantage Actor Critic (A2C) methods because they had example implementations in Pytorch Lightning, which made it easy to get started. I highly recommend the papers and this youtube series of lectures as well as they’re very digestible.
The idea behind policy gradients is really damn cool. Basically, if you have a discrete action space (like we do in the case of Wordle), gradient descent doesn’t work well because the Agent’s approach is to choose the action that maximizes some expectation, and, well, maximums make gradients sad.
The trick with policy gradients is to make the agent choose actions probabilistically, and now based on your loss function, you can nudge the agent to make that action more or less probable.

Wordle network for A2C. The agent samples from the vocabulary.
State and Action representation
The state vector I used was an integer vector of size 417, one for the number of remaining turns and the rest to represent the state I described above.
A naive approach to the approximator would be to take the 13k actions as the output layer of, say, an MLP (multilayer perceptron) and have your agent learn that action space. However, for Wordle, there’s a lot of information in the characters that make up the words, and I didn’t want that to go to waste.
The neural network I designed for both DQN and A2C takes this vector as input and feeds it through an MLP with some hidden layers to an output layer of size 130. Because the output word has a fixed size (5), I one-hot encoded every word in the vocabulary to get a 130-wide one-hot representation for the word (26*5). Taking the dot-product of the MLP output layer and this matrix of one-hot encoded words gets you a single value for each of the 13k possible actions.
This is essentially constraining the value of each word, given an input state, to be the sum of its letters. It’s not clear if this is too restrictive, and it might be a good idea to learn embeddings for the encoded words first, but it’s given decent results so far. For Deep Q learning, we use this as the predicted Q-value of taking this action in the given state. For policy gradients, we pass this through a softmax layer to get probabilities.
Wordle Environment and Training
All that’s missing now is the Wordle environment and training these networks.
I broke the problem up into tiny, small, medium, and large vocabularies to test that things were working correctly. I highly recommend you do this if you’re starting on a new project.
The first problem/testing ground had a vocabulary of 100 words, and for the environment to always choose the same word as output, so basically no randomness. If your model couldn’t figure this out, there was a serious problem somewhere.
The next step up was the same 100 words, but randomly choosing between 2 outputs. I caught a few bugs here around network setup and also related to mutable states (I've lost track of how many times I've designed a mutable state when it really should be immutable).
Once I was fairly confident that most of the big bugs were out, it was time to start training on real problems. The first small problem was the same 100-word vocabulary, but now the environment was randomizing between all of the words. This took the agent around 500k - 700k games to learn to play well, with around a 96% win rate, averaging 2.5 guesses per game. At this point, I didn’t really need it to do better because it wasn’t working on the full game.
The medium problem had a vocabulary size of 1000, with all words as possible targets. The A2C agent was able to learn this game from scratch after playing ~5M games. It had a win rate of over 99% and 3.5 turns per game.

Monitoring guesses to keep track of how the model is learning to win.
The full-sized problem has 13,000 valid words. However, I learned recently that only 2,315 are eligible as targets. This keeps the problem still quite difficult but much easier than one with 13,000 possible targets! And, to be honest, I struggled to get this one going from scratch, with millions of games played and no signs of learning.
The first thing that worked? Warm-starting from the model that trained on the 1000-word training set. This had fantastic results. With network architecture and state designed the way they were, the new model was able to use what it learned from the smaller vocab set and put it to use immediately on the full set. The best model I've trained has around 99.5% win rate and averages about ~3.9 guesses per game. I don’t quite think this is human-level performance yet. My gut says that’s somewhere in the low to mid 3’s.
Below is an example of the small, medium, and full-sized vocabulary training curves demonstrating the effectiveness of warm-starting. The warm-start models are able to quickly start learning from previous iterations. Note that the full-sized training curve is truncated. There were several restarts after this to try to address some of the trickier words (see next section).
Run set
3
Other tweaks and hacks
Because of letter scarcity, the model was pretty bad at some words like PIZZA. To help with this, I set up a “recent losses” FIFO queue with fixed capacity, and whenever the model lost in training, pushed the target word on to it. Whenever the environment reset, I set a 10% probability of drawing a word from the recent losses queue instead of randomly from the vocabulary, enabling the model to get more practice on the words it was struggling with. Words it was struggling with would end up in the queue more often, giving it even more chances to practice. Once it figured it out and was solving consistently, they would slowly be replaced by new challenging words. This approach helped halve the rate at which the model lost games, going from 2% to 1%. It now handles PIZZA quite well.
At this point, the model seemed to have plateaued. One word that it just couldn't figure out was ZESTY. I tried to give the network even more help, so that when it had drawn a word from the FIFO queue, if by the 5th guess it hadn't guessed the word, I forced the agent to guess the goal word, so that it would start to have some positive examples for this situation. This halved the error rate again, getting us now down from 1% to 0.5%.
Results
The model so far is doing alright, though I think it has a ways to go before getting to human level. It doesn’t seem like it will be particularly hard to get this to work, but I’m a bit annoyed at how many games it takes to learn! Another note is that this is a pretty impractical way to create a Wordle-bot, but that's why we have side projects isn't it.
Looking at some sampled output from the model, it does seem to have settled into a strategy of selecting STARE and then some version of CLINK or CLINE or CLEAN given the result of the input. This seems to match up pretty well with the “Wheel of Fortune” approach of going RSTLNE. It's at least figured out to explore after the first guess, but after 2 guesses, it gives up on exploration and goes straight into exploit mode. It hasn't quite figured out that there are cases after the second turn where it's better to eliminate more letters rather than try to guess the answer.
I encourage you to try, but given this is hosted on Heroku free, you may need to give it a minute as the instance spins up. Currently it has a “Goal” mode where you give it a word and it’ll play 6 turns, and a “Suggestor” mode that’ll give you a suggestion if you tell it what you’ve tried and the Wordle responses so far.
Add a comment
Thanks so much for this, Andrew. It's an impressive piece of work and I learned quite a bit about RL by reading the article and trying the model. It does better at WORDLE than I do.
Reply
A detail I'm not sure I followed here. In the output of 26*5 can't it output not real words? Did I miss something?
How do you prevent it from guessing invalid words?
Reply
Such an awesome way to learn about these RL concepts. Thanks for writing this, Andrew
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.