What is Q-Learning?
In this article, we will explore Q-learning: a model-free, off-policy and value-based reinforcement learning algorithm.
Created on September 27|Last edited on January 5
Comment
Welcome to the exciting world of Reinforcement Learning! This marks the beginning of a three-part series focused on Q-learning. Throughout this journey, we'll uncover the core principles of Q-learning, a potent algorithm in reinforcement learning.
As we progress, we'll use practical examples and straightforward mathematical equations to clarify these concepts. Additionally, we'll navigate Q-learning's challenges and applications, highlighting its versatility and limitations.
Come along on this enlightening adventure as we lay the groundwork for a comprehensive understanding of reinforcement learning, setting the stage for a detailed exploration of Q-learning and its practical uses.
Here's what we will be covering:
Reinforcement Learning: An ExampleReinforcement Learning: Optimizing RewardsCartPole: A Common RL ChallengeFinding the optimal policy algorithmBellman equation and Markov Decision ProcessTemporal Difference LearningWhat is Q-Learning?Q-table and Q-valuesHow Q-learning worksInitialise Q-tableTraining Time-step: 1Training Time-step: 2The Advantages of Q-learningApplications of Q-learningChallenges and Limitations of Q-learningConclusion
If you already understand Reinforcement Learning, you can jump right to the section on Q-Learning here.
Reinforcement Learning: An Example
Imagine it's your birthday, and your long-cherished wish for a bicycle has finally come true. Overjoyed, you hop on the bicycle, eager to ride it. Amid the excitement, you realize that you haven't learned how to ride it yet, leading to a tumble. Before embarking on your cycling adventures, it's essential to grasp the learning process. Let's do it in the context of reinforcement learning (RL).

- Agent - In this scenario, you are the agent embarking on the quest to learn how to ride a bicycle safely.
- Environment - The environment is the encompassing space you are learning to ride, comprising the roads, your bicycle, and any obstacles or challenges.
- State - The state encapsulates your learning process, reflecting your current position, bicycle balance, and speed, among other factors.
- Action - The actions at your disposal include pedaling, steering, and maintaining balance.
- Reward - Positive feedback or rewards come your way when you master the art of riding without falling. Conversely, negative feedback, or penalties, is received when you fall or lose balance.
- Policy - The policy is the strategy you develop throughout your learning journey. It encompasses a set of guidelines, such as learning to steer in the right direction, pedal at an optimal speed, and maintain equilibrium to avert falling.
The process would involve you (the agent) starting by riding a bicycle. You take actions like pedaling, steering, and maintaining balance while attempting to ride. The environment responds by providing feedback, such as successfully riding for a certain distance without falling or losing balance and falling. Based on the feedback, you adjust your actions and strategy (the policy), gradually learning to pedal more smoothly, steer accurately, and maintain better balance, leading to longer periods of successful riding without falling.
Through this process, you learn from your experiences and refine your approach based on the feedback from your actions in the environment, eventually mastering the bicycle-riding skill.
Reinforcement Learning: Optimizing Rewards
The central idea in reinforcement learning is to train the agent to learn to act in a way that will maximize its expected rewards over time. As in the above case, you (the agent) need to interact with the environment in such a way that you take certain actions that will make you a better cyclist.
The question is: how should maximum rewards be obtained?
Typically, summing up rewards or penalties for an overall cumulative reward might not suffice. To explore this, let's look at a popular reinforcement learning problem.
CartPole: A Common RL Challenge
The CartPole problem is a popular testbed for reinforcement learning algorithms. This problem involves a pole attached to a cart by an un-actuated joint, which moves along a frictionless track. The objective is to prevent the pole from falling over, which requires the algorithm to learn the appropriate actions to take based on the current state of the system (like the angle and velocity of the pole and the position and velocity of the cart).
Consider a scenario where the agent successfully balances the pole for 100 steps. The challenge arises in pinpointing which of the numerous actions led to this accomplishment and, conversely, which actions might have contributed to the eventual imbalance at the last step. While we have information about each action's individual rewards or penalties, isolating the specific action or sequence of actions responsible for the pole's instability at the final step remains uncertain.
The complexity of assessing the contribution of each action to the eventual outcome underscores the credit assignment problem inherent in reinforcement learning. To tackle this problem, a common strategy is to evaluate an action based on the sum of all the rewards that come after it, usually applying a discount factor γ (gamma) at each step. This sum of discounted rewards is called the action’s return.

Here,
- r = rewards (at each time-step)
- γ = discount factor
Say our agent, the cart, moves to the right three times, gaining rewards of +10, 0, and -50 in each step. When we use the formula (10 + γ × 0 + γ2 × (–50)) with a discount factor of 0.8, the return for the first action turns out to be -22.
Considering the value of the discount factor, we can determine whether immediate rewards hold more significance or if future rewards are more crucial. A discount factor closer to 0 implies that we prioritize the points gained immediately, while a factor nearer to 1 indicates similar importance placed on immediate and future rewards. In the case of the cart balancing the pole, immediate rewards take precedence due to the relatively short-term impact of the actions.
To ensure that our agent learns to take actions leading to maximum cumulative rewards, we'll need an algorithm that guides it through this process.
Finding the optimal policy algorithm
This algorithm, termed policy, acts as the brain of the agent that defines the agent's behavior at a given time or state. Two primary approaches exist to find an optimal policy:
The Policy-Based Approach
The policy-based approach focuses on training the policy directly to determine the best action in any given state. Rather than predefining the policy, the training process shapes the agent's behavior. The policy can either be deterministic, always selecting the same action for a given state, or stochastic, presenting a probability distribution over actions. A well-known algorithm for this approach is the Policy Gradient method.
The Value-Based Approach
In this alternative approach, the policy is not trained directly. Instead, its behavior is predefined manually. The key lies in training the value function to understand which states yield greater value, allowing the policy to select actions that lead to these more valuable states. For example, with a predefined value function, the aim is to create a Greedy policy that consistently selects actions leading to improved rewards. Two types of value functions are typically used:
- State-value function - This function outputs the expected return if the agent starts at a given state (say S0), and then follows the policy indefinitely for each state.
- Action-value function - This function outputs the expected return if the agent starts at a given state (say S0), takes a given action (say A0), and then follows the policy indefinitely for each state and action pair.
The returned value is the expected return for any of the two value-based approaches we choose.
Here's a catch: Consider a case of a self-driving car that have many states and actions to take. If we calculate the expected return for each state or state-action pair by summing up all the rewards it gets starting at a given state and following afterward, it will be a computationally expensive process. To tackle this issue, the Bellman equation comes into the picture.
Bellman equation and Markov Decision Process
The Bellman equation simplifies the state value or state-action value calculation.
To understand it well, let's take an example of Markov Decision Process (MDP). MDP resembles Markov chains (a chain of a fixed number of states which evolves from one state to another at each step. The probability of evolving from a state s to a state s′ is fixed. It depends only on the pair (s, s'), not on past states indicating that the system has no memory) with a slight twist: an agent can choose one of several possible actions, and the transition probabilities depend on the chosen action. Moreover, some state transitions return some reward (positive or negative), and the agent’s goal is to find a policy to maximize rewards over time.

Source: Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow (Example of a Markov chain)
For example, the MDP example shown in the figure below has three states (represented by circles) and up to three possible discrete actions at each step (represented by diamonds).

Source: Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow (Example of a Markov Decision Process)
In this scenario, the agent, starting in state s0, can opt for actions a0, a1, or a2. Choosing action a1 results in no change or reward, a0 offers a 70% chance of a +10 reward. It can then try again and again to gain as much reward as possible, but at some point it will end up in state s1. It has two possible actions: a0 or a2. It can choose to stay put by repeatedly choosing action a0, or it can choose to move on to state s2 and get a negative reward of –50.
The state s2 has only one action a1 possible, which leads to state s0, gaining a reward of +40. While a0 seems favorable in s0 and a1 in s2, the dilemma arises in state s1, where the agent must choose between the seemingly safer option of staying (a0) or taking a risk to progress (a2).
We need to find an optimal state value for every state to solve the MDP. Also, it has only three states and a maximum of three actions to select. But when there is a complex environment comprising several states and actions, calculating all the optimal values gets tough by only summing up the rewards.
To streamline this, Bellman introduced a recursive equation (it is similar to dynamic programming). Instead of starting from scratch for each state, we can calculate the value of any state like this: it's the immediate reward plus the discounted value of the next state. This simplifies the process and saves us from repeating the same calculations for every state.

Source: Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow (Bellman Optimality Equation)
In this equation,
- T(s, a, s′) is the transition probability from state s to state s′, given that the agent chose action a. For example, in the diagram above, T(s2, a1, s0) = 0.8.
- R(s, a, s′) is the immediate reward that the agent gets when it goes from state s to state s′, given that the agent chose action a. For example, in the diagram above, R(s2, a1, s0) = +40.
- γ is the discount factor.
- γ.V*(s') is the discounted value of next state.
Now, to precisely estimate the optimal state value of every possible state, we first initialize all the state value estimates to 0, and then update the values iteratively using the Value iteration algorithm.

Source: Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow (Value Iteration algorithm)
Temporal Difference Learning
In the described Markov Decision Process, the agent lacks prior knowledge of the transition probabilities (T(s, a, s')) and the anticipated rewards (R(s, a, s')). It must actively explore each state and transition at least once to determine the rewards, and repeated experiences are necessary to develop a reliable estimate of the transition probabilities.
The Temporal Difference Learning (TD Learning) algorithm resembles the Value Iteration algorithm but is adapted to accommodate the agent's limited knowledge of the Markov Decision Process (MDP). Typically, the agent begins with knowledge of the feasible states and actions and relies on an exploration policy, like a random one, to navigate the MDP. The Temporal Difference Learning algorithm continuously adjusts the state value estimates based on observed transitions and rewards through ongoing interactions, gradually enhancing the agent's understanding of the MDP dynamics.

Source: Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow (Temporal Difference Learning algorithm)
In this equation,
- α is the learning rate.
- r + γ · Vk(s′) is the TD target
- δk(s, r, s′) is the TD error
This algorithm shares similarities with stochastic gradient descent, as it processes one sample at a time and relies on gradually decreasing the learning rate to converge effectively. Without this adjustment, it may continue oscillating around the optimal state values, highlighting the importance of carefully managing the learning rate to achieve stable convergence.
Now that we've covered all the necessary groundwork, we are fully prepared to delve into the world of Q-learning.
What is Q-Learning?
Q-Learning is the algorithm used to train the Q-function, an action-value function that assesses the value of being in a given state and taking a particular action at that state. The term "Q" represents the "quality" (value) of that action at the given state.
Q-learning is a model-free, off-policy, and value-based reinforcement learning algorithm. By this we mean:
- Model-free - It aims to understand the implications of its actions through repeated experiences. In simpler terms, the algorithm executes actions multiple times and modifies its policy (the approach guiding its actions) to maximize rewards, considering the observed outcomes.
- Off-policy - The policy pre-defined at the beginning of the action-value function training process is not fixed. It uses different policies for acting (inference) and updating (training). More on this later.
- Value-based - It indirectly identifies the optimal policy by training an action-value function, enabling us to determine the value of each state-action pair.
Also, it uses the Temporal Different Learning algorithm to train its action-value function.
Q-table and Q-values
To train the Q-function, we need a Q-table where each table cell represents a state-action pair value (called Q-values). For example, if we have six states and four actions, a Q-table will be a matrix of [number of states, number of actions] where the first index is the number of rows, and the second is the number of columns.
Previously, we covered the Bellman equation and the Value Iteration algorithm in our quest to determine the optimal state values for every possible state. Bellman introduced a minor adjustment to derive the optimal Q-values.
In this revised Bellman equation for Q-values, we compute the optimal Q-values by adding the immediate reward to the discounted reward of the consequent states, considering the action (a) taken at that specific state.
Likewise, to precisely estimate the Q-values of every state-action pair, we initialise all the Q-value estimates to 0, and then update them iteratively using the Q-Value iteration algorithm.

Source: Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow (Q-value iteration algorithm)
Upon completing the Q-function training, we acquire the optimal Q-values, subsequently defining the optimal policy. This policy is straightforward: when the agent finds itself in state s, it should select the action corresponding to the highest Q-value for that state.

Let's delve into a detailed, step-by-step explanation to gain a clearer understanding of how Q-learning operates.
How Q-learning works
I've drawn an example from one of my favorite ongoing anime, "One Piece," for reference. Long story short, Monkey D. Luffy, the main character, embarks on a quest to locate the legendary One Piece and claim the title of Pirate King in the new era. While I don't anticipate any character defeating him, he risks peril if he drowns in the sea due to his inability to swim, a consequence of his devil fruit powers. I've designed a maze-like structure where,
- Monkey D. Luffy begins his adventure at the starting point (S0).
- The goal is to obtain the final treasure known as "One Piece".
- The episode ends if Luffy drowns in the sea.
- The learning rate is 0.1
- The discount rate (γ) is 0.99.

Source: author of this article (States)
The reward function is as follows:
- +0 - going to a state where Luffy is in his ship.
- +1 - going to a state where Luffy lands on an island.
- +10 - going to a state where Luffy finds the "One Piece".
- -10 - going to a state where Luffy drowns in the sea.
We will employ the Q-Learning algorithm to train our agent to adopt an optimal policy, which involves moving right, right, and then down.
Initialise Q-table
Initially, set all the Q-values (state-action pair) to 0. The Q-table consists of six states represented as rows and four actions (left, right, up, and down) as columns.

Source: author of this article (Initialised Q-table with 0)
For simplicity, we will run two training time-steps.
Training Time-step: 1
Choose an action
We employ an epsilon-greedy policy to select a random action, gradually reducing randomness with each time step. Why opt for a random action? It enables the agent to balance exploring new actions and exploiting known effective actions, effectively managing the exploration/exploitation trade-off.
To illustrate, consider visiting a restaurant for the first time where all dishes appear equally appealing. In this scenario, you randomly select a dish. If it turns out to be delightful, you might increase the probability of ordering it again, but raising that probability to 100% is unwise. Doing so would hinder your chance to try out the other dishes, some of which could be even better than the one you initially tried.
Following this policy, by initially setting ɛ to 1.0:
- With probability (1-ɛ) - we do exploitation where the agent selects the action with the highest state-action pair value.
- With probability (ɛ) - we do exploration where the agent tries random actions.
Initially, during the early stages of training, the likelihood of exploration is significant as epsilon (ɛ) maintains a high value. As a result, the majority of actions will be exploratory. However, as the training progresses and the Q-table improves its estimations, we gradually decrease the epsilon value. This adjustment is necessary as we require less exploration and more exploitation over time.
Since we set the epsilon value to 1.0, which is big, Luffy randomly moves to the right.
Perform an action
By choosing right, Luffy lands on an island (a new state), which gives him a reward of +1.
Update Q-value
Utilizing the Q-value iteration algorithm and Temporal Difference Learning, the formula to derive the new Q-value estimation is expressed as follows in the diagram below:

To update our Q-value at a specific state-action pair, we utilize the Temporal Difference target. The TD target is constructed by obtaining the reward R(t+1) following the action A(t). We employ a greedy policy to determine the best state-action pair value for the next state, always selecting the action with the highest state-action value.
Putting the values in the formula, we get value of one state-action pair (Q(0, 1)):
Q(0, 1) = 0 + 0.1 * [1 + 0.99 * 0 - 0]
Q(0, 1) = 0.1
where,
- Q(St, At) - Q(0, 1) = 0
- α = 0.1
- R(t+1) = 1
- γ = 0.99
- max(a)Q(St+1, a) = 0

Source: author of this article (Updated value of Q(0,1))
After updating this Q-value and as Luffy transitions to a new state, we proceed to the second training time-step, selecting our action again using an epsilon-greedy policy.
Training Time-step: 2
Choose an action
Luffy opts for a random action once again, given that epsilon is set at 0.99, which is relatively high. It's worth noting that we slightly decayed epsilon as the training progressed, aiming for less exploration.
Luffy chooses to go down but ends up drowning in the sea.
Perform an action
Because Luffy drowned in the sea, he receives a reward of -10, and the episode ends.
Update Q-value
Putting the TD target values in the above formula, we get,
Q(1, 4) = 0 + 0.1 * [-10 + 0.99 * 0 - 0]
Q(1, 4) = -1

Source: author of this article (Updated value of Q(1, 4))
Since Luffy is drowned, it prompts the initiation of a new episode. Notably, after only two exploration steps, we observe a significant improvement in the agent's intelligence.
Continuing this iterative process of exploring and exploiting the environment while updating Q-values using the TD target, the Q-table progressively provides a more accurate approximation. By the end of the training, we obtain an estimate of the optimal Q-function.
Once we have an optimal Q-function, we have an optimal policy since we know the best action to take for each state.
Furthermore, the distinctive characteristic of Q-learning as an off-policy algorithm becomes evident, where the policy employed for training updating and final execution differs.
The Advantages of Q-learning
Q-learning, a popular reinforcement learning algorithm, offers several advantages:
- Model-Free Learning - Q-learning is model-free, meaning it doesn't require a prior understanding of the environment's dynamics. It learns solely from interactions, making it suitable for complex or unknown systems scenarios.
- Versatility and Applicability - Q-learning applies to a wide-range of domains, from game playing to robotics, finance, and resource management. Its flexibility makes it a valuable choice for diverse problems.
- Learning from Experience - Q-learning excels in learning from past experiences, as it iteratively updates action-value functions based on agent interactions, allowing it to adapt and improve over time.
- Efficient Decision-Making - It's particularly effective in dynamic environments, making quick and efficient decisions based on learned policies, which is essential in real-time applications like robotics and game playing.
- Off-Policy Learning - Q-learning's off-policy nature allows it to learn from experiences collected under different policies, providing greater flexibility and adaptability, particularly when dealing with changing circumstances.
- Simple Implementation - Q-learning is relatively straightforward to implement, making it accessible for beginners as well as reinforcement learning experts.
- Proven Track Record - Q-learning has a well-established track record in various applications, having achieved notable successes in fields like game playing, autonomous systems, and optimization problems.
These advantages underscore Q-learning's practicality and effectiveness, making it a valuable tool in reinforcement learning.
Applications of Q-learning
Q-learning can be applied to a wide variety of tasks. Here are a few examples -
- The agent can be the program controlling a robot. In this scenario, the agent observes the environment (the real world) through sensors like cameras and touch sensors and acts accordingly by sending signals to the motors. It gets positive rewards for efficient navigation towards the goal and negative rewards for detours or time wastage.
- The agent can be the program playing a board game like Go or chess.
- The agent can be the program controlling Ms. Pac-Man where the environment is a simulation of the Atari game, actions are the nine joystick positions, and the rewards are the game points.
- The agent can observe stock market prices and take action on whether to buy or sell.
- The agent can be a smart thermostat that learns to understand human needs, getting positive rewards whenever it is close to the target temperature and saves energy, and getting negative rewards when humans need to tweak the temperature.
- The agent can be a program solving a maze where it gets negative rewards for every time step, so it has to reach the exit as soon as possible.
- There are various other tasks where it is well suited, such as driving cars, recommender systems, or placing ads on a web page.
The widespread applications of Q-learning underscore its versatility and adaptability, making it a powerful tool for solving complex real-world problems across various domains.
Challenges and Limitations of Q-learning
Q-learning, despite its efficacy, encounters several challenges and limitations, including:
- Slow Convergence and High Computational Requirements - Q-learning can take significant time to converge, especially in complex environments. It may require substantial computational resources, making it less feasible for real-time applications.
- Curse of Dimensionality - The performance of Q-learning can deteriorate in high-dimensional state and action spaces, leading to increased computational complexity and reduced efficiency.
- Lack of Generalisation - Q-learning tends to focus on specific states and actions, potentially leading to difficulties in generalizing learned policies to new, unseen environments and increased susceptibility to overfitting.
- Exploration vs. Exploitation Trade-off - Striking the right balance between exploration (trying new actions) and exploitation (choosing the best-known actions) can be tricky. Over-exploration can lead to inefficiency, while under-exploitation can prevent discovering better strategies.
- Handling Continuous State and Action Spaces - Q-learning is primarily designed for discrete state and action spaces. Adapting it for continuous spaces involves complex discretization techniques and can lead to suboptimal results.
- Sensitivity to Hyperparameters - Q-learning's performance can depend highly on the choice of hyperparameters, such as the learning rate and discount factor. Finding the right values can be challenging.
- Lack of Prior Knowledge - Q-learning doesn't incorporate prior knowledge about the environment, making it less efficient when some level of pre-existing understanding is available.
- Non-Markovian Environments - Q-learning assumes that the environment follows the Markov property, meaning the future depends only on the current state and action. In non-Markovian environments, it may not perform optimally.
Despite these challenges and limitations, Q-learning remains a valuable tool in reinforcement learning. Many of these issues can be mitigated or overcome through algorithmic enhancements, specialized techniques, and deep reinforcement learning approaches.
Conclusion
Q-learning stands out as a potent algorithm in reinforcement learning, offering the ability to derive optimal strategies through exploration and exploitation. Despite challenges like slow convergence and sensitivity to hyperparameters, its versatility finds applications across various domains.
By training a Q-function and updating Q-values, the algorithm efficiently adapts to environments, demonstrating its significance in decision-making and experiential learning. While acknowledging its limitations, Q-learning remains a valuable and widely applied approach in reinforcement learning.
Stay tuned for our next article, where we'll dive into hands-on examples.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.