Getting Started with Deep Q-Learning
Diving deep into the world of RL, we will unpack all of the details of deep Q-learning, and train a model on the OpenAI Gym Cartpole Environment!
Created on March 12|Last edited on March 19
Comment
In this guide, we'll dive into reinforcement learning with special focus on Deep Q-Learning. Specifically, we'll cover the theoretical concepts and practical implementation in the popular OpenAI Gym Cartpole environment using PyTorch.

Table of Contents
Table of ContentsUnderstanding Reinforcement LearningUnderstanding States and ActionsUpdating Q-ValuesThe Cartpole Environment The Goal Diving into Deep Q-LearningEssential Functions Logging with Weights & BiasesResults Running Inference The Challenge of RL
Understanding Reinforcement Learning
At the heart of reinforcement learning (RL) is the concept of agents learning to make decisions by interacting with their environment. In this setup, an agent seeks to learn a strategy, or policy, that maximizes cumulative rewards over time. Unlike traditional learning paradigms, RL focuses on learning from actions' consequences, emphasizing the balance between exploration (trying new things) and exploitation (leveraging known information).
Understanding States and Actions
In reinforcement learning, an agent interacts with an environment, which could be anything from a simple game to a complex simulation.
State (s): A state represent the current situation or position of the agent. It's like a snapshot of everything that's going on, which the agent uses to decide its next move. For instance, in chess, the state would be the arrangement of pieces on the board.
Action (a): This is what the agent can do, or the decision it makes, in its current state. In the chess example, an action could be moving a particular pawn.
Reward (r): After the agent takes an action (a) in a given state (s), the environment responds not only by transitioning to a new state but also by providing a reward (r). This reward is typically a numerical value designed to indicate how beneficial or detrimental the action was towards achieving the overall objective. Effectively, reinforcement learning is training agents to receive higher award scores while avoid bad ones.
What are Q-Values in Reinforcement Learning?
A Q-value is a numeric score that corresponds to an agent's action in a certain state. Essentially, it's the prediction of how beneficial (or how detrimental) it is for an agent to take an action at a given time.
For deep Q-learning, we will use a neural network to approximate Q.
The Bellman Equation in Q-Learning
The Bellman equation is the foundational component for the Q-learning algorithm. It embodies the principle that the value of a given state-action pair, or Q-value, is the immediate reward received from the environment plus the discounted value of the future rewards that the agent expects to receive.
This expectation hinges on the actions taken subsequently, all modeled through the lens of the agent's current policy.
The equation, in its original form:
Original Form: Q(s, a) ← Q(s, a) + α [r + γ maxₐ' Q(s', a') − Q(s, a)]
Q(s, a): This is our current estimate of the value for taking action "a" in state "s".
←: This symbol stands for "update". We're updating the Q-value based on new information.
α (Alpha): Known as the learning rate, this value determines how much new information impacts the existing Q-value. A smaller alpha results in slower learning, while a larger alpha speeds it up.
r (Reward): This is what the agent receives after executing action "a" in state "s". It's a measure of the immediate outcome of the action.
γ (Gamma): This is the discount factor, which balances the importance of immediate versus future rewards. A higher gamma values future rewards more highly.
maxₐ' Q(s', a'): This part is crucial. It means that for the new state s' after action a, we consider all possible future actions, and choose the one with the highest Q-value. This represents the best expected reward from the next state onward.
[r + γ maxₐ' Q(s', a') − Q(s, a)]: This term shows the update made based on new experiences. It's the reward received, plus the discounted value of the future rewards, minus the old Q-value.
Updating Q-Values
By applying this update rule, we adjust our Q-value to better reflect the new information. This is done every time the agent takes an action and observes the outcome. Over time, by updating the Q-values for all state-action pairs, the agent learns which actions yield the best results, developing a strategy or policy for maximizing rewards. This learning process, where the agent iteratively improves its decisions, is central to Q-learning and forms the foundation for many reinforcement learning strategies.
In this guide, as we progress further into the practical applications of reinforcement learning, we'll be enhancing our approach by introducing the concept of Double Deep Q-Networks (Double DQN). The traditional DQN has been known to overestimate Q-values due to the max operation in the Bellman equation, which can lead to suboptimal policy learning. Double DQN, an advancement over DQN, addresses this issue by decoupling the action selection from the Q-value generation.
This means that one network is used to select the best action (the action with the highest Q-value), and another, separate network is used to evaluate the Q-value of that action. By integrating the Double DQN approach, we aim to achieve more accurate Q-value estimations and, consequently, more reliable and efficient learning. This method not only stabilizes the training process but also helps in converging towards the optimal policy in a more systematic and robust manner, especially in environments with complex and noisy dynamics like the OpenAI Gym's Cartpole scenario. If this seems a bit complex, don't worry, it's actually quite simple to implement!
The Cartpole Environment
We will be building out a Deep Q learning agent using the OpenAI Gym Cartpole environment, which is a very simple environment, however, it’s great for testing your algorithms, and makes it easy to distinguish bugs that may arise.

The Cartpole Environment
The CartPole environment is a classic benchmark in reinforcement learning, simulating a pole balanced on a moving cart. The goal is to prevent the pole from toppling over by shifting the cart's position to the left or right. This setup serves as an accessible platform for implementing and testing reinforcement learning principles.
In this environment, the state is defined by four real-valued attributes: the cart's position on the track, the cart's velocity, the pole's angle relative to vertical, and the pole's rotation rate. These parameters provide a comprehensive snapshot of the current situation, helping the agent to make informed decisions about the next action to take.
The agent has two possible actions: moving the cart to the left and moving it to the right. Despite the simplicity of the action space, mastering this environment requires a nuanced understanding of the dynamics between the cart's movements and the pole's balance.
The Goal
The objective in the CartPole game is straightforward: keep the pole balanced on the cart for as long as possible. The agent earns a reward for each time step the pole remains upright, with the episode ending when the pole falls too far from vertical or the cart strays too far from the center. Consequently, the reinforcement learning agent's mission is to accumulate the highest total reward by maintaining the pole's balance through strategic actions.
Diving into Deep Q-Learning
Deep Q-Learning extends traditional Q-learning by employing deep neural networks to approximate the Q-value function, which represents the expected utility of taking certain actions from specific states. This approach allows handling high-dimensional state spaces, making it feasible for complex problems.
We start by defining our neural network model, which will serve as the brain of our agent. This model, structured with three fully connected layers, takes as input a combination of the environment's state and a possible action. The output is the predicted Q-value, representing the expected cumulative reward for taking that action in the given state. The first layer of our model accepts an input size that matches the environment's state dimension plus one for the action, projecting this input onto a larger, 500-node layer to capture complex relationships. Subsequent layers distill this information down, finally outputting a single value, the Q-value for the given state-action pair.
class cpmodel(torch.nn.Module):def __init__(self, state_size, action_size):super(cpmodel, self).__init__()self.state_size = state_sizeself.action_size = action_sizeself.fc1 = torch.nn.Linear(5, 500)self.fc2 = torch.nn.Linear(500, 128)self.fc3 = torch.nn.Linear(128, 1)def forward(self, x):# print(x.shape)x = torch.nn.functional.relu(self.fc1(x))x = torch.nn.functional.relu(self.fc2(x))x = self.fc3(x)return x
Next, we'll define a few parameters that are specific to Reinforcment Learning:
epsilon = 1.0 # Initial epsilon value (start with full exploration)epsilon_min = 0.2 # Minimum epsilon value (residual exploration)epsilon_decay_rate = 0.995 # Rate at which epsilon decays per iterationupdate_every = 20 # Update target network every 20 episodes, adjust as neededgamma = 0.99
These parameters may seem a little bit foreign if you are new to reinforcement learning. I'll go through them:
epsilon: Initially set to 1.0, this parameter controls the balance between exploration (trying new actions) and exploitation (choosing the best-known action) in the agent's decision-making process. Starting with full exploration allows the agent to learn about the environment from a broad set of experiences.
epsilon_min: This is the minimum value that epsilon can decrease to, set at 0.2 in this case. It ensures that there is always some level of exploration, preventing the agent from solely exploiting its current knowledge and potentially missing out on better solutions.
epsilon_decay_rate: Set at 0.995, this rate dictates how epsilon decreases over time (each iteration). It allows the agent to gradually shift from exploration to exploitation as it gains more experience and confidence in its knowledge of the environment.
update_every: This parameter, set to 20, determines how frequently the target network should be updated. In deep reinforcement learning, this helps stabilize training by providing a fixed target for the algorithm to aim at over several episodes, adjusting as needed for better learning dynamics.
gamma: Set at 0.99, this is the discount factor used in calculating the future discounted reward. It balances the importance of immediate versus future rewards, with a high value like 0.99 indicating that future rewards are nearly as significant as immediate ones.
Additionally we will define our replay buffer
replay_buffer = deque([], maxlen=10000)
This is a deque (double-ended queue) with a maximum length of 10000, acting as a memory buffer to store past experiences. It enables the agent to learn from a wider range of past actions, rewards, and outcomes, improving the robustness and generalization of the learned policy by reusing experiences from this buffer during training.
The replay buffer, crucial in reinforcement learning, serves as a dynamic memory system that retains past experiences during training, each defined by states, actions, rewards, and subsequent states. It addresses several key challenges: it disrupts the correlation between consecutive experiences to avoid skewed learning processes, enhances data efficiency by allowing multiple uses of the same data, and ensures a stable learning environment against the backdrop of fluctuating data distributions, characteristic of neural network-based models like Deep Q-Learning. Furthermore, by facilitating non-sequential learning, similar to batch processing in supervised learning, it significantly stabilizes and streamlines the training process, making it an indispensable component in the reinforcement learning framework.
Essential Functions
Once our model is in place, we proceed to outline the essential functions for training and evaluating our agent. The core training function, optimize_model, pulls samples from a replay buffer – a collection of past experiences, where each experience comprises the state, action taken, reward received, the subsequent state, and a termination flag indicating the end of an episode – to break correlation in sequential observation and stabilize learning. In the process of optimizing the model, these experiences are separated into distinct tensors, facilitating batch processing which enhances computational efficiency. The policy model uses these to predict Q-values for the encountered state-action pairs, serving as the basis for decision-making.
def optimize_model(policy_model, target_model ):global replay_buffersamples = random.sample(replay_buffer,batch_size)# Organize datastates = torch.tensor([sample[0] for sample in samples], dtype=torch.float)actions = torch.tensor([sample[1] for sample in samples], dtype=torch.long)rewards = torch.tensor([sample[2] for sample in samples], dtype=torch.float)next_states = torch.tensor([sample[3] for sample in samples], dtype=torch.float)dones = torch.tensor([sample[4] for sample in samples], dtype=torch.float)best_actions = torch.zeros(batch_size, dtype=torch.long)# Iterate over all possible actions to find which ones yield the highest Q-values for each next statefor a in [0, 1]:# Convert action to tensor and repeat it for each sample in the batchaction_tensor = torch.full((batch_size, 1), a, dtype=torch.float)# Concatenate the action tensor to the end of each next state tensornsa = torch.cat((next_states, action_tensor), dim=1)q_values = policy_model(nsa).squeeze()# This checks if the current action 'a' is better than the previously recorded best action for each next stateif a == 0: # For the first action, initialize max_qs with the q_values directlymax_qs = q_valuesbest_actions.fill_(a) # Fill with the current action 'a'else: # For subsequent actions, update max_qs and best_actions wherever the new q_values exceed the current max_qsbetter_idx = q_values > max_qs # Find indices where current q_values exceed max_qsmax_qs[better_idx] = q_values[better_idx] # Update max_qsbest_actions[better_idx] = a # Update the actions for these indices to the current action 'a'# Prepare the next state-action pairs using the best actions determined from the online modelnext_state_actions = best_actions.unsqueeze(1) # Add a second dimension to best_actionsnsa_pairs = torch.cat([next_states, next_state_actions.float()], dim=1) # Concatenate along the second dimension# Evaluate these pairs using the target network (e_model) to get the Q-value estimatesnext_state_values = target_model(nsa_pairs).squeeze() # Remove unnecessary dimensions# The (1 - dones) term ensures that we set the target Q-value to just the reward for terminal statestargets = rewards + (gamma * next_state_values * (1 - dones))sa_pairs = torch.cat([states, actions.unsqueeze(1) ], dim=1) # Concatenate states and actionspredicted_q_values = policy_model(sa_pairs).squeeze() # Ensure this matches the shape of targetsloss = criterion(targets, predicted_q_values )# Perform the gradient descent step to update the online modeloptimizer.zero_grad()loss.backward()optimizer.step()
Of critical importance is the loss calculation, which is shown here:
targets = rewards + (gamma * next_state_values * (1 - dones))sa_pairs = torch.cat([states, actions.unsqueeze(1) ], dim=1) # Concatenate states and actionspredicted_q_values = policy_model(sa_pairs).squeeze() # Ensure this matches the shape of targetsloss = criterion(targets, predicted_q_values )
In the Double DQN framework, which aims to provide a more stable learning process by reducing overestimations, optimize_model employs two distinct models: a policy model for selecting actions and a target model for evaluating these actions' future rewards. This dual-model approach mitigates the risk of value overestimation inherent in traditional Q-learning methods. However, for enthusiasts eager to explore the nuances of reinforcement learning further, transitioning to a standard DQN setup is straightforward—simply pass the same model as both the policy and target models to this function.
Through backpropagation, the function updates the policy model by minimizing the loss between predicted Q-values of the original state and action by the policy model, along with the Q-targets, which are computed from immediate rewards and the discounted future rewards estimated by the target model. By iteratively refining the policy model via this optimization, the agent progressively enhances its decision-making strategy, navigating the environment with increasing adeptness. This continual learning cycle underscores the agent's development from exploratory attempts to a more calculated approach, embodying the essence of reinforcement learning.
The evaluation function, evaluate_model, systematically tests the performance of our trained model across several episodes, providing insights into the agent's average score per episode and helping gauge the efficacy of our training.
def evaluate_model(model,env, n_episodes=1):total_scores = [] # To keep track of the total score for each episodefor episode in range(n_episodes):state = env.reset()done = Falsetotal_reward = 0 # Total reward for this episodestps = 0while not done:stps+=1if isinstance(state, tuple):actual_state = state[0]else:actual_state = statestate = actual_statestate_tensor = torch.tensor(state, dtype=torch.float).unsqueeze(0)# Select action with the highest Q-valueqs = []for action in [0, 1]:state_action = torch.cat((state_tensor, torch.tensor([[action]], dtype=torch.float)), dim=1)q_value = model(state_action).item()qs.append(q_value)best_action = np.argmax(qs)# Take action in the environmentnext_state, reward, done, _, _ = env.step(best_action)total_reward += rewardtime.sleep(0.001)# Move to the next statestate = next_stateif stps > 200: # you can change this depending on the max steps you would like to test# time.sleep(0.3)done = Truebreak# Store the total reward for this episodetotal_scores.append(total_reward)# Calculate and print the average scoreaverage_score = sum(total_scores) / n_episodesprint(f'Average Score over {n_episodes} episodes: {average_score}')return average_score # Optionally return the average score
Now, we are ready to write the training loop. The training loop involves repeatedly interacting with the CartPole environment: the agent selects actions based on its current policy, observes the outcomes (next state and reward), and stores these experiences in our replay buffer. Following an interaction, the agent updates its policy model using the optimize_model function. This cycle of exploration (via random action selection modulated by the epsilon parameter) and exploitation (using the model's predictions) underpins the agent's learning process.
best_avg_score = float('-inf') # Initialize the best average score to negative infinitymax_eps = 10000for st in range(max_eps):global_replay_buffer = []total_episodes = 0epsilon = max(epsilon_min, epsilon * epsilon_decay_rate)# Log the epsilon value to wandbwandb.log({'epsilon': epsilon, 'step': st})state = env.reset()done = Falsewhile not done:if isinstance(state, tuple):actual_state = state[0]else:actual_state = stateif random.uniform(0, 1) < epsilon:action = env.action_space.sample()else:vs = []for a in [0, 1]:inp = torch.tensor(list(actual_state) + [a], dtype=torch.float)vs.append(model(inp).item())action = torch.argmax(torch.tensor(vs)).item()next_state, reward, done, truncated, info = env.step(action)replay_buffer.extend([(actual_state, action, reward, next_state, done)])state = next_stateif len(replay_buffer) >= 1000:loss = optimize_model(policy_model=model, target_model=e_model)avg_score = evaluate_model(model=model, env=env, n_episodes=10)wandb.log({'average_score': avg_score, 'step': st})if avg_score > best_avg_score:best_avg_score = avg_score# Save the model as the best one so fartorch.save(model.state_dict(), 'best_model.pth')# Log the new best score to wandbwandb.log({'best_avg_score': best_avg_score, 'step': st})if st % 20 == 0:e_model.load_state_dict(model.state_dict())env.close()
During training, we employ an epsilon-greedy strategy to balance exploration with exploitation, gradually decreasing epsilon to shift the agent's behavior from exploring the environment to exploiting its learned knowledge. Concurrently, we periodically synchronize our target network with the policy network to stabilize the learning targets. This process continues over a series of episodes, allowing the agent to iteratively improve its understanding and performance in the task at hand.
Saving the Best Model
The code checks if the current average score of the model, avg_score, is greater than the best_avg_score which was initialized to negative infinity. This check is performed each time the model has been trained with enough episodes (here, it is implied when the replay buffer length exceeds 1000).
If the model outperforms the previous best, the new avg_score becomes the best_avg_score, and the model's parameters are saved using PyTorch's torch.save function:
torch.save(model.state_dict(), 'best_model.pth')
This line saves the model's parameters to a file called best_model.pth.
Logging with Weights & Biases
Throughout the training process, certain metrics are logged using W&B:
wandb.log({'epsilon': epsilon, 'step': st})wandb.log({'average_score': avg_score, 'step': st})wandb.log({'best_avg_score': best_avg_score, 'step': st})
These lines log the value of epsilon (which influences the exploration-exploitation balance in reinforcement learning), the average score achieved by the model, and the best average score so far (averaged over 10 episodes). Logging these metrics helps in monitoring the model's performance and hyperparameters over time, which is crucial for experiment tracking and analysis. I've uploaded the full training script here, so feel free to check it out.
Results
Run: still-fog-4
1
Now, these results may seem a little unstable, and it's because they are! This is one of the fundamental challenges of current reinforcement methods, and training instability is a well-known challenge. However, by saving the best model, we are able to obtain a model that generalizes quite well!
Running Inference
Now we can write some code to view the model perform in real time! I'll share a script below that will use the model we trained earlier to play the cartpole environment!
import gymimport torchimport matplotlib.pyplot as pltfrom IPython import displayenv = gym.make('CartPole-v1',render_mode='human')class cpmodel(torch.nn.Module):def __init__(self, state_size, action_size):super(cpmodel, self).__init__()self.state_size = state_sizeself.action_size = action_sizeself.fc1 = torch.nn.Linear(5, 500)self.fc2 = torch.nn.Linear(500, 128)self.fc3 = torch.nn.Linear(128, 1)def forward(self, x):# print(x.shape)x = torch.nn.functional.relu(self.fc1(x))x = torch.nn.functional.relu(self.fc2(x))x = self.fc3(x)return xmodel = cpmodel(4, 1)model.load_state_dict(torch.load('best_model.pth'))model.eval()state = env.reset()# img = plt.imshow(env.render()) # Only for initializationdone = Falsewhile not done:display.clear_output(wait=True)img = env.render()actual_state = state if not isinstance(state, tuple) else state[0]vs = []for a in range(env.action_space.n):inp = torch.tensor(list(actual_state) + [a], dtype=torch.float)vs.append(model(inp).item())action = torch.argmax(torch.tensor(vs)).item()state, reward, done, truncated, info = env.step(action)env.close()
The Challenge of RL
In conclusion, while the principles and techniques detailed in this guide—spanning from the basics of reinforcement learning to the more sophisticated realms of Deep Q-Learning and Double DQN—offer robust frameworks for understanding and tackling decision-making problems, their direct application in real-world scenarios can be challenging. The complexities inherent in physical environments, the high dimensionality of real-world data, and the unpredictability of natural systems often necessitate adaptations and modifications to these theoretical models.
Opportunities
Despite these challenges, it is important to recognize that the core ideas and methodologies of reinforcement learning are influencing a wide array of applications beyond traditional gaming or simulated environments. Specifically, the concepts derived from RL are being adapted and integrated into the development of large language models, particularly through techniques such as Reinforcement Learning from Human Feedback (RLHF). In RLHF, LLMs are refined not just through pre-programmed rewards but also by incorporating human judgments and preferences into the learning process, thereby aligning the models' outputs more closely with human values and expectations.
This integration illustrates a compelling bridge between the theoretical underpinnings of RL and practical applications in AI, showcasing the adaptability and potential of RL principles in enhancing and guiding the behavior of sophisticated AI systems. Despite the hurdles in direct applications, the foundational concepts of RL continue to inspire and revolutionize how we approach, design, and refine AI systems, paving the way for more intuitive, effective, and aligned technological solutions.
Add a comment
Tags: Articles, Reinforcement Learning
Iterate on AI agents and models faster. Try Weights & Biases today.