Getting started with reinforcement learning (with a Python tutorial)

A hands-on introduction to reinforcement learning, explaining how agents learn from interaction to solve complex decision-making problems - plus practical implementations of Deep Q-learning and Actor-Critic methods in Python.
Brett Young
Created on April 16|Last edited on May 2
Comment
Reinforcement learning (RL) is a core field within artificial intelligence that enables machines to learn effective actions through experience, much like how humans and animals learn new skills. Instead of being told exactly what to do, an RL agent explores its environment, tries out different strategies, and gradually discovers which behaviors yield the most favorable outcomes. This approach makes reinforcement learning uniquely suited to tackle problems where the right answer isn’t always obvious or available in advance.
The significance of reinforcement learning is growing rapidly, powering breakthroughs across diverse domains - robotics, gaming, autonomous vehicles, finance, and beyond. Robots can learn to grasp delicate objects, game-playing agents master games like Go and Dota 2, and self-driving cars improve their navigation skills - all using RL techniques.
In this tutorial, we’ll guide you through the fundamental concepts of reinforcement learning, explaining how RL agents learn and adapt to their environment. Step by step, you’ll discover what makes RL distinct from other machine learning approaches, explore its core components and mathematical foundations, and understand the differences between popular RL algorithm types.
On the practical side, you’ll learn how to implement several key RL algorithms in Python - including Deep Q-Learning and Actor-Critic methods - using hands-on examples with classic environments like CartPole. By the end, you’ll be equipped with both the theoretical knowledge and practical skills to experiment with reinforcement learning in your own projects.
Let's get started. 
﻿
﻿
Table of contentsWhat is reinforcement learning?How does reinforcement learning work?Why use reinforcement learning instead of supervised learning?Key components of the reinforcement learning processHow these elements come togetherReinforcement learning across different "environments" CartPole-v1:Robotics (e.g., robotic arm control):Autonomous Driving:Key mathematical concepts in reinforcement learningBackpropagation Value functionsThe Bellman equationTemporal difference (TD) learning & TD errorBootstrappingAction sampling: Stochastic vs. deterministic policiesExploration Strategies: ε-Greedy and othersTypes of reinforcement learning algorithmsModel-based vs. model-free reinforcement learningOn-policy vs. off-policy learningValue-based, policy-based, and actor-critic methodsReplay buffer vs. no replay bufferDiscrete vs. continuous action spacesQ-learning and actor-critic algorithmsQ-learning: Value-based, off-policy learningActor-critic: Combining policy gradients and value functionsImplementing deep Q-learning in PythonImplementing A2C in PythonContinuous vs. discrete action spacesExploration vs. exploitation dilemma in reinforcement learningStrategies to balance exploration and exploitationEpsilon-greedy Boltzmann (softmax) explorationEntropy eegularization (in Policy Gradient Methods)RLHF and reinforcement learning in language modelsConclusion
﻿
﻿
What is reinforcement learning?Reinforcement learning is a branch of machine learning focused on training agents to make sequences of decisions by interacting with an environment. At its core, reinforcement learning mimics the way humans and animals learn: through trial and error. 
An RL agent learns by continuously observing its current state, choosing an action, and then receiving feedback from the environment in the form of a reward or penalty. The goal of the agent is to discover a strategy - called a policy - that maximizes the total reward it earns over time.
This trial and error approach is what sets RL apart from other machine learning techniques like supervised learning, where the correct answers (labels) are provided during training. Inreinforcement learning, agents must explore, experiment, and learn what works best solely from the feedback received, often when the consequences of choices are delayed or spread out over many steps.
By leveraging these feedback loops and the concept of maximizing cumulative reward, reinforcement learning has become a powerful tool for tackling complex, decision-based problems in areas such as robotics, autonomous vehicles, game AI, and resource management.
How does reinforcement learning work?Reinforcement learning operates through a feedback-driven process, where an agent learns by repeatedly interacting with its environment. The primary goal of an RL agent is to maximize the expected sum of rewards - also called the total return - over time. Rather than focusing on immediate rewards, the agent seeks strategies that yield the highest cumulative payoff in the long run.
Here’s how the process unfolds step by step:
Observation: The agent observes the current state of the environment.
Action: Based on its current policy (strategy), the agent selects and executes an action.
Feedback: The environment responds by transitioning to a new state and providing a reward signal, indicating the benefit (or cost) of the chosen action.
Learning: The agent updates its policy to improve future decision-making, guided by the rewards received.
This cycle repeats thousands or even millions of times as the agent explores different strategies. Over time, the agent learns - by trial and error - to favor actions leading to higher total returns.
In modern reinforcement learning systems, especially for complex tasks like playing video games or robotic control, the agent’s policy (and sometimes value functions) is typically parameterized by a neural network. This deep reinforcement learning approach enables the agent to handle high-dimensional states, such as images or sensor data.
Just like in supervised learning, neural networks in reinforcement learning are trained using backpropagation, which adjusts the network’s parameters to minimize a loss function. However, the calculation of this loss is fundamentally different from supervised learning:
In supervised learning, the loss is based on the direct comparison between the model’s predictions and the true (labeled) outputs.
In reinforcement learning, the loss is computed based on the difference between the expected return (what the agent predicts will happen) and the actual rewards observed through interaction with the environment. For example, this can involve minimizing the Temporal Difference (TD) error (more on this later) in value-based methods, or using policy gradients in policy-based methods.
This special treatment of the loss function allows reinforcement learing agents to learn not just from fixed datasets, but dynamically, as they encounter new situations and outcomes. As a result, neural networks in RL can learn sophisticated behaviors and policies that optimize long-term objectives in challenging, interactive environments.
Why use reinforcement learning instead of supervised learning?Supervised learning and reinforcement learning address fundamentally different types of problems. In supervised learning, you train a model using a dataset where the correct output (label) for every input is already known. The model learns to make predictions by minimizing the error between its outputs and these known answers. This approach works well for tasks like image classification, spam detection, or handwriting recognition, where each example can be independently labeled.
Reinforcement learning, on the other hand, is designed for situations where explicit labels are not available. Instead, an agent learns by interacting with an environment and receiving feedback in the form of rewards, which may be sparse, delayed, or dependent on the sequence of actions taken. Instead of single-step predictions, RL focuses on learning strategies that maximize long-term cumulative rewards. Essentially, the agent gathers data through its interaction with the environment. 
One of the key distinctions is that, in RL, the consequences of an action might not be fully known until much later. For example, consider a video game where a robot must learn to navigate a maze: taking a wrong turn might not be penalized immediately, but could make it impossible to reach the goal several steps later. Similarly, in robotics, controlling a robotic arm to assemble a product involves a sequence of motions where only the final result determines success or failure.
These types of problems - where the right choice depends on context, timing, and delayed outcomes - are where reinforcement learning excels. By learning from ongoing interaction rather than fixed labels, RL agents can develop strategies to handle complex, sequential, and feedback-driven environments that are challenging or impossible for conventional supervised learning approaches.
Key components of the reinforcement learning processThe reinforcement learning process is defined by several fundamental components that work together to enable learning from interaction:
Agent: The decision-maker or learner, whose goal is to learn how to act optimally within a given environment.
Environment: The world in which the agent operates and with which it interacts. The environment defines the rules, dynamics, and feedback the agent receives.
State space: All possible “situations” or configurations the agent can experience. Each state captures the relevant information the agent needs to make a decision. For example, in CartPole-v1, a state includes variables like cart position and velocity, pole angle, and angular velocity. In a video game, a state might be the current frame of the screen; in a robotic setting, it could be sensor readings and joint positions.
Action space: The set of all possible actions the agent can take at a given moment. This can be as simple as discrete choices (e.g., move left or right in CartPole-v1, press up/down/left/right buttons in a game) or as complex as a range of continuous controls (e.g., adjusting the angle of a robotic joint).
Reward: A numerical value sent from the environment to the agent after each action, indicating how good or bad that action was in the current context. Rewards can be immediate or delayed and often guide the agent toward achieving a long-term objective.
How these elements come togetherAt every time step, the agent observes the current state, selects and executes an action, and receives two things from the environment: a new state and a reward. This interaction loop is repeated over many episodes. The agent’s challenge is to learn a strategy (policy) that selects actions so as to maximize the total sum of rewards it accumulates over time.
Reinforcement learning across different "environments" The specifics of state and action spaces, as well as reward structure, vary dramatically across environments. I will briefly cover a couple of common environments that reinforcement learning algorithms tend to work well in: 
CartPole-v1:State space: Four continuous variables (cart position, cart velocity, pole angle, and pole angular velocity).
Action space: Discrete (move left or right).
Reward: +1 for each timestep the pole remains upright.
﻿
Robotics (e.g., robotic arm control):State space: Sensor readings, joint angles, positions of objects.
Action space: Continuous (set motor torques or velocities).
Reward: Positive reward for successfully completing a task (like grasping an object) and possibly negative reward for collisions or wasted energy.
﻿
Autonomous Driving:State space: Camera input, lidar data, GPS location, speed, and more.
Action space: Steering, throttle, brake (continuous or discrete).
Reward: Staying on the road, following traffic rules, minimizing travel time, and avoiding collisions.
﻿
By designing states, actions, and rewards that capture the specifics of a task, reinforcement learning can be applied in a wide variety of settings. The agent’s learning process is always grounded in this cyclical experience of observation, action, and feedback, regardless of the environment’s complexity.
Through this interaction, the agent gradually discovers effective behaviors, improving its performance as it accumulates experience across episodes.
Key mathematical concepts in reinforcement learningReinforcement learning relies on several core mathematical tools that help agents evaluate situations, estimate future rewards, and adjust their behavior over time. These include value functions, the Bellman equation, temporal difference (TD) learning, TD error, bootstrapping, and backpropagation. Together, they form the foundation for how RL agents learn from experience and improve their decision-making to maximize long-term reward.
Backpropagation is especially important in deep reinforcement learning, where neural networks are used to approximate policies and value functions. Instead of learning from labeled data, these networks learn from rewards, updating their parameters based on how closely predictions match actual outcomes. Understanding how these tools work - and how they fit together - makes it easier to grasp how reinforcement learning systems refine their strategies over time.
Backpropagation Modern deep reinforcement learning algorithms, such as Deep Q-Learning and Actor-Critic methods, rely heavily on neural networks to make decisions and predictions. These networks are trained using backpropagation - the same process used in supervised learning to adjust neural network parameters and minimize loss.
However, in reinforcement learning this process can feel more abstract, since there aren’t direct input-output labels indicating the “correct” action at every step. Instead, learning is driven by rewards from the environment. The network outputs value estimates or action probabilities, and the loss is determined by objectives specific to RL, such as the Temporal Difference (TD) error. At the end of the day, the core mechanism is unchanged: the network makes predictions, observes the outcomes (rewards), computes how “off” its estimates were, and then tweaks its parameters via backpropagation. The major difference is that the target is not a static label but the agent’s best guess at maximizing long-term reward based on its experiences so far.
Value functionsRather than relying on fixed labels, reinforcement learning agents use value functions to guide their learning:
The state value function (often denoted as V(s)) estimates how good or promising it is to be in a particular situation (state), considering all possible future outcomes if the agent follows its current strategy (policy).
The action value function (often referred to as Q(s, a) and also called the Q-function) estimates how much total reward the agent can expect if it takes a certain action from a given state, and then continues to act according to its policy.
By assigning these values, the agent can evaluate its options, compare possible actions, and choose those most likely to yield greater long-term gains.
The Bellman equationThe Bellman equation is a foundational concept that shows how the value of a state or action can be broken down into immediate reward and the (discounted) value of what comes next. For value functions, this recursive relationship is typically written as:
V(s)=E[rt+1+γV(st+1)∣st=s]V(s) = \mathrm{E}\left[r_{t+1} + \gamma V(s_{t+1}) \mid s_t = s\right]V(s)=E[rt+1​+γV(st+1​)∣st​=s]﻿
In plain english: the value of a state V(s) equals the expected reward for being in that state and taking an action, plus the value of the next state (discounted by γ), assuming the agent follows its policy. This allows reinforcement learning algorithms to "bootstrap" estimates - updating current predictions using more recent (and potentially more accurate) estimates about the future.
Temporal difference (TD) learning & TD errorTemporal difference (TD) learning is a central technique in reinforcement learning that allows the agent to update its value estimates after every step, rather than waiting until the end of an episode. The key idea is to compare the predicted value of a state to the reward experienced and the value estimate of the next state. This difference - known as the TD error - serves as a learning signal.
If the outcome was better than expected (positive TD error), the value estimate should increase. If it was worse (negative TD error), the value estimate should decrease. Remarkably, neuroscience studies have shown that dopamine neurons in animal brains behave much like this:
When rewards are better than expected, dopamine spikes - a positive teaching signal.
When rewards are as expected, dopamine stays roughly the same.
When rewards are lower than expected (or missing), dopamine drops - signaling disappointment and driving learning adjustments.
This parallel between reinforcement learning algorithms and biological learning further underscores the significance of TD error in the learning process.
BootstrappingBootstrapping in reinforcement learning refers to updating value estimates based not just on immediate rewards, but also on the agent’s current predictions about future rewards. The algorithm “assumes” that its new estimate at the next state is correct, and uses this as a basis for improving its estimate for the current state. This allows the agent to learn more efficiently, continuously refining its expectations as new experiences are gathered and propagating improvements backward through the sequence of states and actions.
Action sampling: Stochastic vs. deterministic policiesA policy guides how the agent chooses its actions.
In a deterministic policy, the agent always picks the same action for a given situation.
In a stochastic policy, the agent chooses actions according to certain probabilities. This randomness allows the agent to try a wider range of moves, which can be crucial in complex or unpredictable environments. By sometimes taking less-optimal or unusual actions, the agent may discover better strategies or actions that it wouldn’t have found with a fixed, deterministic approach.
Exploration Strategies: ε-Greedy and othersA central challenge in reinforcement learning is balancing exploration (trying new actions to gather information) with exploitation (choosing the current-best action to maximize reward). A common technique is the ε-greedy strategy:
Most of the time, the agent chooses what it currently believes is the best action.
Occasionally (with probability ε), it will randomly select any action, ensuring it continues to explore and doesn’t get stuck in a local optimum.
Other, more sophisticated exploration approaches exist as well, but ε-greedy is a simple and widely used baseline.
Together, these mathematical tools and concepts form the core machinery of reinforcement learning, enabling agents to incrementally improve their performance and tackle challenging decision-making problems in complex, uncertain environments.
Types of reinforcement learning algorithmsReinforcement learning algorithms come in many forms, each suited to different problems and environments. Two key ways to categorize RL algorithms are by whether they learn a model of the environment (model-based vs. model-free) and by how they use experience in learning (on-policy vs. off-policy). RL methods also differ by how they handle action spaces and experience data.
Model-based vs. model-free reinforcement learningModel-based reinforcement learning involves creating an internal model of the environment to plan actions, while model-free RL learns directly from interactions. Each approach has its benefits and challenges, with model-free methods like Q-learning being widely used for their simplicity.
Model-based reinforcement learning algorithmsModel-based reinforcement learning algorithms attempt to learn (or are given) a model that can predict what will happen next - how actions affect the next state and reward. The agent can then use this internal model to plan ahead, simulate possible futures, or even imagine additional experiences beyond direct interaction.
Advantages: Greater sample efficiency, and the ability to plan or adapt quickly if the environment changes.
Challenges: Building a sufficiently accurate model can be very difficult, especially in complex or high-dimensional environments. Planning can also become computationally expensive.
Examples: Dyna-Q (combines real experience with simulated experience from a learned model), AlphaZero (plans moves in board games using a learned model of game dynamics).
Model-free reinforcement learning algorithmsModel-free reinforcement learning algorithms do not try to explicitly learn the environment’s dynamics. Instead, they learn directly from observed state-action-reward sequences, tuning their value functions or policies based solely on real experience.
Advantages: Simpler to implement, avoids the challenge of model learning, can excel in environments where accurate modeling is infeasible.
Challenges: Usually less sample-efficient and can’t plan ahead without direct experience.
Examples: Q-learning, SARSA, Deep Q-Networks, Policy Gradients, Actor-Critic methods.
On-policy vs. off-policy learningA key difference between reinforcement learning algorithms is how they use the data they collect during training.
On-policy algorithms learn about the policy currently being used to make decisions. They update based on the actions the agent actually takes while following that policy. These methods are often more robust to rapid policy changes and can learn directly from exploration, but they’re less data-efficient because they can’t reuse past experience once the policy has changed.
Off-policy algorithms learn about one policy while collecting data under another. This allows experience from earlier or exploratory behavior to contribute to learning, even if the agent is no longer acting that way.  For example Q-learning is off-policy; it learns the value of the optimal policy by always updating towards the highest-value action, regardless of the current action taken. Off-policy methods tend to be more data-efficient, as they can reuse old experience and support techniques like experience replay buffers.
Value-based, policy-based, and actor-critic methodsAnother important way to classify reinforcement learning algorithms is by what they learn and how they make decisions. Three common categories are:
Value-based methods focus on learning value functions that estimate expected future rewards for states or state-action pairs. The agent then acts by selecting actions with the highest predicted value. Examples include Q-learning and DQN. These methods are especially well suited to discrete action spaces.
Policy-based methods learn a policy directly, mapping states to actions, often using techniques called policy gradients. These methods perform well in environments with continuous or very large action spaces. Examples include REINFORCE and PPO. While policy-based methods can be less sample efficient than value-based methods, they can learn complex, stochastic policies more effectively.
Actor-critic methods combine both approaches, training both a policy ("actor") and a value function ("critic") in tandem. The critic helps the actor by providing feedback on the quality of actions; the actor adjusts its behavior to maximize expected reward as guided by the critic. This combination often yields more efficient and stable learning, especially in challenging, high-dimensional environments. Popular actor-critic algorithms include A2C, DDPG, and SAC.
Replay buffer vs. no replay bufferReplay buffer is commonly used in off-policy algorithms like Deep Q Networks (DQN). Experiences (state, action, reward, next state) are stored and randomly sampled to break up correlations in experience and allow for more robust, stable, and data-efficient learning.
No replay buffer is more common in on-policy algorithms, which typically use the most recent experiences and discard old ones to keep learning consistent with the current policy.
Discrete vs. continuous action spacesAnother key distinction is whether the agent’s set of possible actions is discrete or continuous. For discrete action spaces, the agent selects from a finite set of options (e.g., move left, right, forward, press a button). Algorithms like Q-learning and SARSA work well in these settings. In Continuous action spaces, the agent selects actions from a range of real numbers (e.g., set accelerator pedal to any value between 0 and 1). Policy gradient and actor-critic methods are more suited to continuous spaces.
In summary, reinforcement learning algorithms can be model-based or model-free, on-policy or off-policy, may or may not use replay buffers, and can be applied to either discrete or continuous action spaces. The best choice depends on the problem, available data, and computational resources.
Q-learning and actor-critic algorithmsI chose to cover these two algorithm families - Q-learning and actor-critic - because, between them, I believe they capture the fundamental mechanics underlying many modern reinforcement learning approaches. A deep understanding of both will likely allow you quickly generalize to more advanced or customized RL algorithms.
💡
Q-learning and actor-critic methods represent two pillars of reinforcement learning. Q-learning focuses on estimating the value of actions, while actor-critic methods combine the strengths of policy optimization and value estimation for more flexible and powerful learning.
Q-learning: Value-based, off-policy learningQ-learning is a classic value-based, off-policy algorithm. The foundational idea is to learn a Q-function (Q(s, a)), which can be thought of as a value function conditioned on taking a specific action from a particular state - it estimates the expected return for a given (state, action) pair, assuming optimal behavior thereafter.
The main objective in Q-learning is to bootstrap and minimize the temporal difference (TD) error. Every time the agent takes an action, receives a reward, and transitions to a new state, it seeks to reduce the difference between its existing Q-value prediction and the updated target, which is based on the reward received and the maximum Q-value of the next state. Here’s the equation for the TD-error: 
δt=(r+γmax⁡a′Q(s′,a′))−Q(s,a)\delta_t = \left( r + \gamma \max_{a'} Q(s', a') \right) - Q(s, a)δt​=(r+γmaxa′​Q(s′,a′))−Q(s,a)﻿
The Q-network is trained to minimize this TD error by updating its parameters after each transition. Here, s′ is the next state, and the next action a′ is chosen greedily - that is, whichever action has the highest predicted value in s’. This process - updating based on the best future value regardless of the action actually taken - makes Q-learning an off-policy algorithm.
Deep Q-learning (DQL) extends this approach by using neural networks to approximate the Q-function and by introducing experience replay for stability and efficiency, enabling Q-learning to operate in complex, high-dimensional environments.
Actor-critic: Combining policy gradients and value functionsActor-critic methods combine two complementary approaches from reinforcement learning. The actor is responsible for selecting actions by learning a policy - a mapping from states to actions. The critic estimates value functions, judging how good the current state or action is, and provides feedback to guide the actor’s choices.
In practice, both the actor and the critic are implemented as neural networks. The actor network outputs a distribution over possible actions in discrete spaces or over action parameters in continuous settings. The highest value in this distribution corresponds to the actor’s estimate of the best action in the given state. The critic network is typically trained to estimate the value of being in a certain state, denoted as V(s), or sometimes the value of a state-action pair, written as Q(s, a).
Actor-critic algorithms also bootstrap their value estimates using temporal difference (TD) learning, just like Q-learning:
δt=[r+γV(s′)]−V(s)\delta_t = \left[ r + \gamma V(s') \right] - V(s)δt​=[r+γV(s′)]−V(s)﻿
Where s′ is the next state obtained after the taken action. The critic’s goal is to minimize the TD error - it learns to estimate the value of each state as accurately as possible, refining its predictions so that the predicted value matches the received reward plus estimated future values.
For the actor (policy network), the objective is to maximize the expected return. The loss for the actor is typically based on the policy gradient:
Policy Loss=−log⁡π(a∣s)⋅Advantage\text{Policy Loss} = -\log \pi(a \mid s) \cdot \text{Advantage}Policy Loss=−logπ(a∣s)⋅Advantage﻿
where the advantage estimates how much better (or worse) the chosen action was, compared to the critic’s expectation:
Advantage=[r+γV(s′)]−V(s)\text{Advantage} = \left[ r + \gamma V(s') \right] - V(s)Advantage=[r+γV(s′)]−V(s)﻿
In essence, the actor learns to increase the probability of taking actions that lead to higher-than-expected rewards (positive advantage), and decrease the probability for actions that are worse than expected.
A concise way to think of the actor-critic framework as:
The critic learns to evaluate states (or state-action pairs) by minimizing the temporal difference (TD) error.
The actor learns to select actions that lead toward higher-value (advantageous) states, effectively maximizing expected return.
Actor-critic architectures underpin many advanced reinforcement learning algorithms, such as Proximal Policy Optimization (PPO), Asynchronous Advantage Actor Critic (A3C), and others, due to their effectiveness in continuous action spaces, scalability, and stable learning dynamics.
So, just a recap on these algorithms:
Q-learning focuses on learning the value of each possible action per state and minimizing TD error by bootstrapping on the maximum future action value (off-policy).
Actor-critic methods split responsibility: the actor learns a policy to select actions, and the critic learns to estimate value and guide the actor. Both use neural networks, bootstrapping, and TD learning - the critic minimizes TD error, and the actor maximizes the advantage of its actions.
Mastering these two approaches lays the groundwork for understanding the modern Deep RL algorithms. 
Implementing deep Q-learning in PythonTo make the ideas behind value-based and actor-critic reinforcement learning more concrete, we’ll walk through how each approach is implemented in practice. We’ll start with a foundational example: Q-learning - a value-based, off-policy algorithm that works well in discrete action spaces. In Q-learning, the agent estimates the expected return (Q-value) for each action in every state and updates these estimates by minimizing the temporal difference (TD) error as it interacts with the environment.
The code below demonstrates how deep Q-learning is applied in practice. It implements a simple DQN agent to solve the classic CartPole environment using PyTorch. The setup supports structured experimentation and performance tracking with Weights & Biases, including multiple training runs and summary plots to visualize learning progress.
Here's the code:
import gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
import wandb
import matplotlib.pyplot as plt
from scipy.ndimage import gaussian_filter1d
import argparse
import os
from datetime import datetime
﻿
ENV_NAME = 'CartPole-v1'
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
﻿
class QNetwork(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.act_dim_ = act_dim
        self.obs_dim_ = obs_dim
        self.net = nn.Sequential(
            nn.Linear(obs_dim + act_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )
    
    def forward(self, state, action):
        if state.dim() == 1:
            state = state.unsqueeze(0)
        if action.dim() == 0:
            action = action.unsqueeze(0)
        
        action_onehot = torch.nn.functional.one_hot(action, num_classes=self.act_dim_).float()
        x = torch.cat([state, action_onehot], dim=-1)
        return self.net(x).squeeze(-1)
    
    def sample_a(self, state, epsilon=0.1):
        if state.dim() == 1:
            state = state.unsqueeze(0)
        
        if torch.rand(1).item() < epsilon:
            return torch.randint(0, self.act_dim_, (1,), device=state.device).item()
        else:
            action = self.get_max_a(state)
            return action if isinstance(action, int) else action.item()
    
    def get_max_a(self, state):
        if state.dim() == 1:
            state = state.unsqueeze(0)
        
        batch_size = state.size(0)
        actions = torch.arange(self.act_dim_, device=state.device)
        actions = actions.unsqueeze(0).repeat(batch_size, 1)  # [B, A]
        states = state.unsqueeze(1).repeat(1, self.act_dim_, 1)  # [B, A, S]
        
        q_values = self.forward(states.view(-1, state.size(-1)), actions.view(-1))
        q_values = q_values.view(batch_size, self.act_dim_)
        best_action = torch.argmax(q_values, dim=1)
        return best_action.item() if batch_size == 1 else best_action
    
    @property
    def act_dim(self):
        return self.act_dim_
    
    @property
    def obs_dim(self):
        return self.obs_dim_
﻿
def make_env():
    env = gym.make(ENV_NAME)
    obs_dim = env.observation_space.shape[0]
    act_dim = env.action_space.n
    return env, obs_dim, act_dim
﻿
def train(run_id, config):
    # Initialize wandb
    run_name = f"run_{run_id}"
    wandb.init(
        project=config.project_name,
        name=run_name,
        group=config.group_name,
        config={
            "env_name": ENV_NAME,
            "learning_rate": config.learning_rate,
            "gamma": config.gamma,
            "episodes": config.episodes,
            "epsilon": config.epsilon,
            "run_id": run_id
        }
    )
    
    env, obs_dim, act_dim = make_env()
    q_net = QNetwork(obs_dim, act_dim).to(DEVICE)
    optimizer = torch.optim.Adam(q_net.parameters(), lr=config.learning_rate)
    gamma = config.gamma
    epsilon = config.epsilon
    
    all_rewards = []
    
    for episode in range(config.episodes):
        state = env.reset()
        if isinstance(state, tuple):  # gymnasium compatibility
            state = state[0]
        state = torch.tensor(state, dtype=torch.float32, device=DEVICE)
        
        done = False
        ep_reward = 0
        step_count = 0
        total_loss = 0
        
        while not done:
            action = q_net.sample_a(state, epsilon)
            next_state, reward, done, info = env.step(action)
            if isinstance(info, dict) and 'TimeLimit.truncated' in info:
                truncated = info.get('TimeLimit.truncated', False)
                done = done and not truncated
            
            if isinstance(next_state, tuple):
                next_state = next_state[0]
            next_state = torch.tensor(next_state, dtype=torch.float32, device=DEVICE)
            
            q_val = q_net(state, torch.tensor(action, device=DEVICE))
            with torch.no_grad():
                target = reward + gamma * q_net(next_state, torch.tensor(q_net.get_max_a(next_state), device=DEVICE)) * (1. - float(done))
            
            loss = (q_val - target).pow(2).mean()
            total_loss += loss.item()
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            state = next_state
            ep_reward += reward
            step_count += 1
        
        # Calculate metrics
        all_rewards.append(ep_reward)
        avg_reward_10 = sum(all_rewards[-10:]) / min(len(all_rewards), 10)
        avg_loss = total_loss / step_count if step_count > 0 else 0
        
        # Log to wandb
        wandb.log({
            "episode": episode,
            "reward": ep_reward,
            "avg_reward_10": avg_reward_10,
            "loss": avg_loss,
            "steps": step_count
        })
        
        if episode % 10 == 0 or episode == config.episodes - 1:
            print(f"Run {run_id} | Episode {episode} | Reward: {ep_reward:.2f} | Avg(10): {avg_reward_10:.2f}", flush=True)
    
    # Save final model if needed
    if config.save_model:
        model_path = os.path.join(config.output_dir, f"q_net_run_{run_id}.pt")
        torch.save(q_net.state_dict(), model_path)
    
    wandb.finish()
    return all_rewards
﻿
def smooth_rewards(rewards, sigma=2):
    """Apply Gaussian smoothing to rewards."""
    return gaussian_filter1d(rewards, sigma=sigma)
﻿
def create_summary_plot(all_run_rewards, config):
    """Create and save a summary plot showing average and individual runs."""
    plt.figure(figsize=(12, 8))
    
    # Plot individual runs with transparency
    for i, rewards in enumerate(all_run_rewards):
        episodes = range(1, len(rewards) + 1)
        smoothed = smooth_rewards(rewards)
        plt.plot(episodes, smoothed, alpha=0.3, label=f"Run {i+1}" if i < 5 else "")
    
    # Calculate and plot average
    min_length = min(len(r) for r in all_run_rewards)
    truncated_rewards = [r[:min_length] for r in all_run_rewards]
    avg_rewards = np.mean(truncated_rewards, axis=0)
    smoothed_avg = smooth_rewards(avg_rewards)
    
    plt.plot(range(1, min_length + 1), smoothed_avg, 'k-', linewidth=2, label="Average")
    
    # Add confidence intervals (std dev)
    std_rewards = np.std(truncated_rewards, axis=0)
    plt.fill_between(
        range(1, min_length + 1),
        smoothed_avg - std_rewards,
        smoothed_avg + std_rewards,
        color='k', alpha=0.2, label="±1 Std Dev"
    )
    
    plt.title(f"{ENV_NAME} - DQN Average Performance over {len(all_run_rewards)} Runs")
    plt.xlabel("Episodes")
    plt.ylabel("Smoothed Reward")
    plt.grid(True, alpha=0.3)
    plt.legend(loc="lower right")
    
    # Save the figure
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = os.path.join(config.output_dir, f"dqn_summary_plot_{timestamp}.png")
    plt.savefig(filename, dpi=300, bbox_inches="tight")
    print(f"Summary plot saved to {filename}")
    
    # Log to wandb as a summary artifact
    summary_run = wandb.init(
        project=config.project_name,
        name="dqn_summary",
        group=config.group_name,
        job_type="analysis"
    )
    
    wandb.log({"summary_plot": wandb.Image(plt)})
    summary_data = {
        "mean_final_reward": float(avg_rewards[-1]),
        "std_final_reward": float(std_rewards[-1]),
        "max_mean_reward": float(np.max(avg_rewards)),
        "episode_of_max": int(np.argmax(avg_rewards) + 1),
        "num_runs": len(all_run_rewards)
    }
    wandb.log(summary_data)
    
    # Create a summary table of statistics
    data = [[i+1, r[-1], np.max(r), np.argmax(r)+1] for i, r in enumerate(all_run_rewards)]
    table = wandb.Table(columns=["Run", "Final Reward", "Max Reward", "Max Episode"], data=data)
    wandb.log({"run_stats": table})
    
    wandb.finish()
    return filename
﻿
def main():
    parser = argparse.ArgumentParser(description='Train DQN agent with multiple runs')
    parser.add_argument('--runs', type=int, default=50, help='Number of training runs')
    parser.add_argument('--episodes', type=int, default=300, help='Number of episodes per run')
    parser.add_argument('--learning_rate', type=float, default=1e-3, help='Learning rate')
    parser.add_argument('--gamma', type=float, default=0.99, help='Discount factor')
    parser.add_argument('--epsilon', type=float, default=0.1, help='Exploration rate')
    parser.add_argument('--project_name', type=str, default='dqn_cartpole', help='WandB project name')
    parser.add_argument('--group_name', type=str, default=None, help='WandB group name')
    parser.add_argument('--save_model', action='store_true', help='Save model checkpoints')
    parser.add_argument('--output_dir', type=str, default='./output', help='Directory to save outputs')
    
    config = parser.parse_args()
    
    # Create output directory if it doesn't exist
    if not os.path.exists(config.output_dir):
        os.makedirs(config.output_dir)
    
    # Set default group name if not specified
    if config.group_name is None:
        config.group_name = f"{ENV_NAME}_DQN_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
    
    print(f"Starting {config.runs} training runs with {config.episodes} episodes each")
    print(f"WandB Project: {config.project_name}, Group: {config.group_name}")
    
    all_run_rewards = []
    
    for run_id in range(1, config.runs + 1):
        print(f"\n=== Starting Run {run_id}/{config.runs} ===")
        run_rewards = train(run_id, config)
        all_run_rewards.append(run_rewards)
    
    # Create summary visualization
    summary_plot = create_summary_plot(all_run_rewards, config)
    print(f"\nTraining complete! Summary visualization saved to {summary_plot}")
﻿
if __name__ == "__main__":
    main()
As you run and inspect this code, pay attention to the logging and performance monitoring provided by Weights & Biases. Logging key metrics such as episode rewards, moving averages, and losses allows for real-time visualization, easier comparison between experiments, and effective tracking of model improvements or regressions. This makes it straightforward to analyze how training progresses, compare different hyperparameter choices, and share results with others.
It’s also worth noting that this code intentionally leaves out some of the popular stabilization techniques found in more advanced DQN implementations - such as target networks, experience replay buffers, or reward normalization. By not including these mechanisms, you’ll likely observe some of the common pitfalls of reinforcement learning training, like high variance between runs, unstable value updates, and sensitivity to initialization.
This is deliberate: the aim is to illustrate the practical challenges of reinforcement learning exactly as theorized, and to provide a clear diagnostic baseline. As you experiment further, you’ll see firsthand how these stabilization strategies can dramatically improve the reliability and efficiency of learning.
Here are the logs for my run: 
﻿
Run: dqn_summary1
﻿
Implementing A2C in PythonTo illustrate policy gradient methods, we’ll now implement Advantage Actor-Critic (A2C), a widely used algorithm where two neural networks learn in parallel: the actor, which selects actions, and the critic, which evaluates the value of states. This separation supports stable, efficient learning - especially in environments with complex or continuous state/action spaces.
In the code below, the actor outputs a probability distribution over discrete actions (as in CartPole), while the critic estimates the expected return of each state. This setup closely mirrors the A2C update equations, making the math easier to connect to the implementation.
As with DQN, the training loop is run multiple times to account for the variability in policy gradient methods, which are sensitive to initialization and exploration. Averaging results across runs provides a more realistic view of performance and stability.
This version of A2C omits common enhancements like entropy regularization, multi-step returns, and gradient clipping. As a result, it highlights both the strengths and limitations of the core algorithm—making it a useful foundation for experimentation and extension.
Here’s the code:
import gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import wandb
import matplotlib.pyplot as plt
from scipy.ndimage import gaussian_filter1d
import argparse
import os
from datetime import datetime
﻿
ENV_NAME = 'CartPole-v1'
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
﻿
class Actor(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 128),
            nn.ReLU(),
            nn.Linear(128, act_dim)
        )
    
    def forward(self, state, sample=True):
        if state.dim() == 1:
            state = state.unsqueeze(0)
        logits = self.net(state)
        dist = torch.distributions.Categorical(logits=logits)
        action = dist.sample() if sample else torch.argmax(dist.probs, dim=-1)
        action_prob = dist.probs.gather(1, action.unsqueeze(-1)).squeeze(-1)
        return action, action_prob
﻿
class Critic(nn.Module):
    def __init__(self, obs_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )
    
    def forward(self, state):
        if state.dim() == 1:
            state = state.unsqueeze(0)
        return self.net(state).squeeze(-1)
﻿
def make_env():
    env = gym.make(ENV_NAME)
    obs_dim = env.observation_space.shape[0]
    act_dim = env.action_space.n
    return env, obs_dim, act_dim
﻿
def train(run_id, config):
    # Initialize wandb
    run_name = f"run_{run_id}"
    wandb.init(
        project=config.project_name,
        name=run_name,
        group=config.group_name,
        config={
            "env_name": ENV_NAME,
            "lr_actor": config.lr_actor,
            "lr_critic": config.lr_critic,
            "gamma": config.gamma,
            "episodes": config.episodes,
            "run_id": run_id
        }
    )
    
    env, obs_dim, act_dim = make_env()
    actor = Actor(obs_dim, act_dim).to(DEVICE)
    critic = Critic(obs_dim).to(DEVICE)
    
    actor_opt = optim.Adam(actor.parameters(), lr=config.lr_actor)
    critic_opt = optim.Adam(critic.parameters(), lr=config.lr_critic)
    
    gamma = config.gamma
    all_rewards = []
    
    for episode in range(config.episodes):
        state = env.reset()
        if isinstance(state, tuple):
            state = state[0]
        state = torch.tensor(state, dtype=torch.float32, device=DEVICE)
        
        done = False
        ep_reward = 0
        step_count = 0
        
        while not done:
            action, prob = actor(state)
            next_state, reward, done, info = env.step(action.item())
            if isinstance(info, dict) and 'TimeLimit.truncated' in info:
                truncated = info.get('TimeLimit.truncated', False)
                done = done and not truncated
            
            if isinstance(next_state, tuple):
                next_state = next_state[0]
            next_state = torch.tensor(next_state, dtype=torch.float32, device=DEVICE)
            
            val_s = critic(state)
            val_sp = critic(next_state).detach()
            td_err = reward + gamma * val_sp * (1. - float(done)) - val_s
            
            critic_loss = td_err.pow(2)
            advantage = td_err.detach()
            log_prob = torch.log(prob.clamp(min=1e-8))
            actor_loss = -log_prob * advantage
            
            actor_opt.zero_grad()
            actor_loss.backward()
            actor_opt.step()
            
            critic_opt.zero_grad()
            critic_loss.backward()
            critic_opt.step()
            
            state = next_state
            ep_reward += reward
            step_count += 1
        
        all_rewards.append(ep_reward)
        avg_10 = sum(all_rewards[-10:]) / min(len(all_rewards), 10)
        
        # Log to wandb
        wandb.log({
            "episode": episode,
            "reward": ep_reward,
            "avg_reward_10": avg_10,
            "actor_loss": actor_loss.item(),
            "critic_loss": critic_loss.item(),
            "steps": step_count
        })
        
        if episode % 10 == 0 or episode == config.episodes - 1:
            print(f"Run {run_id} | Episode {episode} | Reward: {ep_reward:.2f} | Avg(10): {avg_10:.2f}", flush=True)
    
    # Save final model if needed
    if config.save_model:
        model_path = os.path.join(config.output_dir, f"actor_run_{run_id}.pt")
        torch.save(actor.state_dict(), model_path)
    
    wandb.finish()
    return all_rewards
﻿
def smooth_rewards(rewards, sigma=2):
    """Apply Gaussian smoothing to rewards."""
    return gaussian_filter1d(rewards, sigma=sigma)
﻿
def create_summary_plot(all_run_rewards, config):
    """Create and save a summary plot showing average and individual runs."""
    plt.figure(figsize=(12, 8))
    
    # Plot individual runs with transparency
    for i, rewards in enumerate(all_run_rewards):
        episodes = range(1, len(rewards) + 1)
        smoothed = smooth_rewards(rewards)
        plt.plot(episodes, smoothed, alpha=0.3, label=f"Run {i+1}" if i < 5 else "")
    
    # Calculate and plot average
    min_length = min(len(r) for r in all_run_rewards)
    truncated_rewards = [r[:min_length] for r in all_run_rewards]
    avg_rewards = np.mean(truncated_rewards, axis=0)
    smoothed_avg = smooth_rewards(avg_rewards)
    
    plt.plot(range(1, min_length + 1), smoothed_avg, 'k-', linewidth=2, label="Average")
    
    # Add confidence intervals (std dev)
    std_rewards = np.std(truncated_rewards, axis=0)
    plt.fill_between(
        range(1, min_length + 1),
        smoothed_avg - std_rewards,
        smoothed_avg + std_rewards,
        color='k', alpha=0.2, label="±1 Std Dev"
    )
    
    plt.title(f"{ENV_NAME} - Average Performance over {len(all_run_rewards)} Runs")
    plt.xlabel("Episodes")
    plt.ylabel("Smoothed Reward")
    plt.grid(True, alpha=0.3)
    plt.legend(loc="lower right")
    
    # Save the figure
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = os.path.join(config.output_dir, f"summary_plot_{timestamp}.png")
    plt.savefig(filename, dpi=300, bbox_inches="tight")
    print(f"Summary plot saved to {filename}")
    
    # Log to wandb as a summary artifact
    summary_run = wandb.init(
        project=config.project_name,
        name="summary",
        group=config.group_name,
        job_type="analysis"
    )
    
    wandb.log({"summary_plot": wandb.Image(plt)})
    summary_data = {
        "mean_final_reward": float(avg_rewards[-1]),
        "std_final_reward": float(std_rewards[-1]),
        "max_mean_reward": float(np.max(avg_rewards)),
        "episode_of_max": int(np.argmax(avg_rewards) + 1),
        "num_runs": len(all_run_rewards)
    }
    wandb.log(summary_data)
    
    # Create a summary table of statistics
    data = [[i+1, r[-1], np.max(r), np.argmax(r)+1] for i, r in enumerate(all_run_rewards)]
    table = wandb.Table(columns=["Run", "Final Reward", "Max Reward", "Max Episode"], data=data)
    wandb.log({"run_stats": table})
    
    wandb.finish()
    return filename
﻿
def main():
    parser = argparse.ArgumentParser(description='Train RL agent with multiple runs')
    parser.add_argument('--runs', type=int, default=50, help='Number of training runs')
    parser.add_argument('--episodes', type=int, default=300, help='Number of episodes per run')
    parser.add_argument('--lr_actor', type=float, default=1e-3, help='Learning rate for actor')
    parser.add_argument('--lr_critic', type=float, default=1e-3, help='Learning rate for critic')
    parser.add_argument('--gamma', type=float, default=0.99, help='Discount factor')
    parser.add_argument('--project_name', type=str, default='rl_cartpole', help='WandB project name')
    parser.add_argument('--group_name', type=str, default=None, help='WandB group name')
    parser.add_argument('--save_model', action='store_true', help='Save model checkpoints')
    parser.add_argument('--output_dir', type=str, default='./output', help='Directory to save outputs')
    
    config = parser.parse_args()
    
    # Create output directory if it doesn't exist
    if not os.path.exists(config.output_dir):
        os.makedirs(config.output_dir)
    
    # Set default group name if not specified
    if config.group_name is None:
        config.group_name = f"{ENV_NAME}_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
    
    print(f"Starting {config.runs} training runs with {config.episodes} episodes each")
    print(f"WandB Project: {config.project_name}, Group: {config.group_name}")
    
    all_run_rewards = []
    
    for run_id in range(1, config.runs + 1):
        print(f"\n=== Starting Run {run_id}/{config.runs} ===")
        run_rewards = train(run_id, config)
        all_run_rewards.append(run_rewards)
    
    # Create summary visualization
    summary_plot = create_summary_plot(all_run_rewards, config)
    print(f"\nTraining complete! Summary visualization saved to {summary_plot}")
﻿
if __name__ == "__main__":
    main()
By working through this more “bare-bones” A2C implementation, you’ll become familiar with both the elegance and the pitfalls of policy gradient approaches. You’ll see exactly how the actor and critic interact to improve the policy over time, and you’ll develop valuable insight into why further stabilization strategies are so common in real research and applications. This hands-on experience will provide the foundation you need to fully appreciate and experiment with state-of-the-art reinforcement learning algorithms.
Here's the results for my run: 
﻿
Run: summary1
﻿
Continuous vs. discrete action spacesThe nature of the action space - whether discrete or continuous - is fundamental to how reinforcement learning algorithms operate. Many foundational RL algorithms, such as traditional Q-learning and its deep variants, are primarily designed for discrete action spaces. In these settings, the agent selects from a fixed, finite set of possible actions. For example, in chess, the set of legal moves in each state is discrete; in classic video games like Atari, the agent selects among a small set of joystick/button combinations.
Discrete action space algorithms like Q-learning estimate and compare values for each possible action individually. This is tractable when the action set is small, as the agent can maintain a table (or output layer in a neural network) where each action is explicitly represented.
However, many real-world problems involve continuous action spaces - where actions are drawn from a range of real numbers, and the number of possible choices is infinite. For example, a robotic arm controlling its joint angles can set each motor to any value within a range; self-driving cars continuously choose steering angles and throttle levels. Q-learning-style algorithms struggle in continuous spaces because you can’t enumerate all possible actions to find the “best” one: you can’t calculate max⁡aQ(s,a)	\max_a Q(s, a)maxa​Q(s,a)﻿. Q-learning-style algorithms struggle in continuous action spaces because you can’t enumerate all possible actions to find the best one, as a can take such a large number of values. 
In continuous spaces, you need methods that can output and optimize over continuous variables. Policy-gradient and actor-critic algorithms are much better suited to these environments. The actor network can be designed to output parameters of a continuous distribution (like the mean and variance of a Gaussian), from which actions are sampled. Examples continuous action space algorithms include Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Soft Actor-Critic (SAC), and the continuous version of PPO. These are all actor-critic-style methods adapted for continuous control.
PPO can be seen as an extension and improvement over the Advantage Actor-Critic (A2C) framework. While A2C uses a straightforward actor-critic setup, PPO adds stability and robustness by introducing a clipped surrogate loss and limiting the step size of policy updates - essentially making sure the learned policy doesn’t change too abruptly from one update to the next. These “tricks” enable PPO to learn more reliably, especially in complex environments.
Adapting PPO to continuous action spaces is straightforward thanks to its policy-gradient foundation. Instead of outputting probabilities over discrete actions, the PPO actor network outputs the parameters (mean and variance) of a probability distribution - typically a Gaussian - for each action dimension. The agent samples actions from this continuous distribution to interact with the environment, enabling smooth control in tasks like robotics or autonomous driving.
This flexibility makes methods like PPO a popular choice for continuous control tasks, where classic discrete-action algorithms like Q-learning are not applicable. Other actor-critic variants such as DDPG, TD3, and SAC are also specifically designed for continuous spaces, but PPO remains widely favored for its balance of simplicity, stability, and effectiveness.
Exploration vs. exploitation dilemma in reinforcement learningThe exploration vs. exploitation dilemma is a fundamental challenge in reinforcement learning. An agent must continuously decide whether to exploit its current knowledge - choosing actions that have led to high rewards in the past - or to explore new actions that might lead to even greater rewards in the future. A good analogy for understanding this is how we learn to act in life: if we always stick to the routines and choices we already know, we might miss out on new opportunities or discoveries. On the other hand, occasionally trying something new - like picking a different route to work, tasting an unfamiliar dish, or experimenting with a new skill - can lead us to better outcomes or valuable insights we wouldn’t have found otherwise.
Striking the right balance is critical for effective learning and long-term performance. If an agent exploits too much, it risks getting stuck with suboptimal behavior, never discovering potentially better strategies. On the other hand, if it explores too much, it may spend excessive time trying random or unhelpful actions, slowing down progress and reducing total reward.
Strategies to balance exploration and exploitationIn reinforcement learning, agents must strike a balance between exploring new actions to discover potentially better strategies and exploiting known actions that yield high rewards. This balance, known as the exploration-exploitation trade-off, is fundamental to effective learning. Several strategies have been developed to manage this trade-off, each with its own approach to guiding the agent’s decision-making under uncertainty.
Epsilon-greedy This is simple and widely used in environments with discrete actions. With probability ϵ, the agent picks a random action (exploration). With probability 1 − ϵ , it selects the action with the highest estimated value (exploitation). For example, in Q-learning:
If a random number < ϵ: explore.
Else: pick the action with the highest Q-value.
Typically, ϵ starts high and decreases over time to encourage more exploitation as the agent learns.
Boltzmann (softmax) explorationInstead of choosing randomly or always picking the best action, the agent samples actions based on their value. Actions with higher Q-values are more likely, but not guaranteed. The formula for the probability of choosing action a in state s is: 
P(a)=exp⁡(Q(s,a)/τ)∑a′exp⁡(Q(s,a′)/τ)P(a) = \frac{\exp(Q(s, a)/\tau)}{\sum_{a'} \exp(Q(s, a')/\tau)}








P(a)=∑a′​exp(Q(s,a′)/τ)exp(Q(s,a)/τ)​﻿
Entropy eegularization (in Policy Gradient Methods)For continuous action spaces and policy-gradient approaches (like PPO or actor-critic), exploration is often promoted by adding an “entropy bonus” to the loss. This rewards the policy for maintaining some randomness in action selection, preventing premature convergence to a deterministic policy.
The choice and tuning of exploration strategy greatly influence the speed and quality of learning. An effective RL agent must continually manage the exploration-exploitation trade-off to maximize cumulative reward, especially in complex or changing environments.
RLHF and reinforcement learning in language modelsReinforcement Learning from Human Feedback (RLHF) is a powerful approach for aligning large language models (LLMs) with human values and expectations. Traditional language model training - predicting the next token in massive text corpora - yields models that are fluent and knowledge-rich, but not necessarily safe, helpful, or tuned to what users actually want. RLHF bridges this gap by introducing direct human input into the reinforcement learning process, allowing the model to optimize for nuanced and desirable behaviors.
The RLHF workflow typically begins with a publicly available pretrained language model. For a collection of prompts, the model generates multiple candidate responses for each prompt. Human annotators then compare these outputs - often ranking them or choosing preferred ones based on criteria like helpfulness, safety, or informativeness. This human feedback forms the foundation for training a reward model: a neural network that learns to predict the quality of a response, imitating human preferences.
Once trained, the reward model can automatically evaluate new output, assigning a quantitative reward to any generated response. The main language model is then fine-tuned with reinforcement learning, using algorithms such as Proximal Policy Optimization (PPO). Here’s how this process actually works:
1. Sampling: For each prompt, the language model samples a response by generating tokens one after another. It computes the probability (or log-probability) of this entire sampled response according to the model’s current parameters.
2. Reward Assignment: The reward model scores the response, producing a scalar reward that reflects predicted human preference.
3. Policy Gradient Loss: The training objective uses policy gradient methods. In essence, the model is encouraged to make responses with high rewards more likely in the future, and penalized for low-quality ones. Technically, this means computing a loss for the sampled response based on its log-probability weighted by its reward: 
Loss=−(log⁡π(a∣s))⋅R\text{Loss} = -(\log \pi(a \mid s)) \cdot RLoss=−(logπ(a∣s))⋅R﻿
Put simply, if the reward is high, the model updates its parameters to make that kind of response more likely. 
5. Gradient Computation and Update: The algorithm computes the gradients of this loss with respect to the model’s parameters using backpropagation. It then updates the parameters in a way that reinforces responses with high rewards and discourages those with low rewards.
6. Stabilization (PPO): To avoid distorting the model’s overall language ability, PPO and similar methods add a regularization term, limiting how much the model changes in response to any one update. This preserves fluency and prevents the model from "gaming" the reward model or producing unnatural outputs.
By repeating this cycle - generating outputs, scoring them, and updating based on feedback - the model adapts to align more closely with human preferences. RLHF is the backbone of advanced conversational AI like ChatGPT, enabling such systems to not only understand language, but also deliver responses that are safe, useful, and genuinely aligned with user intent in real-world applications.
ConclusionReinforcement learning signals a shift in how intelligent systems are built - not as passive recall machines, but as explorers. RL agents aren’t programmed to recite right answers; instead, they actively search for them, probing the world with trial and error, and reshaping their behavior with every bit of feedback. This process of discovery - of learning not just from success but also from mistakes and surprises - echoes how humans navigate complexity and uncertainty.
As AI expands into real-world domains where rules aren’t always clear and outcomes unfold over time, reinforcement learning’s cycle of exploration and adaptation will become fundamental. It’s what enables future systems to make tough decisions, adapt to change, and get better not just by memorizing examples, but by living through the consequences. As we move toward more autonomous, interactive, and adaptable AI, reinforcement learning will likely be a driving force - pushing machines to learn, grow, and truly master the art of learning from both right and wrong.
﻿
﻿
﻿
Add a comment
Tags: Reinforcement Learning, Tutorial, Panels, Articles
Iterate on AI agents and models faster. Try Weights & Biases today.