Types of reinforcement learning algorithms
Explore how reinforcement learning helps AI learn from trial and error, with key algorithms, methods like RLHF, and real-world applications.
 This is a translated version of the article. Feel free to report any possible mis-translations in the comments section
Created on August 26|Last edited on August 26
Comment
Reinforcement Learning (RL) represents a paradigm in artificial intelligence that draws inspiration from the way humans and animals naturally learn through trial, error, and incremental improvement. Instead of relying on labeled examples or static datasets, reinforcement learning agents interact with their environments much like a person learns to ride a bike or play a game, by performing actions, experiencing the consequences, and using feedback to gradually refine their understanding of what leads to success or failure. Through this process of exploration, adaptation, and reward-driven learning, RL systems are able to discover effective strategies even in unfamiliar or complex situations where explicit instructions are unavailable or incomplete.
What sets reinforcement learning apart from other types of machine learning is its interactive nature. An RL agent must continuously make decisions, observe how its choices shape future outcomes, and steadily build up knowledge that enables long-term goal achievement, mirroring the way humans adapt over time. As reinforcement learning increasingly powers breakthroughs in robotics,  gaming,  autonomous vehicles, and more, understanding the landscape of different reinforcement learning algorithms is essential for grasping the field’s potential.
In this article, we'll explore the foundational components of reinforcement learning and highlight how this approach differs from traditional  supervised and  unsupervised learning paradigms. We will then examine the two major families of RL algorithms, model-based and model-free, exploring how each learns, their respective benefits and limitations, and where they shine in real-world applications.  
Next, we will introduce some of the most influential algorithms, such as  Q-learning,  Deep Q-Networks (DQN), and Proximal Policy Optimization (PPO), discussing their mechanisms and roles in advancing AI capabilities. Finally, we will touch on state-of-the-art developments, such as distributional reinforcement learning, and conclude by considering how these diverse approaches collectively drive progress, enabling AI to learn, adapt, and act with increasing sophistication in our complex and dynamic world.

Table of contents
What is reinforcement learning?Agents, environments, and rewardsLearning through interactionHow reinforcement learning differs from other forms of machine learningMain types of reinforcement learning algorithmsModel-based vs. model-free reinforcement learning methodsValue-based, policy-based, and actor-critic reinforcement learning methodsOn-policy vs. off-policy reinforcement learning methodsPopular reinforcement learning algorithmsDeep Q-learningSARSAREINFORCEProximal Policy Optimization (PPO)Advantage Actor-Critic (A2C, A3C)Deep Deterministic Policy Gradient (DDPG) and Soft Actor-Critic (SAC)Dyna-QMonte Carlo Tree Search (MCTS), AlphaGo, and MuZeroAdvanced reinforcement learning methodsMulti-agent reinforcement learningReinforcement Learning from Human FeedbackGRPODistributional reinforcement learningCase studies and applicationsReinforcement learning for PCB layoutReinforcement learning for recommendation systemsReinforcement learning for self-driving carsDeepSeek R1Improving efficiency with DistillationThe results of DeepSeek-R1Conclusion
What is reinforcement learning?
Reinforcement learning is a distinctive approach to machine learning in which an agent learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties. Unlike supervised learning, which relies on examples of correct behavior, RL agents must discover for themselves which actions yield the most favorable outcomes over time.
This interactive learning process is much like how a child learns to ride a bicycle, not by being told every movement to make, but by attempting actions, experiencing wobbles and falls, and adjusting future attempts based on the resulting successes or stumbles.
Agents, environments, and rewards
The central characters in reinforcement learning are the agent and the environment. The agent is the learner and decision-maker; this could be anything from a robot to a trading system to a virtual player in a video game. The environment is the world in which the agent operates, presenting situations (states) and responding to the agent’s actions.
At each step, the agent:
- Observes the current state of the environment
- Selects and performs an action
- Receives feedback in the form of a numerical reward
- And the environment transitions to a new state as a consequence of the agent’s action

The agent’s ultimate objective is to learn a policy, a mapping from states to actions, that maximizes its cumulative reward over time.
Learning through interaction
The defining feature of reinforcement learning is that learning happens continuously and iteratively. The agent must explore new actions and adapt its behavior in response to its experiences, balancing short-term rewards with the promise of greater long-term gains. Through numerous episodes of trial and error, the agent develops a behavioral repertoire that enables it to thrive in its environment, mirroring the adaptive, experiential learning that is fundamental to human and animal intelligence.
How reinforcement learning differs from other forms of machine learning
Unlike supervised learning, which depends on large datasets of labeled examples, or unsupervised learning, which seeks patterns in unlabeled data, reinforcement learning is driven by feedback from the environment itself. The agent is not told which action is “correct”; instead, it must infer the most effective strategy based on indirect signals (rewards or penalties) resulting from its own actions. This inherently goal-directed process makes reinforcement learning uniquely suited for problems where correct answers are not available in advance and success depends on a chain of interdependent decisions performed over time.
Having defined the essential ideas behind reinforcement learning, we’ll next examine the major families of RL algorithms. Understanding these categories is crucial, as they influence not only how the agent learns but also which kinds of problems each method is best suited to solve.
Main types of reinforcement learning algorithms
Reinforcement learning encompasses a variety of algorithmic approaches that can be categorized along several key dimensions. The main types are: model-based versus model-free methods, value-based versus policy-based versus actor-critic methods, and on-policy versus off-policy methods.
Each of these represents a different aspect of how reinforcement learning algorithms interact with the environment, represent decision-making strategies, and learn from experience.
Model-based vs. model-free reinforcement learning methods
One fundamental distinction is between model-based and model-free reinforcement learning.
Model-based reinforcement learning methodsattempt to learn an explicit model of the environment’s dynamics, how states evolve in response to agent actions, and what rewards are delivered. This model is then used to plan actions by predicting possible future scenarios. Classic planning techniques, such as value iteration and  Monte Carlo Tree Search, fall within this category, as do recent approaches that combine  neural networks with world models to simulate future trajectories. Model-based reinforcement learning tends to be more sample-efficient, as it leverages the learned model to plan without requiring as many real-world interactions, but it also depends heavily on the accuracy of the model.  
In contrast,model-free reinforcement learning methodsdo not construct such a model. Instead, they learn optimal behaviors directly from experience with the environment, typically by updating value estimates or policies in response to observed rewards. Q-learning, SARSA, and Deep Q-Networks (DQN) are classic examples of model-free algorithms. While model-free methods may require more data to learn effective policies, they are often simpler to implement and can be more robust in complex environments where accurate modeling is difficult.
Value-based, policy-based, and actor-critic reinforcement learning methods
Reinforcement learning algorithms can also be grouped by how they represent and update the agent’s decision-making process.
Value-based reinforcement learning methodsfocus on estimating value functions, which summarize the expected reward of being in a given state or taking a particular action in a state. The agent then acts by selecting actions that maximize these values. Q-learning and DQN exemplify this approach; these algorithms are powerful in settings where it is practical to compute or approximate value functions for all relevant states and actions.
Policy-based reinforcement learning methods, instead, learn a parameterized policy that directly maps states to actions, without explicitly computing value functions. These are especially useful in high-dimensional or continuous action spaces and multi-agent settings. Policy gradient algorithms, such as REINFORCE and Trust Region Policy Optimization (TRPO), fall into this category. They update the policy parameters directly in order to maximize expected cumulative reward.
Bridging value-based and policy-based techniques areactor-criticmethods, which maintain both a policy (the actor) and a value function estimate (the critic). The actor is responsible for selecting actions, while the critic evaluates those actions and provides feedback to improve the actor. This combination often leads to more stable and efficient learning, as seen in algorithms like Advantage Actor-Critic (A2C), Proximal Policy Optimization, and Deep Deterministic Policy Gradient (DDPG).
On-policy vs. off-policy reinforcement learning methods
A further differentiation among reinforcement learning algorithms is whether they are on-policy or off-policy.
On-policy reinforcement learning algorithmslearn about the policy they are currently executing. That is, the data used for learning is collected using the same policy being improved, ensuring the learning process is closely tied to the agent’s current behavior. This makes on-policy methods conceptually straightforward and often leads to stable learning, though sometimes at the expense of data efficiency. Examples of on-policy algorithms include SARSA and many policy gradient methods like A2C and PPO.
Off-policy reinforcement learning algorithms, in contrast, learn the value of one policy (typically the optimal or target policy) while behaving according to another (the behavior policy). This setup enables the algorithm to reuse past experiences generated under older or different policies, often with the aid of a replay buffer, which can significantly enhance sample efficiency and facilitate the use of diverse datasets. Well-known off-policy algorithms are Q-learning, DQN, DDPG, and the Twin Delayed DDPG (TD3). Off-policy approaches are particularly valuable when learning from offline data or from demonstrations provided by other agents or humans.
Popular reinforcement learning algorithms
Reinforcement learning encompasses a diverse set of algorithms, each with its own style of learning from interaction and searching for optimal behavior. Some rely directly on experience, while others attempt to model the world for more deliberate planning. Below is an overview of prominent reinforcement learning algorithms and algorithm families, including their general approach and whether they are model-free or model-based.
Deep Q-learning
Deep Q-learning is an extension of the Q-learning algorithm that uses neural networks to approximate the action-value function (Q-function), enabling the agent to operate in complex, high-dimensional environments. In Deep Q-Learning, the agent interacts with its environment step by step: at each state, it selects an action, observes the reward, and transitions to a new state. Instead of maintaining a table of Q-values, the agent uses a neural network (the "Deep Q-Network," or DQN) to estimate the expected future reward (Q-value) for each possible action, given the current state as input.
During training, the agent stores its experiences, which comprise the current state, the chosen action, the received reward, and the subsequent state, in a memory buffer. At regular intervals, the agent samples random batches of these experiences and uses them to update the neural network. The key update rule is based on minimizing the difference (the “temporal difference error”) between the network's current Q-value prediction and the updated estimate, which incorporates the observed reward and the network’s estimate of future rewards from the next state.
This process enables the network to gradually learn which actions yield higher long-term rewards, even in situations where the state space is too large to be explicitly enumerated. Deep Q-learning has enabled reinforcement learning agents to tackle tasks ranging from playing Atari games directly from pixels to decision-making problems in real-world applications.
SARSA
SARSA stands forState-Action-Reward-State-Action, and it’s a core reinforcement learning algorithm. At first glance, SARSA resembles Q-learning because both aim to teach an agent the value of taking certain actions in specific situations, gradually improving its decisions through experience. The key difference is that SARSA is what’s called an on-policy method. This means it updates its values based on the actions the agent actually takes, following its current policy, including any randomness or exploration (such as when the agent tries something new instead of what seems best).
So, instead of learning based only on what would happen if it always picked the top choice (like Q-learning does), SARSA learns from real sequences of states, actions, and rewards that happen as the agent interacts with the environment, exploration, and all. Because of this, SARSA can be more aware of the consequences of exploratory or risky actions. It tends to be safer or more robust in situations where making mistakes during exploration could lead to bad outcomes, since it essentially “practices what it preaches”, learning from what it actually does rather than from ideal actions it hasn’t necessarily tried.
REINFORCE
REINFORCE is a model-free policy-based reinforcement learning algorithm that directly learns the probability distribution over actions, known as the policy, by sampling entire episodes and nudging its behavior in the direction of higher observed rewards. It doesn’t estimate long-term value functions, but instead uses the reward signal to adjust the likelihood of chosen actions, especially effective in problems where the action space is continuous or optimal strategies are inherently stochastic.
Proximal Policy Optimization (PPO)
Proximal Policy Optimization is a widely used model-free policy gradient reinforcement learning algorithm, popular for its reliability and stability. PPO improves upon earlier policy-based methods by restricting each policy update so that the agent doesn’t leap too far from its previous behavior, encouraging steadier and safer progress. Like REINFORCE, PPO aims to directly improve the policy, but does so in a more controlled and sample-efficient way, making it a go-to algorithm for both simulated and real-world continuous control tasks.
Advantage Actor-Critic (A2C, A3C)
Advantage Actor-Critic (A2C) and its asynchronous variation A3C blend value-based and policy-based reasoning, both are model-free actor-critic reinforcement learning methods. These algorithms feature two components: the actor, which selects actions, and the critic, which estimates the value of those actions. By letting the critic “advise” the actor, agents can update their policies with less variance and greater stability, which becomes especially valuable in high-dimensional or continuous environments like robotics.
Deep Deterministic Policy Gradient (DDPG) and Soft Actor-Critic (SAC)
DDPG and SAC are model-free actor-critic reinforcement learning algorithms tailored for environments with continuous action spaces. DDPG uses deterministic policies, while SAC employs stochastic policies and introduces entropy maximization for better exploration. Both methods leverage neural networks to approximate policies and value functions, enabling them to tackle sophisticated control problems such as robotic manipulation or autonomous driving.
Dyna-Q
Dyna-Q is an early and influential example of a model-based reinforcement learning algorithm. Here, the agent not only learns from real-world experience but also constructs a simple model of the environment as it interacts with it. It then uses this internal model to simulate “imagined” transitions, planning future steps and accelerating learning. Dyna-Q effectively demonstrates the synergy between model-based planning and model-free updates, and lays the groundwork for more advanced model-based RL.
Monte Carlo Tree Search (MCTS), AlphaGo, and MuZero
Monte Carlo Tree Search is a model-based reinforcement learning algorithm widely used in perfect-information games, such as Go, chess, and shogi. It systematically builds a search tree of possible moves and explores them using simulated rollouts guided by value and policy networks. MCTS was a cornerstone of DeepMind’s AlphaGo, which blended model-based planning with deep learning for value and policy estimation, and achieved historic results in Go.
MuZero extends this approach by learning its internal model directly from experience, not only learning a policy and value function, but also discovering a compact, predictive representation of environment dynamics. While MuZero is fundamentally model-based, it can operate even in environments where the true rules are unknown, achieving world-class performance in strategic board games and beyond.
Taken together, these algorithms represent the major approaches to reinforcement learning, each with its own mechanisms, advantages, and trade-offs. Value-based approaches like Q-learning and DQN are simple, robust, and model-free, excelling when the agent can afford lots of direct experience. Policy-based and actor-critic methods (like REINFORCE, PPO, and A2C/A3C) also rely on model-free updates, but often generalize better to continuous or complex decision spaces. In contrast, model-based algorithms such as Dyna-Q, MCTS, and MuZero seek to accelerate learning and enable planning by modeling the world or simulating future outcomes internally. Real-world reinforcement learning increasingly integrates these perspectives, leveraging both experience-driven learning and imaginative planning to build more effective and data-efficient agents.
Advanced reinforcement learning methods
Reinforcement learning continues to break new boundaries as researchers seek to develop agents that not only excel in games or single-agent environments but can also tackle the complexities of human preferences, social interaction, and problem-solving at scale. In recent years, three particularly impactful frontiers have emerged:  multi-agent reinforcement learning (MARL), inverse reinforcement learning, and the application of reinforcement learning to  large language models through  fine-tuning with preference-based or verifiable signals.
Multi-agent reinforcement learning
Multi-agent reinforcement learning extends the standard RL paradigm into environments where multiple agents, each with their own goals or policies, interact simultaneously. Unlike the single-agent setting, here the environment becomes dynamic and constantly shifts as each agent adapts in response to the others.
This introduces unique challenges. Strategies that might work for an isolated agent can break down when they must compete, cooperate, or negotiate with others whose behavior is also changing. Research in multi-agent reinforcement learning has led to systems capable of highly coordinated teamwork, emergent division of labor, and sophisticated negotiation, in applications ranging from autonomous drone swarms and robotic warehouse collaboration to multi-player video games and simulated economics.
Crucially, these agents must learn both from the underlying environment and from each other, discovering not only how to maximize their own reward but also how to anticipate, influence, or align with the policies of their peers.
Reinforcement Learning from Human Feedback
Perhaps the most transformative recent use of reinforcement learning has been in aligning and fine-tuning large language models.  Reinforcement Learning from Human Feedback (RLHF) is now central to ensuring that these models, such as  GPT-4o or  Claude, behave helpfully, safely, and in accordance with human intentions.  
Rather than relying solely on hand-crafted metrics or supervision, RLHF collects human judgments, people assess or rank multiple model responses, and then trains a reward model to predict human preferences. The language model is then optimized using reinforcement learning, most often via algorithms like Proximal Policy Optimization, to maximize the predicted “helpfulness” or acceptability of its outputs. This process has unlocked remarkable improvements in language model capability and alignment, but also brings technical challenges. For instance, PPO typically requires a separate “critic” model to estimate future rewards, which is computationally intensive given the scale of today’s language models.

GRPO
Group Relative Policy Optimization (GRPO), introduced by DeepSeek, offers an efficient alternative for preference-based reinforcement learning fine-tuning of LLMs. GRPO dispenses with the need for a large value network. Instead, for each given prompt, multiple candidate responses are sampled from the model. Each response is evaluated for quality, potentially by a reward model trained on human preferences or by automated checks for correctness.
Rather than estimating an absolute value for each response, GRPO compares the reward for each sample directly to the average reward within that group. The model is then reinforced to increase the probability of responses outperforming their peers within the same prompt, and decrease those lagging behind. By using the group mean as a natural baseline, GRPO aligns with the comparative nature of preference data and significantly reduces memory and compute requirements.
To further stabilize learning, GRPO employs regularization that discourages drastic departures from the original model’s distribution, which serves a similar purpose to KL-divergence penalties in traditional reinforcement learning algorithms. The result is a streamlined yet powerful approach for aligning models with nuanced human or verifiable signals, particularly effective for domains where relative output quality is paramount, such as mathematical reasoning or code generation.

A closely related approach, known as Reinforcement Learning with Verifiable Rewards (RLVR), applies RL techniques where correctness can be determined automatically, for example when a model must produce correct code or mathematical answers verified by automated checkers. In RLVR, the reward comes not from a learned or subjective preference model, but directly from programmatic evaluation, offering a robust training signal for tasks with clear ground truth.
Distributional reinforcement learning
Traditional reinforcement learning works by estimating a single number, the expected total reward (or "value") an agent can get from each state and action, averaged over all possible future outcomes. This average tells us what is likely to happen, on balance, but it throws away information about how much uncertainty, risk, or variability there might be. Sometimes, the possible outcomes can be very different: one path might give a huge win, another a big loss, even if their average is the same.
Distributional reinforcement learning takes a different approach. Instead of only tracking the average return, it tries to learn the entire distribution of possible returns. For each action and state, the agent predicts not just "on average, what happens?" but "what are all the different outcomes I might get, and how likely are they?" In other words, it learns about the range of returns, not just their mean.
Why is this useful? Because the spread and shape of possible futures matter in many situations. Knowing if an action is risky, or if rare, big rewards (or penalties) are possible, helps the agent make better decisions, especially in uncertain, noisy, or safety-critical environments. It can also make learning more stable and efficient, and help agents recognize unlikely but important events.
In practice, distributional reinforcement learning works by representing the possible future returns for each state and action as a probability distribution. Rather than updating a single value estimate, the agent updates its entire prediction about how likely different rewards are, based on experience. This is often done by modeling the return distribution as a collection of representative points ("atoms") or quantiles, and updating them using a version of the Bellman equation adapted to distributions.
For example, the C51 algorithm (a popular distributional method) represents the return distribution as a sum over 51 possible, evenly spaced values or "bins." When the agent observes a transition, it shifts and resizes the probability over these bins to reflect the new evidence, rather than just shifting a single number up or down. Over time, the agent learns which outcomes are common, which are rare, and how much variability to expect for every action in each state.
Case studies and applications
The influence of reinforcement learning now extends far beyond academic demonstrations and games, powering breakthroughs that are transforming real-world industries. RL’s appeal lies in its ability to autonomously learn complex, sequential behaviors, enabling automation and optimization in domains once too dynamic or messy for inflexible rule-based systems.
Reinforcement learning for PCB layout
A standout case comes from electronics hardware, where the product cycle has historically been bottlenecked by manual, slow printed circuit board (PCB) layout.  Quilter is a startup with pioneering a new approach: it employs a fully autonomous, physics-driven reinforcement learning engine to handle both component placement and trace routing. Unlike tools reliant on templates or human-derived heuristics, Quilter’s agent explores vast design spaces, optimizing for manufacturability, electrical integrity, and design constraints. Each potential board layout is instantly evaluated by real-world simulations and rule checks, and the RL agent uses feedback to propose dozens of increasingly effective layouts, giving engineering teams rapid, independent results that accelerate iteration and innovation in hardware development.
Reinforcement learning for recommendation systems
A highly influential example of reinforcement learning in real-world recommendation is YouTube’s large-scale production deployment of a top-K policy-gradient-based system (Chen et al., 2019). Faced with recommending from an immense catalog to billions of global users, YouTube reformulated the problem as a sequential decision process: the recommender learns directly from user behavior logs (like clicks and watch time) to optimize for long-term satisfaction and engagement, rather than just immediate reward.
One of the fundamental challenges in this setting is that new algorithms must learn from historical feedback generated by previous versions of the recommender, resulting in unavoidable data biases. The YouTube system addressed this by developing methods for “off-policy correction,” adjusting the learning process to account for policy changes over time and better align recommendations with genuine user preferences. The model also innovated by focusing on recommending a set of items per interaction (as users are shown multiple videos at a time), rather than optimizing for a single recommendation.
Through a combination of reinforcement learning, off-policy correction, and tailored techniques for large-scale settings, this approach enabled more robust adaptation to evolving user interests and richer, more diverse recommendations. Live experiments demonstrated improvements in overall viewing time and user engagement. While many implementation details have likely evolved since the original study, this line of work underscores how reinforcement learning principles are being adapted and scaled for the dynamic, production-level environments of major online platforms.
Reinforcement learning for self-driving cars
Perhaps the most dramatic example of reinforcement learning’s real-world impact comes from the 2024  study “Robust Autonomy Emerges from Self-Play” by a group of engineers affiliated with Apple. These researchers introduced GIGAFLOW, a large-scale simulator built to train self-driving agents entirely through self-play RL, without relying on data from human drivers.
In GIGAFLOW, AI agents control a diverse range of virtual traffic participants, including cars, trucks, cyclists, and pedestrians, each with unique goals and behaviors. The agents navigate diverse, dynamic environments, repeatedly encountering complex scenarios that mirror the unpredictability of real traffic: congested intersections, merges, unprotected turns, last-minute obstacles, and more. Training unfolds at an immense scale, accruing over 1.6 billion simulated kilometers, allowing the reinforcement learning agents to develop robust driving policies through continual trial, error, and feedback.

Key innovations of GIGAFLOW include its use of structured state representations, permutation-invariant network layers, and a reinforcement learning algorithm (Proximal Policy Optimization, PPO) with decoupled actor and critic parameters for stability. The system also implements “advantage filtering,” focusing learning on the most challenging or impactful experiences.
Notably, the resulting generalist policy outperformed specialist models on leading autonomous driving benchmarks (CARLA, nuPlan, and the Waymo Open Motion Dataset), despite never using a single kilometer of real-world driving data. The policies learned not only human-like and safe driving, but demonstrated adaptability across vehicle types and even drove with variable “personalities”, from cautious to assertive, simply by adjusting reward functions.
DeepSeek R1
A great demonstration of the power of reinforcement learning is the  DeepSeek R1 model. DeepSeek-R1 is a language model developed specifically to excel at advanced reasoning tasks, including mathematics, logic, and programming. Its training process is notably different from that of most general-purpose models like GPT-4o, which are primarily supervised on massive datasets to predict the next token and then further aligned with human preferences via RLHF.
DeepSeek-R1 was trained using a multi-stage process to enhance its reasoning abilities and align its responses with human preferences for helpfulness and harmlessness. The training began with "cold-start" supervised fine-tuning, where thousands of high-quality, human-generated examples featuring detailed  chains of thought (CoT) were used to provide the base language model with a strong foundation in readable and well-structured reasoning.
After this initial fine-tuning, the model underwent reasoning-oriented reinforcement learning. In this stage, the model was encouraged to generate correct and clear answers, particularly in tasks involving math, coding, and logical reasoning. The rewards during RL were determined by both the accuracy of the answers and the consistency of the language, guiding the model to produce correct and well-structured reasoning.
To further enhance the quality of the training data, the researchers employed rejection sampling. In this process, the model generated multiple responses for each prompt. Only those responses that satisfied specific criteria, such as accuracy, clarity, and correct formatting, were selected, while the rest were discarded. These high-quality responses formed a new supervised training dataset for further fine-tuning.
Additionally, data covering broader domains, like writing and factual question answering, were collected and incorporated into the training, which helped expand the model’s capabilities.
The final reinforcement learning stage focused on aligning the model with human preferences, namely helpfulness and harmlessness, while further refining its reasoning skills. In this stage, responses were judged both automatically (using reward signals for accuracy, formatting, and language consistency) and with model-based or rule-based assessments of helpfulness and safety. This ensured the model not only developed strong reasoning performance, but also produced responses that were appropriate, safe, and useful in real-world interactions.
Improving efficiency with Distillation
The final step in the DeepSeek-R1 process is distillation: the outputs and capabilities of the best RL-tuned DeepSeek-R1 model are used as supervision to train much smaller dense models, such as those based on Llama or Qwen architectures at 1.5B, 7B, 14B, 32B, and 70B scales. These distilled models are not trained with reinforcement learning themselves but simply fine-tuned on the high-quality reasoning trajectories generated by DeepSeek-R1.
The results are remarkable: for the first time, small and efficient open-source models achieve and even surpass the reasoning performance of much larger, instruction-tuned models, and in many cases, close the gap to models like OpenAI's o1-mini series. For example, the DeepSeek-R1-distilled Qwen-7B surpasses the 32B QwQ-Preview model on major math and logic benchmarks, while the 32B and 70B distilled versions set new records for open-source reasoning at their scale.
The results of DeepSeek-R1
When comparing DeepSeek-R1 and its distilled versions to mainstream non-reasoning models such as GPT-4o, the difference in training methodology becomes clear. Models like GPT-4o remain strong generalists and are more chatty and creative in open conversation, but they are not incentivized during training to explicitly generate long, logical chains of thought or to optimize for formal problem correctness across math, science, or competition programming. As a result, on the very benchmarks that stress step-by-step reasoning, such as AIME or MATH-500, DeepSeek-R1 outperforms GPT-4o by a dramatic margin: where GPT-4o typically hovers in the teens or low 20s in accuracy on these tasks, DeepSeek-R1 achieves scores above 79% on AIME and over 97% on MATH-500, matching or exceeding even OpenAI's closed "o1" line, which is itself a reasoning-specialized branch.

The smaller distilled R1 models inherit much of this reasoning ability, and do so much more efficiently than any method that tries to RL-train small models directly. In fact, experiments show that attempts to teach a small model to reason via reinforcement learning from scratch are computationally costly and yield less capable models than simple distillation from a powerful RL-trained teacher. This distillation thus makes first-rate reasoning accessible at all commonly deployed model sizes, allowing smaller open LLMs to excel where previous models never could.

Conclusion
Reinforcement learning stands out as a transformative approach within artificial intelligence, offering a path for machines to achieve complex goals through experience, adaptation, and continuous feedback. As we have seen, RL enables agents not only to react but also to plan and improve over time, tackling challenges ranging from engineering design to recommendation and autonomous driving, domains where explicit instructions or static data often fall short.
The diversity of reinforcement learning algorithms, ranging from foundational value-based and policy-based techniques to advanced approaches that leverage human feedback or distributional learning, continues to expand what is possible. Each new breakthrough adds to RL’s growing ability to handle uncertainty, optimize long-term outcomes, and generalize to real-world settings.
Nonetheless, reinforcement learning is still evolving, with open questions around efficiency, safety, and real-world robustness driving continued research. As the technology matures, reinforcement learning is poised to become a cornerstone for the next generation of intelligent, adaptive systems, redefining how machines learn and act in dynamic, unpredictable environments.
Add a comment