Reinforcement learning for reasoning: Enhancing AI capabilities

Explore how reinforcement learning with verifiable rewards (RLVR) and GRPO shape LLM reasoning gains, where true improvements lie, plus a practical GRPO training guide.
Brett Young
Created on June 17|Last edited on June 25
Comment
Recent advances in large language models (LLMs) have sparked hopes that reinforcement learning, especially when paired with verifiable rewards (RLVR), can unlock truly robust reasoning, problem-solving, and mathematical abilities in AI systems. Methods like Group Relative Policy Optimization (GRPO) have become core tools in this new regime, enabling models to directly optimize for outputs that are automatically and objectively correct. 
But as RLVR gains traction in both industry and open-source efforts, a growing body of research is starting to question how much actual reasoning improvement RL, as currently practiced, can produce and how much is simply a reflection of abilities that base models already possess.
For those interested in getting their hands dirty right away, you can:
Jump to the tutorial, implementing GRPO with Huggingface.﻿
﻿
For those who want to gain a deeper understanding of RL for reasoning models? This article will first demystify how reinforcement learning works in the context of language models and outline the distinctions between traditional RL, Reinforcement Learning with Human Feedback (RLHF), and the increasingly popular RLVR approaches. We will then provide a practical look at policy gradient methods like A2C and modern variants such as GRPO, explaining their motivations, mechanics, and claims to efficiency. 
With this foundation in place, we will review state-of-the-art research showing dramatic performance gains in reasoning tasks via RLVR. However, we'll also take a sober look at new critical research that urges caution: recent controlled studies reveal that some of RLVR’s headline successes may not constitute genuine advances in model reasoning or generalization. 
Finally, we will walk through the practical aspects of training a language model with GRPO, laying out what goes into an RLVR training system and highlighting best practices. By the end, you’ll understand not just how RLVR and GRPO work under the hood, but also where their real benefits and limitations lie as tools for building the next generation of reasoning AIs.
﻿
Table of contentsTable of contentsReinforcement learning foundationsPolicy gradient learning: A core pillar of modern reinforcement learningPolicy gradient loss with advantageProximal Policy Optimization (PPO) and beyondWhy GRPO? What problem does it solve?How does GRPO work?A case study on RLVR and GRPO: DeepSeek R1 Improving RL efficiency with distillation The results of DeepSeek-R1 Cautionary findings on RL-fine-tuned chain-of-thought modelsStudy 1: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem ComplexityStudy 2: Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?Study 3. A Minimalist Approach to LLM Reasoning: from Rejection Sampling to ReinforceStudy 4. Reinforcement Learning for Reasoning in Large Language Models with One Training ExampleConcluding Cautions and Future DirectionsTutorial: Implementing GRPO training with Huggingface Running inference with our modelsConclusion 
﻿
Reinforcement learning foundationsAt its core, reinforcement learning (RL) is about teaching an agent to act in an environment to maximize rewards. The foundational building blocks of RL are:
State (s): the agent’s current situation or context.
Action (a): a set of choices the agent can make.
Reward (r): feedback received after taking an action, which tells the agent how good or bad that action was in the state.
The agent’s policy defines how it maps each state to actions, and the goal is to learn a policy that, over time, accumulates as much reward as possible.
Traditionally, reinforcement learning has been applied in contexts such as robotics or games (think AlphaGo or Atari), where the states and actions correspond to physical world configurations or moves on a board. For large language models, the same principles are now being harnessed to refine how these models generate text. For LLMs, the state, action, and reward spaces have slightly different meanings: 
State: The text generated so far (the prompt and model output up to the current token).
Action: The next word or token the model chooses to output.
Reward: An external signal reflecting how good a generated response is - either based on human judgments or programmatic checks.
The original landmark application for LLMs was Reinforcement Learning from Human Feedback (RLHF). Here, LLMs generate responses (actions), and human annotators (or trained reward models that mimic them) provide feedback on which answers are more helpful, safe, or appropriate (reward). The model then adjusts its policy to make desirable completions more likely.
Recently, this has evolved into Reinforcement Learning with Verifiable Rewards (RLVR). Instead of relying on subjective judgments or learned reward models, RLVR uses objective, automatic criteria to assign rewards, such as checking if a math answer is exactly correct or if generated code passes unit tests. This produces strong, unambiguous learning signals for fine-tuning LLM behavior.
Policy gradient learning: A core pillar of modern reinforcement learningModern reinforcement learning algorithms for LLMs are often based on policy gradient methods. Here, instead of trying to estimate the long-term value of each state (as in classic Q-learning or value iteration), the algorithm directly tweaks the model’s parameters to increase the likelihood of actions (outputs) that led to higher rewards. In other words, the model’s policy (probability distribution over possible outputs) gets adjusted to favor rewarded actions, using gradients computed from observed reward differences.
A central idea is the advantage, which measures how much better or worse an action performs compared to the average expectation for a given state. This can be calculated as:
Advantage = Actual Reward – Expected Reward (predicted by a separate critic network)
Policy gradient loss with advantageThe basic policy loss (using advantage) is:
L=−Et[At log⁡πθ(at∣st)]L = -\mathbb{E}_{t} \left[ A_t \, \log \pi_{\theta}(a_t \mid s_t) \right]
L=−Et​[At​logπθ​(at​∣st​)]﻿
Where: 
﻿Et\mathbb{E}_{t}   Et​﻿ = expectation over ( t )
﻿AtA_t At​﻿ = advantage at time ( t )
﻿πθ(at∣st)\pi_{\theta}(a_t \mid s_t) πθ​(at​∣st​)﻿ = policy’s probability of action ( ata_tat​﻿ ), given state ( sts_tst​﻿ ), under parameters ( θ\theta θ﻿ )
This loss function encourages the model to assign higher probability to actions that turn out better than expected (positive advantage) and lower probability to actions that perform worse than expected (negative advantage). In other words, it “nudges” the policy to repeat good behaviors and avoid bad ones, as judged by their impact on future reward compared to the critic’s baseline.
The advantage acts as a weight: positive advantages increase the likelihood of those actions in the future, while negative ones decrease their probability. This way, learning focuses on the surprisingly good or bad actions rather than just repeating what is already expected.
Proximal Policy Optimization (PPO) and beyondA widely used improvement over standard policy gradients is Proximal Policy Optimization (PPO). PPO keeps policy updates stable using a clipped surrogate objective, which prevents the model from changing too abruptly with each update. This stability is especially crucial for large models, such as LLMs.
Modern actor-critic methods - including PPO - almost always use Generalized Advantage Estimation (GAE) to compute advantages. GAE flexibly blends short-term and long-term reward signals by mixing information from n-step returns, controlled by a parameter λ. This approach strikes a practical balance between bias and variance, leading to more reliable and efficient learning.
For LLMs, newer methods like Group Relative Policy Optimization (GRPO) further simplify reinforcement learning training by comparing candidate outputs within a group, using their relative rewards, streamlining training without a separate value network.
GRPO, introduced in the DeepSeekMath paper, is a reinforcement learning algorithm explicitly designed to enhance the reasoning abilities of LLMs for challenging mathematical tasks while avoiding some of the computational bottlenecks of traditional RL techniques like Proximal Policy Optimization.
Why GRPO? What problem does it solve?In standard reinforcement learning for language models, PPO has been a workhorse for aligning model output to desired behaviors:
The model (policy) generates candidate responses,
A value network (the "critic") estimates how good particular actions are, and
A reward signal encourages the model to generate better answers in the future.
However, for LLM-scale models, especially in mathematical reasoning, maintaining a value network nearly as large as the main model is computationally expensive - and in the RLVR setting, where rewards are often only available at the sequence-level (i.e., the final answer is right or wrong), it can be difficult to train an accurate value model at each token.
GRPO offers a solution that is both computationally efficient and closely aligned with the structure of reward signals in language model RL.
﻿
How does GRPO work?Instead of using a separate value model to compute baselines and advantages (which help stabilize updates and reduce variance), GRPO looks at groups of sampled outputs for a given input prompt. Here's the core idea:
1. For each prompt (question), the model generates a group of candidate outputs (answers), typically by sampling from its current policy.
2. Each output is scored using a reward function, which can be a binary indicator (i.e., "Is the answer exactly correct?") or a learned reward model.
3. GRPO calculates the average reward across the group of responses - the simplest available "baseline."
4. The "advantage" for each response is then computed relative to this mean reward: if an output's reward is better than the group average, its probability should be increased; if it's worse, it should be decreased. Here's the formula for the advantage in GRPO: 
A^i,t=r~i=ri−mean(r)std(r)\hat{A}_{i,t} = \tilde{r}_i = \frac{r_i - mean(\mathbf{r})}{std(\mathbf{r})}

A^i,t​=r~i​=std(r)ri​−mean(r)​﻿
Where: 
﻿rir_iri​﻿: raw return (or reward) for the  sample
﻿r\mathbf{r}r﻿: the vector of all returns (e.g., over a batch or trajectory)
5. The update step uses this relative (group-wise) advantage to reinforce the model: outputs better than their peers are made more likely under the policy, and worse ones are suppressed.
This group-centric approach replaces the need for a heavy value network, greatly reducing memory and computation costs. GRPO also directly reflects the way many reward functions for LLMs are built - as comparative or preference-based, rather than requiring a notion of absolute quality.
A case study on RLVR and GRPO: DeepSeek R1 A great demonstration of the power of RLVR is the DeepSeek R1 model. DeepSeek-R1 is a language model developed specifically to excel at advanced reasoning tasks, including mathematics, logic, and programming. Its training process is notably different from that of most general-purpose models like GPT-4o, which are primarily supervised on massive datasets to predict the next token and then further aligned with human preferences via RLHF.
DeepSeek-R1 was trained using a multi-stage process to enhance its reasoning abilities and align its responses with human preferences for helpfulness and harmlessness. The training began with "cold-start" supervised fine-tuning, where thousands of high-quality, human-generated examples featuring detailed chains of thought (CoT) were used to provide the base language model with a strong foundation in readable and well-structured reasoning.
After this initial fine-tuning, the model underwent reasoning-oriented reinforcement learning. In this stage, the model was encouraged to generate correct and clear answers, particularly in tasks involving math, coding, and logical reasoning. The rewards during RL were determined by both the accuracy of the answers and the consistency of the language, guiding the model to produce correct and well-structured reasoning.
To further enhance the quality of the training data, the researchers employed rejection sampling. In this process, the model generated multiple responses for each prompt. Only those responses that met specific criteria, such as accuracy, clarity, and correct formatting, were selected, while the rest were discarded. These high-quality responses formed a new supervised training dataset for further fine-tuning.
Additionally, data covering broader domains - like writing and factual question answering - were collected and incorporated into the training, which helped expand the model’s capabilities.
The final reinforcement learning stage focused on aligning the model with human preferences, namely helpfulness and harmlessness, while further refining its reasoning skills. At this stage, responses were evaluated both automatically (using reward signals for accuracy, formatting, and language consistency) and through model-based or rule-based assessments of helpfulness and safety. This ensured the model not only developed strong reasoning performance but also produced responses that were appropriate, safe, and useful in real-world interactions.
Improving RL efficiency with distillation The final step in the DeepSeek-R1 reinforcment learning process is distillation: the outputs and capabilities of the best RL-tuned DeepSeek-R1 model are used as supervision to train much smaller dense models, such as those based on Llama or Qwen architectures at 1.5B, 7B, 14B, 32B, and 70B scales. These distilled models are not trained using reinforcement learning themselves, but rather fine-tuned on the high-quality reasoning trajectories generated by DeepSeek-R1. The results are remarkable: for the first time, small and efficient open-source models achieve and even surpass the reasoning performance of much larger, instruction-tuned models, and in many cases, close the gap to models like OpenAI's o1-mini series. For example, the DeepSeek-R1-distilled Qwen-7B surpasses the 32B QwQ-Preview model on major math and logic benchmarks, while the 32B and 70B distilled versions set new records for open-source reasoning at their scale.
The results of DeepSeek-R1 When comparing DeepSeek-R1 and its distilled versions to mainstream non-reasoning models such as GPT-4o, the difference in training methodology becomes clear. Models like GPT-4o remain strong generalists and are more chatty and creative in open conversation, but they are not incentivized during training to explicitly generate long, logical chains of thought or to optimize for formal problem correctness across math, science, or competition programming. As a result, on the very benchmarks that stress step-by-step reasoning, such as AIME or MATH-500, DeepSeek-R1 outperforms GPT-4o by a dramatic margin: where GPT-4o typically hovers in the teens or low 20s in accuracy on these tasks, DeepSeek-R1 achieves scores above 79% on AIME and over 97% on MATH-500, matching or exceeding even OpenAI's closed "o1" line, which is itself a reasoning-specialized branch.
﻿
The smaller distilled R1 models inherit much of this reasoning ability, and do so much more efficiently than any method that tries to RL-train small models directly. In fact, experiments show that attempts to teach a small model to reason via reinforcement learning from scratch are computationally costly and yield less capable models than simple distillation from a powerful RL-trained teacher. This distillation thus makes first-rate reasoning accessible at all commonly deployed model sizes, allowing smaller open LLMs to excel where previous models never could.
﻿
Cautionary findings on RL-fine-tuned chain-of-thought modelsWhile reinforcement learning, especially when coupled with Chain-of-Thought (CoT) prompting and verifiable rewards, has led to dramatic improvements on mathematical and reasoning benchmarks for large language models, a growing body of recent research urges a sober re-examination of these apparent gains. Several lines of evidence collectively point to important limitations, hidden complexities, and unresolved questions in the current approaches:
Study 1: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem ComplexityApple's study, The Illusion of Thinking, convincingly demonstrates that impressive benchmark scores can mask deep structural shortcomings in reinforcement learning-trained reasoning models. They find that as problems become more complex, supposed "reasoning" LLMs fail to generalize, and sometimes even outperform baseline LLMs on simple tasks only because of architectural quirks rather than robust reasoning ability. At medium complexity, the outputs of these models are littered with inefficient trial and error; at high complexity, both RL-fine-tuned and vanilla models alike collapse to failure. Critically, as model difficulty grows, the quantity and quality of reasoning actually decrease, revealing a bottleneck that pure RL-COT does not overcome. Standard benchmarks, which focus solely on final answer accuracy, are often contaminated or miss the true structure of the model's reasoning, raising the risk of overinterpreting surface-level wins.
﻿
Study 2: Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?Further skepticism comes from studies (e.g., Yang Yue et al.) investigating whether RLVR expands the fundamental reasoning capacity of LLMs. The evidence suggests it does not: RLVR is highly effective at amplifying the sampling of solutions that the base model could already reach, making them easier to find in a few tries (pass@1) - but does not teach the model to actually solve new classes of problems or invent new reasoning strategies. At high sampling rates (pass@k, for large k), base models perform similarly or even better, covering more ground and more unique solutions than their RLVR-trained versions, which become over-focused on rewarded patterns. Distillation (learning from a more capable teacher) remains the only clear way to expand true model capabilities. Thus, while RL can concentrate the probability mass over “known” solutions, it doesn’t break new ground.
﻿
Study 3. A Minimalist Approach to LLM Reasoning: from Rejection Sampling to ReinforceA related finding, highlighted by Xiong et al., is that many common RL benefits can be matched - or even surpassed - by much simpler sample selection or post-training schemes. For example, filtering for positively-rewarded samples (as in “rejection sampling fine-tuning”) is nearly as effective as (or sometimes better than) elaborate RL schemes like PPO, GRPO, or DPO in many math and reasoning benchmarks. Furthermore, the most reliable gains come not from architectural tweaks, but from filtering out “hard negative” batches (those with all wrong answers) and selecting effective training signals. In some settings, classic RL tricks and extra complexity actually harm convergence. This suggests that the biggest practical advances may lie in principled data filtering rather than intricate RL algorithms per se.
Study 4. Reinforcement Learning for Reasoning in Large Language Models with One Training ExamplePerhaps most startling were the findings from “1-shot RLVR” (Wang et al.), where training on just one carefully chosen example produced most of the gains typically attributed to thousand-example RL. While dramatic, this raised the question: how much were researchers truly teaching the model to “reason,” and how much were they simply activating or surfacing existing, dormant capabilities in the base model? These ultra-efficient gains were likely a sign of existing “islands of ability” in the base distribution that RLVR brought to the surface - rather than a sign of broad, deep learning. As such, claims of strong reasoning generalization via RLVR required careful scrutiny.
﻿
Wang et al. showed that reinforcement learning with a verifiable reward (RLVR), even when using only a single training example, dramatically enhanced the mathematical reasoning abilities of large language models. When they applied 1-shot RLVR to Qwen2.5-Math-1.5B, accuracy on the MATH500 benchmark jumped from 36% to over 73%, a result that matched RLVR training with data sets of over a thousand examples. Average performance across six reasoning benchmarks nearly doubled, and using just two examples yielded even slightly higher scores. These gains generalized broadly: similar substantial improvements appeared across different model architectures, RL algorithms (such as GRPO and PPO), and both handpicked and many randomly chosen math problems, with most single examples producing 30% or greater improvement no matter which was used for training.
﻿
Strikingly, these performance boosts were not limited to mathematical reasoning. Single-example RLVR also carried over to non-mathematical reasoning tasks, and cross-domain generalization occurred - training on a geometry problem, for instance, improved the model’s performance in algebra and number theory as well. The process led models to generate longer, more self-reflective solutions. The authors observed “post-saturation generalization,” where, even after the model fully overfit the single training example, its test accuracy continued to improve. This suggested that RLVR reshaped the model’s output distribution more fundamentally than by mere memorization.
Ablation studies revealed that these gains were driven primarily by the policy gradient RL loss, rather than by regularization effects. Moreover, simply encouraging output diversity (via entropy loss) without any reward still led to a 27% boost in MATH500 accuracy. In sum, the study demonstrated that a great deal of a base LLM’s latent reasoning ability could be surfaced with minimal RL supervision, shifting the research focus from brute-force data scaling to careful example selection and reward design - a finding that called for a nuanced interpretation of how “general reasoning” was being developed in such models.
Concluding Cautions and Future DirectionsCollectively, these results suggest that many headline successes in RL-aligned CoT LLMs may reflect more about our benchmarks, sampling protocols, and data selection than they do about genuine, broad-based machine reasoning progress. RL, in its current form, seems primarily a tool for more efficiently producing what the base model already “knows” - not for discovering new reasoning strategies or applying them to fundamentally novel domains.
Tutorial: Implementing GRPO training with Huggingface To get our hands dirty with GRPO, we will implement a simple example training a model using the GRPO trainer. To avoid extra complexity, we will use a simple reward function that encourages completions of a specific target length. In the code below, we define two reward functions: one incentivizing long completions (around 500 tokens), and another incentivizing short completions (around 100 tokens).
We use the Qwen2-0.5B-Instruct language model and a sample of the TL;DR summarization dataset. For each setting, we initialize a GRPOTrainer, configure the reward, and run training for 500 steps. We use Weights & Biases for experiment tracking. After training, each fine-tuned model is saved.
Here’s the code for training: 
import wandb
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
from transformers import AutoTokenizer
﻿
# --- Data and Tokenizer ---
dataset = load_dataset("trl-lib/tldr", split="train[:10000]")
eval_dataset = load_dataset("trl-lib/tldr", split="test[:30]")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct", trust_remote_code=True)
﻿
# --- Reward functions: Target length only, no logging ---
def reward_long(completions, **kwargs):
    num_tokens = [len(tokenizer.encode(c, add_special_tokens=False)) for c in completions]
    rewards = [-(l - 500) ** 2 / 1000 for l in num_tokens]
    return rewards
﻿
def reward_short(completions, **kwargs):
    num_tokens = [len(tokenizer.encode(c, add_special_tokens=False)) for c in completions]
    rewards = [-(l - 100) ** 2 / 1000 for l in num_tokens]
    return rewards
﻿
# --- TRAIN LONG ---
wandb.init(project="trl-qwen2-long-short", name="reward-long-500", notes="Target 500 tokens per response", reinit=True)
print("Training -- reward LONG (target ~500 tokens) completion model...")
args_long = GRPOConfig(
    output_dir="qwen2-long-500",
    per_device_train_batch_size=8,
    max_steps=500,
    logging_steps=10,
    save_strategy="no",
    eval_strategy="steps",
    eval_steps=50,
    do_eval=True,
    max_completion_length=1028
)
long_trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=reward_long,
    args=args_long,
    train_dataset=dataset,
    eval_dataset=eval_dataset,
)
long_trainer.train()
long_trainer.save_model("qwen2-long-500")
wandb.finish()
﻿
# --- TRAIN SHORT ---
wandb.init(project="trl-qwen2-long-short", name="reward-short-100", notes="Target 100 tokens per response", reinit=True)
print("Training -- reward SHORT (target ~100 tokens) completion model...")
args_short = GRPOConfig(
    output_dir="qwen2-short-100",
    per_device_train_batch_size=8,
    max_steps=500,
    logging_steps=10,
    save_strategy="no",
    eval_strategy="steps",
    eval_steps=50,
    do_eval=True,
    max_completion_length=1028
)
short_trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=reward_short,
    args=args_short,
    train_dataset=dataset,
    eval_dataset=eval_dataset,
)
short_trainer.train()
short_trainer.save_model("qwen2-short-100")
wandb.finish()
Once you run the code above, you will have fine-tuned two versions of a language model - one that prefers long completions (around 500 tokens), and another that prefers short completions (around 100 tokens). The key mechanism that enables this is the use of custom reward functions. Instead of asking the model to imitate a particular text, you’re giving it a numerical reward score based on how close each completion gets to your target length, and the GRPO reinforcement learning algorithm updates the model to maximize future expected reward.
During training, the process unfolds as follows. For each batch, the model is prompted using examples from the TL;DR dataset and generates a set of continuations. The reward function examines each completion, measures its length in tokens, and calculates a reward value based on how closely it aligns with the target. For the “long” model, outputs closer to 500 tokens are favored, and for the “short” model, outputs closer to 100 tokens score higher. The GRPOTrainer uses these reward signals to update the model’s policy, making it more likely to produce outputs of the rewarded length in the future.
Here are the training visualizations for my experiment: 
﻿
Run set1
﻿
﻿
﻿
Run set1
﻿
By running both experiments with otherwise identical parameters (same model, data, optimizer, and training steps), you’re isolating the effect of the reward function alone on the model’s behavior. After training finishes and the models are saved, you should find that the “long” and “short” variants reliably steer the length of their outputs according to your objective, regardless of the prompt content
Running inference with our modelsNow that we have two specialized models fine-tuned via GRPO - one trained to generate longer outputs and the other trained for shorter responses - we can see their behaviors in action. In this section, we’ll load both models and run inference on a set of prompts to directly compare how reinforcement learning, guided purely by a reward for output length, shapes their generations.
Below, we define a utility function to generate completions from any saved model using the Hugging Face Transformers library. We then load a sample of prompts from the TL;DR dataset, run both our "reward-long" and "reward-short" models on each prompt, and collect both the completions and the number of tokens generated. This allows us not only to observe the actual responses side-by-side but also to quantitatively compare how closely each model sticks to the desired output length.
﻿
﻿
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
﻿
import weave
weave.init("length_control_grpo")
﻿
﻿
@weave.op 
def run_inference(model_path, prompt, tokenizer, max_new_tokens=2048):
    model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
    model.eval()
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        **inputs, max_new_tokens=max_new_tokens,
        do_sample=False, temperature=0.0,
        pad_token_id=tokenizer.eos_token_id
    )
    completion = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return completion.strip()
﻿
model_paths = {
    "Reward-Long Model": "qwen2-long-500",
    "Reward-Short Model": "qwen2-short-100"
}
﻿
# Load example prompts
dataset = load_dataset("trl-lib/tldr", split="test")
prompts = [dataset[i]["prompt"] for i in range(50)]
﻿
# Prepare tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct", trust_remote_code=True)
# Dict to store completions and their lengths
all_outputs = {name: [] for name in model_paths}
all_token_counts = {name: [] for name in model_paths}  # <-- New dict for lengths
﻿
# Iterate through prompts, generating output from each model one by one
for i, prompt in enumerate(prompts):
    print(f"\nPrompt {i+1}: {prompt}")
    for name, path in model_paths.items():
        completion = run_inference(path, prompt, tokenizer)
        print(f"\n{name} Completion:\n{completion}")
        all_outputs[name].append(completion)
        # Store token count
        tokens = tokenizer(completion, add_special_tokens=False)["input_ids"]
        all_token_counts[name].append(len(tokens))
﻿
# Print out the lengths, measured in number of tokens
print("\n=== Completion Token Counts ===")
for idx, prompt in enumerate(prompts):
    print(f"\nPrompt {idx+1}: {prompt}")
    for name in model_paths:
        tokens = all_token_counts[name][idx]
        print(f"  {name}: {tokens} tokens")
﻿
# Print average length for each model
print("\n=== Average Completion Token Counts ===")
for name in model_paths:
    avg = sum(all_token_counts[name]) / len(all_token_counts[name])
    print(f"{name}: {avg:.2f} tokens (average over {len(all_token_counts[name])} prompts)")
﻿
In the script, I leverage W&B Weave to visualize the outputs from my model. Weave is a tracking and debugging tool by Weights & Biases designed for large language model (LLM) applications. By adding weave.init() and decorating your inference function with @weave.op, every function call - including inputs, outputs, and code - is automatically logged. This lets you trace, analyze, and compare LLM runs easily in the Weave web dashboard, helping you systematically evaluate and debug your models. Weave is helpful not only for debugging LLM applications during development, but also after deploying the model into production, so that you can ensure that the model is behaving as expected. Here's a screenshot inside Weave: 
﻿
After running our script, we will see that our model is performing quite well, in alignment with our reward function! The model trained for longer completions averages 503.22 tokens, while the one trained for shorter completions averages 74.66 tokens. This confirms that the GRPO training effectively encoded the desired output length behavior, using only a reward signal based on token count.
﻿
In practice, more complex training systems can be constructed, which force the model to solve much more complex problems. For example, instead of simply generating text, the model might be trained to carry out multi-step reasoning, solve math word problems, or construct formal proofs. In these cases, reward functions can be designed to reflect the correctness of the final answer or the correctness of intermediate steps. 
Conclusion Using reinforcement learning with verifiable rewards and algorithms like GRPO leads to clear and measurable improvements in how language models solve problems, especially in tasks such as math or step-by-step reasoning. Models become more adept at producing answers that meet the specific goals set by the reward function, often improving rapidly with just a small amount of targeted feedback.
However, even with these impressive gains, it’s not clear whether the models are truly acquiring new reasoning skills or merely improving at following patterns and formats that align with the rewards. Most of the time, these improvements come from making the model produce more of the kinds of responses it could have given.
Overall, while reinforcement learning and GRPO are powerful at shaping model behavior and making outputs more reliable or useful, there’s still a lot to learn about what this actually means for real reasoning and understanding. The field has made great progress in making models appear more intelligent, but determining the actual limits and what “reasoning” actually entails within these systems is still very much a work in progress.
﻿
﻿
﻿
﻿
﻿
﻿
Add a comment
Tags: Articles, Reinforcement Learning, Agents, Weave, GenAI
Iterate on AI agents and models faster. Try Weights & Biases today.