Skip to main content

New studies uncover interesting findings for reasoning models

Discover how two recent studies challenge conventional reinforcement learning in LLM reasoning - revealing that simple data filtering can rival complex methods and that RLVR may only optimize known abilities.
Created on April 24|Last edited on April 24
Two recent papercritically reassess of how large language models develop reasoning abilities. One, from Salesforce AI, explores the effectiveness of simpler reinforcement learning methods. The other, from Tsinghua University, challenges whether reinforcement learning with verifiable rewards (RLVR) actually enhances reasoning at all. These studies suggest that much of what is assumed about reasoning improvements in LLMs may need to be reconsidered.

Simple methods rival complex RL (Salesforce AI – Xiong et al.)

In the paper titled Minimalist Approach to LLM Reasoning, the researchers evaluate various RL methods for improving math reasoning in LLMs. Modern post-training often uses complex algorithms like Proximal Policy Optimization (PPO) and Generalized Reinforcement with Proximal Optimization (GRPO), which are computationally demanding due to their need for value models and reward normalization.
The study introduces RAFT (Reward-ranked Fine-Tuning), a rejection sampling-based method that trains only on responses that receive positive rewards. Despite its simplicity, RAFT matches or outperforms GRPO during early and mid-stage training on benchmarks like Math500, Minerva Math, and Olympiad Bench. A variant, Reinforce-Rej, improves further by removing both fully correct and fully incorrect prompts, refining training signal quality.
An improved version, RAFT++, adds importance sampling and clipping for better performance in early training. However, it eventually falls behind GRPO due to entropy collapse—since it trains only on correct answers, the model converges too quickly and fails to explore alternative reasoning paths. GRPO’s advantage, the authors argue, lies not in complex optimization, but in keeping exploration open through occasional negative samples.
The conclusion is straightforward: filtering data well may matter more than the choice of reinforcement learning algorithm. By simply cleaning up prompts, even basic training strategies can compete with or exceed state-of-the-art techniques.


RLVR doesn’t expand reasoning (Tsinghua University – Yue et al.)

In Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?, researchers question whether RL with verifiable rewards actually builds new reasoning capacity or just improves a model’s ability to guess known answers.
RLVR stands for "Reinforcement Learning with Verifiable Rewards." It’s a specific setup in LLM training where the model is fine-tuned using reinforcement learning (RL), but instead of using vague or subjective feedback (like human preference scores), it uses binary rewards based on whether an answer is objectively right or wrong. Think math problems: the model gets a 1 if it solves the problem correctly, and a 0 if it doesn’t. These rewards are then used to adjust the model to favor responses that lead to high-reward outcomes.
The study evaluates models like LLaMA, Qwen, and DeepSeek across tasks in math, coding, and visual reasoning using different RL techniques. They use the pass@k metric, which measures the probability that at least one out of k samples is correct. RLVR-trained models outperform at small k, suggesting improved single-shot accuracy. However, as k increases, base models match or surpass them. This pattern implies that RLVR doesn’t uncover new reasoning strategies but biases the model toward pre-existing ones.
Further perplexity analysis shows that RL-trained models’ reasoning paths are already within the base model’s output distribution. RLVR just increases the chance of selecting these known paths, narrowing the distribution and decreasing exploratory diversity. This increases sampling efficiency but at the cost of reasoning flexibility.
The researchers compare RLVR to distillation from larger models, which genuinely transfer new reasoning capabilities. Distilled models outperform both base and RLVR versions on tasks that require abilities not present in the original model. None of the RL methods studied—PPO, GRPO, RLOO—manage to fully unlock the model’s potential, especially at large k. Longer RL training further worsens performance due to overfitting and entropy loss.

Implications for LLM development

Together, these studies dismantle the common narrative that RL algorithms inherently enhance reasoning in LLMs. Instead, they point to two main takeaways.
  • First, performance gains often come from better filtering of training data, not from the sophistication of the RL algorithm itself.
  • Second, RLVR does not create new reasoning ability but merely optimizes the use of existing ones, often at the cost of exploration and generalization.
For developers working on post-training, this has major implications. If the aim is efficient training, methods like RAFT and Reinforce-Rej are promising due to their simplicity and stability. But if the goal is to extend a model’s capabilities, RLVR may fall short. Instead, distillation—or possibly new training paradigms—might be required to push LLM reasoning into truly new territory.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.