LIMO: Less is more for reasoning?

Is less more for reasoning?
Created on February 10|Last edited on February 10
Comment
Recent research in artificial intelligence is suggesting an unexpected idea: complex reasoning can emerge in large language models with only a fraction of the training data once thought necessary. The LIMO (Less-Is-More for Reasoning) model challenges long-standing assumptions by achieving state-of-the-art performance in mathematical reasoning with just 817 carefully curated training examples. This discovery overturns the belief that models require tens or hundreds of thousands of examples to develop advanced reasoning capabilities.  
LIMO, developed by researchers at SJTU, SII, and GAIR, significantly outperforms previous supervised fine-tuning (SFT) models on challenging benchmarks like AIME24 and MATH. With only 817 examples, it achieves 57.1% accuracy on AIME24 and 94.8% on MATH, compared to prior strong SFT-based models that scored just 6.5% and 59.2%, respectively. Even more remarkably, LIMO generalizes well to out-of-distribution reasoning tasks, improving performance by 40.5% across ten different benchmarks. These results suggest that reasoning capabilities in LLMs do not primarily emerge from large-scale supervised fine-tuning but rather from the strategic use of high-quality, carefully curated examples.  
﻿
Building the LIMO Dataset: Why 817 Examples Are Enough  The success of LIMO hinges on the careful construction of its training dataset, which is significantly smaller than traditional reasoning datasets. Instead of collecting tens of thousands of examples, the researchers focused on selecting problems that naturally elicit complex reasoning and constructing solutions that encourage extended logical thinking.  
To build the dataset, they started with a massive pool of reasoning problems—drawn from sources like NuminaMath-CoT, AIME historical exam questions, and the MATH dataset—and filtered them based on difficulty, diversity, and reasoning depth. Initial filtering eliminated problems that could be solved by existing models with simple heuristics, ensuring only genuinely challenging problems remained. The final selection process involved manual curation, expert review, and model-based ranking, ultimately narrowing the dataset down to 817 high-quality examples.  
LIMO’s dataset does not just provide correct answers but also detailed reasoning chains. Each solution is structured to guide the model through a step-by-step logical process, emphasizing verification, self-reflection, and structured problem decomposition. This approach ensures that even with a small dataset, the model learns not just to memorize solutions but to develop generalizable reasoning skills.  
Comparing this to another recent dataset, the s1K dataset (built for test-time scaling research), both share the principle that high-quality, well-curated examples matter more than sheer quantity. The s1K dataset was constructed by selecting 1,000 problems from a larger pool of 59,000, emphasizing quality, diversity, and difficulty. 
LIMO vs Traditional Scaling Approaches  LIMO stands in stark contrast to existing approaches that emphasize data quantity and model scaling. Traditional supervised fine-tuning methods rely on vast datasets, often exceeding 100,000 examples, but face diminishing returns as models struggle to generalize beyond memorized patterns. Reinforcement learning scaling, exemplified by models like DeepSeek-R1, takes a different approach by using large-scale optimization to explore reasoning trajectories. While effective, RL scaling requires extensive computational resources and treats reasoning as a problem of trajectory discovery rather than activation.  
LIMO, on the other hand, takes a fundamentally different stance. It assumes that sophisticated reasoning capabilities are already embedded within pre-trained models and that the key challenge is activating them through carefully designed cognitive templates. By directly constructing high-quality reasoning trajectories, LIMO achieves remarkable reasoning efficiency without the need for computationally expensive RL exploration.  
Independent but Parallel Breakthroughs in AI Reasoning  Interestingly, LIMO’s findings align with another recent discovery called S1, which uses 1k highly curated samples to fine-tune a model, and dynamically force the model to think longer using "wait" tokens. In both cases, researchers independently arrived at the conclusion that improving reasoning performance does not necessarily require more data. Instead, both approaches focus on leveraging existing model knowledge more effectively—LIMO through minimal but carefully curated training examples, and test-time scaling through real-time adjustments to computational effort during inference.  
This kind of independent convergence has been seen before in scientific research. In mathematics, multiple groundbreaking theorems have been developed independently by different researchers working with similar underlying concepts. The simultaneous discovery of calculus by Newton and Leibniz, or the independent formulation of evolutionary theory by Darwin and Wallace, are classic examples of how certain ideas emerge naturally once the conditions for discovery are met. In AI, it appears that the field has reached a stage where data efficiency and reasoning activation have become central questions, leading different research groups to arrive at similar insights from different angles.  
The Future of Data-Efficient Reasoning  LIMO’s success opens up new possibilities for AI research, particularly in data-efficient training methods. Future work could extend the LIMO approach to other reasoning-intensive domains, such as scientific discovery, logical deduction, and causal inference. Additionally, these findings raise important questions about the theoretical limits of reasoning in LLMs—how much pre-training is truly necessary, and what are the minimal conditions required to activate sophisticated cognitive abilities?  
Beyond research, the practical implications of LIMO are profound. If complex reasoning can be elicited with minimal fine-tuning, AI systems could become far more accessible, requiring less computational power and training data to achieve high performance. This could accelerate AI adoption in fields like education, scientific research, and industry, where robust reasoning is crucial but large-scale training data may be limited.  
As AI continues to evolve, discoveries like LIMO and test-time scaling suggest a paradigm shift in how we approach intelligence in machines. Rather than viewing reasoning as a function of raw data and scale, these findings point toward a future where intelligence emerges from the strategic use of knowledge and the refinement of cognitive processes.
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.