Skip to main content

S1: Achieving Test-Time Scaling with Just 1,000 Examples

Created on February 5|Last edited on February 5
A team of researchers has introduced a new way to improve language model reasoning, showing that test-time scaling can dramatically boost accuracy using minimal extra training. Their method, described in s1: Simple Test-Time Scaling, allows a model to control how much it "thinks" before finalizing an answer. Instead of relying on massive pretraining, this technique fine-tunes an existing model with a small but highly effective dataset and introduces budget forcing, a simple intervention that improves reasoning on demand.

How does the test-time scaling algorithm work?

Traditional AI models generate an answer in a fixed number of steps. Test-time scaling removes this limit, allowing the model to spend more time solving complex problems when needed. This is done using budget forcing, a technique that gives direct control over how long the model spends reasoning before producing an answer.
Budget forcing works by modifying the decoding process. Instead of letting the model freely decide when to stop thinking, the system sets a minimum and maximum threshold for reasoning steps. If the model reaches the maximum, it is forced to stop and provide an answer. If it tries to stop before reaching the minimum, the system forces it to continue thinking by modifying its output.
This is where the "Wait" trick comes in. Normally, when a model is ready to give its answer, it generates an end-of-thinking token like "Final Answer:" or "I'm done." Budget forcing monitors for this. If the model tries to generate this token too soon—before reaching the required reasoning depth—the system blocks the stop token and instead appends "Wait" to the model's output. This tricks the model into continuing its reasoning, often leading to self-correction and a more refined answer.
If the model is solving a math problem and initially gets the wrong answer, the forced "Wait" intervention pushes it to reconsider its reasoning before finalizing. On the other hand, if the model overthinks and produces too many reasoning steps, the system cuts it off at the predefined limit, forcing it to provide an answer.
This method allows fine control over test-time computation without retraining the model or relying on external human intervention. By simply adjusting how long the model is allowed to think, performance can be improved dynamically—even after training is complete.

Where does the reasoning dataset come from?

The dataset, called s1K, was created by selecting 1,000 carefully curated reasoning problems from a larger pool of 59,000 examples. These problems came from diverse sources, including math competitions like AIME and Olympiads, standardized tests such as the SAT and LSAT, and even Ph.D.-level science questions. The researchers also created two new datasets from Stanford’s Statistics Ph.D. qualifying exams and quantitative brain-teasers used in finance interviews.
To generate reasoning traces for these problems, the team used Google’s Gemini Flash Thinking API, which provided detailed step-by-step solutions. The dataset was then filtered based on three criteria: quality, difficulty, and diversity. Quality ensured low-quality responses and formatting errors were removed. Difficulty ensured only challenging problems were included. Diversity ensured a mix of questions across multiple subjects.
After this filtering, only 1,000 high-quality problems remained, forming the s1K dataset, which was used to fine-tune the Qwen2.5-32B model.

How well does the model perform?

Despite training on only 1,000 examples, the s1-32B model outperformed larger and proprietary models. On the AIME24 math benchmark, s1-32B scores 12.1 points higher than o1-preview, which is a 27% relative increase over o1-preview’s score. The researchers also found that test-time compute scaling can outperform traditional model scaling, making s1-32B one of the most sample-efficient reasoning models available.


What are the limitations?

While budget forcing improves accuracy, it has two main limitations. Gains eventually plateau, meaning that adding more "Wait" tokens doesn’t always keep improving results. The model is still limited by its context window, which means that if the reasoning trace becomes too long, it may start to degrade in quality or get cut off.
To push test-time scaling further, researchers may explore reinforcement learning methods or hybrid strategies like Monte Carlo Tree Search, which allow models to simulate multiple possible reasoning paths before choosing the best one.

What does this mean for the future of AI?

This research challenges the assumption that bigger models and more training data are always necessary to improve reasoning. Instead, test-time scaling proves that reasoning can be enhanced dynamically, even after training, with minimal extra data.
By making their model, dataset, and code fully open, the researchers have also set a new standard for transparency in AI development, allowing others to replicate and improve upon their findings. This could lead to AI systems that reason more effectively without requiring massive training resources, making advanced AI more accessible to researchers and developers worldwide.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.