Skip to main content

Budget forcing s1-32B: Waiting is all you need?

We test whether budget forcing - a simple test-time intervention - can significantly boost the reasoning accuracy of s1-32B, potentially enabling smaller models to rival closed-source giants like OpenAI's o1-preview. This is a translated version of the article. Feel free to report any possible mis-translations in the comments section
Created on August 26|Last edited on August 26
The s1-32B model, built off Qwen 2.5 32B Instruct, is being presented as a breakthrough in reasoning efficiency - claiming to achieve OpenAI o1-level performance using just 1,000 training examples and a simple decoding trick called "budget forcing." Unlike o1, which relies on reinforcement learning and massive compute, s1-32B is built by fine-tuning an existing model on a small, carefully curated dataset. The claim is that this minimalist approach not only matches closed models in competitive reasoning tasks but also demonstrates true test-time scaling - where increasing compute at inference directly improves accuracy.
Instead of relying on more training data or complex model architectures, s1-32B controls its own reasoning process at test time. If the model tries to stop thinking too soon, it’s forced to continue by appending "Wait" to its output, encouraging deeper reasoning. This trick can be continued for about 6 "waits" until performance starts to level off. According to the authors, this method allows the model to self-correct mistakes and scale its reasoning dynamically.


Table of contents



How well does "waiting" work?

On benchmarks like AIME24, the authors claim s1-32B outperforms OpenAI’s o1-preview, reaching 56.7% accuracy - better than models trained on vastly larger datasets. But does this actually hold up? That’s what we're going to find out.
The approach behind s1-32B is surprisingly simple. Instead of training on massive datasets or using reinforcement learning like OpenAI’s o1, the authors fine-tune an existing model (Qwen2.5-32B-Instruct) on just 1,000 carefully selected reasoning problems. This dataset, called s1K, is designed to maximize difficulty, diversity, and quality. They start with a much larger set of 59,000 reasoning problems from sources like math competitions and PhD-level science exams.
Then, they filter it down in three steps:
  1. removing low-quality samples,
  2. keeping only difficult problems that models struggle with, and
  3. ensuring a wide range of topics.
Once the model is trained on this small dataset, they introduce budget forcing at test time to control how much the model "thinks" before giving an answer.
The idea is simple:
  1. If the model tries to stop reasoning too early, they force it to keep going by appending the word "Wait" to its output, making it generate more reasoning steps.
  2. If the model keeps going for too long, they cut it off with an end-of-thinking token, forcing it to wrap up its answer.
This method pushes the model to double-check its work and refine its reasoning before committing to a final response.
The key claim is that budget forcing leads to test-time scaling - where simply allowing the model to think longer improves accuracy. Their results show that s1-32B’s performance on AIME24 math problems improves from 50% to 57% just by increasing its reasoning time. Here are the results reported by the paper:


Testing methodology

To rigorously evaluate these claims about budget forcing and test-time scaling, I've set up three critical comparisons that will help us understand the true impact of this technique.
The first check is for dataset contamination—whether s1-32B’s performance is inflated due to direct overlap between the AIME test set and the s1K training set.If AIME problems were included in s1K, the model's accuracy could be misleading, making it seem like it generalizes when it has actually memorized answers. Before evaluating budget forcing, we will compare AIME and s1K to confirm that the model wasn’t trained on the test set.
Next, we will examine the base model, Qwen2.5-32B-Instruct, to determine how much of s1-32B’s performance comes from fine-tuning versus the underlying model’s existing capabilities. If Qwen was already strong at mathematical reasoning or had seen similar problems during pretraining, then the improvements in s1-32B might be due to its strong starting point rather than budget forcing. We will compare Qwen’s baseline performance to s1-32B to measure the effect of fine-tuning alone.
Since Qwen’s accuracy on AIME24 is documented at 50%, we use this as a baseline reference, avoiding redundant evaluations
💡
Once we confirm that the dataset is clean and the base model’s abilities are accounted for, we test budget forcing itself.We will compare s1-32B in two modes - one with full budget forcing (allowing up to six “Wait” iterations) - and one with no Wait iterations. Since both conditions use the same fine-tuned model, this isolates the effect of budget forcing.If budget forcing is the key factor, performance should drop significantly when it’s removed.If accuracy remains similar, it suggests that fine-tuning on s1K was the main driver of improvements.
Finally, we will compare s1-32B’s performance to OpenAI’s o1-preview, o3, and DeepSeek-R1to see if it actually matches or exceeds these closed source models. If s1-32B holds up, it supports the claim that budget forcing improves reasoning. If not, the method’s impact may have been overstated. These tests ensure we aren’t just reproducing reported numbers but verifying whether budget forcing is actually responsible for the results.

Checking for dataset contamination

Before running our main evaluations, we first check for dataset contamination - specifically, whether any problems from the AIME test set appear in the s1K training set. If there is overlap, the reported performance could be inflated due to memorization rather than genuine reasoning improvements. To systematically verify this, we use Weave to log the inputs and outputs of our inference function, which returns True if two questions are the same.
By leveraging Weave’s filtering capabilities, we can quickly visualize any cases where AIME problems match those in s1K. If contamination exists, these duplicate or near-identical problems will be flagged, showing direct overlap between the training and test sets. If no matches appear, we can proceed confidently, knowing that the model wasn’t trained on the test set.
Here's the code:
import os
from litellm import completion
from datasets import load_dataset
import asyncio
import weave; weave.init("dataset_comparison")
import time
# Set OpenAI API key
os.environ["OPENAI_API_KEY"] = ""


def load_datasets():
print("Loading datasets...")
aime_dataset = load_dataset("Maxwell-Jia/AIME_2024")["train"]
s1k_dataset = load_dataset("simplescaling/s1K")["train"]
return aime_dataset, s1k_dataset

@weave.op
async def compare_questions(q1: str, q2: str) -> str:
prompt = f"""Compare these two math problems and respond with ONLY the word 'same' if they are the same question (ignoring formatting differences or slight numerical differences/wording differences) or 'different' if they are different questions.

Question 1: {q1}
Question 2: {q2}"""

response = completion(
model="openai/gpt-4o-mini",
messages=[{"content": prompt, "role": "user"}],
max_tokens=10,
temperature=0.1
)
return 'same' in response.choices[0].message.content.strip().lower()

async def run_comparison():
aime_dataset, s1k_dataset = load_datasets()
print("Starting comparisons...")
for aime_q in aime_dataset:
for s1k_q in s1k_dataset:
result = await compare_questions(aime_q["Problem"], s1k_q["question"])

if result:
print(f"\nFound matching question!")
print(f"AIME: {aime_q['Problem'][:100]}...")
print(f"S1K: {s1k_q['question'][:100]}...")
print("---")

if __name__ == "__main__":
asyncio.run(run_comparison())

Luckily, our script did not detect any overlap between the AIME test set and the Budget Forcing official training set. We can easily filter by the output of our inference function inside Weave, and see that no calls returned true!


Testing budget forcing

Now we will write an evaluation script is designed to test s1-32B’s performance, isolate the effects of budget forcing, and compare it against OpenAI’s o1-preview. The script runs multiple models through a structured evaluation pipeline, logs predictions, and scores accuracy using a consistent method.
We start by loading the AIME dataset and preparing it for evaluation. To establish a baseline, we first run Qwen2.5-32B-Instruct, the base model from which s1-32B is fine-tuned. This helps determine how much of s1-32B’s performance comes from fine-tuning versus its original capabilities.
Next, we will evaluate s1-32B in two conditions:
  1. One with full budget forcing (allowing up to six “Wait” iterations), and
  2. One without.
Both setups use the same fine-tuned model, so this directly isolates the impact of budget forcing.If budget forcing is responsible for performance improvements, we should see a noticeable accuracy drop when it is removed.
We will also run OpenAI’s o1-preview as an external benchmark. This allows us to compare s1-32B’s performance against a closed model trained with extensive compute and reinforcement learning. If s1-32B matches or exceeds o1-preview, it supports the claim that budget forcing enables a smaller fine-tuned model to compete with more resource-intensive approaches.
The evaluation process is managed using Weave, which logs model inputs, outputs, and correctness scores.The scoring function compares each model’s output to the ground truth answer using GPT-4, ensuring consistency in evaluation. This structured approach allows us to efficiently compare models, track performance differences, and validate whether budget forcing is genuinely responsible for the reported gains.
Here's the code for the eval:
import os
import asyncio
import json
from datasets import load_dataset
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from openai import OpenAI
import time


import weave; weave.init("aime_evaluation")

# Initialize OpenAI client
openai_client = OpenAI(api_key="your api key")

# Model constants
MAX_TOKENS_THINKING = 132000
NUM_IGNORE = 6
MODEL_NAME = "simplescaling/s1-32B"
JUDGE_MODEL = "gpt-4o-2024-08-06"


# Initialize model and tokenizer globally
model = LLM(
"Qwen/Qwen2.5-32B-Instruct",
tensor_parallel_size=2,
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-32B-Instruct")

class QwenBaseModel(weave.Model):
@weave.op
async def predict(self, text: str) -> str:
"""Run inference using base Qwen model without Wait iterations"""
try:
# Prepare the prompt
prompt = f"<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n{text}<|im_end|>\n<|im_start|>assistant\n"
# Single thinking phase
stop_token_ids = tokenizer("<|im_start|><|im_end|>")["input_ids"]
sampling_params = SamplingParams(
max_tokens=MAX_TOKENS_THINKING,
min_tokens=0,
stop_token_ids=stop_token_ids,
skip_special_tokens=False,
temperature=0.0,
)
prompt += "<|im_start|>"
output = model.generate(prompt, sampling_params=sampling_params)
# Final answer phase
prompt += output[0].outputs[0].text
stop_token_ids = tokenizer("<|im_end|>")["input_ids"]
sampling_params = SamplingParams(
max_tokens=32768,
min_tokens=0,
stop_token_ids=stop_token_ids,
skip_special_tokens=False,
temperature=0.0,
)
final_output = model.generate(prompt, sampling_params=sampling_params)
return final_output[0].outputs[0].text
except Exception as e:
print(f"Failed to get Qwen response: {e}")
return None


class S1ModelWithWait(weave.Model):
@weave.op
async def predict(self, text: str) -> str:
"""Run inference using S1 model with Wait iterations"""
try:
# Prepare the prompt
prompt = f"<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n{text}<|im_end|>\n<|im_start|>assistant\n"
# First thinking phase
stop_token_ids = tokenizer("<|im_start|><|im_end|>")["input_ids"]
sampling_params = SamplingParams(
max_tokens=MAX_TOKENS_THINKING,
min_tokens=0,
stop_token_ids=stop_token_ids,
skip_special_tokens=False,
temperature=0.0,
)
prompt += "<|im_start|>think"
output = model.generate(prompt, sampling_params=sampling_params)
# Handle multiple thinking iterations
ignore_str = "Wait"
max_tokens_thinking_tmp = MAX_TOKENS_THINKING
for i in range(NUM_IGNORE):
tokens_used = len(output[0].outputs[0].token_ids)
max_tokens_thinking_tmp -= tokens_used
if max_tokens_thinking_tmp < 1000: # Safe buffer
break
prompt += output[0].outputs[0].text + ignore_str
sampling_params = SamplingParams(
max_tokens=max_tokens_thinking_tmp,
min_tokens=1,
stop_token_ids=stop_token_ids,
skip_special_tokens=False,
temperature=0.0,
)
output = model.generate(prompt, sampling_params=sampling_params)
if i > 0 and output[0].outputs[0].text in prompt:
break
# Final answer phase
prompt += output[0].outputs[0].text
stop_token_ids = tokenizer("<|im_end|>")["input_ids"]
sampling_params = SamplingParams(
max_tokens=32768,
min_tokens=0,
stop_token_ids=stop_token_ids,
skip_special_tokens=False,
temperature=0.0,
)
final_output = model.generate(prompt, sampling_params=sampling_params)
return final_output[0].outputs[0].text
except Exception as e:
print(f"Failed to get S1 response: {e}")
return None

class S1ModelNoWait(weave.Model):
@weave.op
async def predict(self, text: str) -> str:
"""Run inference using S1 model without any Wait iterations"""
try:
# Prepare the prompt
prompt = f"<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n{text}<|im_end|>\n<|im_start|>assistant\n"
# Single thinking phase
stop_token_ids = tokenizer("<|im_start|><|im_end|>")["input_ids"]
sampling_params = SamplingParams(
max_tokens=MAX_TOKENS_THINKING,
min_tokens=0,
stop_token_ids=stop_token_ids,
skip_special_tokens=False,
temperature=0.0,
)
prompt += "<|im_start|>think"
output = model.generate(prompt, sampling_params=sampling_params)
# Final answer phase
prompt += output[0].outputs[0].text
stop_token_ids = tokenizer("<|im_end|>")["input_ids"]
sampling_params = SamplingParams(
max_tokens=32768,
min_tokens=0,
stop_token_ids=stop_token_ids,
skip_special_tokens=False,
temperature=0.0,
)
final_output = model.generate(prompt, sampling_params=sampling_params)
return final_output[0].outputs[0].text
except Exception as e:
print(f"Failed to get S1 response: {e}")
return None

def run_o1_inference(prompt: str) -> str:
"""Run inference using OpenAI o1-preview"""
try:
response = openai_client.chat.completions.create(
model="o1-preview",
messages=[
{"role": "user", "content": f"Solve the following problem. put your final answer within \\boxed{{}}:\n{prompt}"}
]
)
return response.choices[0].message.content
except Exception as e:
print(f"Failed to get o1-preview response: {e}")
return None

class O1PreviewModel(weave.Model):
@weave.op
async def predict(self, text: str) -> str:
print("running o1-preview inference")
return run_o1_inference(text)

@weave.op
async def gpt4o_scorer(label: str, model_output: str) -> dict:
"""Score the model's output by comparing it with the ground truth."""
query = f"""
YOU ARE A LLM JUDGE DETERMINING IF THE FOLLOWING MODEL GENERATED ANSWER IS THE SAME AS THE CORRECT ANSWER
I WILL GIVE YOU THE LAST 100 CHARS OF THE MODEL'S REASONING PATH, WHICH WILL CONTAIN THE FINAL ANSWER ->

Model's Answer (last 100 chars): {str(model_output)[-100:]}
Correct Answer: {label}
Your task:
1. State the model's predicted answer (answer only).
2. State the ground truth (answer only).
3. Determine if the model's final answer is correct (ignore formatting differences, etc.). RESPOND with the predicted and ground truth answer, followed with a JSON object containing the correctness encapsulated within the following delimiters:
```json
{{ "correctness": true/false }}
```
"""
try:
response = openai_client.chat.completions.create(
model=JUDGE_MODEL,
messages=[{"role": "user", "content": query}]
)
response_text = response.choices[0].message.content
json_start = response_text.index("```json") + 7
json_end = response_text.index("```", json_start)
correctness = json.loads(response_text[json_start:json_end].strip()).get("correctness", False)
return {"correctness": correctness, "reasoning": response_text}
except Exception as e:
print(f"Scoring failed: {e}")
return {"correctness": False, "reasoning": str(e)}

def load_ds():
print("Loading AIME dataset...")
dataset = load_dataset("Maxwell-Jia/AIME_2024")["train"]
return [{"text": row["Problem"], "label": row["Answer"]} for row in dataset]

async def run_evaluations():
global model, tokenizer
print("Loading dataset...")
dataset = load_ds()
print("Preparing dataset for evaluation...")
dataset_prepared = [{"text": row["text"], "label": row["label"]} for row in dataset]

# First evaluate base Qwen
print("Starting Qwen evaluation...")
scorers = [gpt4o_scorer]
evaluation = weave.Evaluation(
dataset=dataset_prepared,
scorers=scorers,
name="qwen-base Evaluation"
)
results = await evaluation.evaluate(QwenBaseModel())
print(f"Results for base Qwen: {results}")

# Switch to S1 model
print("\nSwitching to S1-32B model...")
model = LLM(
"simplescaling/s1-32B",
tensor_parallel_size=4,
)
tokenizer = AutoTokenizer.from_pretrained("simplescaling/s1-32B")

# Now evaluate S1 variations and o1
test_models = {
"s1-32b-wait": S1ModelWithWait(),
"s1-32b-nowait": S1ModelNoWait(),
"o1-preview": O1PreviewModel()
}

for model_name, model_instance in test_models.items():
print(f"\nEvaluating {model_name}...")
evaluation = weave.Evaluation(
dataset=dataset_prepared,
scorers=scorers,
name=f"{model_name} Evaluation"
)
results = await evaluation.evaluate(model_instance)
print(f"Results for {model_name}: {results}")

if __name__ == "__main__":
asyncio.run(run_evaluations())
The evaluation is built on the Weave library, which provides a structured way to run model comparisons and benchmarking. Weave handles the parallel execution of model predictions and scoring, making it efficient to evaluate multiple models on the same dataset. In our implementation, each model is wrapped in a Weave Model class with an async predict method. For S1-32B, this method manages the complex thinking process with or without Wait iterations, while for o1-preview it handles the API calls to OpenAI.
The scoring process uses gpt4o_scorer, which takes each model's output and the correct answer, then uses GPT-4o to determine if the answer is correct. This approach is particularly important for math problems where the same answer might be expressed in different ways. The scorer looks at the last 100 characters of the model's output to find the final answer, making it robust to variations in reasoning format.
Our evaluation pipeline first loads the AIME dataset using the Hugging Facedatasets library, then prepares it for parallel evaluation. The Weave Evaluation class handles the coordination between model predictions and scoring, running multiple evaluations concurrently for efficiency. It maintains separate evaluation streams for each model while using the same scoring criteria, ensuring fair comparisons.
Here are the results:

The results show that:
  • O3 Mini with the "high" reasoning parameter set achieved the highest accuracy at 0.867, reflecting its extensive reasoning effort.
  • DeepSeek-R1 followed with 0.767, while
  • S1 scored 0.667.
  • S1 "No-Wait", which skipped the wait technique and performed a single reasoning pass, scored 0.533.
  • O1 Preview trailed at 0.500.
I had previously benchmarked O3 Mini and Deepseek R1, so using Weave Evaluations was really handy for quickly comparing these new results against my earlier tests. Qwen has reported that the base model performs at 50% accuracy on this benchmark. Interestingly, S1's performance on this eval is 10% higher than what they reported in the paper.

The Weave comparisons view

Once evaluations are complete, Weave organizes results into an interactive dashboard. This powerful tool enables you to:
  • Compare model outputs side by side,
  • Filter results by specific parameters,
  • Trace inputs and outputs for every function call.
The dashboard simplifies debugging and provides deep insights into model performance, making Weave an indispensable tool for tracking and refining large language models.
For reasoning-focused tasks like those tackled by S1, the comparisons view offers a step-by-step trace of each model’s decision-making process. This feature makes it easy to identify logical missteps, errors in interpreting prompts, or areas where one model outperforms another, such as the O1-Mini.
By analyzing these granular outputs, you can better understand why models succeed or fail on specific tasks. This insight is invaluable for diagnosing issues, improving models, and tailoring their capabilities for complex reasoning scenarios. For this eval, it was really interesting to see the examples where S1 succeeded using budget forcing, and the base model failed!
Here’s a screenshot of what the comparison view looks like:


Conclusion

A 13% boost in accuracy from a simple test-time intervention like budget forcing is significant, especially given that it doesn’t require additional training data or architectural changes.The fact that S1-32B jumped from near-baseline performance to outperforming OpenAI’s o1-preview by such a large margin suggests that structured, iterative reasoning can be leveraged far more effectively than previously thought.
Given these results, it’s likely that closed-source models will adopt similar techniques, refining them with proprietary optimizations. If this approach scales to models like O3 Mini - which already achieved 86.7% - there’s a real possibility of reaching a perfect score on AIME in the near future.
Theoretically, if budget forcing can be applied to O3 Mini’s reasoning process, it could push accuracy to 100%, demonstrating that test-time scaling alone might be enough to close the gap on top-tier reasoning benchmarks.