Skip to main content

Training GPT-4o to reason: Fine-tuning vs budget forcing

Can fine-tuning and budget forcing improve GPT-4o’s reasoning? We test structured datasets and inference-time techniques to boost multi-step problem-solving.
Created on February 13|Last edited on February 14
OpenAI’s GPT-4o is already one of the most capable language models, but can it be transformed into an even stronger reasoning engine through targeted fine-tuning? That’s the question this experiment aims to answer.
Using a dataset designed for complex multi-step reasoning, we evaluate whether fine-tuning can push GPT-4o’s performance beyond state-of-the-art models like OpenAI’s o1-preview. To measure this, we test it on AIME (American Invitational Mathematics Examination), a benchmark known for its challenging problems that require deep logical reasoning and structured problem-solving.


Table of contents



Using the s1K dataset to fine-tune GPT-4o for complex reasoning

The dataset we will be working with, s1K, is a carefully curated selection of high-difficulty reasoning problems, originally used to fine-tune s1-32B (built on Qwen 2.5 32b), a model that outperforms some of OpenAI’s closed models like o1-preview through test-time scaling techniques like budget forcing.
Instead of training on massive datasets, the s1 approach focused on maximizing the quality and complexity of its training data, proving that small, well-selected datasets can dramatically improve model reasoning. Now, we’re applying that same dataset to fine-tune GPT-4o and see whether it can achieve similar or better improvements in structured reasoning.
Beyond fine-tuning, we’ll also test budget forcing, a decoding strategy that encourages the model to extend its reasoning process before reaching a final answer (explored later). By combining fine-tuning and inference-time optimizations, we aim to determine how effectively GPT-4o can be enhanced for multi-step reasoning tasks.

Preparing the s1K dataset to fine-tune GPT-4o

Fine-tuning a model like GPT-4o requires careful dataset preparation, not only to maximize reasoning performance but also to control costs. GPT-4o’s fine-tuning pricing is $25 per million tokens, meaning that fine-tuning on our full 5 million-token dataset would cost approximately $125 per training run. Given this, I started with smaller-scale experiments using 10% of the dataset and GPT-4o Mini, allowing for quick iterations before committing to full fine-tuning.
One of the key challenges I encountered early on was the model’s tendency to enter a never-ending thinking phase when asked to generate reasoning steps. This issue arose when the model failed to recognize when to stop, leading to uncontrolled output generation. To address this, I modified the dataset formatting by explicitly separating the "thinking" phase from the final answer and incorporating hard constraints on reasoning length.
Throughout this process, W&B Weave proved invaluable for debugging and visualizing the model’s chain of thought. Weave’s tracing capabilities made it possible to identify looping issues, overly verbose outputs, and potential inefficiencies without manually inspecting raw logs. This immediate feedback led to an optimized dataset format that better controls the reasoning phase and ensures structured outputs - a topic discussed in the next section.
Here’s a screenshot from Weave showing the model’s reasoning trace during one of these runs, where the model was unable to provide a final answer:


Formatting the s1K dataset

Each example in our dataset follows a structured multi-turn chat format, ensuring that the model generates structured reasoning before providing a final answer. The format is:
  • System message: Defines the model’s role as a helpful AI assistant.
  • User message: Contains the problem statement, followed by an instruction to think step by step within a fixed token budget.
  • Assistant response: Outputs the reasoning process, enclosed within <imstart>think and <imend> tags to distinguish it from the final answer.
  • User message: Prompts the model to provide its final response, ensuring that it does not continue reasoning indefinitely.
  • Assistant response: Gives the final answer, enclosed within <imstart>answer and <imend>.
This setup isn’t necessarily the most efficient solution, as a better approach might involve enforcing length constraints at the model level instead of structuring responses this way. However, for this specific use case, it provides a practical way to keep the model’s reasoning controlled while fitting within OpenAI’s chat-based format.
Here's the script used to prepare the dataset:
import json
import random
from datasets import load_dataset

# Load the Hugging Face dataset
dataset = load_dataset("simplescaling/s1K")["train"]

# Shuffle the dataset for randomness
random.seed(42)
dataset = list(dataset) # Convert to list for shuffling
random.shuffle(dataset)

# # Split into 95% training and 5% validation
split_idx = int(0.95 * len(dataset))
train_data, val_data = dataset[:split_idx], dataset[split_idx:]

# Function to format each example in the agreed multi-turn OpenAI structure
def format_example(example):
question = example["question"].strip()
thinking_steps = "\n".join(example["thinking_trajectories"]).strip()
final_answer = example["attempt"].strip()

# Ensure final answer is correctly formatted
final_answer = "Answer: " + final_answer if not final_answer.startswith("Answer:") else final_answer

# Construct multi-turn format
formatted_messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": question + "\n Think for up to 8096 tokens."},
{"role": "assistant", "content": "<imstart>think\n" + thinking_steps + "\n<imend>"},
{"role": "user", "content": "Now provide the final answer. Respond in under 2048 tokens!!!"},
{"role": "assistant", "content": "<imstart>answer\n" + final_answer + "\n<imend>"}
]

return {"messages": formatted_messages}

# Apply formatting to training and validation sets
formatted_train_data = [format_example(example) for example in train_data]
formatted_val_data = [format_example(example) for example in val_data]

# Save as JSONL for OpenAI fine-tuning
train_output_file = "fullnew_s1K_openai_finetune_train.jsonl"
val_output_file = "fullnews1K_openai_finetune_val.jsonl"

# Save training data
with open(train_output_file, "w") as f:
for entry in formatted_train_data:
json.dump(entry, f)
f.write("\n")

# Save validation data
with open(val_output_file, "w") as f:
for entry in formatted_val_data:
json.dump(entry, f)
f.write("\n")

print(f"Training dataset saved as {train_output_file}")
print(f"Validation dataset saved as {val_output_file}")

The decision to add explicit token constraints in the user instructions was influenced by findings in the s1 paper, which explored different methods for controlling reasoning length at test time. While the paper found that models often ignored direct token limits, I found during this experiment, that GPT-4o responded well to explicit output constraints, making this a viable strategy for ensuring structured reasoning without excessive generation.
By applying this formatting, the dataset remains structured, cost-effective, and optimized to avoid runaway responses. With the dataset ready, the next step is to run fine-tuning trials and evaluate how well GPT-4o adapts to structured reasoning tasks.
Here's a sample of what the dataset looks like
{
"messages": [
{
"role": "system",
"content": "You are a helpful AI assistant."
},
{
"role": "user",
"content": "What is the sum of the first 100 positive integers? Think for up to 2048 tokens."
},
{
"role": "assistant",
"content": "<imstart>think\n{the reasoning trace}\n<imend>"
},
{
"role": "user",
"content": "Now provide the final answer. Respond in under 256 tokens."
},
{
"role": "assistant",
"content": "<imstart>answer\n{the answer}\n<imend>"
}
]
}

Uploading the dataset and starting fine-tuning

Once the dataset was prepared, the training process was initiated using OpenAI’s fine-tuning console. The training and validation JSONL files were uploaded, and a training run was started for one epoch. This served as a baseline test to assess how well GPT-4o adapted to structured reasoning before making further refinements. With OpenAI’s system handling tokenization and training automatically, the process remained straightforward.


Evaluating the fine-tuned model

After training the model, the next step was evaluation. The assessment was conducted using Weave Evaluations, which compared the fine-tuned GPT-4o against the base GPT-4o to measure the impact of fine-tuning on reasoning performance. Additionally, a separate model was created to test the budget forcing technique, utilizing the fine-tuned model.
Previously, the same evaluation was performed with o1-preview, and Weave’s capabilities allowed for saving these evaluations for future comparisons when testing new models.
The evaluation process followed a straightforward setup:
  • Load the AIME24 dataset and prepare it for inference
  • Run GPT-4o (base model) to establish a baseline performance
  • Run fine-tuned GPT-4o and Budget Forced GPT-4o on the same dataset
Here's the code for the eval:
import os
import asyncio
import json
from datasets import load_dataset
from openai import OpenAI
import weave; weave.init("aime_evaluation")

# Initialize OpenAI client
openai_client = OpenAI(api_key="your api key")
# Model constants
MODEL_BASE = "gpt-4o-2024-08-06" # Stock model
MODEL_FINETUNED = "ft: my fine tuned model id" # Fine-tuned model
JUDGE_MODEL = "gpt-4o-2024-08-06"

class BaseGPTModel(weave.Model):
@weave.op
async def predict(self, text: str) -> str:
"""Run inference using stock GPT-4o model"""
try:
response = openai_client.chat.completions.create(
model=MODEL_BASE,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"{text}\nThink for up to 8096 tokens."}
],
temperature=0.0
)
return response.choices[0].message.content
except Exception as e:
print(f"Failed to get base model response: {e}")
return None

class FineTunedGPTModel(weave.Model):
@weave.op
async def predict(self, text: str) -> str:
"""Run inference using fine-tuned GPT-4o model"""
try:
# First, get thinking steps
think_response = openai_client.chat.completions.create(
model=MODEL_FINETUNED,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"{text}\nThink for up to 8096 tokens."}
],
temperature=1.0
)
thinking_steps = think_response.choices[0].message.content
# Then, get final answer
final_response = openai_client.chat.completions.create(
model=MODEL_FINETUNED,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": text},
{"role": "assistant", "content": f"<imstart>think\n{thinking_steps}\n<imend>"},
{"role": "user", "content": "Now provide the final answer. Respond in under 2048 tokens!!!"}
],
temperature=1.0
)
final_answer = final_response.choices[0].message.content
# Combine thinking and answer
full_response = f"""Thinking Steps:
{thinking_steps}

Final Answer:
{final_answer}"""
return full_response
except Exception as e:
print(f"Failed to get fine-tuned model response: {e}")
return None



class BudgetForcingGPTModel(weave.Model):
@weave.op
async def predict(self, text: str) -> str:
"""Run inference using budget forcing approach with fine-tuned GPT-4o"""
try:
# Constant for number of thinking iterations
NUM_IGNORE = 6
# Initialize messages
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"{text}\nThink for up to 8096 tokens."}
]
# Thinking phase with budget forcing
thinking_trace = ""
for i in range(NUM_IGNORE):
try:
response = openai_client.chat.completions.create(
model=MODEL_FINETUNED,
messages=messages,
max_tokens=8096,
temperature=1.0
)
thinking_text = response.choices[0].message.content
thinking_trace += thinking_text + "\nContinuing...\n"
messages.append({"role": "assistant", "content": thinking_text + "\nWait? Did I make a mistake? "})
messages.append({"role": "user", "content": "Continue your thinking where you left off, correcting any mistakes if there is any. Think for up to 8096 tokens."})
await asyncio.sleep(1)
except Exception as e:
print(f"Error during reasoning step {i}: {e}")
break
# Final answer phase
messages.append({"role": "user", "content": "Now provide the final answer. Respond in under 2048 tokens!!!"})
final_response = openai_client.chat.completions.create(
model=MODEL_FINETUNED,
messages=messages,
max_tokens=2048,
temperature=1.0
)
final_answer = final_response.choices[0].message.content
# Combine thinking and answer
full_response = f"""Thinking Steps:
{thinking_trace}

Final Answer:
{final_answer}"""
return full_response
except Exception as e:
print(f"Failed to get budget forcing model response: {e}")
return None


@weave.op
async def gpt4o_scorer(label: str, model_output: str) -> dict:
"""Score the model's output by comparing it with the ground truth."""
try:
# Extract the final answer section
final_answer = (model_output.split("Final Answer:")[-1].strip() if "Final Answer:" in model_output else model_output)[-100:]
query = f"""
YOU ARE A LLM JUDGE DETERMINING IF THE FOLLOWING MODEL GENERATED ANSWER IS THE SAME AS THE CORRECT ANSWER

Model's Answer: {final_answer}
Correct Answer: {label}
Your task:
1. State the model's predicted answer (answer only).
2. State the ground truth (answer only).
3. Determine if the model's final answer is correct (ignore formatting differences, etc.). RESPOND with the predicted and ground truth answer, followed with a JSON object containing the correctness encapsulated within the following delimiters:
```json
{{ "correctness": true/false }}
```
"""
response = openai_client.chat.completions.create(
model=JUDGE_MODEL,
messages=[{"role": "user", "content": query}]
)
response_text = response.choices[0].message.content
json_start = response_text.index("```json") + 7
json_end = response_text.index("```", json_start)
correctness = json.loads(response_text[json_start:json_end].strip()).get("correctness", False)
return {"correctness": correctness, "reasoning": response_text}
except Exception as e:
print(f"Scoring failed: {e}")
return {"correctness": False, "reasoning": str(e)}

def load_ds():
print("Loading AIME dataset...")
dataset = load_dataset("Maxwell-Jia/AIME_2024")["train"]
return [{"text": row["Problem"], "label": row["Answer"]} for row in dataset]

async def run_evaluations():
print("Loading dataset...")
dataset = load_ds()
print("Preparing dataset for evaluation...")
dataset_prepared = [{"text": row["text"], "label": row["label"]} for row in dataset]

scorers = [gpt4o_scorer]

# Evaluate fine-tuned GPT-4o
print("\nEvaluating fine-tuned GPT-4o...")
finetuned_evaluation = weave.Evaluation(
dataset=dataset_prepared,
scorers=scorers,
name="finetuned-gpt4o Evaluation"
)
finetuned_results = await finetuned_evaluation.evaluate(FineTunedGPTModel())
print(f"Results for fine-tuned GPT-4o: {finetuned_results}")

# Evaluate budget forcing GPT-4o
print("\nEvaluating budget forcing GPT-4o...")
budget_evaluation = weave.Evaluation(
dataset=dataset_prepared,
scorers=scorers,
name="budget-forcing-gpt4o Evaluation"
)
budget_results = await budget_evaluation.evaluate(BudgetForcingGPTModel())
print(f"Results for budget forcing GPT-4o: {budget_results}")

# # Evaluate base GPT-4o
print("\nEvaluating base GPT-4o...")
base_evaluation = weave.Evaluation(
dataset=dataset_prepared,
scorers=scorers,
name="base-gpt4o Evaluation"
)
base_results = await base_evaluation.evaluate(BaseGPTModel())
print(f"Results for base GPT-4o: {base_results}")


# Return all results
return {
"base": base_results,
"finetuned": finetuned_results,
"budget_forcing": budget_results
}


if __name__ == "__main__":
asyncio.run(run_evaluations())
While fine-tuning improved GPT-4o’s structured reasoning abilities, there were still instances where the model struggled with multi-step reasoning. Could its performance be improved further without additional training? This led to an exploration of budget forcing, an inference-time technique designed to extend logical depth before the model reaches a final answer.

What is budget forcing?

Budget forcing is an inference-time technique designed to enhance reasoning by dynamically controlling how long a model thinks before producing a final answer. Instead of allowing the model to determine when it has finished reasoning, budget forcing ensures that it continues processing until it has fully explored the problem.
If the model attempts to stop too early, it is forced to keep going by appending a signal - such as “Wait” - prompting it to refine and expand on its reasoning. This process can be repeated multiple times, allowing the model to self-correct errors and deepen its logical steps before finalizing its response.
The key advantage of budget forcing is that it enables test-time scaling, where increasing inference-time computation directly improves accuracy. Rather than relying solely on fine-tuning or additional training data, this method pushes models to use their existing knowledge more effectively. Experiments have shown that allowing a model to engage in multiple rounds of forced reasoning significantly improves its performance on complex problems. However, there is a balance to maintain - forcing too many iterations can lead to diminishing returns, while stopping too early may prevent the model from reaching an optimal solution.
In this experiment, budget forcing was applied to the fine-tuned GPT-4o to test whether structured multi-step reasoning could be improved without additional training. The results confirmed that this approach led to the highest correctness scores among all tested models, even surpassing the fine-tuned GPT-4o without budget forcing. This suggests that inference-time techniques can be just as impactful as fine-tuning itself, offering a powerful way to enhance model reasoning without requiring additional datasets or compute during training.

Budget forcing results

As budget forcing works by ensuring that a model doesn’t stop reasoning prematurely - instead of letting it decide when to stop, this budget forcing forces it to keep thinking by appending a follow-up prompt to the end of the assistant’s answer. After each step of reasoning, the model’s response is modified to include:
"Wait? Did I make a mistake?"
This prompt prevents the model from treating its response as final and instead encourages it to review its own reasoning. The next user message then reinforces this by instructing the model to continue from where it left off and correct any mistakes:
"Continue your thinking where you left off, correcting any mistakes if there is any. Think for up to 8096 tokens."
This process is repeated multiple times, refining and expanding the reasoning trace at each step. By forcing the model to iteratively review and build upon its thought process, budget forcing effectively simulates deeper logical reasoning.
Once the model has completed several cycles of iterative thinking, it is transitioned to the final answer phase with the prompt:
"Now provide the final answer. Respond in under 2048 tokens!!!"
This ensures that the model doesn’t continue reasoning indefinitely and instead synthesizes its refined logic into a final response. By structuring inference this way, budget forcing pushes the model to fully engage with complex problems, improving accuracy through self-correction and extended reasoning. However, forcing too many iterations can introduce unnecessary complexity, while stopping too soon may leave reasoning incomplete, requiring careful tuning to find the optimal balance.
Overall, our efforts paid off, and we did see some improvement on AIME compared to the base model:

The fine-tuned GPT-4o showed a slight improvement in reasoning over the base model, increasing from 20 percent to 23 percent. However, the “budget-forced” model performed best, reaching 30 percent accuracy. This suggests that while fine-tuning improves reasoning, inference-time techniques like budget forcing provide even greater gains, even scaling to models like GPT-4o.
At first, I was a bit disappointed by this modest 10 percentage again is actually a ~50 percent improvement relative to the base model’s performance.
💡
These results indicate that fine-tuning GPT-4o on high-quality reasoning datasets enhances structured reasoning, but further improvements may require a combination of fine-tuning and decoding optimizations. Future experiments could explore hybrid approaches to push GPT-4o’s reasoning abilities even further.
Also, I'll share the comparison of our Budget Forced model against the o1-preview model:


Weave Evaluations allows us to also compare each response for all of the models in the comparisons view. For reasoning models, this is particularly helpful as it allows you to visualize the chain-of-thoughts from the models, and also see exactly how the responses differ between models.
Here's a screenshot of the comparisons view:


Conclusion

Fine-tuning and inference-time techniques like budget forcing clearly enhance GPT-4o’s reasoning capabilities. Future experiments could explore hybrid approaches, including leveraging o3-mini for error detection and iterative reasoning refinement. Expanding this method to multimodal tasks, such as chart interpretation, could further unlock GPT-4o’s potential in complex analytical applications.
If you’re interested in trying the model yourself, feel free to reach out, or comment below - I’d be happy to provide access for further experimentation.


Iterate on AI agents and models faster. Try Weights & Biases today.