o3-mini vs. DeepSeek-R1: API setup, performance testing & model evaluation

Learn how to set up and run OpenAI o3-mini via the API, explore its flexible reasoning effort settings, and compare its performance against DeepSeek-R1 using W&B Weave Evaluations.
Brett Young
Created on January 31|Last edited on January 31
Comment
Getting started running the new o3-mini model from OpenAI via the API is surprisingly easy, and will offer you a lot of flexibility in what you can do with it beyond what you might encounter through the ChatGPT interface and even GPT-4o.
This guide will walk through running inference with o3-mini, as well as how to compare its performance against DeepSeek-R1 on the AIME dataset using Weave Evaluations. Whether you're integrating AI for research or software development, this article will help you get started with OpenAI’s latest models.
If you'd prefer to get your hands dirty faster, you can test the script out in this Colab: 
﻿
Note: Currently, the o3-series of models are only accessible via API if you have a Tier 3-5 account.
Unfortunately, if you don't, you'll have to access o3 through the ChatGPT interface.
We recommend bookmarking this article for when the models open up more generally.
If you'd like to get started with GPT-4o, and upgrade your code later, we have a tutorial for that here.
💡
﻿
Table Of ContentsW&B Weaveo3-mini vs. R1 Pricing o3-mini Python quickstartEvaluating o3-mini with Weave Handling API Failures and Ensuring Reliable ComparisonsResults The Weave comparisons dashboard Conclusion
﻿
﻿
If you're just getting started and don't yet have your machine set up to run Python, we've created a quick tutorial here that will have you up and running in just a few minutes.
Additionally, if you need a walkthrough on getting the OpenAI API, you'll find that here.
💡
W&B Weave﻿W&B Weave streamlines the process of logging and analyzing model outputs in your project. Getting started is simple, just import Weave and initialize it with your project name.
A key feature of Weave is the @weave.op decorator. In Python, decorators extend a function’s behavior, and by applying @weave.op, you automatically log that function’s inputs and outputs. This allows you to effortlessly track the data flowing through your model.
Once executed, Weave captures and visualizes these logs in an interactive dashboard, providing insights into function calls, inputs, and outputs. This enhances debugging, organizes experimental data, and simplifies model development—especially when working with o3-mini and other AI models.
o3-mini vs. R1 Pricing Deepseek-Reasoner (R1) charges $0.55 per million input tokens (assuming a cache miss), compared to o3-mini's $1.10 per million. For output tokens, R1's cost is $2.19 per million, while o3-mini is priced at $4.40. OpenAI offers a Batch API, reducing input token costs to $0.55 per million and output token costs to $2.20 per million. The Batch API handles multiple requests simultaneously, improving efficiency and reducing costs.
﻿
﻿
OpenAI o3-mini offers a flexible reasoning effort setting with three levels: low, medium, and high. This allows users to optimize for either speed or accuracy based on their needs. By default, ChatGPT operates at medium reasoning effort, striking a balance between efficiency and performance, similar to OpenAI o1 in STEM tasks.
For tasks that require faster responses with simpler reasoning, the low setting reduces computational overhead, making it ideal for quick, straightforward tasks. Meanwhile, the high setting enhances problem-solving capabilities, significantly improving accuracy in complex benchmarks like AIME and GPQA. While this setting enables o3-mini to outperform previous models in advanced evaluations, it comes with a tradeoff of slightly increased latency.
These options allow developers to fine-tune performance for different workloads—whether optimizing for real-time applications or tackling deep analytical reasoning.ning.
﻿
o3-mini Python quickstartLet's jump right in. This tutorial assumes you have a valid OpenAI API key, python installed, along with the following pip packages: 
pip install weave openai datasets
First, we will start by writing a simple script to run inference with o3-mini and Weave: 
import os
from openai import OpenAI
# Option to hardcode API key (leave empty to use env variable)
OPENAI_API_KEY = ""
# Use hardcoded key if provided, else fetch from environment
api_key = OPENAI_API_KEY if OPENAI_API_KEY else os.getenv("OPENAI_API_KEY")
﻿
if not api_key:
    raise ValueError("No API key provided. Set it manually or through the environment variable.")
﻿
O3_MODEL = "o3-mini"  # Replace with actual o3 model ID
# Initialize OpenAI client
openai_client = OpenAI(api_key=api_key)
﻿
# Math problem
problem = "Find the derivative of f(x) = 3x^2 + 5x - 7."
# Run inference
try:
    response = openai_client.chat.completions.create(
        model=O3_MODEL,
        messages=[{"role": "user", "content": problem}]
    )
    print(response.choices[0].message.content)
except Exception as e:
    print(f"Failed to get OpenAI response: {e}")
Here Weave logs the input and output to our model. Notice that we didn't even need to add the @weave.op, because Weave is already integrated with the OpenAI python library, and all that's required for logging inference calls to weave is to import and init Weave! 
Evaluating o3-mini with Weave Now that we’ve set up the inference pipeline, we’ll evaluate OpenAI’s o3 mini models and R1 on the AIME dataset. This dataset consists of PhD-level scientific questions, making it a rigorous benchmark for assessing AI reasoning.
We’ll use Weave Evaluations to compare o3 mini’s responses against ground truth solutions. The evaluation process involves generating answers with o3 mini and scoring them using GPT-4o as a judge to determine correctness. By automating this evaluation, we can measure how well o3 mini handles complex academic reasoning.
Handling API Failures and Ensuring Reliable ComparisonsDuring previous evaluations, we ran into issues with DeepSeek-R1 where API calls would randomly fail, returning empty or malformed responses. Recently, there have been outages with DeepSeek-R1, so be aware that API calls may fail. The best solution for now is to comment out the R1 portion of the code and run the evaluation with just OpenAI's o3, or find another provider that serves the DeepSeek 671B parameter model.
Next, we’ll run the evaluation pipeline and analyze the results: 
import os
import asyncio
import requests
from datasets import load_dataset
from openai import OpenAI
import json
import weave; weave.init("aime_evaluation")
﻿
# API Keys and Clients
openai_client = OpenAI(api_key="your api key")
deepseek_client = OpenAI(
    api_key="your-deepseek-key",
    base_url="https://api.deepseek.com"
)
﻿
# Model constants
O3_MINI = "o3-mini"  
JUDGE_MODEL = "gpt-4o-2024-08-06"  # For evaluation
﻿
# Consistent system message
system_message = "Solve the following problem. put your final answer within \\boxed{}: "
﻿
def run_openai_inference(prompt: str) -> str:
    """Run inference using OpenAI API"""
    try:
        response = openai_client.chat.completions.create(
            model=O3_MINI,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Failed to get OpenAI response: {e}")
        return None
﻿
class O3MiniModel(weave.Model):
    @weave.op
    async def predict(self, text: str) -> str:
        print("running inference")
        return run_openai_inference(f"{system_message}{text}")
﻿
class DeepseekR1(weave.Model):
    @weave.op
    async def predict(self, text: str) -> str:
        try:
            response = deepseek_client.chat.completions.create(
                model="deepseek-reasoner",
                messages=[
                    {"role": "user", "content": system_message + text}
                ]
            )
            return response.choices[0].message.content
        except Exception as e:
            print(f"Failed to get Deepseek response: {e}")
            return None
﻿
@weave.op
async def gpt4_scorer(label: str, model_output: str) -> dict:
    """Score the model's output by comparing it with the ground truth."""
    query = f"""
    YOU ARE A LLM JUDGE DETERMINING IF THE FOLLOWING MODEL GENERATED ANSWER IS THE SAME AS THE CORRECT ANSWER 
    I WILL GIVE YOU THE LAST 100 CHARS OF THE MODEL'S REASONING PATH, WHICH WILL CONTAIN THE FINAL ANSWER -> 
﻿
    Model's Answer (last 100 chars): {str(model_output)[-100:]}
    Correct Answer: {label}
    
    Your task:
    1. State the model's predicted answer (answer only).
    2. State the ground truth (answer only).
    3. Determine if the model's final answer is correct (ignore formatting differences, etc.). RESPOND with the predicted and ground truth answer, followed with a JSON object containing the correctness encapsulated within the following delimiters:
       ```json
       {{ "correctness": true/false }}
       ```
    """
    
    try:
        response = openai_client.chat.completions.create(
            model=JUDGE_MODEL,
            messages=[{"role": "user", "content": query}]
        )
        response_text = response.choices[0].message.content
        
        json_start = response_text.index("```json") + 7
        json_end = response_text.index("```", json_start)
        correctness = json.loads(response_text[json_start:json_end].strip()).get("correctness", False)
        
        return {"correctness": correctness, "reasoning": response_text}
    except Exception as e:
        print(f"Scoring failed: {e}")
        return {"correctness": False, "reasoning": str(e)}
﻿
def load_ds():
    print("Loading AIME dataset...")
    dataset = load_dataset("Maxwell-Jia/AIME_2024")["train"]
    return [{"text": row["Problem"], "label": row["Answer"]} for row in dataset]
﻿
async def run_evaluations():
    print("Loading dataset...")
    dataset = load_ds()
    print("Initializing models...")
﻿
    models = {
        "o3-mini": O3MiniModel(),
        "deepseek-r1": DeepseekR1()
    }
﻿
    print("Preparing dataset for evaluation...")
    dataset_prepared = [{"text": row["text"], "label": row["label"]} for row in dataset]
﻿
    print("Running evaluations...")
    scorers = [gpt4_scorer]
    for model_name, model in models.items():
        print(f"\nEvaluating {model_name}...")
        evaluation = weave.Evaluation(
            dataset=dataset_prepared,
            scorers=scorers,
            name=f"{model_name} Evaluation"
        )
        results = await evaluation.evaluate(model)
        print(f"Results for {model_name}: {results}")
﻿
if __name__ == "__main__":
    asyncio.run(run_evaluations())
The script begins by loading the AIME dataset using the Hugging Face load_dataset function. This dataset consists of math competition questions, making it a strong benchmark for evaluating complex reasoning abilities. Each entry contains a problem prompt (text) and a ground truth solution (label).
The script then runs inference using OpenAI’s o3-mini models, as well as DeepSeek-R1 via the DeepSeek API. Each model processes the dataset separately, generating responses for each question. The inference process works as follows:
o3-mini: Queries are sent to OpenAI’s API, and responses are returned as model outputs.
DeepSeek-R1: Queries are processed through DeepSeek’s API, retrieving responses in the same manner.
Results For the first run, using the 'medium' reasoning effort, the o3 scored exactly the same as R1 at 76% accuracy. For anyone curious, I encourage you to try running the eval with "high" reasoning to see if o3 can out-thing R1!
Here's a screenshot of the Weave evaluation, showing R1 vs o3:
﻿
﻿
The Weave comparisons dashboard Once evaluations are complete, Weave organizes results into an interactive dashboard. This powerful tool enables you to:
Compare model outputs side by side,
Filter results by specific parameters, and
Trace inputs and outputs for every function call.
This structured, visual approach simplifies debugging and provides deep insights into model performance, making Weave an indispensable tool for tracking and refining large language models.
For reasoning-focused tasks like those tackled by DeepSeek-R1, the comparisons view offers a step-by-step trace of each model’s decision-making process. This feature makes it easy to identify logical missteps, errors in interpreting prompts, or areas where one model outperforms another, such as the O3-Mini.
By analyzing these granular outputs, you can better understand why models succeed or fail on specific tasks. This insight is invaluable for diagnosing issues, improving models, and tailoring their capabilities for complex reasoning scenarios.
Here’s a screenshot of what the comparison view looks like: 
﻿
By analyzing these detailed outputs, the comparisons view makes it easier to understand how and why models succeed or fail on specific tasks. This level of detail is invaluable for diagnosing issues, improving models, and tailoring their capabilities for challenging reasoning scenarios.
ConclusionOpenAI o3-mini provides a powerful and flexible option for reasoning-intensive tasks, particularly in STEM domains. With its multiple reasoning effort settings and efficient API integration, it allows developers to optimize for either speed or depth of reasoning.
The Weave Evaluations framework makes it easy to systematically compare model outputs, track performance, and refine workflows. Whether you're running direct inference or conducting large-scale benchmarking, o3-mini offers a scalable and cost-effective solution.
As API access expands and more users integrate these models into their workflows, o3-mini is set to become a key tool for research, development, and AI-driven applications.
﻿
﻿
Add a comment
Tags: Articles, GenAI, LLM, OpenAI, Community Posts
Iterate on AI agents and models faster. Try Weights & Biases today.