Evaluating o4-mini vs. Claude 3.7 vs. Gemini 2.5 Pro on code generation

A real-world head-to-head test of Gemini 2.5 Pro, o4-mini, and Claude 3.7 Sonnet on competitive programming problems—built on a custom execution framework with Weave integration to track correctness, spot bugs, and cut through benchmark hype.
Brett Young
Created on May 7|Last edited on May 8
Comment
Google’s new Gemini 2.5 Pro Preview (I/O Edition) is now available, promising stronger code generation and smarter reasoning across a range of development tasks. Big claims are one thing, but this field moves fast, and Gemini now finds itself shoulder-to-shoulder with other top-tier models alongside OpenAI’s o4-mini and Anthropic’s Claude 3.7 Sonnet, all vying to be every developer’s favorite assistant.
To see what actually holds up, we’re putting Gemini 2.5 Pro through the same real-world programming challenges as its strongest peers: o4-mini from OpenAI and Claude 3.7 Sonnet from Anthropic. There’s no hand-picking or shortcutting here. Each model faces identical code problems drawn from serious benchmark sets, and every solution is programmatically executed and scored for correctness.
This isn’t about leaderboard marketing. It’s about seeing which model can genuinely run and return the right answer in code that matters. We want to see if Gemini 2.5 Pro can live up to the hype, not just in isolation but directly compared to the other models at the frontier. This head-to-head will give a clear picture of where Gemini stands today and what developers can expect if they rely on it for the tough, unfiltered coding tasks that actually matter.
﻿
Table of contentsOur benchmark: CodeContests Our benchmark: CodeContests Building a lightweight code execution frameworkEvaluating Gemini 2.5 Pro Preview, o4-mini, and Claude 3.7 SonnetHead-to-head results: Gemini 2.5 Pro vs o4-mini vs Claude 3.7 SonnetThe Weave comparison viewCatching bugs with Weave Conclusion
﻿
Our benchmark: CodeContests A growing issue in LLM evaluation is that models appear to "overfit" to benchmark test sets, even when they weren't explicitly trained on them. This isn't traditional overfitting (where a model memorizes answers seen during training). Instead, it's more like data leakage on a massive scale caused by the way LLMs are pre-trained.
Most popular benchmarks like GSM8K, HumanEval, MMLU, etc. have been circulating online for years. They're on GitHub, in academic papers, blog posts, and tutorials. When an LLM is pre-trained on a huge scrape of the public internet, there's a good chance it’s seen parts—or even entire copies—of these benchmarks. This means that during evaluation, the model might already “know” the task, even if the benchmark was held out from fine-tuning.
This creates the illusion of strong generalization. A model scores well not because it truly solved the problem, but because it recognizes the format, remembers similar phrasing, or has memorized the answer distribution. That’s why some models can perform well on test sets without ever being trained specifically on them. They’ve just seen extremely similar (or identical) questions enough times in the wild to fake generalization.
This makes it hard to trust benchmark scores, especially when multiple models are trained on similar corpora. The gap between pretraining and evaluation is too blurry, and as benchmarks get reused, their value erodes. Any serious claims about reasoning or generalization should be treated with caution unless backed by genuinely unseen, private evals. So, we cannot guarantee the following benchmarks we will run will reflect true generalization simply because public benchmarks are increasingly saturated  (models have likely seen them during pretraining, even if not explicitly trained on them). 
That said, benchmarks are still useful. They give a shared reference point for comparing models, tracking changes over time, and stress-testing specific capabilities. But they should be treated as diagnostics, not ground truth. A high score on a benchmark doesn’t mean the model understands the task. It might just be regurgitating what it's seen before.
Our benchmark: CodeContests We’ll run Gemini 2.5 Pro, o4-mini, and Claude 3.7 Sonnet on CodeContests to see who handles real competitive problems best. This collection brings together over 10,000 real competitive programming problems from platforms like Codeforces, AtCoder, and CodeChef. Each problem includes a natural language description, multiple public and hidden test cases, and metadata like tags and difficulty ratings.
CodeContests is widely used for benchmarking code generation models thanks to its diverse problem types and strict evaluation setup. The problems aren’t designed with LLMs in mind - they require clear reasoning, precise implementation, and actual correctness on test cases, not just plausible code.
For our head-to-head comparison, we use the public test split, ensuring every model faces unseen, authentic problems. This benchmark gives a direct and realistic measurement of how well Gemini 2.5 Pro Preview, o4-mini, and Claude 3.7 Sonnet can handle real-world coding tasks.
Building a lightweight code execution frameworkWe will use a custom, lightweight framework to automate the entire evaluation. For each problem, every model will only see the natural language description, and will be prompted to generate a Python function as a solution. This framework standardizes calls to Gemini 2.5 Pro, o4-mini, and Claude 3.7 Sonnet, so every line of code and every test is apples-to-apples.
To keep things efficient and flexible, we’ll leverage an LLM to handle auxiliary tasks - such as crafting the appropriate function calls and formatting example inputs for execution. After running each model’s code against the public test cases, we’ll also use an LLM to compare the model’s outputs with the expected answers. This approach enables smart handling of edge cases (like subtle formatting differences or floating point tolerance) that are common when checking code outputs.
By automating function calling and answer evaluation with LLMs, we can ensure a repeatable evaluation process - while keeping the focus where it belongs: on whether the models can generate code that actually works for real-world problems.
Here's a simple script showing some of the core components of our code execution framework:
﻿
import subprocess
import sys
from litellm import completion
﻿
﻿
def clean_llm_code_block(text):
    return text.replace("```python", "").replace("```", "").strip()
﻿
﻿
def ask_llm_for_function_call(code: str, raw_input: str) -> str:
    prompt = (
        "You're given a Python function and a single input string. "
        "Format it into a valid Python function call using only standard types.\n\n"
        f"Function:\n{code}\n\n"
        f"Input:\n{raw_input.strip()}\n\n"
        "Return ONLY a valid function call (e.g., solve(3, 5))."
    )
    response = completion(
        model="openai/gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=100,
    )
    raw = response["choices"][0]["message"]["content"]
    return clean_llm_code_block(raw)
﻿
def compare_output_with_llm(expected: str, actual: str) -> bool:
    prompt = (
        f"Expected output: {expected.strip()}\n"
        f"Actual output: {actual.strip()}\n\n"
        "Are these outputs semantically equivalent? Reply YES or NO."
    )
    response = completion(
        model="openai/gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=10,
    )
    return response["choices"][0]["message"]["content"].strip().upper() == "YES"
﻿
def run_code_and_call_function(code: str, function_call: str, timeout=10):
    full_code = code + f"\n\nprint({function_call})"
    try:
        result = subprocess.run(
            [sys.executable, "-c", full_code],
            capture_output=True,
            text=True,
            timeout=timeout
        )
        return result.stdout.strip(), result.stderr.strip()
    except subprocess.TimeoutExpired:
        return "", "Execution timed out."
    except Exception as e:
        return "", str(e)
﻿
# Example LLM-generated code (you'd swap this in dynamically)
code_from_llm = """
def solve(a, b, mod):
    return pow(a, -1, mod)
"""
﻿
# Example test case block
test_case = {
    "input": [
        "3 2 998244353\n",
        "9 3 998244353\n",
        "3 1 998244353\n",
        "9 4 998244353\n"
    ],
    "output": [
        "665496236\n",
        "449209967\n",
        "499122178\n",
        "665496237\n"
    ]
}
﻿
# Run test cases
for i, input_line in enumerate(test_case["input"]):
    expected = test_case["output"][i]
    try:
        call = ask_llm_for_function_call(code_from_llm, input_line)
        result, error = run_code_and_call_function(code_from_llm, call)
﻿
        if error:
            print(f"[{i}] ERROR during execution: {error}")
            continue
﻿
        is_correct = compare_output_with_llm(expected, result)
        print(f"[{i}] input: {input_line.strip()} → output: {result} | expected: {expected.strip()} | PASS: {is_correct}")
﻿
    except Exception as e:
        print(f"[{i}] Failed: {repr(e)}")
﻿
When writing complex evaluations, I think it's helpful to first build out a simpler single example version of the evaluation, which validates some of the core challenges that will be involved in the evaluation. After this, it's much simpler to add in more models, metrics, and scoring functions. 
Evaluating Gemini 2.5 Pro Preview, o4-mini, and Claude 3.7 SonnetTo ensure a fair head-to-head comparison, we run each LLM on the same batch of 30 CodeContests problems (from the test set), automating every step, from solution generation to test case evaluation. For each problem, the model is given only the task description and must output a standalone Python function (solve). We then enforce output cleanliness using a code extraction routine (with a backup LLM parser for robustness), ensuring only valid, import-ready code makes it to execution.
Next, for each test input, we auto-generate the necessary function calls using another LLM prompt. The candidate code is then executed in an isolated environment, and its output compared to expected results. 
import os
import sys
import time
import subprocess
from datasets import load_dataset
from litellm import completion as oai_completion
from anthropic import Anthropic
from google import genai
import weave
﻿
from weave.flow.eval_imperative import EvaluationLogger
from google import genai
from google.genai import types
from litellm import completion as oai_completion
import re
from litellm import completion as oai_completion
﻿
﻿
weave.init("codecontests_eval")
﻿
# API keys
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "sk-...")
CLAUDE_KEY = os.getenv("CLAUDE_API_KEY", "sk-...)
GENAI_KEY = os.getenv("GOOGLE_API_KEY", "sk-...")
﻿
# Clients
anthropic_client = Anthropic(api_key=CLAUDE_KEY)
gemini_client = genai.Client(api_key=GENAI_KEY)
﻿
﻿
﻿
def clean_llm_code_block(text):
    import re
﻿
    cleaned_text = text.replace("```python", "").replace("```", "").strip()
    code_blocks = re.findall(r"(def solve\(.*?)(?=^def |\Z)", cleaned_text, re.DOTALL | re.MULTILINE)
    source_text = code_blocks[-1] if code_blocks else cleaned_text
﻿
    prompt = (
        "Given the following response from a language model, extract ONLY the valid Python code for the function. "
        "Do not include any explanations, text, or formatting fences. Only the code.\n\n"
        f"Response:\n{source_text}\n\n"
        "Return ONLY the Python code, including any necessary imports:"
    )
﻿
    response = oai_completion(
        model="openai/gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
﻿
    gpt4o_code = response["choices"][0]["message"]["content"]
    gpt4o_code = gpt4o_code.replace("```python", "").replace("```", "").strip()
    return gpt4o_code
﻿
@weave.op()
def generate_completion(model: str, prompt: str) -> str:
    if model.startswith("openai/"):
        response = oai_completion(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            reasoning_effort="low",
        )
        return response["choices"][0]["message"]["content"].strip()
﻿
    elif model.startswith("anthropic/"):
        response = anthropic_client.messages.create(
            model=model.replace("anthropic/", ""),
            max_tokens=8000,
            thinking={"type": "enabled", "budget_tokens": 4000},
            messages=[{"role": "user", "content": prompt}],
        )
        for block in response.content:
            if block.type == "text":
                return block.text.strip()
        return "[No Claude response]"
﻿
    elif model.startswith("gemini/"):
        result = gemini_client.models.generate_content(
            model=model.replace("gemini/", ""),
            config=types.GenerateContentConfig(
                thinking_config=types.ThinkingConfig(thinking_budget=4000)
            ),
            contents=[prompt]
        )
        return result.text.strip() if result.text else "[No Gemini response]"
﻿
    else:
        raise ValueError(f"Unsupported model: {model}")
    
﻿
﻿
def ask_llm_for_function_implementation(description: str, model: str) -> str:
    prompt = (
        f"Write a Python3 function named `solve` with typed input arguments for this problem -- eg the solve function should take arguments to handle different test cases:\n\n"
        f"{description.strip()}\n\n"
        "Return only a valid Python function -- no special packages that arent commonly used and NO MAIN function, no  if __name__ == __main__....., JUST write the function --  that returns the result. No comments, no explanations."
        f"HOWEVER, you still need to include necessary imports for libraries"
        f"IF you do not include the right imports, the code will not be executable, and your response will be judged as incorrect!"
    )
    return clean_llm_code_block(generate_completion(model, prompt))
﻿
﻿
﻿
@weave.op
def ask_llm_for_function_call(code: str, raw_input: str, model: str) -> str:
﻿
    prompt = (
        "You're given a Python function and a single input string. "
        "Format it into a valid Python function call using only standard types.\n\n"
        f"Function:\n{code}\n\n"
        f"Input:\n{raw_input.strip()}\n\n"
        "Return ONLY a valid function call (e.g., solve(3, 5)) WITH NO 'def' "
    )
﻿
    # Always use GPT-4o for this inference, regardless of the `model` argument.
    response = oai_completion(
        model="openai/gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
    # The LLM may return markdown code blocks; strip them just in case.
    content = response["choices"][0]["message"]["content"]
    content = content.replace("```python", "").replace("```", "").strip()
    return content
﻿
﻿
﻿
def compare_output_with_llm(expected: str, actual: str, model: str) -> bool:
    prompt = (
        f"Expected output: {expected.strip()}\n"
        f"Actual output: {actual.strip()}\n\n"
        "Are these outputs equivalent? Eg ignore minor formatting errors etc, we are just looking for overall correctness in the output Reply YES or NO."
    )
    # response = generate_completion(model, prompt)
﻿
    response = oai_completion(
        model="openai/gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
    # The LLM may return markdown code blocks; strip them just in case.
    res = 'YES' in str(response["choices"][0]["message"]["content"]).upper()
    return res 
﻿
﻿
﻿
def run_code_and_call_function(code: str, function_call: str, timeout=10):
    full_code = code + f"\n\nprint({function_call})"
    try:
        start = time.time()
        result = subprocess.run(
            [sys.executable, "-c", full_code],
            capture_output=True,
            text=True,
            timeout=timeout
        )
        latency = time.time() - start
        return result.stdout.strip(), result.stderr.strip(), latency
    except subprocess.TimeoutExpired:
        return "", "Execution timed out.", timeout
    except Exception as e:
        return "", str(e), 0.0
﻿
﻿
def ask_model_for_pip_command(error_msg):
    prompt = (
        "Given this Python error:\n\n"
        + error_msg +
        "\n\nWrite the pip install command needed to fix it. Only return the command, e.g.:\n"
        "pip install requests"
    )
    return generate_completion("openai/gpt-4o-2024-08-06", prompt)
﻿
﻿
def run_pip_install(pip_command):
    print(f"Running: {pip_command}")
    try:
        result = subprocess.run(
            pip_command.split(),
            capture_output=True,
            text=True,
            timeout=180
        )
        print(result.stdout.strip())
        if result.stderr:
            print(result.stderr.strip())
    except Exception as e:
        print(f"pip install failed: {e}")
﻿
﻿
﻿
def evaluate_model_on_code_contests(model_name: str):
    print(f"\n\nRunning evaluation for model: {model_name}\n")
    ds = load_dataset("deepmind/code_contests", split="test", streaming=True)
    ds = list(ds.take(31))
﻿
    eval_logger = EvaluationLogger(
        model=model_name.replace("-", "_").replace("/", "_").replace(".", "_"),
        dataset="code_contests_test"
    )
    all_latencies = []
﻿
    for i in range(30):
        row = ds[i]
        description = row["description"]
        raw_inputs = row["public_tests"]["input"]
        expected_outputs = row["public_tests"]["output"]
﻿
        try:
            code = ask_llm_for_function_implementation(description, model=model_name)
            print(f"\n=== Task {row['name']} ===", flush=True)
            # print("Generated code:\n", code)
﻿
            all_passed = True
            task_latencies = []
            results_lst, expected_lst = [], []
﻿
            for j, raw_input in enumerate(raw_inputs):
                expected = expected_outputs[j] if j < len(expected_outputs) else ""
﻿
                try:
                    function_call = ask_llm_for_function_call(code, raw_input, model=model_name)
                    result, error, latency = run_code_and_call_function(code, function_call)
                    if latency < 99:
                        task_latencies.append(latency)
﻿
                
                    if error:
                        print(f"[{j}] Runtime error: {error}")
                        if "ModuleNotFoundError" in error:
                            pip_cmd = ask_model_for_pip_command(error)
                            run_pip_install(pip_cmd)
                            # Re-run once after pip install
                            result, error, latency = run_code_and_call_function(code, function_call)
                            
                            task_latencies.append(latency)
                            if error:
                                print(f"[{j}] Retry failed: {error}")
                                all_passed = False
                                continue
                        else:
                            all_passed = False
                            continue
﻿
                    is_correct = compare_output_with_llm(expected, result, model="openai/gpt-4o-2024-08-06")###### 
                    results_lst.append(result)
                    expected_lst.append(expected)
                    if not is_correct:
                        all_passed = False
                    print(f"[{j}] input: {raw_input.strip()} → output: {result} | expected: {expected.strip()} | PASS: {is_correct} | latency: {latency:.2f}s")
﻿
                except Exception as inner:
                    print(f"[{j}] Inner error: {repr(inner)}")
                    all_passed = False
            
            task_avg_latency = sum(task_latencies) / len(task_latencies) if len(task_latencies) > 0 else 0.0
            all_latencies.extend(task_latencies)
﻿
            prediction_log = eval_logger.log_prediction(
                inputs={"description": description},
                output={'code': code, 'execution_result': results_lst, 'expected_execution_result': expected_lst}
            )
            prediction_log.log_score("correctness", all_passed)
            prediction_log.log_score("code_latency", task_avg_latency)
            prediction_log.finish()
﻿
        except Exception as e:
            print(f"[{i}] Top-level failure: {repr(e)}")
            prediction_log = eval_logger.log_prediction(
                inputs={"description": description},
                output=str(e)
            )
            prediction_log.log_score("correctness", False)
            prediction_log.finish()
﻿
    avg_latency = sum(all_latencies) / len(all_latencies) if all_latencies else 0.0
    eval_logger.log_summary({"avg_code_latency": avg_latency})
    print(f"Evaluation complete for {model_name}. View in Weave UI.")
﻿
﻿
# Run for all models
﻿
evaluate_model_on_code_contests("gemini/gemini-2.5-pro-preview-05-06")
evaluate_model_on_code_contests("anthropic/claude-3-7-sonnet-20250219")
evaluate_model_on_code_contests("openai/o4-mini")
﻿
To keep everything reproducible and track detailed logs, we use Weave to log our evaluation metrics - capturing raw model code, test inputs, outputs, errors, and metrics like average latency. This makes it easy to inspect failures, spot trends across models, and ensure that every result and bug is traceable, which is especially useful as both models and benchmarks evolve. I used Weave’s new EvaluationLogger for this, which offers a more flexible way to instrument evaluations.
Using the EvaluationLogger, you’re not locked into any strict format. Instead, you can manually loop through your dataset, call the model however you want, and log predictions and scores as they come. This makes it straightforward to slot Weave into pre-existing pipelines or custom eval loops without rewriting everything.
Head-to-head results: Gemini 2.5 Pro vs o4-mini vs Claude 3.7 SonnetHere are the results for my evaluation: 
﻿
I tested 30 samples since the thinking modes are computationally intensive. I used the "low" thinking budget setting for OpenAI's 04-mini model and aimed to keep the budget around 4000 tokens across all models. Despite this, Gemini regularly exceeded that, averaging around 25,000 tokens per response (I’m not sure if this was a bug or a mistake on my part). In my experience, that much verbosity usually boosts performance. But Gemini 2.5 Pro still only scored 0.333 in correctness, just behind Claude 3.7 Sonnet at 0.367 and well below OpenAI's o4-mini at 0.5.
The Weave comparison viewWeave's comparison view is particularly valuable for clearly visualizing differences in reasoning across multiple coding models. By displaying each model's outputs side-by-side, this view lets you immediately identify discrepancies in correctness, logical consistency, and handling of visual inputs such as charts or graphs. Through this intuitive interface, you can quickly pinpoint reasons why certain models fail while others are able to succeed.
﻿
Such insights make it easier to analyze and optimize coding performance effectively. By highlighting not only results, but also underlying coding styles, Weave’s comparison view simplifies evaluating and improving each model's reasoning capabilities.
Catching bugs with Weave After starting one of the evaluations, I looked at some of the initial responses from the models, and it included non-code text explanations to the code. This was clearly an issue, and after further investigation, the issue was coming from my clean_llm_code_block, which at the time was only using Regex code extraction, which apparently wasn’t robust against the full distribution of LLM outputs (or I’m just bad at writing regex filters). So, I decided to take the easy way out and simply parse the code using a separate LLM to ensure near 100% reliability. Thanks to Weave, I was able to catch this bug early!
﻿
﻿
ConclusionOur journey benchmarking Gemini 2.5 Pro (I/O Edition), o4-mini, and Claude 3.7 Sonnet shows that the landscape for AI coding assistants is rapidly evolving—and intensely competitive. These models handle complex, real-world programming problems with impressive consistency, but interpreting their headline numbers requires caution. As we’ve highlighted, public benchmarks like CodeContests remain useful diagnostic tools, but their value is limited by potential data leakage and the saturation of test sets across the open web. True breakthroughs in reasoning and generalization still demand harder-to-game, private, or novel evaluation methods.
Despite such caveats, our evaluation—supported by an automated, LLM-augmented execution framework—provides a fair, reproducible, and transparent way to compare what leading models actually deliver when faced with real, unseen coding challenges. Tools like Weave helped us catch reliability issues early, underscoring the importance of robust evaluation infrastructure.
Ultimately, while benchmarks are imperfect, stress-testing today’s best models on well-curated tasks remains crucial for understanding their practical strengths and shortcomings. Developers considering tools like Gemini 2.5 Pro can take encouragement from its raw capabilities but should remain clear-eyed about the difference between leaderboard performance and real-world resilience. As the field moves forward, the race to build not only smarter but also more rigorously evaluated code assistants is far from over.
﻿
﻿
﻿
﻿
﻿
﻿
﻿
Add a comment
Tags: Articles, Evaluations, OpenAI
Iterate on AI agents and models faster. Try Weights & Biases today.