Testing Claude 4 vs. Codex vs. Gemini 2.5 Pro on CodeContests
Putting Claude 4 Sonnet and Opus to the test
Created on May 29|Last edited on June 11
Comment
Anthropic’s Claude Opus 4 and Sonnet 4 represent a new generation of large language models focused on coding, reasoning, and seamless integration into developer workflows. Building on strong past performance from the Claude family, these models arrive alongside prominent peers like OpenAI’s codex-mini and Google’s Gemini 2.5 Pro, each aiming to become an essential tool for engineers and software teams.
In this article, we will explore how Claude Opus 4 and Sonnet 4 perform across a range of practical coding tasks, using consistent programmatic benchmarks and live code execution to measure capability and reliability. By comparing these results with current models from OpenAI and Google, we aim to provide a clear, up-to-date perspective on how the latest Claude models serve modern development needs—especially in scenarios where code quality, problem-solving, and efficiency truly matter.

Table of contents
Table of contentsWhat’s New in Claude 4?LLM benchmarks: A north star for intelligent ability? Our benchmark: CodeContestsCode Correctness as a metricExperiment 1: How many thinking tokens is best? Experiment 2: Claude Sonnet 4 vs. Claude Opus 4 vs. Codex vs. Gemini 2.5 ProThe Weave Comparisons View Conclusion
What’s New in Claude 4?
Claude 4 introduces major advances in AI reasoning, code generation, and agentic workflows, building on Anthropic’s strengths in transparency and safety. The Claude 4 family includes Claude Opus 4—the new state-of-the-art for coding and complex tasks—and a dramatically upgraded Claude Sonnet 4, now faster, smarter, and more precise than ever.
There are two powerful new models:
- Claude Opus 4: Anthropic’s most capable AI yet, excelling at challenging coding, multi-step reasoning, and large-scale agent workflows. Outperforms previous benchmarks and delivers best-in-class performance for software engineering and research.
- Claude Sonnet 4: A major step up from Sonnet 3.7, offering top-tier reasoning, better instruction-following, and stronger code synthesis—all with faster, more practical responses for everyday use.
- Tool Use + Extended Thinking (Beta): Both models can now use external tools (like calculators, code runners, web search, and custom APIs) as part of their step-by-step reasoning—pausing to “think,” call a tool, then continue building their answer.
- With interleaved thinking, Claude can alternate between inner thoughts, tool calls, and final outputs for transparent, auditable workflows.
- Parallel Tool Use and Enhanced Memory: Claude 4 can execute tools in parallel and, when allowed, “remember” key facts and context across turns. With file access, Opus 4 can create and update its own memory files, enabling better context and continuity over long agent runs.
- API Upgrades: Access new API features including code execution, a flexible file API, MCP connector, and prompt caching for up to 1 hour—tools that help you build more powerful, persistent AI workflows.
- Better Safety, Fewer Loopholes: Both models show a 65% reduction in shortcut/loophole behaviors compared to the previous generation, ensuring more reliable agentic tasks and code edits.
LLM benchmarks: A north star for intelligent ability?
A growing issue in LLM evaluation is that models appear to "overfit" to benchmark test sets, even when they weren't explicitly trained on them. This isn't traditional overfitting (where a model memorizes answers seen during training). Instead, it's more like data leakage on a massive scale caused by the way LLMs are pre-trained.
Most popular benchmarks like GSM8K, HumanEval, MMLU, etc. have been circulating online for years. They're on GitHub, in academic papers, blog posts, and tutorials. When an LLM is pre-trained on a huge scrape of the public internet, there's a good chance it’s seen parts—or even entire copies—of these benchmarks. This means that during evaluation, the model might already “know” the task, even if the benchmark was held out from fine-tuning.
This creates the illusion of strong generalization. A model scores well not because it truly solved the problem, but because it recognizes the format, remembers similar phrasing, or has memorized the answer distribution. That’s why some models can perform well on test sets without ever being trained specifically on them. They’ve just seen extremely similar (or identical) questions enough times in the wild to fake generalization.
This makes it hard to trust benchmark scores, especially when multiple models are trained on similar corpora. The gap between pretraining and evaluation is too blurry, and as benchmarks get reused, their value erodes. Any serious claims about reasoning or generalization should be treated with caution unless backed by genuinely unseen, private evals. So, we cannot guarantee the following benchmarks we will run will reflect true generalization simply because public benchmarks are increasingly saturated (models have likely seen them during pretraining, even if not explicitly trained on them).
That said, benchmarks are still useful. They give a shared reference point for comparing models, tracking changes over time, and stress-testing specific capabilities. But they should be treated as diagnostics, not ground truth. A high score on a benchmark doesn’t mean the model understands the task. It might just be regurgitating what it's seen before.
Our benchmark: CodeContests
We’ll run Gemini 2.5 Pro, OpenAI’s codex, and Claude 4 Sonnet and Opus 4 on CodeContests to see who handles real competitive problems best. This collection brings together over 10,000 real competitive programming problems from platforms like Codeforces, AtCoder, and CodeChef. Each problem includes a natural language description, multiple public and hidden test cases, and metadata like tags and difficulty ratings.
CodeContests is widely used for benchmarking code generation models thanks to its diverse problem types and strict evaluation setup. The problems aren’t designed with LLMs in mind - they require clear reasoning, precise implementation, and actual correctness on test cases, not just plausible code.
For this comparison, we will use a fraction of the public test split so that each model faces a set of unseen, authentic problems. This provides a transparent and realistic picture of how well Gemini 2.5 Pro, Codex, and Claude 4 Sonnet and Opus perform on real competitive programming tasks.
Code Correctness as a metric
To ensure a fair and automated evaluation across different language models, I built a custom, lightweight code execution framework. For each competitive programming problem, I present only the natural language description to each model, prompting it to generate a Python function as its solution. This framework keeps things apples-to-apples by standardizing how I call Gemini 2.5 Pro, Codex, and Claude 4 Sonnet and Opus, so every line of code and every test is directly comparable.
A key feature of my setup is using an LLM not just for code generation, but also for the supporting tasks, such as turning example input strings into valid function calls and formatting data for execution. This lets me ensure that all the glue work—like mapping sample inputs to Python calls—is done consistently across every problem and every model.
After running the generated code against the public test cases, I rely on an LLM to compare the outputs to the expected answers. This makes output checking much more robust: the LLM can recognize semantically equivalent results, even if there are superficial differences in formatting or minor floating-point discrepancies—cases that typical string comparisons might miss.
By automating function invocation and answer checking with LLMs, I make the entire evaluation process repeatable, consistent, and above all, focused on real correctness—that is, whether the models’ code can pass real-world test cases. The main emphasis is always on actual performance, not just code that appears plausible.
Experiment 1: How many thinking tokens is best?
The first experiment aims to answer a basic question: how much internal reasoning should I ask Claude Sonnet 4 for when prompting it to solve code problems? Claude allows fine-grained control over its “thinking” process—by specifying a budget_tokens value, I can set the maximum number of tokens Claude can spend on internal reasoning (the so-called “thinking blocks”) before generating its final answer.
To investigate, I test a range of thinking budgets, from the standard setting (no explicit thinking), all the way up to 24,000 tokens. For each configuration, I run Claude on the same set of competitive programming problems from CodeContests. The framework passes each problem’s natural language description to the model and collects its solution, running it on real test cases to measure both correctness and latency.
By logging results across the different configurations, I can see how increasing Claude’s “thinking budget” affects code accuracy and response time. Does giving Claude more room for internal reflection help it write better code, or simply slow things down? This experiment is designed to map out that tradeoff in detail, so I can tune the reasoning depth for optimal performance
Here’s the code for the eval:
import osimport sysimport timeimport subprocessfrom datasets import load_datasetfrom litellm import completion as oai_completionfrom anthropic import Anthropicfrom google import genaiimport weavefrom weave.flow.eval_imperative import EvaluationLoggerfrom google.genai import typesimport reweave.init("codecontests_evalv2")os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "your_api_key")OAIKEY = os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "your_api_key")CLAUDE_KEY = os.getenv("CLAUDE_API_KEY", "your_api_key")GENAI_KEY = os.getenv("GOOGLE_API_KEY", "your_api_key")from openai import OpenAIclient = OpenAI(api_key=OAIKEY)anthropic_client = Anthropic(api_key=CLAUDE_KEY)gemini_client = genai.Client(api_key=GENAI_KEY)def clean_llm_code_block(text):cleaned_text = text.replace("```python", "").replace("```", "").strip()code_blocks = re.findall(r"(def solve\(.*?)(?=^def |\Z)", cleaned_text, re.DOTALL | re.MULTILINE)source_text = code_blocks[-1] if code_blocks else cleaned_textprompt = ("Given the following response from a language model, extract ONLY the valid Python code for the function. ""Do not include any explanations, text, or formatting fences. Only the code.\n\n"f"Response:\n{source_text}\n\n""Return ONLY the Python code, including any necessary imports:")response = oai_completion(model="openai/gpt-4o-2024-08-06",messages=[{"role": "user", "content": prompt}],temperature=0.0,)gpt4o_code = response["choices"][0]["message"]["content"]gpt4o_code = gpt4o_code.replace("```python", "").replace("```", "").strip()return gpt4o_code@weave.op()def generate_completion(model: str, prompt: str) -> str:# Codex models (openai/codex-*, openai/codex-mini-latest, etc)if model.startswith("openai/codex-"):# Use the .responses.create API as per the latest OpenAI SDKcodex_model = model.replace("openai/", "")# You can optionally add custom instructions (here, we use a neutral persona)response = client.responses.create(model=codex_model,instructions="You are a helpful, accurate Python coding assistant.",input=prompt)return response.output_text.strip()# General OpenAI chat-completionselif model.startswith("openai/"):from litellm import completion as oai_completionresponse = oai_completion(model=model,messages=[{"role": "user", "content": prompt}],reasoning_effort="low",)return response["choices"][0]["message"]["content"].strip()# Anthropic Claude (API unchanged)elif model.startswith("anthropic/"):response = anthropic_client.messages.create(model=model.replace("anthropic/", ""),max_tokens=8000,thinking={"type": "enabled", "budget_tokens": 4000},messages=[{"role": "user", "content": prompt}],)for block in response.content:if block.type == "text":return block.text.strip()return "[No Claude response]"# Gemini (API unchanged)elif model.startswith("gemini/"):result = gemini_client.models.generate_content(model=model.replace("gemini/", ""),config=types.GenerateContentConfig(thinking_config=types.ThinkingConfig(thinking_budget=4000)),contents=[prompt])return result.text.strip() if result.text else "[No Gemini response]"else:raise ValueError(f"Unsupported model: {model}")def ask_llm_for_function_implementation(description: str, model: str) -> str:prompt = (f"Write a Python3 function named `solve` with typed input arguments for this problem -- eg the solve function should take arguments to handle different test cases:\n\n"f"{description.strip()}\n\n""Return only a valid Python function -- no special packages that arent commonly used and NO MAIN function, no if __name__ == __main__....., JUST write the function -- that returns the result. No comments, no explanations."f"HOWEVER, you still need to include necessary imports for libraries"f"IF you do not include the right imports, the code will not be executable, and your response will be judged as incorrect!")return clean_llm_code_block(generate_completion(model, prompt))@weave.opdef ask_llm_for_function_call(code: str, raw_input: str, model: str) -> str:prompt = ("You're given a Python function and a single input string. ""Format it into a valid Python function call using only standard types.\n\n"f"Function:\n{code}\n\n"f"Input:\n{raw_input.strip()}\n\n""Return ONLY a valid function call (e.g., solve(3, 5)) WITH NO 'def' ")response = oai_completion(model="openai/gpt-4o-2024-08-06",messages=[{"role": "user", "content": prompt}],temperature=0.0,)content = response["choices"][0]["message"]["content"]content = content.replace("```python", "").replace("```", "").strip()return contentdef compare_output_with_llm(expected: str, actual: str, model: str) -> bool:prompt = (f"Expected output: {expected.strip()}\n"f"Actual output: {actual.strip()}\n\n""Are these outputs equivalent? Eg ignore minor formatting errors etc, we are just looking for overall correctness in the output Reply YES or NO.")response = oai_completion(model="openai/gpt-4o-2024-08-06",messages=[{"role": "user", "content": prompt}],temperature=0.0,)res = 'YES' in str(response["choices"][0]["message"]["content"]).upper()return resdef run_code_and_call_function(code: str, function_call: str, timeout=10):full_code = code + f"\n\nprint({function_call})"try:start = time.time()result = subprocess.run([sys.executable, "-c", full_code],capture_output=True,text=True,timeout=timeout)latency = time.time() - startreturn result.stdout.strip(), result.stderr.strip(), latencyexcept subprocess.TimeoutExpired:return "", "Execution timed out.", timeoutexcept Exception as e:return "", str(e), 0.0def ask_model_for_pip_command(error_msg):prompt = ("Given this Python error:\n\n"+ error_msg +"\n\nWrite the pip install command needed to fix it. Only return the command, e.g.:\n""pip install requests")return generate_completion("openai/gpt-4o-2024-08-06", prompt)def run_pip_install(pip_command):print(f"Running: {pip_command}")try:result = subprocess.run(pip_command.split(),capture_output=True,text=True,timeout=180)print(result.stdout.strip())if result.stderr:print(result.stderr.strip())except Exception as e:print(f"pip install failed: {e}")def evaluate_model_on_code_contests(model_name: str):print(f"\n\nRunning evaluation for model: {model_name}\n")ds = load_dataset("deepmind/code_contests", split="test", streaming=True)ds = list(ds.take(31))eval_logger = EvaluationLogger(model=model_name.replace("-", "_").replace("/", "_").replace(".", "_"),dataset="code_contests_test")all_latencies = []for i in range(30):row = ds[i]description = row["description"]raw_inputs = row["public_tests"]["input"]expected_outputs = row["public_tests"]["output"]try:code = ask_llm_for_function_implementation(description, model=model_name)print(f"\n=== Task {row['name']} ===", flush=True)all_passed = Truetask_latencies = []results_lst, expected_lst = [], []for j, raw_input in enumerate(raw_inputs):expected = expected_outputs[j] if j < len(expected_outputs) else ""try:function_call = ask_llm_for_function_call(code, raw_input, model=model_name)result, error, latency = run_code_and_call_function(code, function_call)if latency < 99:task_latencies.append(latency)if error:print(f"[{j}] Runtime error: {error}")if "ModuleNotFoundError" in error:pip_cmd = ask_model_for_pip_command(error)run_pip_install(pip_cmd)# Re-run once after pip installresult, error, latency = run_code_and_call_function(code, function_call)task_latencies.append(latency)if error:print(f"[{j}] Retry failed: {error}")all_passed = Falsecontinueelse:all_passed = Falsecontinueis_correct = compare_output_with_llm(expected, result, model="openai/gpt-4o-2024-08-06")results_lst.append(result)expected_lst.append(expected)if not is_correct:all_passed = Falseprint(f"[{j}] input: {raw_input.strip()} → output: {result} | expected: {expected.strip()} | PASS: {is_correct} | latency: {latency:.2f}s")except Exception as inner:print(f"[{j}] Inner error: {repr(inner)}")all_passed = Falsetask_avg_latency = sum(task_latencies) / len(task_latencies) if len(task_latencies) > 0 else 0.0all_latencies.extend(task_latencies)prediction_log = eval_logger.log_prediction(inputs={"description": description},output={'code': code, 'execution_result': results_lst, 'expected_execution_result': expected_lst})prediction_log.log_score("correctness", all_passed)prediction_log.log_score("code_latency", task_avg_latency)prediction_log.finish()except Exception as e:print(f"[{i}] Top-level failure: {repr(e)}")prediction_log = eval_logger.log_prediction(inputs={"description": description},output=str(e))prediction_log.log_score("correctness", False)prediction_log.finish()avg_latency = sum(all_latencies) / len(all_latencies) if all_latencies else 0.0eval_logger.log_summary({"avg_code_latency": avg_latency})print(f"Evaluation complete for {model_name}. View in Weave UI.")# ---- RUN FOR ALL TARGET MODELS ----evaluate_model_on_code_contests("gemini/gemini-2.5-pro-preview-05-06")evaluate_model_on_code_contests("anthropic/claude-sonnet-4-20250514")evaluate_model_on_code_contests("openai/codex-mini-latest")evaluate_model_on_code_contests("anthropic/claude-opus-4-20250514")
This Python script sets up an automated evaluation system for Gemini, Claude, and OpenAI's Codex, specifically for competitive programming tasks. It starts by taking problem descriptions from the DeepMind Code Contests dataset and asking each LLM to generate a Python solve function. The code then intelligently cleans and extracts the generated Python, creating the necessary function calls to run it against provided test cases.
The framework is designed with robust error handling; if it encounters a ModuleNotFoundError during execution, it even asks GPT-4o to suggest the correct pip install command, attempts to install the missing package, and retries the code. Crucially, instead of simple string comparisons, it uses an LLM to semantically compare the actual and expected outputs, which makes the evaluation more forgiving of minor formatting differences and focuses on true correctness.
To visualize our results, we use Weave Evaluations to log our evaluation metrics. This captures raw model code, test inputs, outputs, errors, and performance metrics like average latency. This makes it easy to inspect failures, spot trends across models, and ensures that every result and bug is traceable, which is especially useful as both models and benchmarks evolve. I used Weave’s new EvaluationLogger for this, which offers a flexible way to instrument evaluations. With the EvaluationLogger, you're not locked into any strict format; you can manually loop through your dataset, call the model however you want, and log predictions and scores as they come. This makes it straightforward to slot Weave into pre-existing pipelines or custom evaluation loops without rewriting everything.
Here are the results as shown inside Weave:

I tested 30 samples to assess the impact of varying internal "thinking" allocation on code generation performance. It's important to note that the "Total Tokens" metric displayed does not include the full thinking tokens, as the Claude 4 model sometimes provides only summaries of its internal thought process. Interestingly, the model with no thinking budget achieved a correctness of 36.7%, outperforming the configuration with a 1024-token thinking budget, which scored 26.7%. However, as the thinking budget increased further, so did performance. The model with a 4000-token budget reached 40.0% correctness, and the 8000-token budget maintained this level. Crucially, the model configured with a 12000-token thinking budget demonstrated the highest correctness at 53.3%, a substantial improvement. This suggests that while minimal thinking might sometimes hinder performance, a more substantial thinking allocation significantly boosts solution quality.
Experiment 2: Claude Sonnet 4 vs. Claude Opus 4 vs. Codex vs. Gemini 2.5 Pro
For the second experiment, I directly compare Claude 4 Sonnet, Claude 4 Opus, OpenAI’s new Codex model, and Gemini 2.5 Pro on an identical batch of thirty competitive programming problems from the CodeContests test set. Each model receives only the problem’s natural language description and must return a single, self-contained Python function as its solution.
For each sample input, I automatically generate the correct call signature and execute the candidate code in an isolated environment. The resulting output is then checked against the expected ground truth using a robust, LLM-powered semantic equivalence check. This approach guarantees that the assessment isn’t thrown off by minor formatting mismatches, but rather focuses on real correctness.
By running all three models under these tightly controlled and identical conditions, I get a clear picture of how each system handles authentic, real-world coding challenges—measuring not just code plausibility, but actual problem-solving and implementation skill.
Here’s the code for the eval:
import osimport sysimport timeimport subprocessfrom datasets import load_datasetfrom litellm import completion as oai_completionfrom anthropic import Anthropicfrom google import genaiimport weavefrom weave.flow.eval_imperative import EvaluationLoggerfrom google.genai import typesimport reweave.init("codecontests_evalv2")os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "your_api_key")OAIKEY = os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "your_api_key")CLAUDE_KEY = os.getenv("CLAUDE_API_KEY", "your_api_key")GENAI_KEY = os.getenv("GOOGLE_API_KEY", "your_api_key")from openai import OpenAIclient = OpenAI(api_key=OAIKEY)anthropic_client = Anthropic(api_key=CLAUDE_KEY)gemini_client = genai.Client(api_key=GENAI_KEY)def clean_llm_code_block(text):cleaned_text = text.replace("```python", "").replace("```", "").strip()code_blocks = re.findall(r"(def solve\(.*?)(?=^def |\Z)", cleaned_text, re.DOTALL | re.MULTILINE)source_text = code_blocks[-1] if code_blocks else cleaned_textprompt = ("Given the following response from a language model, extract ONLY the valid Python code for the function. ""Do not include any explanations, text, or formatting fences. Only the code.\n\n"f"Response:\n{source_text}\n\n""Return ONLY the Python code, including any necessary imports:")response = oai_completion(model="openai/gpt-4o-2024-08-06",messages=[{"role": "user", "content": prompt}],temperature=0.0,)gpt4o_code = response["choices"][0]["message"]["content"]gpt4o_code = gpt4o_code.replace("```python", "").replace("```", "").strip()return gpt4o_code@weave.op()def generate_completion(model: str, prompt: str, thinking_budget=None, streaming=True) -> str:response_text = ""if model.startswith("anthropic/"):# Calculate correct total max_tokensif thinking_budget and int(thinking_budget) > 0:max_tokens = 8000 + int(thinking_budget)else:max_tokens = 8000create_kwargs = dict(model=model.replace("anthropic/", ""),max_tokens=max_tokens,messages=[{"role": "user", "content": prompt}],)# Only provide thinking if budget is positive and not Noneif thinking_budget and int(thinking_budget) > 0:create_kwargs['thinking'] = {"type": "enabled", "budget_tokens": int(thinking_budget)}client = anthropic_clientif streaming:with client.messages.stream(**create_kwargs) as stream:for event in stream:if event.type == "content_block_start":block_type = event.content_block.typeif block_type == "thinking":print("\n[THINKING]: ", end="", flush=True)elif block_type == "text":print("\n[RESPONSE]: ", end="", flush=True)elif event.type == "content_block_delta":d = event.deltaif getattr(d, "type", None) == "thinking_delta":print(d.thinking, end="", flush=True)elif getattr(d, "type", None) == "text_delta":print(d.text, end="", flush=True)response_text += d.textelif event.type == "content_block_stop":print()else:response = client.messages.create(**create_kwargs)for block in response.content:if block.type == "thinking" and getattr(block, "thinking", "").strip():print("\n[THINKING]:", block.thinking.strip(), flush=True)elif block.type == "text" and getattr(block, "text", "").strip():print("\n[RESPONSE]:", block.text.strip(), flush=True)response_text += block.text.strip()return str(response_text)else:raise ValueError(f"Unsupported model: {model}")def ask_llm_for_function_implementation(description: str, model: str, thinking_budget=None) -> str:prompt = (f"Write a Python3 function named `solve` with typed input arguments for this problem -- eg the solve function should take arguments to handle different test cases:\n\n"f"{description.strip()}\n\n""Return only a valid Python function -- no special packages that arent commonly used and NO MAIN function, no if __name__ == __main__....., JUST write the function -- that returns the result. No comments, no explanations."f"HOWEVER, you still need to include necessary imports for libraries"f"IF you do not include the right imports, the code will not be executable, and your response will be judged as incorrect!")return clean_llm_code_block(generate_completion(model, prompt, thinking_budget=thinking_budget))@weave.opdef ask_llm_for_function_call(code: str, raw_input: str, model: str, thinking_budget=None) -> str:prompt = ("You're given a Python function and a single input string. ""Format it into a valid Python function call using only standard types.\n\n"f"Function:\n{code}\n\n"f"Input:\n{raw_input.strip()}\n\n""Return ONLY a valid function call (e.g., solve(3, 5)) WITH NO 'def' ")response = oai_completion(model="openai/gpt-4o-2024-08-06",messages=[{"role": "user", "content": prompt}],temperature=0.0,)content = response["choices"][0]["message"]["content"]content = content.replace("```python", "").replace("```", "").strip()return contentdef compare_output_with_llm(expected: str, actual: str, model: str) -> bool:prompt = (f"Expected output: {expected.strip()}\n"f"Actual output: {actual.strip()}\n\n""Are these outputs equivalent? Eg ignore minor formatting errors etc, we are just looking for overall correctness in the output Reply YES or NO.")response = oai_completion(model="openai/gpt-4o-2024-08-06",messages=[{"role": "user", "content": prompt}],temperature=0.0,)res = 'YES' in str(response["choices"][0]["message"]["content"]).upper()return resdef run_code_and_call_function(code: str, function_call: str, timeout=10):full_code = code + f"\n\nprint({function_call})"try:start = time.time()result = subprocess.run([sys.executable, "-c", full_code],capture_output=True,text=True,timeout=timeout)latency = time.time() - startreturn result.stdout.strip(), result.stderr.strip(), latencyexcept subprocess.TimeoutExpired:return "", "Execution timed out.", timeoutexcept Exception as e:return "", str(e), 0.0def ask_model_for_pip_command(error_msg):prompt = ("Given this Python error:\n\n"+ error_msg +"\n\nWrite the pip install command needed to fix it. Only return the command, e.g.:\n""pip install requests")# In this context, no "thinking" is likely needed; you could propagate a thinking_budget if you wantreturn generate_completion("anthropic/claude-sonnet-4-20250514", prompt, thinking_budget=None)def run_pip_install(pip_command):print(f"Running: {pip_command}")try:result = subprocess.run(pip_command.split(),capture_output=True,text=True,timeout=180)print(result.stdout.strip())if result.stderr:print(result.stderr.strip())except Exception as e:print(f"pip install failed: {e}")def evaluate_model_on_code_contests(model_name: str, thinking_budget=None):tb_str = "nothinking" if not thinking_budget or int(thinking_budget) == 0 else f"tb{thinking_budget}"print(f"\n\nRunning evaluation for model: {model_name} | thinking_budget={tb_str}\n")ds = load_dataset("deepmind/code_contests", split="test", streaming=True)ds = list(ds.take(31))eval_logger = EvaluationLogger(model=f"{model_name.replace('-', '_').replace('/', '_').replace('.', '_')}_{tb_str}",dataset="code_contests_test")all_latencies = []for i in range(30):row = ds[i]description = row["description"]raw_inputs = row["public_tests"]["input"]expected_outputs = row["public_tests"]["output"]try:code = ask_llm_for_function_implementation(description, model=model_name, thinking_budget=thinking_budget)print(f"\n=== Task {row['name']} ===", flush=True)all_passed = Truetask_latencies = []results_lst, expected_lst = [], []for j, raw_input in enumerate(raw_inputs):expected = expected_outputs[j] if j < len(expected_outputs) else ""try:function_call = ask_llm_for_function_call(code, raw_input, model=model_name, thinking_budget=thinking_budget)result, error, latency = run_code_and_call_function(code, function_call)if latency < 99:task_latencies.append(latency)if error:print(f"[{j}] Runtime error: {error}")if "ModuleNotFoundError" in error:pip_cmd = ask_model_for_pip_command(error)run_pip_install(pip_cmd)# Re-run once after pip installresult, error, latency = run_code_and_call_function(code, function_call)task_latencies.append(latency)if error:print(f"[{j}] Retry failed: {error}")all_passed = Falsecontinueelse:all_passed = Falsecontinueis_correct = compare_output_with_llm(expected, result, model="openai/gpt-4o-2024-08-06")results_lst.append(result)expected_lst.append(expected)if not is_correct:all_passed = Falseprint(f"[{j}] input: {raw_input.strip()} → output: {result} | expected: {expected.strip()} | PASS: {is_correct} | latency: {latency:.2f}s")except Exception as inner:print(f"[{j}] Inner error: {repr(inner)}")all_passed = Falsetask_avg_latency = sum(task_latencies) / len(task_latencies) if len(task_latencies) > 0 else 0.0all_latencies.extend(task_latencies)prediction_log = eval_logger.log_prediction(inputs={"description": description},output={'code': code, 'execution_result': results_lst, 'expected_execution_result': expected_lst})prediction_log.log_score("correctness", all_passed)prediction_log.log_score("code_latency", task_avg_latency)prediction_log.finish()except Exception as e:print(f"[{i}] Top-level failure: {repr(e)}")prediction_log = eval_logger.log_prediction(inputs={"description": description},output=str(e))prediction_log.log_score("correctness", False)prediction_log.finish()avg_latency = sum(all_latencies) / len(all_latencies) if all_latencies else 0.0eval_logger.log_summary({"avg_code_latency": avg_latency})print(f"Evaluation complete for {model_name} (thinking_budget={tb_str}). View in Weave UI.")# ---- RUN FOR CLAUDE/SONNET4 FOR MULTIPLE THINKING BUDGETS INCLUDING "NO THINKING" ----thinking_budgets = [None, 1024, 4000, 8000, 12000 ]for tb in thinking_budgets:evaluate_model_on_code_contests("anthropic/claude-sonnet-4-20250514", thinking_budget=tb)
Here are the results for the evaluation:

Codex achieved the highest correctness score of 63% correctness, outperforming the other models, despite not having a “thinking” feature (at least not one that is publicly known nor does the model API accept a effort parameter). Claude Sonnet and Claude Opus attain correctness scores of 43% and 40%, respectively. Gemini 2.5 Pro scores 36% in correctness. Overall it was interesting to see Claude Sonnet 4 outperform Claude Opus 4, but also, this eval was limited to 4000 thinking tokens for the Claude and Gemini models, so it would be interesting to see how these models scaled at larger thinking budgets.
The Weave Comparisons View
Weave's comparison view is particularly valuable for clearly visualizing differences in reasoning across multiple coding models. By displaying each model's outputs side-by-side, this view lets you immediately identify discrepancies in correctness and logical consistency. Through this intuitive interface, you can quickly pinpoint reasons why certain models fail while others are able to succeed.

Such insights make it easier to analyze and optimize coding performance effectively. By highlighting not only results, but also underlying coding styles, Weave’s comparison view simplifies evaluating and improving each model's reasoning capabilities.
Conclusion
The evaluation of Claude 4 Sonnet, Claude 4 Opus, OpenAI's Codex, and Gemini 2.5 Pro on the CodeContests benchmark provides valuable insights into the capabilities of these models in handling real-world coding challenges. The results show that Codex achieves the highest correctness score, outperforming the other models. However, the evaluation also highlights the importance of thinking budgets for Claude models, with a substantial increase in correctness observed when increasing the thinking budget to 12,000 tokens.
The comparison of these models underscores the complexity of evaluating large language models, particularly in the context of coding tasks. The use of a custom code execution framework and LLM-powered semantic equivalence check ensures a fair and robust evaluation process. The results also suggest that while benchmarks are useful for comparing models, they should be treated with caution due to the potential for data leakage and overfitting.
Overall, this evaluation contributes to our understanding of the strengths and limitations of current large language models in coding tasks. By analyzing the performance of these models, developers can refine their approaches to optimize model performance and improve solution quality.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.