Evaluating the new Gemini 2.5 Pro Experimental model

Gemini 2.5 Pro Experimental is Google's most advanced AI model to date, featuring multimodal input support, a massive 1 million-token context window, and the ability to solve complex problems.
Brett Young
Created on March 28|Last edited on March 28
Comment
﻿Gemini 2.5 Pro Experimental is Google's latest AI model, designed to handle complex tasks with enhanced reasoning and coding capabilities. The model is able to generate long chains of thoughts before answering, allowing the model to process problems step-by-step, resulting in more accurate and contextually relevant outputs. The model also supports multimodal inputs—including text, audio, images, and video—and features an extended context window of 1 million tokens, soon to expand to 2 million. 
Compared to previous models like Gemini 1.5 Pro and Gemini 2.0 flash, Gemini 2.5 Pro Experimental offers significant advancements. In terms of use cases, Gemini 2.5 Pro Experimental excels in areas requiring advanced reasoning. 
Here's a look at the model our Evaluations feature. You can see 2.5 (in blue) outperforming 2.0 (in pink): 
AIME 2025 benchmark scores
Table of contentsGemini 2.5 Pro explainedExtensive context window Code generation and STEM applications Multimodal input capabilities Accessing and experimenting with Gemini 2.5 ProRunning Gemini 2.5 Pro on imagesCalling custom tools with GeminiCode execution with Gemini 2.5 Pro Evaluating Gemini on AIME 2025 math problems￼Evaluation with W&B WeaveThe Weave comparison ViewConclusion
﻿
Gemini 2.5 Pro explainedGemini 2.5 Pro and Gemini Flash 2.0 Thinking are both reasoning models, which are designed to perform complex tasks by breaking down problems into smaller steps and solving them through explicit logical thinking, rather than simply generating answers like general-purpose LLMs. This makes them well-suited for math, code, and any task where intermediate steps and structured thought are required to reach the correct output. However, this usually requires a larger token budget and much longer time to generate a response, since the model is effectively thinking through the problem step by step before producing an answer. 
Here are some benchmarks comparing Gemini 2.5 Pro to other models: 
﻿
And here are some benchmarks comparing Gemini 2.5 Pro to other Gemini models.
Disclaimer: I have not verified these results with my own implementation of the benchmarks.
💡
﻿
﻿
The choice between models depends heavily on your specific use case. Factors like task complexity, latency requirements, cost constraints, and the availability of high-quality data for fine-tuning all play a role. 
Simpler tasks like basic classification, retrieval, or short-form text generation may run perfectly well on cheaper, faster models. But more demanding applications, especially those involving multi-step reasoning or precision in math or code, will benefit from more capable models even if they cost more or respond slightly slower. 
Pricing has not been announced for Gemini 2.5 Pro, but if you are currently looking to achieve the highest reasoning performance, Gemini 2.5 Pro is the smartest model available in the Gemini lineup. 
Gemini 2.0 Flash vs. Gemini 2.0 Flash Lite Comparing Flash 2.0 to Flash 2.0 Lite, the average performance difference is 5% (with Flash 2.0 outperforming Flash 2.0 lite), with variance ranging from 1% to 13% depending on the task.
Flash 2.0 Lite compared to 2.0 Flash is half the price, so the choice comes down to whether that ~5% performance gap matters for your use case. If you're optimizing for cost and scale, Flash Lite is a strong option. If you need more consistent accuracy across tasks, Flash 2.0 is the better pick. Gemini 2.5 Pro is engineered to handle tasks that require deep analytical thinking and problem-solving. This capability makes it ideal for domains that demand rigorous analysis, such as scientific research, advanced mathematics, and technical problem solving.
Extensive context window The model supports a context window of up to 1 million tokens, with plans to increase this capacity to 2 million. This extensive context handling allows for superior understanding and retention over longer conversations or documents, enabling the model to maintain coherence over extended interactions and manage more data in a single query than ever before.
Code generation and STEM applications In the realm of software development and STEM fields, Gemini 2.5 Pro proves exceptionally capable. It can generate, analyze, and refactor complex code, supporting various programming languages and frameworks. This ability extends to scientific modeling, where the model can help simulate and predict complex scientific phenomena, making it a valuable tool for researchers and engineers.
Multimodal input capabilities Unlike its predecessors, Gemini 2.5 Pro can process and generate not just textual content but also audio, images, and video. This multimodal capability allows it to act on a wider range of inputs, making it versatile in applications like multimedia content creation, educational tools, and interactive simulations.
Accessing and experimenting with Gemini 2.5 ProNow we’ll write some code to actually use Gemini 2.5. You’ll run multimodal prompts, call custom tools, execute Python code, and evaluate math reasoning performance on real benchmark problems.
First, install the required libraries:
pip install google-generativeai weave datasets pillow requests litellm google-genai
You’ll also need API keys for both Google AI Studio and OpenAI. Set them using os.environ in your script or however you prefer to manage secrets. 
Running Gemini 2.5 Pro on imagesGemini 2.5 Pro supports multimodal inputs, which means you can pass in not just text but also images. This makes it useful for anything from document analysis to interpreting charts or photos. The script below downloads an image from a URL, sends it to Gemini along with a prompt, and returns a textual explanation.
You can wrap this logic in a weave.@op to log inputs and outputs automatically and track runs. That’s useful for debugging or model comparison later. Once the image is downloaded and passed in, Gemini will process the visual content alongside the text, just like any regular chat input—no special handling needed beyond packaging the prompt and image together.
from google import genai
from google.genai import types
from PIL import Image
import requests
from io import BytesIO
import weave
﻿
weave.init("gemini-vision-demo")
﻿
# Download the image
url = "https://d1yhils6iwh5l5.cloudfront.net/charts/resized/124357/large/cotd.png"
response = requests.get(url)
image = Image.open(BytesIO(response.content))
﻿
# Gemini client
client = genai.Client(api_key="")
﻿
@weave.op()
def analyze_chart(prompt: str, img: Image.Image) -> str:
    res = client.models.generate_content(
        model="gemini-2.5-pro-exp-03-25",
        contents=[prompt, img]
    )
    return res.text
﻿
# Call and print
result = analyze_chart("What's shown in this chart?", image)
print(result)
﻿
You can wrap this logic in a @weave.op to log inputs and outputs automatically. In the Weave Traces UI, you'll see the exact image and prompt that were passed in, along with the model’s output. This gives you a full trace of the interaction for later review or comparison. 
﻿
Calling custom tools with GeminiGemini 2.5 supports function calling natively. You can define your own Python functions, pass them into the chat session, and Gemini will figure out when to call them and what arguments to use.
In this example, the model is prompted to make a room "cozy for reading." The function set_light_values is registered as a tool, and Gemini decides how to call it, including selecting values for brightness and color temperature. The tool call is parsed and executed manually, and the result is returned alongside the model’s original response.
This shows how Gemini can be used not just for Q&A, but for controlling external systems or chaining logic through your own Python functions. You can also toggle whether function calls run automatically or require manual control.
from google import genai
from google.genai import types
import weave
﻿
weave.init("gemini-function-call")
﻿
# Define the mock function
def set_light_values(brightness: int, color_temp: str) -> dict[str, int | str]:
    print("simulate function to set light values")
    return {
        "brightness": brightness,
        "colorTemperature": color_temp
    }
﻿
tool_list = [set_light_values]
tool_map = {fn.__name__: fn for fn in tool_list}
﻿
@weave.op()
def run_light_prompt(prompt: str):
    config = {
        'tools': tool_list,
        'automatic_function_calling': {'disable': True},
        'tool_config': {
            'function_calling_config': {
                'mode': 'any'
            }
        }
    }
﻿
    client = genai.Client(api_key="")
    chat = client.chats.create(model='gemini-2.5-pro-exp-03-25', config=config)
    response = chat.send_message(prompt)
﻿
    raw_calls = []
    results = []
﻿
    if response.function_calls:
        for fn_call in response.function_calls:
            raw_calls.append({
                "function": fn_call.name,
                "args": fn_call.args
            })
            fn = tool_map.get(fn_call.name)
            if fn:
                result = fn(**fn_call.args)
                results.append({
                    "function": fn_call.name,
                    "args": fn_call.args,
                    "result": result
                })
﻿
    return {
        "text": response.text,
        "raw_function_calls": raw_calls,
        "executed_results": results
    }
﻿
# Run it
result = run_light_prompt("Make the room cozy for reading.")
print(result)
﻿
Using a @weave.op around this function logs the entire interaction—including the function call Gemini attempted, the arguments it selected, and the output returned by the function. You can trace exactly how Gemini interpreted and acted on the prompt.
﻿
Code execution with Gemini 2.5 Pro Gemini 2.5 can also generate and execute code directly. When code execution is enabled via the tool config, Gemini will not only write Python but actually run it and return the output. That includes printing intermediate steps or running computations inline.
This is useful for data analysis, math, and anything that requires executable logic rather than just generation. The script parses the model’s response to extract the code, the execution output, and any accompanying explanation. You can treat this like a lightweight coding agent—no extra eval logic or tooling needed.
﻿
from google import genai
from google.genai import types
import weave
﻿
weave.init("gemini-code-exec")
﻿
# Parse response helper
def parse_code_execution_response(response):
    code = None
    output = None
    explanation = None
﻿
    for part in response.candidates[0].content.parts:
        if part.executable_code:
            code = part.executable_code.code
        if part.code_execution_result:
            output = part.code_execution_result.output
        if part.text:
            explanation = part.text
﻿
    return {
        "code": code,
        "execution_output": output,
        "explanation": explanation
    }
﻿
# Inference op
@weave.op()
def run_code_inference(prompt: str):
    client = genai.Client(api_key="")
﻿
    chat = client.chats.create(
        model='gemini-2.5-pro-exp-03-25',
        config=types.GenerateContentConfig(
            tools=[types.Tool(code_execution=types.ToolCodeExecution)]
        )
    )
﻿
    response = chat.send_message(prompt)
    return parse_code_execution_response(response)
﻿
# Run it
result = run_code_inference("Please run Python code to calculate the sum of [5, 10, 15, 20].")
print(result)
Weave tracks the full flow:
the input prompt,
the code generated,
the execution result, and
the explanation
All captured automatically in the trace. It’s easy to audit or review these later to see how Gemini handled specific code-generation tasks.
﻿
Evaluating Gemini on AIME 2025 math problems￼To test Gemini 2.5 on complex math problems, this script runs it on the AIME 2025 dataset and scores results using GPT-4o as a judge. Each model gets the same problem, generates a solution, and the scorer checks whether the boxed final answer matches the ground truth.
Everything runs through Weave, which handles logging, evaluation, and scoring. Each model is wrapped in a simple class with a predict() method. Scoring is done using GPT-4o by asking it to judge whether the model's final answer is correct, based on the last 100 characters of its output. The answer is extracted using a JSON block, and accuracy is computed over the full dataset.
The dataset is loaded via Hugging Face’s datasets library, combining both parts of AIME 2025. You can easily swap in your own dataset or model by editing the configs or class names. The point here is reproducible evaluation at scale, using real math problems and an external judge for fairness.
Here's the code: 
import os
import asyncio
import json
from datasets import load_dataset
from litellm import completion
import google.generativeai as genai
import weave
weave.init("aime_evaluation")  # Initialize Weave
﻿
# Set API keys
OPENAI_API_KEY = "your_openai_api_key"  # Replace with your actual OpenAI API key
GOOGLE_API_KEY = "your_google_ai_studio_api_key"  # Replace with your actual Google API key
﻿
# Configure API clients
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
genai.configure(api_key=GOOGLE_API_KEY)
﻿
# Define Gemini model configurations
GEMINI_CONFIGS = [
    {
        "name": "gemini-2.5-pro-exp-03-25",
        "model_name": "gemini-2.5-pro-exp-03-25"
    },
    {
        "name": "gemini-2.0-flash-thinking-exp-01-21",
        "model_name": "gemini-2.0-flash-thinking-exp-01-21" 
    }
]
﻿
# Consistent system message for all models
system_message = "Solve the following problem. put your final answer within \\boxed{}: "
﻿
# Function to perform inference using litellm for the scorer
def run_inference_openai(prompt, model_id="gpt-4o-2024-08-06"):
    try:
        response = completion(
            model=model_id,
            temperature=0.0,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        
        # Extract content from litellm response
        if response and hasattr(response, 'choices') and len(response.choices) > 0:
            content = response.choices[0].message.content
            return content
        else:
            print("No content found in response")
            return None
            
    except Exception as e:
        print(f"Failed to get response: {e}")
        return None
﻿
# Gemini inference function
async def gemini_inference(prompt: str, config: dict) -> str:
    """
    Generate solution using Gemini with specified config.
    """
    try:
        model = genai.GenerativeModel(config["model_name"])
        
        # Enhanced prompt with step-by-step instruction
        enhanced_prompt = f"{system_message} {prompt}\n"
        
        print(f"\n--- Starting inference with {config['name']} ---")
        
        # Generate content
        response = model.generate_content(
            enhanced_prompt,
﻿
        )
        
        # Extract text from response
        if response.text:
            text_content = response.text
            print(f"Successfully generated content with {config['name']}")
            return text_content
        else:
            return "No response generated"
                
    except Exception as e:
        print(f"Error in inference with {config['name']}: {e}")
        return f"Error: {str(e)}"
﻿
# Define model classes for each Gemini configuration
class Gemini25ProExpModel(weave.Model):
    @weave.op
    async def predict(self, text: str) -> str:
        return await gemini_inference(text, GEMINI_CONFIGS[0])
﻿
class Gemini20FlashModel(weave.Model):
    @weave.op
    async def predict(self, text: str) -> str:
        return await gemini_inference(text, GEMINI_CONFIGS[1])
﻿
# Score function using GPT-4o via litellm
@weave.op
async def gpt4o_scorer(label: str, model_output: str) -> dict:
    """Score the model's output by comparing it with the ground truth."""
    query = f"""
    YOU ARE A LLM JUDGE DETERMINING IF THE FOLLOWING MODEL GENERATED ANSWER IS THE SAME AS THE CORRECT ANSWER 
    I WILL GIVE YOU THE LAST 100 CHARS OF THE MODEL'S REASONING PATH, WHICH WILL CONTAIN THE FINAL ANSWER -> 
﻿
    Model's Answer (last 100 chars): {str(model_output)[-100:]}
    Correct Answer: {label}
    
    Your task:
    1. State the model's predicted answer (answer only).
    2. State the ground truth (answer only).
    3. Determine if the model's final answer is correct (ignore formatting differences, etc.). RESPOND with the predicted and ground truth answer, followed with a JSON object containing the correctness encapsulated within the following delimiters:
       ```json
       {{ "correctness": true/false }}
       ```
    """
    
    # Perform inference using litellm
    response = run_inference_openai(query, "gpt-4o-2024-08-06")
    
    if response is None:
        return {"correctness": False, "reasoning": "Inference failed."}
    
    try:
        # Extract correctness JSON object from the response
        json_start = response.index("```json") + 7
        json_end = response.index("```", json_start)
        correctness = json.loads(response[json_start:json_end].strip()).get("correctness", False)
    except (ValueError, IndexError):
        correctness = False
﻿
    return {"correctness": correctness, "reasoning": response}
﻿
﻿
def load_ds():
    print("Loading AIME dataset...")
    all_problems = []
    
    # Load AIME2025-I
    aime_i = load_dataset("opencompass/AIME2025", "AIME2025-I")
    # Check what splits are available
    splits_i = list(aime_i.keys())
    print(f"Available splits in AIME2025-I: {splits_i}")
    
    # Use the first available split
    first_split_i = splits_i[0]
    problems_i = [{"text": row["question"], "label": row["answer"]} for row in aime_i[first_split_i]]
    print(f"Loaded {len(problems_i)} problems from AIME2025-I")
    all_problems.extend(problems_i)
    
    # Load AIME2025-II
    aime_ii = load_dataset("opencompass/AIME2025", "AIME2025-II")
    splits_ii = list(aime_ii.keys())
    print(f"Available splits in AIME2025-II: {splits_ii}")
    
    # Use the first available split
    first_split_ii = splits_ii[0]
    problems_ii = [{"text": row["question"], "label": row["answer"]} for row in aime_ii[first_split_ii]]
    print(f"Loaded {len(problems_ii)} problems from AIME2025-II")
    all_problems.extend(problems_ii)
    
    print(f"Total problems loaded: {len(all_problems)}")
    return all_problems
﻿
﻿
﻿
# Run evaluations for each model
async def run_evaluations():
    print("Loading dataset...")
    dataset = load_ds()
    print(f"Loaded {len(dataset)} problems")
    
    print("Initializing models...")
    models = {
        # "gemini-2.5-pro-exp": Gemini25ProExpModel(),
        "gemini-2.0-flash": Gemini20FlashModel(),
    }
﻿
    print("Preparing dataset for evaluation...")
    # You can uncomment this to use a subset for testing
    # test_size = 5  # Number of problems to evaluate per model
    # dataset_prepared = dataset[:test_size]
    dataset_prepared = dataset
﻿
    print("Running evaluations...")
    scorers = [gpt4o_scorer]
    
    for model_name, model in models.items():
        print(f"\n\n=== EVALUATING {model_name.upper()} ===")
        evaluation = weave.Evaluation(
            dataset=dataset_prepared,
            scorers=scorers,
            name=f"{model_name} Evaluation"
        )
        results = await evaluation.evaluate(model)
        print(f"Results for {model_name}: {results}")
        
        # Calculate and print accuracy
        if hasattr(results, 'scores') and 'gpt4o_scorer' in results.scores:
            correct = sum(1 for score in results.scores['gpt4o_scorer'] if score['correctness'])
            accuracy = correct / len(dataset_prepared) if dataset_prepared else 0
            print(f"{model_name} accuracy: {accuracy:.2%} ({correct}/{len(dataset_prepared)})")
﻿
if __name__ == "__main__":
    # Set your API keys here or in environment variables
    if not os.environ.get("OPENAI_API_KEY"):
        os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY  # Required for litellm
    
    # Run evaluations
    asyncio.run(run_evaluations())
Evaluation with W&B WeaveThis evaluation compares Gemini 2.0 Flash and Gemini 2.5 Pro Experimental on a set of AIME-style math problems, using Weave to track correctness, latency, and token usage.
﻿
Gemini 2.5 Pro achieved a significantly higher correctness score (0.8) compared to Gemini 2.0 Flash (0.53), but with much higher token usage. The increase in both latency and tokens reflects a longer chain of thought, as Gemini 2.5 takes more steps and uses more internal computation to reach its answers.
Correctness is evaluated using GPT-4o, which checks if the model's final boxed answer matches the ground truth. These results align with recent findings that allocating more budget for reasoning—either via longer context windows, more internal steps, or extended inference time—leads to more accurate outputs on complex tasks. Gemini 2.5 performs better because it thinks more, not because it's faster.
The Weave comparison ViewWeave’s comparison view is especially useful for evaluating reasoning models because it allows for a direct visualization of response traces, making it easier to spot discrepancies in logic, structure, and correctness. By presenting multiple outputs side-by-side, it highlights the differences between the step-by-step approaches a model takes to solving a problem.
﻿
This format is particularly valuable for understanding the reasoning paths of the model. By providing transparency into the model’s thought process, Weave’s comparison tool makes it easier to refine and optimize reasoning models for specific applications. Weave’s comparison view also provides direct access to individual reasoning traces, making it easy to inspect latency, cost, and token usage for each response. This allows for a precise evaluation of how different reasoning budgets affect efficiency beyond just correctness.
ConclusionGemini 2.5 Pro Experimental reflects a shift toward deeper, more deliberate AI reasoning and improves accuracy on complex tasks like math, code, and multi-step analysis. This makes it especially useful in scenarios where precision and reasoning quality matter more than raw response time.
Weave makes it easy to evaluate models like Gemini in a controlled and transparent way. By tracking correctness, latency, and token usage across different prompts and model variants, it helps you understand the trade-offs between reasoning depth and computational cost. As models continue to scale in capability and complexity, having tooling like Weave to benchmark and debug model behavior becomes increasingly important.
﻿
﻿
﻿
﻿
﻿
￼
﻿
﻿