Evaluating Claude 3.7 Sonnet: Performance, reasoning, and cost optimization

Experimenting with Anthropic's new flagship LLM, Claude 3.7 Sonnet!
Created on March 4|Last edited on March 26
Comment
﻿Claude 3.7 Sonnet is Anthropic’s latest AI model, bringing a major leap in reasoning, coding, and multimodal capabilities. Designed as an upgrade to Claude 3.5, this version introduces Extended Thinking mode, a feature that allows you to control how much the AI "thinks" before providing answers. With a 200K token context window, improved problem-solving skills, and enhanced coding abilities, Claude 3.7 positions itself as one of the most powerful LLMs available.
The model is particularly well-suited for software development, research, and business automation. It offers one of the best reasoning performances in the AI space, outperforming prior Claude versions and even competing models like GPT-4o in certain benchmarks. 
However, with this increased capability comes an added need for efficient budgeting—the management of how much computational power and token consumption a user is willing to allocate per query. In this article, I will demonstrate how to work with Claude 3.7, as well as evaluate how more reasoning effort impacts overall performance on the AIME 2024 Math benchmark.
﻿
﻿
Table of contentsTable of contentsWhat's new in Claude 3.7 Sonnet?Performance benchmarksMultimodal capabilities and large contextCoding enhancementsBudgeting and costsWhat is budgeting in Claude 3.7?Why budgeting mattersBudgeting comparison to OpenAI’s o3-miniRunning Inference in Claude 3.7 SonnetTool use in Claude 3.7 SonnetEvaluating Claude 3.7’s reasoning capabilities with Weave EvaluationsInterpretation of results with Weave The Weave Comparison View Conclusion 
﻿
What's new in Claude 3.7 Sonnet?Claude 3.7 Sonnet introduces several major upgrades over its predecessor, Claude 3.5. The most significant is Extended Thinking mode, which allows you to trade off speed for deeper reasoning. In standard mode, Claude responds quickly with general knowledge, but in Extended mode, it engages in step-by-step reflection, improving accuracy on complex tasks like coding, math, and logical deduction. Unlike some models that rely on separate architectures for fast and slow processing, Claude 3.7 uses a single hybrid model for both, giving you more control over performance.
Performance benchmarksComparing these frontier AI models across specific benchmarks reveals their distinct capabilities:
﻿
Instruction Following (IFEval): Claude 3.7 Sonnet leads with 93.2% (extended thinking) and 90.8% (standard mode), outperforming DeepSeek R1's 83.3%. Grok 3 Beta and o3-mini lack reported scores here.
Math Problem-Solving (MATH 500): DeepSeek R1 tops this benchmark at 97.3%, narrowly ahead of o3-mini's 97.9% and Claude 3.7 Sonnet's 96.2% (extended thinking).
Graduate-Level Reasoning (GPQA Diamond): Claude 3.7 Sonnet scores 84.8%, slightly above Grok 3 Beta which achieves 84.6%. Claude 3.7 Sonnet's high score uses "internal scoring" with parallel test time compute, while Grok 3's result uses majority voting with N=64 samples. o3-mini reaches 79.7%.
Agentic Coding (SWE-bench Verified): Claude 3.7 Sonnet leads with 70.3% (high compute) compared to o3-mini's 49.3%, DeepSeek R1's 49.2%, and no reported score for Grok 3.
High School Math (AIME 2024): Grok 3 Beta scores 93.3% and Claude 3.7 Sonnet reaches 80.0%. As with GPQA, Claude's score uses internal scoring with parallel test time compute, while Grok 3's high result uses majority voting with N=64 samples. o3-mini achieves 83.3%.
Visual Reasoning (MMMU): Grok 3 Beta leads with 78.0%, slightly ahead of o3-mini's 78.2% and Claude 3.7 Sonnet's 75%.
Multimodal capabilities and large contextClaude 3.7 is now multimodal, meaning it can process images as well as text. This allows you to upload screenshots, diagrams, and scanned documents for analysis. Additionally, its 200K token context window remains one of the largest in the industry, enabling you to work with entire books, massive research documents, or full code repositories in a single prompt.
Coding enhancementsClaude 3.7 was specifically optimized for software engineering. It supports a new feature called Claude Code, a command-line tool that integrates with developer workflows, allowing Claude to edit files, run tests, and even commit code to GitHub. This makes it one of the most autonomous AI coding assistants available. Benchmarks show state-of-the-art performance in solving real-world coding challenges, making it a preferred tool for developers working on large-scale projects.
Budgeting and costsClaude 3.7 maintains the same pricing as Claude 3.5: $3 per million input tokens and $15 per million output tokens. However, with the introduction of Extended Thinking mode, budgeting becomes more important than ever. Unlike standard responses, extended reasoning consumes additional "thinking tokens", which count toward the total cost. You must decide how much computational effort they want to allocate to each query, balancing depth of reasoning with expense and response time.
The next section will explore how budgeting works in Claude 3.7, how it compares to OpenAI’s approach, and the trade-offs between cost, accuracy, and latency.
What is budgeting in Claude 3.7?Budgeting in Claude 3.7 refers to managing token usage to optimize performance while controlling costs. Since the model offers both fast and extended reasoning, you can fine-tune how many tokens Claude should use for thinking before providing an answer. This is especially useful in scenarios where accuracy is important, but excessive token consumption could lead to higher costs or longer response times.
Why budgeting mattersEvery response from Claude 3.7 consumes tokens, but Extended Thinking mode significantly increases this consumption. A standard response may use a few hundred to a few thousand tokens, whereas an extended reasoning response—especially for complex coding or logic problems—can consume tens of thousands. Since pricing is based on input/output token usage, longer reasoning translates directly into higher expenses.
Budgeting ensures that you get the most out of the model without overspending. It also helps in latency-sensitive applications where speed is a factor. For example, if an AI-powered support chatbot is using Claude, enabling extended reasoning on every query could slow down responses and drive up operational costs unnecessarily. In such cases, selectively using extended reasoning for only the most complex queries provides a better balance.
Budgeting comparison to OpenAI’s o3-miniClaude 3.7 offers manual budgeting, allowing you to control reasoning depth. A higher budget improves accuracy, especially for coding, math, and logic, but increases cost and slows responses down. A lower budget is faster and cheaper but may weaken reasoning. One major advantage of Claude is its full reasoning trace, providing transparency and interpretability.
OpenAI’s o3-mini automates budgeting with a reasoning effort parameter. High mode enhances accuracy without manual tuning but consumes more tokens, while low mode is cost-effective but may weaken reasoning. Unlike Claude, o3 does not provide visibility into its full reasoning process, making Claude a great choice for those who need control and transparency.
Running Inference in Claude 3.7 SonnetClaude 3.7 Sonnet provides a flexible inference system that allows you to balance response speed and depth of reasoning. In its basic inference mode, the model processes queries without engaging in extended multi-step reasoning, making it ideal for tasks requiring quick and efficient responses. This mode is particularly useful for straightforward problem-solving, logical puzzles, and general knowledge queries where additional computational effort isn't necessary.
In this implementation, the inference request dynamically adjusts based on the THINKING_BUDGET parameter. If the budget is greater than zero, Claude is allowed to engage in step-by-step reasoning before providing an answer. Otherwise, it defaults to a standard response with minimal computation. This ensures that you can optimize both accuracy and token consumption depending on the complexity of the task.
import os
import json
from anthropic import Anthropic
import weave; weave.init("claude37_inference")
﻿
# Configuration
API_KEY = "your claude key" # Replace with your API key
MODEL = "claude-3-7-sonnet-20250219"
THINKING_BUDGET = 2000
ANSWER_TOKENS = 4000
MAX_TOKENS = ANSWER_TOKENS + THINKING_TOKENS
ENABLE_STREAMING = False  # Set to True to enable streaming
﻿
﻿
# Default prompt if none is provided
DEFAULT_PROMPT = """
Solve this puzzle: Three people check into a hotel. They pay $30 to the manager.
The manager finds out that the room only costs $25 so he gives $5 to the bellboy to return
to the three people. The bellboy, however, decides to keep $2 and gives $1 back to each person.
Now, each person paid $10 and got back $1, so they paid $9 each, totaling $27.
The bellboy kept $2, which makes $29. Where is the missing $1?
"""
﻿
# Initialize client
client = Anthropic(api_key=API_KEY)
﻿
﻿
@weave.op
def run_inference(prompt=None):
    """
    Run inference with the given prompt or default prompt if none provided.
    
    Args:
        prompt (str, optional): The prompt to send to Claude. Defaults to None.
    
    Returns:
        dict: The complete response object when using basic_inference
        or the final text response when using stream_inference
    """
    # Use the provided prompt or fall back to the default
    actual_prompt = prompt if prompt is not None else DEFAULT_PROMPT
    
    if ENABLE_STREAMING:
        return stream_inference(actual_prompt)
    else:
        return basic_inference(actual_prompt)
﻿
def basic_inference(prompt):
    """
    Basic non-streaming inference.
    
    Args:
        prompt (str): The prompt to send to Claude
    
    Returns:
        dict: The complete response object from the API
    """
    try:
        # Prepare request parameters
        request_params = {
            "model": MODEL,
            "max_tokens": MAX_TOKENS,
            "messages": [{"role": "user", "content": prompt}]
        }
        
        # Only include thinking if budget is greater than 0
        if THINKING_BUDGET > 0:
            request_params["thinking"] = {"type": "enabled", "budget_tokens": THINKING_BUDGET}
        
        response = client.messages.create(**request_params)
        
        # Print thinking blocks and answer
        for block in response.content:
            if block.type == "thinking":
                print("\n=== THINKING ===")
                print(block.thinking)
                print("===============")
            elif block.type == "text":
                print("\n=== ANSWER ===")
                print(block.text)
                print("==============")
        
        return response
    except Exception as e:
        print(f"Error: {e}")
        return None
﻿
def stream_inference(prompt):
    """
    Streaming inference.
    
    Args:
        prompt (str): The prompt to send to Claude
    
    Returns:
        str: The combined text response from the stream
    """
    try:
        # Prepare streaming request parameters
        stream_params = {
            "model": MODEL,
            "max_tokens": MAX_TOKENS,
            "messages": [{"role": "user", "content": prompt}]
        }
        
        # Only include thinking if budget is greater than 0
        if THINKING_BUDGET > 0:
            stream_params["thinking"] = {"type": "enabled", "budget_tokens": THINKING_BUDGET}
        
        full_text_response = ""
        
        with client.messages.stream(**stream_params) as stream:
            print("\n=== STREAMING RESPONSE ===")
            current_block_type = None
            
            for event in stream:
                if event.type == "content_block_start":
                    current_block_type = event.content_block.type
                    print(f"\n--- Starting {current_block_type} block ---")
                elif event.type == "content_block_delta":
                    if event.delta.type == "thinking_delta":
                        print(event.delta.thinking, end="", flush=True)
                    elif event.delta.type == "text_delta":
                        print(event.delta.text, end="", flush=True)
                        full_text_response += event.delta.text
                elif event.type == "content_block_stop":
                    print(f"\n--- End {current_block_type} block ---")
                elif event.type == "message_stop":
                    print("\n--- Message complete ---")
                    print("\n==========================")
        
        return full_text_response
    except Exception as e:
        print(f"Error in streaming: {e}")
        return None
﻿
if __name__ == "__main__":
    # Example 1: Using the default prompt
    default_response = run_inference()
    print("\nInference with default prompt complete.")
    
    # Example 2: Using a custom prompt
    custom_prompt = "Explain how quantum computing differs from classical computing in simple terms."
    custom_response = run_inference(custom_prompt)
    print("\nInference with custom prompt complete.")
By structuring the request dynamically, this approach ensures that the model only engages in extended reasoning when explicitly needed. This level of control allows you to make real-time adjustments based on cost and latency considerations.
For more complex problems—such as multi-step math proofs or coding tasks—you can switch to extended reasoning mode, where Claude is allowed to allocate additional tokens for deeper analysis. The next section will examine how this impacts accuracy and problem-solving performance in practice.
After running our script, we can see the inputs and outputs to our run_inference function inside Weave:
﻿
﻿
Tool use in Claude 3.7 SonnetClaude 3.7 Sonnet extends its capabilities by integrating external tools, allowing it to retrieve live data, perform calculations, and interact with structured sources. Instead of relying solely on internal reasoning, the model can call APIs, execute functions, and incorporate results into its responses. This improves accuracy and expands its ability to handle real-world tasks.
In this implementation, Claude selects from multiple tools, including a weather API, a calculator, and a dictionary lookup function. The model first evaluates the user’s query, determines if a tool is needed, and generates the appropriate tool call. If required, the system loops until the correct result is retrieved and incorporated into the response.
import os
import json
from anthropic import Anthropic
import weave; weave.init("claude37_inference")
﻿
﻿
# Configuration
API_KEY = "your claude key" # Replace with your API key
MODEL = "claude-3-7-sonnet-20250219"
ENABLE_STREAMING = False  # Set to True to enable streaming output
﻿
# Initialize client
client = Anthropic(api_key=API_KEY)
﻿
# Tools definitions
TOOLS = [
    {
        "name": "weather",
        "description": "Get current weather information for a location.",
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "The location to get weather for."}
            },
            "required": ["location"]
        }
    },
    {
        "name": "calculator",
        "description": "Perform mathematical calculations.",
        "input_schema": {
            "type": "object",
            "properties": {
                "expression": {"type": "string", "description": "The mathematical expression to evaluate."}
            },
            "required": ["expression"]
        }
    },
    {
        "name": "dictionary",
        "description": "Get definitions of words.",
        "input_schema": {
            "type": "object",
            "properties": {
                "word": {"type": "string", "description": "The word to look up."}
            },
            "required": ["word"]
        }
    }
]
﻿
# Mock data for tools
WEATHER_DATA = {
    "New York": {"temperature": 72, "condition": "Sunny", "humidity": "45%"},
    "London": {"temperature": 62, "condition": "Cloudy", "humidity": "78%"},
    "Tokyo": {"temperature": 80, "condition": "Partly cloudy", "humidity": "65%"},
    "Paris": {"temperature": 65, "condition": "Rainy", "humidity": "82%"},
    "Sydney": {"temperature": 85, "condition": "Clear", "humidity": "55%"},
}
﻿
DICTIONARY_DATA = {
    "ephemeral": {
        "definition": "Lasting for a very short time.",
        "part_of_speech": "adjective",
        "example": "Ephemeral snowflakes melted on her hand."
    },
    "ubiquitous": {
        "definition": "Present, appearing, or found everywhere.",
        "part_of_speech": "adjective",
        "example": "Smartphones have become ubiquitous in modern society."
    },
    "serendipity": {
        "definition": "The occurrence of events by chance in a happy or beneficial way.",
        "part_of_speech": "noun",
        "example": "The discovery was a perfect example of serendipity."
    },
    "algorithm": {
        "definition": "A process or set of rules to be followed in calculations or other problem-solving operations.",
        "part_of_speech": "noun",
        "example": "A search algorithm"
    }
}
﻿
def get_weather(location):
    """Mock weather data function."""
    return WEATHER_DATA.get(location, {"error": f"No weather data available for {location}"})
﻿
def calculate(expression):
    """Simple calculator function."""
    try:
        # Warning: eval can be dangerous in production code
        # Use a proper math expression parser in real applications
        result = eval(expression, {"__builtins__": {}})
        return {"result": result}
    except Exception as e:
        return {"error": f"Could not evaluate expression: {str(e)}"}
﻿
def get_definition(word):
    """Mock dictionary function."""
    return DICTIONARY_DATA.get(word.lower(), {"error": f"No definition found for '{word}'"})
﻿
def execute_tool(tool_name, tool_input):
    """Execute the appropriate tool based on the name and input."""
    if tool_name == "weather":
        return get_weather(tool_input["location"])
    elif tool_name == "calculator":
        return calculate(tool_input["expression"])
    elif tool_name == "dictionary":
        return get_definition(tool_input["word"])
    else:
        return {"error": "Unknown tool requested"}
﻿
﻿
﻿
﻿
﻿
@weave.op
def multi_tool_example(prompt):
    """
    Example showing Claude choosing between multiple tools.
    
    Args:
        prompt (str): The user's input prompt
        
    Returns:
        dict: The final response object from the API
    """
    # Initial request with the user's prompt
    print(f"\n=== QUESTION: {prompt} ===")
    response = client.messages.create(
        model=MODEL,
        max_tokens=4000,
        thinking={"type": "enabled", "budget_tokens": 2000},
        tools=TOOLS,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Display thinking and tool selection
    thinking_blocks = [b for b in response.content if b.type == "thinking"]
    for block in thinking_blocks:
        print("\n🧠 THINKING:")
        print(block.thinking[:300] + "..." if len(block.thinking) > 300 else block.thinking)
    
    # Process tool use if needed
    conversation = [{"role": "user", "content": prompt}]
    
    # We might need multiple tool calls, so loop until we get a final answer
    while response.stop_reason == "tool_use":
        tool_block = next((b for b in response.content if b.type == "tool_use"), None)
        
        if tool_block:
            # Show which tool was selected
            print(f"\n🔧 SELECTED TOOL: {tool_block.name}")
            print(f"Tool input: {tool_block.input}")
            
            # Execute the appropriate tool
            tool_result = execute_tool(tool_block.name, tool_block.input)
            print(f"Tool result: {json.dumps(tool_result, indent=2)}")
            
            # Save assistant's response (thinking + tool use)
            assistant_blocks = thinking_blocks + [tool_block]
            conversation.append({"role": "assistant", "content": assistant_blocks})
            
            # Add tool result to conversation
            conversation.append({
                "role": "user", 
                "content": [{
                    "type": "tool_result",
                    "tool_use_id": tool_block.id,
                    "content": json.dumps(tool_result)
                }]
            })
            
            # Get next response
            response = client.messages.create(
                model=MODEL,
                max_tokens=4000,
                thinking={"type": "enabled", "budget_tokens": 2000},
                tools=TOOLS,
                messages=conversation
            )
            
            # Update thinking blocks for next iteration
            thinking_blocks = [b for b in response.content if b.type == "thinking"]
            for block in thinking_blocks:
                print("\n🧠 ADDITIONAL THINKING:")
                print(block.thinking[:300] + "..." if len(block.thinking) > 300 else block.thinking)
    
    # Print final answer
    print("\n✓ FINAL ANSWER:")
    final_text = ""
    for block in response.content:
        if block.type == "text":
            print(block.text)
            final_text += block.text
    
    # Collect all tools used throughout the conversation
    tools_used = []
    for msg in conversation:
        if msg["role"] == "assistant" and isinstance(msg["content"], list):
            for block in msg["content"]:
                if hasattr(block, "type") and block.type == "tool_use":
                    tools_used.append({
                        "name": block.name,
                        "input": block.input,
                        "id": block.id
                    })
    
    # Return the complete response object for further processing
    return {
        "response_object": response,
        "final_text": final_text,
        "conversation_history": conversation,
        "tools_used": tools_used
    }
﻿
if __name__ == "__main__":
    # Example 1: Weather question
    result1 = multi_tool_example("What's the current weather in Tokyo?")
    
    # Example 2: Math calculation
    result2 = multi_tool_example("Calculate the square root of 144 plus 25")
    
    # Example 3: Dictionary definition
    result3 = multi_tool_example("Define the word 'algorithm' for me")
    
    # Example 4: Complex question requiring multiple tools
    result4 = multi_tool_example("If it's 62°F in London, what's that in Celsius?")
    
    # You can now use the returned results for further processing
    print("\n=== Example of accessing returned data ===")
    print(f"Example 4 final text: {result4['final_text']}")
    
    # Print tools used in the complex query example
    print("\n=== Tools used in Example 4 ===")
    if result4['tools_used']:
        for i, tool in enumerate(result4['tools_used'], 1):
            print(f"Tool {i}: {tool['name']} (ID: {tool['id']})")
            print(f"Input: {json.dumps(tool['input'], indent=2)}")
    else:
        print("No tools were used in this example.")
This method allows Claude to decide when to use external tools based on the input. The model analyzes the query, calls the necessary tool if needed, processes the result, and integrates it into its response.
This approach keeps automation transparent, letting you see how Claude reasons and interacts with tools. Retrieving live data or solving precise calculations becomes more accurate and efficient.
By incorporating tool use into its reasoning, Claude 3.7 Sonnet expands its problem-solving abilities while maintaining speed and reliability. With W&B Weave, we can also track tool usage, so we can monitor how our model uses each tool: 
﻿
﻿
Evaluating Claude 3.7’s reasoning capabilities with Weave EvaluationsNow we will evaluate Claude 3.7’s reasoning capabilities by testing different reasoning budgets, ranging from standard mode to extended thinking with up to 24K tokens. Using Weave Evaluations, we will track model inputs, outputs, and performance metrics, providing a clear and structured way to analyze how budgeting affects accuracy, latency, and cost. By the end, you’ll have a streamlined process to explore Claude 3.7’s capabilities and understand the trade-offs of different reasoning budgets.
The dataset for this evaluation is the AIME 2024 dataset, a collection of challenging mathematical reasoning problems. Each entry consists of a problem description as the input and a corresponding solution as the output. This dataset is designed to test the reasoning capabilities of large language models across a variety of tasks, including algebra, calculus, and number theory. By using this dataset, we aim to benchmark Claude 3.7’s ability to understand and solve problems requiring both contextual comprehension and precise logical reasoning.
Heres the code for my evaluation: 
import os
import asyncio
import json
from datasets import load_dataset
from anthropic import Anthropic
from litellm import completion  # Using litellm instead of Azure OpenAI
import weave
weave.init("aime_evaluation")  # Initialize Weave
﻿
# Initialize Anthropic client
ANTHROPIC_API_KEY = os.environ.get("ANTHROPIC_API_KEY", "your key")
client = Anthropic(api_key=ANTHROPIC_API_KEY)
﻿
# Set OpenAI API key for litellm
os.environ["OPENAI_API_KEY"] = "your key"  # Replace with your actual OpenAI API key
﻿
# Constants
MODEL_NAME = "claude-3-7-sonnet-20250219"
MAX_ANSWER_TOKENS = 8000  # Maximum tokens for the answer
﻿
# Define different reasoning setups to test with much larger thinking budgets
REASONING_CONFIGS = [
    {
        "name": "standard",
        "thinking": None,  # Standard Claude (no extended thinking)
        "max_tokens": MAX_ANSWER_TOKENS
    },
    {
        "name": "thinking_4k",
        "thinking": {"type": "enabled", "budget_tokens": 4000},
        "max_tokens": MAX_ANSWER_TOKENS + 4000
    },
    {
        "name": "thinking_8k",
        "thinking": {"type": "enabled", "budget_tokens": 8000},
        "max_tokens": MAX_ANSWER_TOKENS + 8000
    },
    {
        "name": "thinking_16k",
        "thinking": {"type": "enabled", "budget_tokens": 16000},
        "max_tokens": MAX_ANSWER_TOKENS + 16000
    },
    {
        "name": "thinking_24k",
        "thinking": {"type": "enabled", "budget_tokens": 24000},
        "max_tokens": MAX_ANSWER_TOKENS + 24000
    }
]
﻿
# Consistent system message for all models
system_message = "Solve the following problem. put your final answer within \\boxed{}: "
﻿
# Function to perform inference using litellm (for the scorer)
def run_inference(prompt, model_id="gpt-4o-2024-08-06"):
    try:
        response = completion(
            model=model_id,
            temperature=0.0,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        
        # Extract content from litellm response
        if response and hasattr(response, 'choices') and len(response.choices) > 0:
            content = response.choices[0].message.content
            return content
        else:
            print("No content found in response")
            return None
            
    except Exception as e:
        print(f"Failed to get response: {e}")
        return None
﻿
# Claude inference function with streaming
async def claude_inference(prompt: str, config: dict) -> str:
    """
    Generate solution using Claude with specified reasoning config and streaming.
    """
    try:
        kwargs = {
            "model": MODEL_NAME,
            "max_tokens": config["max_tokens"],
            "messages": [
                {"role": "user", "content": system_message + prompt}
            ]
        }
        
        # Add thinking parameters if specified
        if config["thinking"]:
            kwargs["thinking"] = config["thinking"]
        
        print(f"\n--- Starting inference with {config['name']} config ---")
        print(f"Max tokens: {config['max_tokens']}")
        if config["thinking"]:
            print(f"Thinking budget: {config['thinking']['budget_tokens']} tokens")
        
        thinking_content = ""
        text_content = ""
        
        # Use streaming
        with client.messages.stream(**kwargs) as stream:
            for event in stream:
                if event.type == "content_block_start":
                    if event.content_block.type == "thinking":
                        print(f"\n[Starting thinking block...]")
                    elif event.content_block.type == "text":
                        print(f"\n[Starting response...]")
                    elif event.content_block.type == "redacted_thinking":
                        print(f"\n[Redacted thinking block]")
                
                elif event.type == "content_block_delta":
                    if event.delta.type == "thinking_delta":
                        # Just print dots to show progress without flooding console
                        print(".", end="", flush=True)
                        thinking_content += event.delta.thinking
                    elif event.delta.type == "text_delta":
                        print(event.delta.text, end="", flush=True)
                        text_content += event.delta.text
                
                elif event.type == "content_block_stop":
                    if hasattr(event, 'content_block') and event.content_block.type == "thinking":
                        print(f"\n[Thinking block complete - {len(thinking_content)} chars]")
                    elif hasattr(event, 'content_block') and event.content_block.type == "text":
                        print(f"\n[Response complete - {len(text_content)} chars]")
                    else:
                        print(f"\n[Block complete]")
        
        print(f"\n--- Inference complete ---")
        
        # If no text was generated, return an error message
        if not text_content:
            return "No response generated"
            
        return text_content
                
    except Exception as e:
        print(f"Error in inference: {e}")
        return f"Error: {str(e)}"
﻿
# Define model classes for each reasoning configuration
class ClaudeStandardModel(weave.Model):
    @weave.op
    async def predict(self, text: str) -> str:
        return await claude_inference(text, REASONING_CONFIGS[0])
﻿
class ClaudeThinking4kModel(weave.Model):
    @weave.op
    async def predict(self, text: str) -> str:
        return await claude_inference(text, REASONING_CONFIGS[1])
﻿
class ClaudeThinking8kModel(weave.Model):
    @weave.op
    async def predict(self, text: str) -> str:
        return await claude_inference(text, REASONING_CONFIGS[2])
﻿
class ClaudeThinking16kModel(weave.Model):
    @weave.op
    async def predict(self, text: str) -> str:
        return await claude_inference(text, REASONING_CONFIGS[3])
﻿
class ClaudeThinking24kModel(weave.Model):
    @weave.op
    async def predict(self, text: str) -> str:
        return await claude_inference(text, REASONING_CONFIGS[4])
﻿
﻿
# Score function using GPT-4o via litellm
@weave.op
async def gpt4o_scorer(label: str, model_output: str) -> dict:
    """Score the model's output by comparing it with the ground truth."""
    query = f"""
    YOU ARE A LLM JUDGE DETERMINING IF THE FOLLOWING MODEL GENERATED ANSWER IS THE SAME AS THE CORRECT ANSWER 
    I WILL GIVE YOU THE LAST 100 CHARS OF THE MODEL'S REASONING PATH, WHICH WILL CONTAIN THE FINAL ANSWER -> 
﻿
    Model's Answer (last 100 chars): {str(model_output)[-100:]}
    Correct Answer: {label}
    
    Your task:
    1. State the model's predicted answer (answer only).
    2. State the ground truth (answer only).
    3. Determine if the model's final answer is correct (ignore formatting differences, etc.). RESPOND with the predicted and ground truth answer, followed with a JSON object containing the correctness encapsulated within the following delimiters:
       ```json
       {{ "correctness": true/false }}
       ```
    """
    
    # Perform inference using litellm
    response = run_inference(query, "gpt-4o-2024-08-06")
    
    if response is None:
        return {"correctness": False, "reasoning": "Inference failed."}
    
    try:
        # Extract correctness JSON object from the response
        json_start = response.index("```json") + 7
        json_end = response.index("```", json_start)
        correctness = json.loads(response[json_start:json_end].strip()).get("correctness", False)
    except (ValueError, IndexError):
        correctness = False
﻿
    return {"correctness": correctness, "reasoning": response}
﻿
# Load and preprocess dataset
def load_ds():
    print("Loading AIME dataset...")
    try:
        dataset = load_dataset("Maxwell-Jia/AIME_2024")["train"]  # No test set here
        return [{"text": row["Problem"], "label": row["Answer"]} for row in dataset]
    except Exception as e:
        print(f"Error loading AIME dataset: {e}")
        return []
﻿
# Run evaluations for each model
async def run_evaluations():
    print("Loading dataset...")
    dataset = load_ds()
    print(f"Loaded {len(dataset)} problems")
    
    print("Initializing models...")
    models = {
        "standard": ClaudeStandardModel(),
        "thinking_4k": ClaudeThinking4kModel(),
        "thinking_8k": ClaudeThinking8kModel(),
        "thinking_16k": ClaudeThinking16kModel(),
        "thinking_24k": ClaudeThinking24kModel(),
    
    }
﻿
    print("Preparing dataset for evaluation...")
    # Take a subset for faster testing - adjust as needed
    # test_size = 5  # Number of problems to evaluate per model
    dataset_prepared = [{"text": row["text"], "label": row["label"]} for row in dataset]
﻿
    print("Running evaluations...")
    scorers = [gpt4o_scorer]
    results = {}
    
    for model_name, model in models.items():
        print(f"\n\n=== EVALUATING {model_name.upper()} ===")
        evaluation = weave.Evaluation(
            dataset=dataset_prepared,
            scorers=scorers,
            name=f"{model_name} Evaluation"
        )
        await evaluation.evaluate(model)
﻿
﻿
if __name__ == "__main__":
    # Set your API keys here or in environment variables
    if not os.environ.get("ANTHROPIC_API_KEY"):
        os.environ["ANTHROPIC_API_KEY"] = "YOUR_ANTHROPIC_API_KEY"
    
    if not os.environ.get("OPENAI_API_KEY"):
        os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"  # Required for litellm
    
    # Run evaluations
    asyncio.run(run_evaluations())
The AIME 2024 dataset is loaded using the Hugging Face load_dataset function, pairing each problem with its correct answer to enable objective evaluation of Claude 3.7’s reasoning accuracy. This dataset serves as a strong benchmark for assessing the model’s ability to handle complex mathematical and logical reasoning tasks.
The evaluation tests multiple configurations of Claude 3.7, including standard mode and extended reasoning budgets of 4K, 8K, 16K, and 24K tokens. Each model variant is instantiated using Weave’s Model class to ensure direct and consistent comparisons. A standardized system message is applied across all runs to maintain fairness in evaluation.
Claude 3.7 supports streaming inference, allowing real-time tracking of responses. When extended reasoning is enabled, the model generates visible thought processes before delivering a final answer. Each reasoning configuration defines a reasoning token budget, enabling controlled experimentation with different levels of computational effort.
Interpretation of results with Weave This evaluation highlights the trade-offs between cost, accuracy, and response time when using different reasoning budgets in Claude 3.7. Higher budgets generally improve correctness on complex problems but lead to increased token usage and latency. Here are the results for my evaluation: 
﻿
﻿
Correctness is assessed using GPT-4o as a scoring model, measuring how accurately each model produces solutions compared to ground truth answers. The latency and total token consumption are tracked to understand the efficiency trade-offs between faster responses and deeper reasoning.
The results indicate that higher reasoning budgets improve accuracy but significantly increase latency. Models using 16K and 24K tokens achieved the highest correctness scores (0.500), while lower-budget models scored progressively lower, with standard mode performing the worst at 0.200.
Anthropic’s reported benchmarks for Claude 3.7 Sonnet highlight its performance improvements in extended reasoning but do not fully disclose the specifics of their parallel test-time computation method. They indicate that Claude can use up to 64K tokens for extended thinking, but they haven't detailed how this process determines when to stop generating reasoning paths or how multiple possible solutions are evaluated. Their approach may involve separate internal models or heuristics that dynamically allocate reasoning time based on problem complexity.
﻿
﻿
﻿
﻿
Their results show Claude 3.7 Sonnet scoring 61.3% / 80.0% on AIME 2024, with the first number likely representing a pass@1 score using a single reasoning path (this is my best guess), while the second number reflects their internal test-time compute method, which may involve selecting a final reasoning trace using a more advanced method such as using another language model (like a second specialized copy of Claude) to check work and pick what it thinks is best. 
The Weave Comparison View Weave’s comparison view is especially useful for evaluating reasoning models because it allows for a direct visualization of response traces, making it easier to spot discrepancies in logic, structure, and correctness. By presenting multiple outputs side by side, it highlights how different reasoning budgets influence the step-by-step approach a model takes to solving a problem.
﻿
﻿
This format is particularly valuable for understanding how variations in token allocation affect reasoning depth, consistency, and clarity. It enables you to analyze whether a model maintains logical coherence across different configurations and to pinpoint where shorter or longer reasoning paths may lead to different conclusions. By providing transparency into the model’s thought process, Weave’s comparison tool makes it easier to refine and optimize reasoning models for specific applications.
Weave’s comparison view also provides direct access to individual reasoning traces, making it easy to inspect latency, cost, and token usage for each response. This allows for a precise evaluation of how different reasoning budgets affect efficiency beyond just correctness.
﻿
Conclusion Claude 3.7 Sonnet represents a shift toward greater flexibility in AI reasoning, giving you more direct control over computational depth and accuracy. The model’s ability to dynamically adjust reasoning effort, integrate external tools, and handle large-scale context makes it well-suited for complex problem-solving across a range of domains. However, with this increased capability comes the need for more intentional decision-making around budgeting and response optimization.
Weave is an excellent tool for evaluating reasoning models, providing a structured framework for analyzing how different reasoning budgets affect accuracy, efficiency, and response quality. By enabling you to systematically track model outputs, token consumption, and latency, Weave offers a level of transparency that is essential for refining AI performance. As AI systems continue to evolve, tools like Weave will play a key role in helping researchers and developers optimize models like Claude 3.7 for real-world applications. 
Feel free to check out the Google Colab here.
Training GPT-4o to reason: Fine-tuning vs budget forcing
Can fine-tuning and budget forcing improve GPT-4o’s reasoning? We test structured datasets and inference-time techniques to boost multi-step problem-solving.
Budget forcing s1-32B: Waiting is all you need? 
We test whether budget forcing - a simple test-time intervention - can significantly boost the reasoning accuracy of s1-32B, potentially enabling smaller models to rival closed-source giants like OpenAI's o1-preview.
o3-mini vs. DeepSeek-R1: API setup, performance testing & model evaluation
Learn how to set up and run OpenAI o3-mini via the API, explore its flexible reasoning effort settings, and compare its performance against DeepSeek-R1 using W&B Weave Evaluations.
DeepSeek-R1 vs OpenAI o1: A guide to reasoning model setup and evaluation
Discover the capabilities of DeepSeek-R1 and OpenAI o1 models for reasoning and decision-making. Includes setup guides, API usage, local deployment, and Weave-powered comparisons.
﻿
﻿
Add a comment
Tags: Articles, Weave, Evaluations, GenAI, Tutorial, Experiment, Agents
Iterate on AI agents and models faster. Try Weights & Biases today.