Tutorial: Running inference with Qwen3 235B A22B Thinking-2507 using W&B Inference

Getting set up and running Qwen3 235B A22B Thinking-2507, OpenAI's advanced language model, in Python using W&B Inference.
Brett Young
Created on September 5|Last edited on September 9
Comment
﻿
Running inference with Qwen3 235B A22B Thinking-2507 on W&B Inference powered by CoreWeave is easy to set up and gives you flexibility in how it can be used. In this tutorial, you’ll configure inference and walk through its advanced features step by step.
Whether you need to handle large documents, support multiple languages, or work on complex reasoning tasks, this guide shows how to apply Qwen3 235B A22B Thinking-2507 effectively for inference.
﻿
Table of contentsWhat is Qwen3 235B A22B Thinking-2507?W&B WeaveTutorial: Running inference with Qwen3 235B A22B Thinking-2507 using W&B InferencePrerequisitesStep 1: Installation & setupStep 2: Environment configurationStep 3: Running basic inference with Qwen3 235B A22B Thinking-2507Step 4: Advanced Qwen3 235B A22B Thinking-2507 inference configurationRunning inference with Qwen3 235B A22B Thinking-2507's unique capabilitiesMonitoring  Qwen3 235B A22B Thinking-2507 inference with W&B WeaveBest PracticesNext steps
﻿
What is Qwen3 235B A22B Thinking-2507?Qwen3 235B A22B Thinking-2507 is an open-weight Mixture-of-Experts model from Alibaba’s Qwen team, designed for deep reasoning. It activates 22B of its 235B parameters per token and supports ultra-long context up to 262,144 tokens, making it strong for logic, math, coding, science, and extended reasoning tasks.﻿﻿
Why it stands out📊 It outperforms earlier open-source models on benchmarks like AIME, SuperGPQA, LiveCodeBench, HMMT, and MMLU-Redux.
🧠 It operates entirely in “thinking mode,” producing structured reasoning before giving final answers.
Tuned for reasoning-intensive use cases📝 Built for step-by-step reasoning, tool use, agent workflows, multilingual tasks, and long-form content.
🚀 As of July 2025, it is the most capable open-source model in the Qwen3-235B family for structured reasoning.
﻿
🏆 On ArenaHard, Qwen3-235B-A22B scores 95.6, higher than OpenAI-o1 at 92.1 and Deepseek-R1 at 93.2, and nearly matching Gemini-2.5 Pro at 96.4.
🧮 On AIME24, it reaches 85.7, ahead of OpenAI-o1 at 74.3 and Deepseek-R1 at 79.8, though still below Gemini-2.5 Pro at 92.0.
📐 On AIME25, the model posts 81.5, stronger than OpenAI-o1 at 79.2 and Deepseek-R1 at 70.0, but just under Gemini-2.5 Pro at 86.7.
💻 On LiveCodeBench v5, it achieves 70.7, beating OpenAI-o1 at 63.9 and Deepseek-R1 at 64.3, and about even with Gemini-2.5 Pro at 70.4.
⚡ On CodeForces, Qwen3-235B-A22B records an Elo rating of 2056, ahead of OpenAI-o1 at 1891 and Gemini-2.5 Pro at 2001, while also edging out OpenAI-o3-mini at 2036.
📝 On Aider Pass@2, it scores 61.8, nearly identical to OpenAI-o1 at 61.7, above Deepseek-R1 at 56.9, but trailing Gemini-2.5 Pro at 72.9.
📊 On LiveBench, it reaches 77.1, surpassing OpenAI-o1 at 75.7 and Deepseek-R1 at 71.6, though still behind Gemini-2.5 Pro at 82.4.
🔧 On BFCL v3, it posts 70.8, stronger than OpenAI-o1 at 67.8 and Deepseek-R1 at 56.9, and well above Gemini-2.5 Pro at 62.9.
🌍 On MultiIF, which covers eight languages, Qwen3-235B-A22B scores 71.9, far higher than OpenAI-o1 at 48.8 and above Deepseek-R1 at 67.7, though a bit below Gemini-2.5 Pro at 77.8.
W&B Weave﻿W&B Weave simplifies the process of tracking and analyzing model outputs in your project. To get started with Weave, you'll first import it and initialize it with your project name.
One of its standout features is the @weave.op decorator. In Python, a decorator is a powerful tool that extends the behavior of a function. By placing @weave.op above any function in your code, you instruct Weave to automatically log that function's inputs and outputs. This makes it incredibly easy to keep track of what data goes in and what comes out.
After your code executes, logs are available in the Weave dashboard, where you can inspect visualizations and traces of each function call. This makes debugging easier and keeps experimental data well organized, helping you develop and refine models like Qwen3 235B A22B Thinking-2507 more effectively.
Tutorial: Running inference with Qwen3 235B A22B Thinking-2507 using W&B InferenceLet's jump right in. We will be running inference with the Qwen/Qwen3-235B-A22B-Thinking-2507 model specifically.
If you're not familiar with Jupyter Notebooks, you can get set up in about 5 minutes. I walk you through it in this tutorial.
💡
PrerequisitesBefore starting, ensure you have:
A Weights & Biases account (you can sign up free here)
Python 3.7 or higher installed
Basic familiarity with Python and API usage
Understanding of your use case requirements (document analysis, code review, multilingual tasks, etc.)
Step 1: Installation & setup
1. Install required packagesTo get started running inference with Qwen3 235B A22B Thinking-2507, all you need to install is OpenAI and Weave. We’ll also show you how to streamline the review of multiple outputs with W&B Weave, making the process far more efficient.
The code to do this is:
pip install openai wandb weave
Run this in your terminal or Jupyter cell after entering this code.
When you execute the cell, you'll notice an asterisk ([*]) appear between the brackets [ ]. This indicates that the cell is running, and you'll need to wait until the asterisk turns into a number before proceeding.
2. Get your W&B API keyVisit https://wandb.ai/authorize﻿
Copy your API key
Keep it handy for the next step
Step 2: Environment configurationSetting up your environment variables is crucial for secure and seamless operation. You'll need both your W&B API key.
Option 1: In a Jupyter Notebook# Set environment variables in your notebook
%env WANDB_API_KEY=your-wandb-api-key-here
Option 2: In Terminal/Shellexport WANDB_API_KEY="your-wandb-api-key-here"
Option 3: In Python scriptimport os
# Set environment variables programmatically
os.environ["WANDB_API_KEY] = "your-wandb-api-key-here"
Step 3: Running basic inference with Qwen3 235B A22B Thinking-2507Hopefully, this hasn't been too painful because now we're at the fun part.
Here's a complete example to start running inference with Qwen3 235B A22B Thinking-2507:
import os
import openai
import weave
﻿
PROJECT = "wandb_inference"
weave.init(PROJECT)
﻿
client = openai.OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=os.getenv("WANDB_API_KEY"),
    project=PROJECT,
    default_headers={
        "OpenAI-Project": "wandb_fc/quickstart_playground"  # replace with your actual team/project
    }
)
﻿
resp = client.chat.completions.create(
    model="Qwen/Qwen3-235B-A22B-Thinking-2507",
    messages=[
        {"role": "system", "content": "You are a science educator who explains complex topics clearly."},
        {"role": "user", "content": "Explain quantum computing fundamentals, advantages, applications, and limitations. Use simple analogies for someone without a physics background."}
    ],
    temperature=0.7,
    max_tokens=1000,
)
﻿
print(resp.choices[0].message.content)
﻿
You'll find the inputs and outputs recorded to your Weave dashboard with the parameters automatically included:
﻿
Step 4: Advanced Qwen3 235B A22B Thinking-2507 inference configuration
Understanding inference parametersYou can adjust Qwen3 235B A22B Thinking-2507's response behavior using these inference parameters—experiment with them and compare the results in Weave.
import os
import openai
import weave
﻿
PROJECT = "wandb_inference"
weave.init(PROJECT)
﻿
client = openai.OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=os.getenv("WANDB_API_KEY"),
    project=PROJECT,
    default_headers={
        "OpenAI-Project": "wandb_fc/quickstart_playground"  # replace with your actual team/project
    }    
)
﻿
resp = client.chat.completions.create(
    model="Qwen/Qwen3-235B-A22B-Thinking-2507",
    messages=[
        {"role": "system", "content": "You are a creative writing assistant specializing in science fiction."},
        {"role": "user", "content": "Write a compelling 800-word time travel story with well-developed characters, plot twists, and philosophical themes about causality and free will."}
    ],
    temperature=0.8,
    top_p=0.9,
    max_tokens=2000,
)
print(resp.choices[0].message.content)
﻿
Parameter Guidelines:
Temperature: Use 0.1-0.3 for analytical tasks, 0.7-0.9 for creative work
Top_p: Combine with temperature; 0.9 works well for most applications
This gives us added flexibility to influence our model output. These parameters are also automatically logged to W&B Weave for observability:
﻿
Streaming inference responsesFor real-time output and better user experience:
import os
import sys
import openai
import weave
﻿
﻿
PROJECT = "wandb_inference"
weave.init(PROJECT)
﻿
client = openai.OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=os.getenv("WANDB_API_KEY"),
    project=PROJECT,
    default_headers={
        "OpenAI-Project": "wandb_fc/quickstart_playground"  # replace with your actual team/project
    }
)
﻿
stream = client.chat.completions.create(
    model="Qwen/Qwen3-235B-A22B-Thinking-2507",
    messages=[
        {"role": "system", "content": "You are a space exploration historian and science communicator."},
        {"role": "user", "content": "Create a comprehensive narrative about humanity's space exploration journey from early dreams to current interplanetary plans. Include key moments, discoveries, heroes, innovations, and philosophical implications."}
    ],
    stream=True,
    temperature=0.7,
)
﻿
sys.stdout.write("Response: ")
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta and delta.content:
        sys.stdout.write(delta.content)
        sys.stdout.flush()
print()
﻿
We got a streaming response:
﻿
With the metrics logged to Weave:
﻿
As well as the full output:
﻿
Running inference with Qwen3 235B A22B Thinking-2507's unique capabilitiesThis is where running inference with Qwen3 235B A22B Thinking-2507 really shines. Let's explore what makes it special.
Long context inferenceQwen3 235B A22B Thinking-2507 excels at running inference on extensive documents. Here's a practical example:
import io
import requests
import openai
import weave
from pypdf import PdfReader
﻿
PROJECT = "wandb_inference"
weave.init(PROJECT)
﻿
PDF_URL = "https://docs.aws.amazon.com/pdfs/bedrock-agentcore/latest/devguide/bedrock-agentcore-dg.pdf"
QUESTION = "How does AgentCore's memory architecture work?"
﻿
client = openai.OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=os.getenv("WANDB_API_KEY"),
    project=PROJECT,
    default_headers={
        "OpenAI-Project": "wandb_fc/quickstart_playground"  # replace with your actual team/project
    }
)
﻿
r = requests.get(PDF_URL, timeout=60)
r.raise_for_status()
﻿
reader = PdfReader(io.BytesIO(r.content))
pages = reader.pages[:100]
text = "\n\n".join(page.extract_text() or "" for page in pages)
﻿
doc_snippet = text
﻿
prompt = (
    f"Based on this AWS Bedrock AgentCore documentation, answer: {QUESTION}\n\n"
    f"Documentation:\n{doc_snippet}\n\n"
    "Provide analysis with direct quotes."
)
﻿
resp = client.chat.completions.create(
    model="Qwen/Qwen3-235B-A22B-Thinking-2507",
    messages=[
        {"role": "system", "content": "You are an AWS Bedrock AgentCore expert. Analyze the documentation and answer questions based only on the provided text."},
        {"role": "user", "content": prompt}
    ],
    temperature=0.2,
    max_tokens=1500,
)
﻿
print(resp.choices[0].message.content)
﻿
Which outputs to Weave:
﻿
﻿
Multilingual inferenceLeverage Qwen3 235B A22B Thinking-2507's multilingual inference capabilities for international development:
import os
import openai
import weave
﻿
PROJECT = "wandb_inference"
weave.init(PROJECT)
﻿
client = openai.OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=os.getenv("WANDB_API_KEY"),
    project=PROJECT,
    default_headers={
        "OpenAI-Project": "wandb_fc/quickstart_playground"  # replace with your actual team/project
    }
)
﻿
code_snippet = """
// JavaScript function with Chinese identifiers and mixed language comments
function 计算总价(商品列表, 折扣率) {
    let 总价 = 0;  // Total price accumulator
    for (const 商品 of 商品列表) {
        总价 += 商品.价格 * 商品.数量  // Calculate item total: price * quantity
    }
    const 折扣金额 = 总价 * 折扣率  // Calculate discount amount
    return 总价 - 折扣金额  // Return final price after discount
}
﻿
# Python function with Chinese docstring and mixed naming conventions
def validate_用户输入(user_data):
    '''
    验证用户输入数据的完整性和有效性
    Validates user input data for completeness and validity
    
    This function checks if all required fields are present in the user data
    and raises appropriate errors if validation fails.
    '''
    required_fields = ['name', 'email', '年龄']  # Mix of English and Chinese field names
    for field in required_fields:
        if field not in user_data:
            raise ValueError(f"Missing required field: {field}")
    return True
"""
﻿
task = (
    "Analyze this code: "
    "1) Explain each function in English and Chinese "
    "2) Assess code quality issues "
    "3) Recommend improvements "
    "4) Provide refactored version with consistent naming"
)
﻿
resp = client.chat.completions.create(
    model="Qwen/Qwen3-235B-A22B-Thinking-2507",
    messages=[
        {"role": "system", "content": "You are a software architect with expertise in English and Chinese."},
        {"role": "user", "content": f"{task}\n\nCode:\n{code_snippet}"}
    ],
    temperature=0.2,
    max_tokens=1200,
)
﻿
print(resp.choices[0].message.content)
Which logs to Weave as:
﻿
Complex multi-step reasoning inference with Qwen3 235B A22B Thinking-2507Utilize Qwen3 235B A22B Thinking-2507's inference reasoning capabilities for complex problem solving:
﻿
import openai
import weave
import os
﻿
﻿
PROJECT = "wandb_inference"
weave.init(PROJECT)
﻿
client = openai.OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=os.getenv("WANDB_API_KEY"),
    project=PROJECT,
    default_headers={
        "OpenAI-Project": "wandb_fc/quickstart_playground"  # replace with your actual team/project
    }
)
﻿
resp = client.chat.completions.create(
    model="Qwen/Qwen3-235B-A22B-Thinking-2507",
    messages=[
        {"role": "system", "content": "You are a SaaS business strategist with expertise in growth optimization and churn reduction."},
        {"role": "user", "content": """
        Analyze our SaaS metrics and provide a strategic plan:
        
        Metrics:
        - Monthly churn: 15%
        - CAC: $150
        - CLV: $800  
        - MRR per customer: $50
        - Active customers: 2,000
        - Monthly marketing spend: $60,000
        
        Provide: 1) Metric assessment vs benchmarks, 2) Critical issues, 3) Improvement strategy, 4) Implementation timeline, 5) Success metrics.
        """}
    ],
    temperature=0.7,
    max_tokens=1000,
)
﻿
print(resp.choices[0].message.content)
﻿
Which you'll see in the dashboard:
﻿
Monitoring  Qwen3 235B A22B Thinking-2507 inference with W&B WeaveFrom the final cell, you can view the inference output and copy it if needed. To explore further or review past inference requests, open your Weights & Biases dashboard or follow the links provided with the response.
After initializing Weave with your environment variable, all inference API calls are automatically tracked. Here’s what gets logged and how you can use it effectively:
What Weave tracks automaticallyRequest details: Model used, parameters, token counts
Response data: Content, processing time, success/failure status
Usage metrics: Token consumption, API costs, rate limit status
Performance: Response latency, throughput patterns
Accessing your logsVisit your W&B project dashboard at: https://wandb.ai/[your-username]/[your-project]
Navigate to the "Weave" section
View detailed logs, filter by date/model/status
Analyze usage patterns and optimize accordingly
Custom Weave annotationsAdd custom metadata and organize your API calls:
import os
import openai
import weave
﻿
PROJECT = "wandb_inference"
weave.init(PROJECT)
﻿
client = openai.OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=os.getenv("WANDB_API_KEY"),
    project=PROJECT,
    default_headers={
        "OpenAI-Project": "wandb_fc/quickstart_playground"  # replace with your actual team/project
    }
)
﻿
@weave.op()
def analyze_customer_feedback(feedback_text, sentiment_threshold=0.5):
    """
    Analyze comprehensive customer feedback and return detailed sentiment analysis with actionable insights.
    This function performs deep sentiment analysis while tracking execution via weave.op decorator
    since the OpenAI client doesn't have built-in weave integration.
    """
    resp = client.chat.completions.create(
        model="Qwen/Qwen3-235B-A22B-Thinking-2507",
        messages=[
            {"role": "system", "content": "You are a customer experience analyst. Provide sentiment scores (-1 to +1), identify themes, and suggest improvements."},
            {"role": "user", "content": f"Analyze this customer feedback: {feedback_text}\n\nThreshold: {sentiment_threshold}\n\nProvide: 1) Sentiment score, 2) Key themes, 3) Specific concerns, 4) Recommended actions."}
        ],
        temperature=0.1,
        max_tokens=500,
    )
    return resp.choices[0].message.content
﻿
if __name__ == "__main__":
    out = analyze_customer_feedback(
        "I've used this software for three months. Core functionality works, but recent interface changes complicate my workflow. Features are buried in submenus, layout feels counterintuitive, and it loads slower since the update.",
        sentiment_threshold=0.3,
    )
    print(out)
﻿
Which would appear as:
﻿
Best PracticesHere are some best practices to follow when testing and/or deploying Qwen3 235B A22B Thinking-2507, or any other model for that matter.
Security and ConfigurationAlways store API keys in environment variables rather than hardcoding them. Use clear and descriptive project names that follow the “team/project” format to keep your work organized. Restrict API key permissions to only the scopes that are truly necessary.
Prompt Engineering for Qwen3 235B A22B Thinking-2507Make use of Qwen3 235B A22B Thinking-2507’s long-context support by supplying all the background information your task requires. Be explicit about the output format and writing style you expect. Well-crafted system prompts help set the right expertise and context. For analytical results, keep the temperature low (around 0.1–0.3). For more open-ended or creative tasks, raise it to 0.7–0.9.
Performance OptimizationEnable streaming to make long responses flow more naturally and improve usability. Group similar queries with batch processing to save time and resources. Monitor token usage to manage costs and stay within limits. For tasks that repeat frequently, apply caching so results can be reused without extra computation.
Monitoring and DebuggingLeverage Weave’s automatic logging to capture every production call. Attach useful metadata as annotations so different use cases are easy to identify. Review failed requests regularly to spot recurring issues and trends. Monitor response times closely and fine-tune parameters to keep performance consistent.
Next stepsNow that you're equipped with comprehensive Qwen3 235B A22B Thinking-2507 knowledge:
🔗 Explore Advanced Features
Dive deeper into W&B Inference documentation for advanced configuration options
Learn about Weave's evaluation capabilities for systematic model testing
📊 Optimize Your Workflow
Set up automated monitoring dashboards for your specific use cases
Implement A/B testing between different prompting strategies
Create custom evaluation metrics for your domain-specific tasks
🚀 Scale Your Implementation  
Build production pipelines with proper error handling and monitoring
Implement cost optimization strategies based on usage patterns
Explore integration with other W&B tools for end-to-end ML workflows
📚 Dive Deeper into Qwen3 235B A22B Thinking-2507
Visit the Qwen3 235B A22B Thinking-2507 model card for detailed capability information
Explore community examples and use cases
Stay updated with model improvements and new features
﻿
Add a comment
Tags: Articles, LLM, Inference
Iterate on AI agents and models faster. Try Weights & Biases today.