Skip to main content

Tutorial: Running inference with Zhipu AI's GLM-4.5 using W&B Inference

Getting set up and running GLM-4.5, Zhipu's advanced long-context language model, in Python using W&B Inference.
Created on September 9|Last edited on September 9
Running inference with GLM-4.5 through W&B Inference powered by CoreWeave is surprisingly straightforward and will offer you a lot of flexibility in what you can do with it beyond what you might encounter through other interfaces. In this tutorial, we’ll have you running inference and exploring the advanced capabilities of GLM-4.5 in depth.
Whether you're running inference on massive documents, building multilingual applications, or tackling complex reasoning tasks, this guide provides everything you need to run inference with GLM-4.5 effectively.

Table of contents



What is GLM-4.5?

GLM-4.5 is Zhipu AI’s latest large-scale model built to unify reasoning, coding, and agentic abilities. It comes in two versions: GLM-4.5 (355B total, 32B active) and GLM-4.5-Air (106B total, 12B active), both using a mixture-of-experts design with deep architecture, grouped-query attention, and speculative decoding.
🧠 On reasoning tasks, it reaches 84.6 on MMLU Pro, 91.0 on AIME24, 98.2 on MATH500, and 79.1 on GPQA. These results put it close to models like Claude 4 Opus and Gemini 2.5 Pro, and within range of Grok 4 on top reasoning benchmarks.
💻 On coding, it scores 64.2 on SWE-bench Verified and 37.5 on Terminal Bench, surpassing GPT-4.1 on SWE-bench and showing strong full-stack development skills. In agentic coding evaluations, it achieves a 90.6% tool-calling success rate, ahead of Claude 4 Sonnet (89.5%), Kimi K2 (86.2%), and Qwen3-Coder (77.1).
🤖 On agentic tasks, GLM-4.5 performs 79.7 on τ-bench Retail and 60.4 on τ-bench Airline, on par with Claude 4 Sonnet. On BrowseComp, it hits 26.4%, beating Claude-4-Opus (18.8%) and close to o4-mini-high (28.3).
⚙️ It supports 131k context length!
🚀 Overall, GLM-4.5 ranks 3rd across 12 major benchmarks, balancing reasoning, coding, and agentic performance while being highly efficient at its scale.






For detailed technical specifications and performance benchmarks, visit the GLM-4.5 model documentation.

W&B Weave

W&B Weave makes it easy to track and analyze model outputs in your project, offering much more than basic logging. To get started, you simply import it and initialize it with your project name.
A key feature is the @weave.op decorator. A normal Python function runs without recording anything, but when you add @weave.op, every call is automatically logged with its inputs and outputs. This saves you from having to write manual print statements or build custom logging logic.
When your code executes, the outputs are captured in the Weave dashboard. Rather than relying on raw console logs or static text files, the dashboard provides interactive charts, timelines, and detailed traces for each function call. You can dig deeper into specific runs, make comparisons, and observe how data flows through your models. Instead of fragmented log entries, you have a unified and connected view of your experiments.
This distinction is what makes Weave so effective for model development. It goes beyond recording outputs by structuring your experimental data, allowing you to debug more quickly, reproduce results consistently, and fine-tune models like GLM-4.5 with greater ease.

Tutorial: Running inference with GLM-4.5 using W&B Inference

Let's jump right in. This tutorial assumes you're working in a Jupyter notebook (notable in some screenshots), but of course, the code will work in other applications.
We will be running inference with the zai-org/GLM-4.5 model specifically.
If you're not familiar with Jupyter Notebooks, you can get set up in about 5 minutes. I walk you through it in this tutorial.
💡

Prerequisites

Before starting, ensure you have:
  • A Weights & Biases account (you can sign up free here)
  • Python 3.7 or higher installed
  • Basic familiarity with Python and API usage
  • Understanding of your use case requirements (document analysis, code review, multilingual tasks, etc.)

Step 1: Installation & setup

1. Install required packages

To get started running inference with GLM-4.5, all you need to install is OpenAI and Weave. However, we'll also show you how to simplify reviewing multiple outputs using W&B Weave, which makes the process much more efficient.
The code to do this is:
pip install openai wandb weave
Run this in your terminal or Jupyter cell after entering this code.
When you execute the cell, you'll notice an asterisk ([*]) appear between the brackets [ ]. This indicates that the cell is running, and you'll need to wait until the asterisk turns into a number before proceeding.

2. Get your W&B API key

  1. Copy your API key
  2. Keep it handy for the next step

Step 2: Environment configuration

Setting up your environment variables is crucial for secure and seamless operation. You'll need both your W&B API key.

Option 1: In a Jupyter Notebook

# Set environment variables in your notebook
%env WANDB_API_KEY=your-wandb-api-key-here

Option 2: In Terminal/Shell

export WANDB_API_KEY="your-wandb-api-key-here"

Option 3: In Python script

import os
# Set environment variables programmatically
os.environ["WANDB_API_KEY] = "your-wandb-api-key-here"

Step 3: Running basic inference with GLM-4.5

Hopefully, this hasn't been too painful because now we're at the fun part.
Here's a complete example to start running inference with GLM-4.5:
import os
import openai
import weave

PROJECT = "wandb_inference"
weave.init(PROJECT)

client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key=os.getenv("WANDB_API_KEY"),
project=PROJECT,
default_headers={
"OpenAI-Project": "wandb_fc/quickstart_playground" # replace with your actual team/project
}
)

response = client.chat.completions.create(
model="zai-org/GLM-4.5",
messages=[
{"role": "system", "content": "You serve as an expert educator specializing in quantum computing, breaking down complex concepts into clear, accessible explanations that inspire curiosity and understanding."},
{"role": "user", "content": "Explain the fundamental principles of quantum computing in an engaging way. Cover key concepts like superposition, entanglement, quantum gates, and qubits. Discuss how these elements work together to perform computations that classical computers cannot efficiently handle. Include practical examples and potential real-world applications that demonstrate quantum computing's transformative potential."}
],
temperature=0.7,
)

print("Response:", response.choices[0].message.content)

You'll find the inputs and outputs recorded to your Weave dashboard with the parameters automatically included:


Step 4: Advanced GLM-4.5 inference configuration

Understanding inference parameters

Adjust GLM-4.5's response behavior with these key inference parameters (feel free to play around with them and compare the outputs in Weave!).
import os
import openai
import weave

PROJECT = "wandb_inference"
weave.init(PROJECT)

client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key=os.getenv("WANDB_API_KEY"),
project=PROJECT,
default_headers={
"OpenAI-Project": "wandb_fc/quickstart_playground" # replace with your actual team/project
}
)

@weave.op()
def creative_writing_experiment():
# Test different temperature settings
temperatures = [0.1, 0.7, 1.2]

for temp in temperatures:
print(f"\n--- Temperature: {temp} ---")
response = client.chat.completions.create(
model="zai-org/GLM-4.5",
messages=[
{"role": "system", "content": "You are a creative writing coach who helps writers develop compelling narratives, explore different styles, and craft engaging stories that resonate with readers."},
{"role": "user", "content": "Write a short story about a mysterious antique shop where time seems to stand still. The owner has a secret, and customers often leave with more than just purchases. Focus on atmosphere, subtle mystery, and the interplay between past and present. Keep the story under 500 words."}
],
temperature=temp,
max_tokens=400,
)
print(response.choices[0].message.content)

result = creative_writing_experiment()
Parameter Guidelines:
  • Temperature: Use 0.1-0.3 for analytical tasks, 0.7-0.9 for creative work
  • Top_p: Combine with temperature; 0.9 works well for most applications
This gives us added flexibility to influence our model output. These parameters are also automatically logged to W&B Weave for observability:


Streaming inference responses

For real-time output and better user experience:
import os
import sys
import openai
import weave

PROJECT = "wandb_inference"
weave.init(PROJECT)

client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key=os.getenv("WANDB_API_KEY"),
project=PROJECT,
default_headers={
"OpenAI-Project": "wandb_fc/quickstart_playground" # replace with your actual team/project
}
)

stream = client.chat.completions.create(
model="zai-org/GLM-4.5",
messages=[
{"role": "system", "content": "You are a compelling narrative architect who crafts engaging stories about humanity's greatest achievements, blending historical facts with inspiring human elements."},
{"role": "user", "content": "Craft an inspiring narrative about humanity's journey into space, from ancient stargazers to modern space explorers. Highlight pivotal moments, courageous individuals, technological innovations, international collaborations, and the philosophical implications of becoming a multi-planetary species. Weave together scientific achievements with the human spirit of discovery and the challenges overcome along the way."}
],
stream=True,
temperature=0.7,
)

sys.stdout.write("Response: ")
for chunk in stream:
delta = chunk.choices[0].delta
if delta and delta.content:
sys.stdout.write(delta.content)
sys.stdout.flush()
print()

We got a streaming response:

With the metrics logged to Weave:

As well as the full output:


Step 5: Running inference with GLM-4.5's unique capabilities

This is where running inference with GLM-4.5 really shines. Let's explore what makes it special.

Long context inference

GLM-4.5 excels at running inference on extensive documents. Here's a practical example:
import os
import openai
import weave
from pypdf import PdfReader

PROJECT = "wandb_inference"
weave.init(PROJECT)

client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key=os.getenv("WANDB_API_KEY"),
project=PROJECT,
default_headers={
"OpenAI-Project": "wandb_fc/quickstart_playground" # replace with your actual team/project
}
)

def extract_pdf_text(file_path):
reader = PdfReader(file_path)
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
return text

@weave.op()
def analyze_technical_documentation():
# For demo purposes, we'll use a sample text instead of reading a PDF
sample_text = """
AWS Cloud Computing Services Overview:

Amazon Web Services (AWS) provides a comprehensive suite of cloud computing services designed to help businesses scale and innovate. Key services include:

1. Compute Services:
- Amazon EC2: Virtual servers in the cloud
- AWS Lambda: Serverless compute service
- Amazon ECS/EKS: Container orchestration

2. Storage Services:
- Amazon S3: Object storage service
- Amazon EBS: Block storage for EC2 instances
- Amazon Glacier: Long-term archival storage

3. Database Services:
- Amazon RDS: Managed relational databases
- Amazon DynamoDB: NoSQL database service
- Amazon Redshift: Data warehousing solution

These services enable organizations to build resilient, scalable, and cost-effective cloud architectures.
"""

response = client.chat.completions.create(
model="zai-org/GLM-4.5",
messages=[
{"role": "system", "content": "You are a technical documentation specialist who excels at analyzing complex technical materials, extracting key insights, and providing clear summaries with actionable recommendations."},
{"role": "user", "content": f"Analyze this technical documentation and provide a detailed summary covering the main services, their purposes, and how they work together to create comprehensive cloud solutions. Identify key benefits, potential use cases, and architectural considerations. Documentation content: {sample_text}"}
],
temperature=0.4,
)
return response.choices[0].message.content

result = analyze_technical_documentation()
print("Technical Documentation Analysis:", result)
Which outputs to Weave:


Multilingual inference

Leverage GLM-4.5's multilingual inference capabilities for translation:
import os
import openai
import weave

PROJECT = "wandb_inference"
weave.init(PROJECT)

client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key=os.getenv("WANDB_API_KEY"),
project=PROJECT,
default_headers={
"OpenAI-Project": "wandb_fc/quickstart_playground" # replace with your actual team/project
}
)

@weave.op()
def translate_and_explain():
response = client.chat.completions.create(
model="zai-org/GLM-4.5",
messages=[
{"role": "system", "content": "You are a proficient multilingual communicator skilled in translation, cultural adaptation, and cross-language communication strategies."},
{"role": "user", "content": "Translate the following English text to Spanish, French, and German. For each translation, provide cultural context and explain any idiomatic expressions or cultural nuances that might affect the translation. Also suggest appropriate communication strategies for each language when discussing this topic. Text to translate: 'The key to successful international business is understanding cultural differences and adapting communication styles accordingly. Building trust through respectful dialogue and genuine interest in local customs can open doors to meaningful partnerships worldwide.'"}
],
temperature=0.5,
)
return response.choices[0].message.content

result = translate_and_explain()
print("Multilingual Translation and Analysis:", result)

Which logs to Weave as:


Complex multi-step reasoning inference with GLM-4.5

Utilize GLM-4.5's inference reasoning capabilities for complex problem solving:
import os
import openai
import weave

PROJECT = "wandb_inference"
weave.init(PROJECT)

client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key=os.getenv("WANDB_API_KEY"),
project=PROJECT,
default_headers={
"OpenAI-Project": "wandb_fc/quickstart_playground" # replace with your actual team/project
}
)

@weave.op()
def analyze_saas_business_model():
response = client.chat.completions.create(
model="zai-org/GLM-4.5",
messages=[
{"role": "system", "content": "You are a seasoned SaaS strategist who provides deep insights into software business models, market positioning, and growth strategies for subscription-based companies."},
{"role": "user", "content": "Analyze the SaaS business model for a project management platform. Provide a comprehensive assessment covering: 1) Key value propositions and competitive advantages, 2) Pricing strategy recommendations with different tiers, 3) Customer acquisition and retention strategies, 4) Key metrics to track (MRR, churn rate, CAC, LTV), 5) Scaling challenges and solutions, 6) Market positioning against competitors like Asana, Trello, and Monday.com. Include specific recommendations for product development and go-to-market strategies."}
],
temperature=0.6,
)
return response.choices[0].message.content

result = analyze_saas_business_model()
print("SaaS Business Analysis:", result)
Which you'll see in the dashboard:


Monitoring GLM-4.5 inference with W&B Weave

From the last cell, you can see the inference results and copy them if needed. To dig deeper or revisit earlier inference requests, head to your Weave dashboard or use the links provided in the response.
Once Weave is set up through your environment variable, every inference API call is automatically recorded. The following shows what gets logged and how you can take advantage of it:

What Weave tracks automatically

  • Request details: Model used, parameters, token counts
  • Response data: Content, processing time, success/failure status
  • Usage metrics: Token consumption, API costs, rate limit status
  • Performance: Response latency, throughput patterns

Accessing your logs

  • Visit your W&B project dashboard at: https://wandb.ai/[your-username]/[your-project]
  • Navigate to the "Weave" section
  • View detailed logs, filter by date/model/status
  • Analyze usage patterns and optimize accordingly

Custom Weave annotations

Organize your API calls with the @weave.op:
import os
import openai
import weave

PROJECT = "wandb_inference"
weave.init(PROJECT)

client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key=os.getenv("WANDB_API_KEY"),
project=PROJECT,
default_headers={
"OpenAI-Project": "wandb_fc/quickstart_playground" # replace with your actual team/project
}
)

@weave.op()
def analyze_customer_feedback():
response = client.chat.completions.create(
model="zai-org/GLM-4.5",
messages=[
{"role": "system", "content": "You are a skilled sentiment analyst who examines customer feedback with precision, identifying emotional tones, key concerns, and actionable insights to improve user experience."},
{"role": "user", "content": "Analyze the following customer reviews and provide a comprehensive sentiment breakdown. Identify positive aspects, areas for improvement, recurring themes, and specific recommendations for enhancing customer satisfaction. Reviews: 1. 'The product works great but customer service was slow to respond.' 2. 'Amazing quality and fast shipping, will definitely buy again!' 3. 'Interface is confusing and not user-friendly at all.' 4. 'Good value for money, exceeded my expectations.' 5. 'Delivery was delayed and packaging was damaged upon arrival.'"}
],
temperature=0.3,
)
return response.choices[0].message.content

result = analyze_customer_feedback()
print("Customer Feedback Analysis:", result)

Which would appear as:


Best Practices

Here are some best practices to follow when testing and/or deploying GLM-4.5, or any other model for that matter.

🔐 Security and Configuration

  • Environment variables: Always store API keys in environment variables, never hardcode them
  • Project organization: Use clear, descriptive project names following the “team/project” format
  • Access control: Limit API key permissions to necessary scopes only

✍️ Prompt Engineering for GLM-4.5

  • Leverage long context: You can provide as much context as you need, GLM-4.5 is built to handle it smoothly.
  • Clear instructions: Be specific about the desired output format and style
  • System messages: Use detailed system prompts to establish expertise and context
  • Temperature selection: Lower values (0.1–0.3) for analytical tasks, higher (0.7–0.9) for creative work

⚡ Performance Optimization

  • Streaming: Use streaming for longer responses to improve user experience
  • Batch processing: Group similar requests when possible to improve efficiency
  • Token management: Monitor token usage to optimize costs and stay within limits
  • Caching: Implement response caching for frequently requested analyses

📊 Monitoring and Debugging

  • Weave integration: Use Weave’s automatic logging for all production calls
  • Custom annotations: Add meaningful metadata to track different use cases
  • Error analysis: Regularly review failed requests to identify patterns
  • Performance tracking: Monitor response times and adjust parameters accordingly

Next steps

Now that you have a solid grasp of GLM-4.5B, you can proceed with the steps to maximize the value of the model. 🔗 Start by checking the W&B Inference documentation, which provides advanced configuration options. You may also want to explore Weave’s evaluation tools, designed to support systematic model testing.
📊 To improve your workflow, set up automated monitoring dashboards that match your project needs. Running A/B experiments with different prompting strategies can sharpen results, and defining custom evaluation metrics will let you measure performance against your domain-specific goals.
🚀 When you’re ready to scale, focus on building production pipelines that include robust error handling and monitoring. Analyzing usage patterns can guide cost management strategies, and connecting with other W&B tools will help create a more integrated end-to-end setup.
📚 For a deeper understanding of GLM-4.5 itself, review the official model card for a detailed breakdown of its capabilities. You can also study examples and use cases to see how it performs in practice.
Iterate on AI agents and models faster. Try Weights & Biases today.