Tutorial: Running inference with Kimi K2 using W&B Inference

Getting set up and running Kimi K2, MoonShot AI's advanced long-context language model, in Python using W&B Inference. We'll be working with the moonshotai/Kimi-K2-Instruct model.
Dave Davies
Created on August 11|Last edited on August 14
Comment
﻿
Running inference with Kimi K2 through W&B Inference powered by CoreWeave is surprisingly straightforward, and will offer you a lot of flexibility in what you can do with it beyond what you might encounter through other interfaces. In this tutorial, we'll have you running inference and exploring advanced capabilities in depth.
Whether you're running inference on massive documents, building multilingual applications, or tackling complex reasoning tasks, this guide provides everything you need to run inference with Kimi K2 effectively.
Table of contentsWhat is Kimi K2?W&B WeaveTutorial: Running inference with Kimi K2 using W&B InferencePrerequisitesStep 1: Installation & setupStep 2: Environment configurationStep 3: Running basic inference with Kimi K2Step 4: Advanced Kimi K2 inference configurationStep 5: Running inference with Kimi K2's unique capabilitiesMonitoring  Kimi K2 inference with W&B WeaveBest PracticesNext steps
﻿
What is Kimi K2?Kimi K2 is a state-of-the-art large language model developed by MoonShot AI, explicitly engineered for tasks requiring deep understanding and extensive context processing. What sets Kimi K2 apart from other language models:
🔍 Extended context window: Process up to millions of tokens in a single conversation - perfect for analyzing entire codebases, legal documents, research papers, or maintaining context across lengthy dialogues.
🌐 Multilingual excellence: Superior performance across languages with particular strength in Chinese and English, making it ideal for international applications and cross-cultural content analysis.
🧠 Advanced reasoning: Excels at multi-step problem solving, mathematical reasoning, and complex analytical tasks that require maintaining logical consistency across long chains of thought.
💻 Code understanding: Sophisticated comprehension and generation capabilities across programming languages, with deep understanding of software architecture and patterns.
📋 Instruction following: Maintains context and follows detailed instructions throughout extended conversations, adapting its responses based on accumulated context.
For detailed technical specifications and performance benchmarks, visit the Kimi K2 model documentation.
W&B Weave﻿W&B Weave simplifies the process of tracking and analyzing model outputs in your project. To get started with Weave, you'll first import it and initialize it with your project name.
One of its standout features is the @weave.op decorator. In Python, a decorator is a powerful tool that extends the behavior of a function. By placing @weave.op above any function in your code, you're telling Weave to log that function's inputs and outputs automatically. This makes it incredibly easy to keep track of what data goes in and what comes out.
Once your code runs, these logs appear in the Weave dashboard, where you'll find detailed visualizations and traces of the function calls. This not only aids in debugging but also helps structure your experimental data, making it much easier to develop and fine-tune models like Kimi K2.
Tutorial: Running inference with Kimi K2 using W&B InferenceLet's jump right in. This tutorial assumes you're working in a Jupyter notebook, but of course, the code will work in other applications. 
We're going to be running inference with the moonshotai/Kimi-K2-Instruct model specifically.
If you're not familiar with Jupyter Notebooks, you can get set up in about 5 minutes. I walk you through it in this tutorial.
💡
PrerequisitesBefore starting, ensure you have:
A Weights & Biases account (you can sign up free here)
Python 3.7 or higher installed
Basic familiarity with Python and API usage
Understanding of your use case requirements (document analysis, code review, multilingual tasks, etc.)
Step 1: Installation & setup
1. Install required packagesTo get started running inference with Kimi K2, all you need to install is OpenAI and Weave. However, we'll also show you how to simplify reviewing multiple outputs using W&B Weave, which makes the process much more efficient.
The code to do this is:
pip install openai wandb weave
Run this in your terminal or Jupyter cell after entering this code.
When you execute the cell, you'll notice an asterisk ([*]) appear between the brackets [ ]. This indicates that the cell is running, and you'll need to wait until the asterisk turns into a number before proceeding.
2. Get your W&B API keyVisit https://wandb.ai/authorize﻿
Copy your API key
Keep it handy for the next step
Step 2: Environment configurationSetting up your environment variables is crucial for secure and seamless operation. You'll need both your W&B API key and project information.
Option 1: In a Jupyter Notebook# Set environment variables in your notebook
%env WANDB_API_KEY=your-wandb-api-key-here
%env WANDB_PROJECT=your-team/your-project
Option 2: In Terminal/Shellexport WANDB_API_KEY="your-wandb-api-key-here"
export WANDB_PROJECT="your-team/your-project"
Option 3: In Python scriptimport os
﻿
# Set environment variables programmatically
os.environ["WANDB_API_KEY] = "your-wandb-api-key-here"
os.environ["WANDB_PROJECT"] = "your-team/your-project"
Step 3: Running basic inference with Kimi K2Hopefully, this hasn't been too painful because now we're at the fun part.
Here's a complete example to start running inference with Kimi K2:
import openai
import weave
import os
﻿
# Initialize Weave for automatic logging and monitoring
# This will use the WANDB_PROJECT environment variable
weave.init(os.getenv("WANDB_PROJECT"))
﻿
# Create the OpenAI client configured for W&B Inference
client = openai.OpenAI(
    # W&B Inference endpoint
    base_url='https://api.inference.wandb.ai/v1',
    
    # Use environment variable for API key (your W&B API key)
    api_key=os.getenv("WANDB_API_KEY"),
    
    # Specify your team and project for usage tracking
    project=os.getenv("WANDB_PROJECT"),
)
﻿
# Make a request to Kimi K2
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant specialized in providing detailed explanations."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=1000
)
﻿
# Print the response
print(response.choices[0].message.content)
You'll find the inputs and outputs recorded to your Weave dashboard with the parameters automatically included:
﻿
Step 4: Advanced Kimi K2 inference configuration
Understanding inference parametersAdjust Kimi K2's response behavior with these key inference parameters (feel free to play around with them and compare the outputs in Weave!).
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2-Instruct",
    messages=[
        {"role": "system", "content": "You are a creative writing assistant."},
        {"role": "user", "content": "Write a short story about time travel."}
    ],
    temperature=0.8,        # Controls creativity (0.0=focused, 1.0=creative)
    max_tokens=2000,        # Maximum response length
    top_p=0.9,             # Nucleus sampling (0.1=focused, 1.0=diverse)
    frequency_penalty=0.1,  # Reduces repetitive phrases (0.0-2.0)
    presence_penalty=0.1,   # Encourages topic diversity (0.0-2.0)
)
Parameter Guidelines:
Temperature: Use 0.1-0.3 for analytical tasks, 0.7-0.9 for creative work
Top_p: Combine with temperature; 0.9 works well for most applications
Penalties: Start with 0.1-0.2 to reduce repetition without affecting quality
This gives us added flexibility to influence our model output. These parameters are also automatically logged to W&B Weave for observability and future evaluation:
﻿
Streaming inference responsesFor real-time output and better user experience:
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a comprehensive story about space exploration."}
    ],
    stream=True,
    temperature=0.7
)
﻿
# Print the response as it streams
print("Response: ", end="")
for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")
We got a streaming response:
﻿
With the metrics logged to Weave:
﻿
As well as the full output:
﻿
Step 5: Running inference with Kimi K2's unique capabilitiesThis is where running inference with Kimi K2 really shines. Let's explore what makes it special.
Long context inferenceKimi K2 excels at running inference on extensive documents. Here's a practical example:
# Real example: Analyzing a technical specification
technical_doc = """
API Documentation: RESTful Web Services v2.1
﻿
OVERVIEW
This document outlines the complete specification for our RESTful web services API, 
including authentication, endpoints, request/response formats, error handling, 
and rate limiting policies.
﻿
AUTHENTICATION
All API requests must include a valid Bearer token in the Authorization header:
Authorization: Bearer <your-access-token>
﻿
Tokens expire after 24 hours and must be renewed using the /auth/refresh endpoint.
﻿
ENDPOINTS
﻿
1. User Management
GET /api/v2/users - Retrieve user list
POST /api/v2/users - Create new user
PUT /api/v2/users/{id} - Update existing user
DELETE /api/v2/users/{id} - Delete user
﻿
2. Data Operations  
GET /api/v2/data - Query data with filtering
POST /api/v2/data - Submit new data
PUT /api/v2/data/{id} - Update existing data
﻿
RATE LIMITING
- Standard tier: 1000 requests/hour
- Premium tier: 10000 requests/hour
- Enterprise tier: Unlimited
﻿
ERROR HANDLING
All errors return JSON with 'error' and 'message' fields:
{
  "error": "INVALID_TOKEN",
  "message": "The provided token is expired or invalid"
}
﻿
[Document continues with detailed specifications...]
"""
﻿
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2-Instruct",
    messages=[
        {
            "role": "system", 
            "content": "You are an expert API analyst. Provide detailed technical analysis focusing on potential issues, missing information, and recommendations for improvement."
        },
        {
            "role": "user", 
            "content": f"Analyze this API documentation and provide a comprehensive review:\n\n{technical_doc}"
        }
    ],
    temperature=0.3,  # Lower temperature for focused technical analysis
    max_tokens=2000
)
﻿
print(response.choices[0].message.content)
Which outputs to Weave:
﻿
Multilingual inferenceLeverage Kimi K2's multilingual inference capabilities for international development:
mixed_language_code = """
// English comments with Chinese variable names
function 计算总价(商品列表, 折扣率) {
    let 总价 = 0;
    
    // Calculate base total
    for (const 商品 of 商品列表) {
        总价 += 商品.价格 * 商品.数量;
    }
    
    // Apply discount
    const 折扣金额 = 总价 * 折扣率;
    return 总价 - 折扣金额;
}
﻿
# Python function with mixed comments
def validate_用户输入(user_data):
    '''
    验证用户输入数据的完整性和有效性
    Validates user input data for completeness and validity
    '''
    required_fields = ['name', 'email', '年龄']
    
    for field in required_fields:
        if field not in user_data:
            raise ValueError(f"Missing required field: {field}")
    
    return True
"""
﻿
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a senior software engineer fluent in multiple languages. Analyze code for best practices, potential issues, and provide refactoring suggestions."
        },
        {
            "role": "user",
            "content": f"Please review this multilingual code and suggest improvements for maintainability:\n\n{mixed_language_code}"
        }
    ],
    temperature=0.2
)
﻿
print(response.choices[0].message.content)
Which logs to Weave as:
﻿
Complex multi-step reasoning inference with Kimi K2Utilize Kimi K2's inference reasoning capabilities for complex problem solving:
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a strategic business analyst. Break down complex problems into logical steps and provide actionable recommendations."
        },
        {
            "role": "user",
            "content": """
            Our SaaS company is experiencing a 15% monthly churn rate. Our customer acquisition cost (CAC) is $150, 
            average customer lifetime value (CLV) is $800, and monthly recurring revenue per customer is $50. 
            We have 2,000 active customers and are spending $60,000/month on marketing.
            
            Please analyze this situation and provide a comprehensive strategy to improve our metrics, 
            including specific actions, expected timelines, and success metrics.
            """
        }
    ],
    temperature=0.4,  # Balanced creativity and focus for strategic thinking
    max_tokens=2500
)
﻿
print(response.choices[0].message.content)
Which you'll see in the dashboard:
﻿
Monitoring  Kimi K2 inference with W&B WeaveFrom the final cell you can access the inference output, and copy it from there. If you want to dig deeper into what's going on or review past inference requests (and I recommend you do) you can do so by visiting your Weights & Biases dashboard or clicking the links outputted with the response.
With Weave initialized using your environment variable, all your inference API calls are automatically tracked. Here's what gets logged and how to use it effectively:
What Weave tracks automaticallyRequest details: Model used, parameters, token counts
Response data: Content, processing time, success/failure status
Usage metrics: Token consumption, API costs, rate limit status
Performance: Response latency, throughput patterns
Accessing your logsVisit your W&B project dashboard at: https://wandb.ai/[your-username]/[your-project]
Navigate to the "Weave" section
View detailed logs, filter by date/model/status
Analyze usage patterns and optimize accordingly
Custom Weave annotationsAdd custom metadata and organize your API calls:
import weave
﻿
@weave.op()
def analyze_customer_feedback(feedback_text, sentiment_threshold=0.5):
    """
    Analyzes customer feedback and categorizes sentiment.
    This function will be tracked with custom metadata.
    """
    response = client.chat.completions.create(
        model="moonshotai/Kimi-K2-Instruct",
        messages=[
            {
                "role": "system", 
                "content": f"You are a customer feedback analyst. Analyze sentiment and provide a score from -1 (very negative) to 1 (very positive). Also categorize the main topics mentioned."
            },
            {
                "role": "user", 
                "content": f"Analyze this customer feedback: {feedback_text}"
            }
        ],
        temperature=0.1  # Consistent analysis
    )
    return response.choices[0].message.content
﻿
# This call will be logged with function name, parameters, and execution context
result = analyze_customer_feedback(
    "The new update is confusing and slow. I can't find the features I used daily.",
    sentiment_threshold=0.3
)
print(result)
Which would appear as:
﻿
Best PracticesHere are some best practices to follow when testing and/or deploying Kimi K2 via inference, or any other model for that matter.
Security and ConfigurationEnvironment variables: Always store API keys in environment variables, never hardcode them
Project organization: Use clear, descriptive project names following the "team/project" format
Access control: Limit API key permissions to necessary scopes only
Prompt Engineering for Kimi K2Leverage long context: Don't hesitate to provide extensive context - Kimi K2 handles it well
Clear instructions: Be specific about the desired output format and style
System messages: Use detailed system prompts to establish expertise and context
Temperature selection: Lower values (0.1-0.3) for analytical tasks, higher (0.7-0.9) for creative work
Performance OptimizationStreaming: Use streaming for longer responses to improve user experience
Batch processing: Group similar requests when possible to improve efficiency
Token management: Monitor token usage to optimize costs and stay within limits
Caching: Implement response caching for frequently requested analyses
Monitoring and DebuggingWeave integration: Use Weave's automatic logging for all production calls
Custom annotations: Add meaningful metadata to track different use cases
Error analysis: Regularly review failed requests to identify patterns
Performance tracking: Monitor response times and adjust parameters accordingly
Next stepsNow that you're equipped with comprehensive Kimi K2 knowledge:
🔗 Explore Advanced Features
Dive deeper into W&B Inference documentation for advanced configuration options
Learn about Weave's evaluation capabilities for systematic model testing
📊 Optimize Your Workflow
Set up automated monitoring dashboards for your specific use cases
Implement A/B testing between different prompting strategies
Create custom evaluation metrics for your domain-specific tasks
🚀 Scale Your Implementation  
Build production pipelines with proper error handling and monitoring
Implement cost optimization strategies based on usage patterns
Explore integration with other W&B tools for end-to-end ML workflows
📚 Dive Deeper into Kimi K2
Visit the Kimi K2 model card for detailed capability information
Explore community examples and use cases
Stay updated with model improvements and new features
With this comprehensive setup, you're ready to harness Kimi K2's advanced capabilities while maintaining professional-grade monitoring, logging, and error handling through W&B Inference and Weave.
﻿
Add a comment
Tags: Articles, LLM, GenAI, Agents
Iterate on AI agents and models faster. Try Weights & Biases today.