Tutorial: Kimi K2 for code generation with observability

Kimi K2 is Moonshot AI's 1T-parameter open model excelling in coding and reasoning. Explore its features, API deployment, and monitoring with W&B Weave.
Dave Davies
Created on July 14|Last edited on July 18
Comment
﻿
You can run Kimi K2 free via API on in the Playground using W&B Inference by CoreWeave. Read the announcement.
💡
Kimi K2 is the latest large language model (LLM) from Moonshot AI, a research lab and startup known for pushing the boundaries of open-source AI. Announced in July 2025, Kimi K2 directly challenges proprietary systems from OpenAI and Anthropic with comparable (or even superior) capabilities. 
We'll be covering how the capabilities and details around this new model, as well as how to run it locally. That said, if you want to dive right into using it via the OpenRouter API you can just click the button below. 
Show me the code﻿
﻿
At its core, this model is designed for advanced reasoning and coding: it can generate code, solve complex problems step-by-step, and even autonomously perform multi-step tasks like an AI agent. What makes Kimi K2 especially notable is its Mixture-of-Experts (MoE) architecture, which enables having an enormous number of parameters (1 trillion in total) without incurring the full computational cost on each query. 
In practice, only a subset of expert networks (32B parameters worth) are activated for any given input, making Kimi K2 highly efficient for its size. Thanks to this design and a novel MuonClip optimizer, the model was trained on a massive 15.5 trillion token dataset with stable convergence. The result is an AI system that excels in coding tasks, complex reasoning, and tool use, often matching or outperforming much larger closed-source models in benchmarks.
Beyond raw performance, Kimi K2 has been purpose-built for what Moonshot calls agentic intelligence. In simple terms, agentic capabilities refer to the model’s ability to not just provide static answers, but to interact with tools, write and execute code, and complete goals through a series of steps under its own initiative. For example, instead of merely explaining data, Kimi K2 might generate a Python script to analyze the data, run it (in a sandbox), and return results.
This focus on action-oriented output makes Kimi K2 more than a chatbot, it’s closer to an AI assistant that can act autonomously. As Moonshot’s team put it, “Kimi K2 does not just answer; it acts,” emphasizing that the model was trained to use tools and perform multi-step workflows without human intervention. 
For users, this means tasks like debugging code, data analysis, or even orchestrating web actions can potentially be handled by Kimi K2 in a single prompt. In the next sections, we will break down Kimi K2’s technical features, its variants, how you can deploy it, and finally walk through an example of using W&B Weave to monitor its behavior in a coding task.
﻿
Table of contentsUnderstanding Kimi K2Variants of Kimi K2Deploying Kimi K2 via APILicensing and accessTutorial: Advanced Kimi K2 observability with W&B WeavePrerequisitesStep 1: Initialize Weave and configure Kimi K2 via OpenRouterStep 2: Create instrumented functions with rich metadataStep 3: Execute and Monitor Multiple ScenariosStep 4: Analyzing results in W&B WeaveStep 5: (Optional) Custom evaluation and feedback loopBenefits of This Enhanced Approach:Advanced use cases and customizationConclusionSources:
﻿
Understanding Kimi K2Kimi K2’s key features and capabilities stem from its powerful architecture and training. It is a Mixture-of-Experts Transformer model with 61 layers and 64 attention heads, leveraging 384 expert sub-models (with 8 experts activated per token) to achieve the equivalent of a 1-trillion-parameter network. This sparse activation approach means that for any given input, only the most relevant “experts” are engaged, allowing the model to be both massive in knowledge and efficient in computation.
The attention mechanism is optimized for a very long context window of up to 128,000 tokens, enabling Kimi K2 to handle and reason about extremely large inputs (for instance, analyzing an entire codebase or lengthy document in one go). The model also uses the SwiGLU activation function and a custom optimizer called MuonClip to maintain training stability at this scale. In practical terms, these architectural choices translate into a model that has absorbed a vast amount of information and can perform complex reasoning steps without diverging or losing coherence over long sequences.
Benchmark results highlight Kimi K2’s performance on various tasks compared to other models (higher is better).
﻿
In coding benchmarks like LiveCodeBench and OJBench, Kimi K2 (blue) leads open models and even surpasses some proprietary models. It also shows strong results in agentic tool-use evaluations (TauCo and AceBench) and math reasoning (AIME, GPQA), often matching or exceeding GPT-4’s scores.  Kimi K2’s real-world performance backs up its impressive specs. In benchmark evaluations, it often ranks at the top among open-source models and is competitive with the best closed models.
For example, on the LiveCodeBench coding challenge (which tests writing correct code from prompts), Kimi K2 achieved 53.7% accuracy, beating another open MoE model DeepSeek-V3 (46.9%) and even outscoring GPT-4.1 (which scored 44.7%). Similarly, on the challenging MATH-500 exam, Kimi K2 scored 97.4%, edging out GPT-4.1’s 92.4%. These results suggest that Moonshot AI managed to “crack” aspects of mathematical reasoning that had eluded larger competitors.
Kimi K2 also shines in software engineering tasks: it hit 65.8% on the SWE-Bench Verified coding benchmark, significantly above most open models. In general, across coding, math, and multi-step reasoning tasks, Kimi K2 delivers state-of-the-art results for an open model. Its prowess isn’t just limited to static Q&A – because of the agentic design, it can plan and execute solutions (like writing code or calling a tool) where other models might only give a textual answer. This combination of broad knowledge, reasoning depth, and actionable output is what makes Kimi K2 stand out.
Another notable capability of Kimi K2 is its exceptional code generation skill. The model was trained on a massive corpus of source code, which enables it to produce well-structured, contextually appropriate code in multiple programming languages. In fact, Kimi K2’s code generation is one of its standout features – it can handle tasks like writing functions, debugging errors, or even translating code from one language to another.
Many users have found that Kimi K2 can solve competitive programming problems and real-world coding challenges with ease, often rivaling specialized code AIs. Combined with its long context window (useful for understanding large codebases), this makes Kimi K2 especially powerful for developers. It can keep track of extensive code context and generate new code that fits seamlessly, which is extremely valuable for tasks like code completion, refactoring, or answering questions about code. Overall, understanding Kimi K2 means recognizing it as a general-purpose AI with a specialization in coding and “thinking” tasks – one that leverages an innovative architecture to deliver high performance while remaining accessible as an open-source model.
Variants of Kimi K2To accommodate different use cases, Moonshot AI released two variants of the Kimi K2 model: Kimi-K2-Base and Kimi-K2-Instruct. Both share the same underlying architecture and knowledge, but they are tuned for distinct purposes:
Kimi-K2-Base: This is the raw, foundational model. It’s essentially the pre-trained base model without instruction fine-tuning. The Base variant is ideal for researchers, enthusiasts, or companies who want full control over the model’s behavior. You might choose Kimi-K2-Base if you plan to fine-tune the model on custom data, experiment with novel prompts, or integrate it into specialized workflows. It’s described as a strong starting point for those who want to tinker, customize, or build bespoke solutions on top of Kimi K2’s knowledge. In other words, Kimi-K2-Base gives you the flexibility to shape the model’s style and responses to your needs, much like one would with a base GPT model.
Kimi-K2-Instruct: This is an instruction-tuned version of Kimi K2, designed to be usable out-of-the-box for general tasks and conversations. The Instruct variant has undergone additional training (often called fine-tuning or post-training) on instruction-following data, making it behave politely and helpfully in a chat or assistant setting. Moonshot optimized Kimi-K2-Instruct for general-purpose dialogue and “agentic experiences”. This means it’s the go-to model if you want a ready-made AI assistant that can follow user instructions, engage in multi-turn conversations, and perform agent-like tasks (tool use, code execution) without requiring the user to manage every step. Notably, the instruct model is said to be “reflex-grade” without long thinking, implying it responds quickly and concisely, which is often desirable in interactive applications.
Having these two variants enhances Kimi K2’s versatility. If you want a highly adaptive AI for research, Kimi-K2-Base gives you a playground to try novel prompts or fine-tune to niche domains (for example, a biomedical Q&A bot or a game-playing agent). On the other hand, if you need a reliable conversational agent or coding assistant right away, Kimi-K2-Instruct is ready to deploy with sensible defaults. This dual-release strategy (Base and Instruct) is increasingly common in open-source AI, as it caters to both builders (who prefer raw models to mold themselves) and end-users (who prefer a model that “just works” for common tasks). 
Deploying Kimi K2 via APIHere we'll be covering running Kimi K2 locally. Our tutorial below covers running it through the API. It's also available on GitHub here.
One of the strengths of Kimi K2 being open-source is that you have multiple options to deploy and use it, from running it on your own hardware to accessing it through convenient APIs. In this section, we’ll outline how to deploy Kimi K2 via an API, highlighting compatibility with OpenAI/Anthropic API standards and the recommended inference engines for best performance.
1. Obtaining the Model Weights: Moonshot AI has made the Kimi K2 model weights available openly. You can find the checkpoints on HuggingFace. The weights use a specific block-fp8 format (to reduce memory usage), so ensure your environment supports that or convert as needed. Downloading the model requires substantial storage (since it’s a trillion-parameter model, even with only 32B active parameters, expect hundreds of gigabytes of data).
2. Choosing an Inference Engine: Given Kimi K2’s size and MoE architecture, it’s crucial to use an optimized inference engine to serve it. The developers recommend using one of the following engines for running K2:
vLLM: A highly optimized transformer inference engine that supports continuous batching of requests for high throughput.
SGLang (SkyGlass Language): An engine (lesser-known) which presumably has optimized kernels for MoE models.
KTransformers: Likely a toolkit or extension for serving large transformers efficiently (possibly a Moonshot-specific solution).
TensorRT-LLM: NVIDIA’s TensorRT for LLMs, providing optimized GPU acceleration (especially if you have NVIDIA hardware, this can greatly speed up inference).
Each of these engines is built to handle large models efficiently. For example, vLLM is known for its ability to handle long context windows and many concurrent requests by effectively managing GPU memory and compute. TensorRT-LLM will leverage low-level optimizations and quantization to run the model faster on GPUs (often a must for production deployment of a model this size). Moonshot provides a Model Deployment Guide (linked in their repo) with example configurations for vLLM and SGLang to help you get started.
3. Starting the Kimi K2 Service: Once you have an engine set up, you’ll load the Kimi K2 weights into it and start a server. The exact steps depend on the engine:
With vLLM, you might use a command-line tool or Python API to launch an HTTP server or an RPC server for the model. For instance, vLLM has a RESTful API mode that can mimic OpenAI’s API format.
With OpenRouter (an emerging solution for unified LLM access), you can potentially host Kimi K2 behind OpenRouter. In fact, Kimi K2 is available on the OpenRouter platform, which provides a unified API for many models. By signing up for OpenRouter and obtaining an API key, you can call Kimi K2 through OpenRouter’s endpoints without hosting it yourself. This is a convenient option if you want to try K2 without heavy infrastructure – just note their service may charge for usage after a free tier.
If Moonshot AI offers their own API: According to VentureBeat, Moonshot provides competitively priced API access to Kimi K2 (e.g., pricing far below OpenAI’s rates). Check Moonshot’s website for an official API endpoint. Using their API would be as simple as sending HTTP requests with your queries and API key.
4. API Compatibility (OpenAI/Anthropic format): A great feature of Kimi K2’s deployment is that it’s designed to be compatible with OpenAI and Anthropic’s API schemas. This means if you have existing code that calls openai.ChatCompletion.create(...) or an Anthropic Claude API, you can point it to your Kimi K2 service with minimal changes. In the Moonshot GitHub, they mention providing an OpenAI/Anthropic–compatible API layer. For example, the API expects chat messages with roles (“system”, “user”, “assistant”) just like OpenAI’s, and supports functions (tool calls) similar to OpenAI’s function calling interface. One caveat noted is that the temperature parameter is scaled slightly differently for Anthropic compatibility (they multiply it by 0.6 under the hood), so you might not need to adjust much when switching from, say, Claude to Kimi. Essentially, you could take an existing chatbot app that uses gpt-3.5-turbo or Claude, change the API endpoint and model name to Kimi K2, and it should work out-of-the-box.
5. Example API Usage: After your server is running, you can make requests to Kimi K2. Here’s a simple example using Python pseudocode, assuming we have an OpenAI-compatible endpoint (local or via OpenRouter):
import openai
openai.api_base = "http://localhost:8000/v1"         # your Kimi K2 server URL
openai.api_key = "YOUR_API_KEY"                      # if needed (not needed for local)
response = client.chat.completions.create(
    model="kimi-k2-instruct", 
    messages=[ 
      {"role": "system", "content": "You are Kimi, an AI assistant."}, 
      {"role": "user", "content": "Give a brief self-introduction."} 
    ],
    temperature=0.6,
    max_tokens=256
)
print(response['choices'][0]['message']['content'])
In this snippet, we treat Kimi K2 as if it were an OpenAI model. The model name might be something like "kimi-k2-###" depending on how the server or OpenRouter expects it – check their docs (for OpenRouter it could be an alias like "openrouter:Kimi-K2-Instruct"). The system prompt is optional but recommended (Moonshot suggests a default system message introducing the AI as “Kimi, an AI assistant created by Moonshot AI”). We use temperature=0.6 because that’s the recommended value for natural, accurate responses with K2-Instruct. The model will then return a completion in the usual format.
6. Inference Considerations: Because Kimi K2 is large, running it in real-time may require strong hardware (multiple high-memory GPUs). If you use a cloud service or Moonshot’s API, the heavy lifting is done for you. But if self-hosting, monitor your GPU memory usage and latency. You might use Tensor Parallelism or CPU offloading if needed to fit the model. Also, use the fp8 precision if supported, as it drastically cuts memory usage (with minimal quality loss). Using the recommended engines like TensorRT or vLLM will automatically handle many optimizations (like KV cache management, batch scheduling, etc.).
By following these steps, you can integrate Kimi K2 into various applications—whether it’s a chat assistant in a Slack bot, an AI pair programmer in your IDE, or a backend service that solves analytic problems. The big advantage here is that because Kimi K2’s API is compatible with the popular standards set by OpenAI/Anthropic, you can swap in Kimi K2 for other models with little code change. This allows easy A/B testing against models like GPT-4 or Claude and helps in adopting Kimi K2 in existing pipelines.
Licensing and accessKimi K2 is not only technologically impressive but also generously offered under an open license for others to use and build upon. Both the model’s code repository and the trained weights are released under a modified MIT License. In practical terms, this means you have significant freedom to use, distribute, and even commercialize applications built on Kimi K2, with only minimal conditions (the “modified” MIT likely includes some attribution or responsibility clause).
This open licensing is a breath of fresh air in a landscape where many advanced models are closed or have restrictive licenses. Developers and organizations can inspect Kimi K2’s code, fine-tune the model on proprietary data, or integrate it into products without having to negotiate usage fees or worry about violating terms (beyond the standard MIT conditions).
To access the model, you have a few resources:
﻿GitHub Repository: Moonshot AI’s GitHub (MoonshotAI/Kimi-K2) contains the code, model cards, and documentation for Kimi K2. The README provides an overview of the model, its features, and usage examples (some of which we’ve cited here). You’ll also find links to the model weights and any updates or community discussions (issues, pull requests). This is the first place to go if you want to run Kimi K2 yourself or delve into how it was built.
﻿HuggingFace Hub: The model weights (checkpoints) are hosted on Hugging Face Hub for easy downloading. Hugging Face might also have a model page (possibly for Kimi-K2-Instruct and Kimi-K2-Base separately) with example usage snippets and an interactive inference widget.
It’s worth highlighting how developer-friendly this licensing and access model is. By using an MIT-like license, Moonshot encourages wide adoption of Kimi K2. You could, for instance, deploy Kimi K2 in your company’s internal tools or integrate it into a product for customers without needing special permission. The open access also means researchers around the world can inspect how the model was trained, evaluate its biases or limitations, and contribute improvements.
Moonshot’s strategy here is not just altruism; as VentureBeat noted, it’s also a savvy move to build a community and ecosystem around Kimi K2. Every developer who experiments with or fine-tunes Kimi K2 potentially feeds back useful insights or enhancements, making the model better over time. In essence, the more people using Kimi K2, the stronger it can become, which is a virtuous cycle that proprietary models don’t benefit from. Just be sure to respect the license by reviewing its terms (e.g., keep the copyright notice and note any modifications if you redistribute it).
Tutorial: Advanced Kimi K2 observability with W&B WeaveNow that you have a grasp of what Kimi K2 is and how to deploy it, let’s dive into a hands-on tutorial. 
This comprehensive tutorial demonstrates how to leverage Weights & Biases (W&B) Weave's powerful observability features to monitor, debug, and optimize Kimi K2's performance across various coding and agentic use cases. We'll specifically use OpenRouter as the API endpoint for Kimi K2, providing a convenient and free-tier-friendly way to access this powerful model.
PrerequisitesBefore you begin, ensure you have the necessary libraries installed and your W&B and OpenRouter accounts set up.
pip install wandb weave openai
You'll need:
A Weights & Biases account and API key (available from wandb.ai/settings).
An OpenRouter account and API key (available from openrouter.ai/keys).
Step 1: Initialize Weave and configure Kimi K2 via OpenRouterFirst, we'll set up our Python environment by initializing W&B Weave and configuring the openai client to point to OpenRouter's Kimi K2 endpoint.
import weave
import openai
from typing import Dict, List, Any
import time
import json
﻿
# Initialize Weave with a clear project name.
# This creates a dedicated space in your W&B dashboard for Kimi K2 logs.
# You will be prompted to log in to W&B if you haven't already.
weave.init(
    project_name="kimi-k2-deep-observability",
    # entity="your-wandb-entity" # Optional: specify your W&B team/entity if working in a team
)
﻿
# Configure the OpenAI client to use OpenRouter's API for Kimi K2.
# OpenRouter's API is largely compatible with OpenAI's.
# IMPORTANT: Replace "YOUR_OPENROUTER_API_KEY" with your actual OpenRouter API key.
# It should start with 'sk-'.
client = openai.OpenAI(
    api_key="YOUR_OPENROUTER_API_KEY",
    base_url="https://api.openrouter.ai/api/v1"
)
﻿
# Define the model name for Kimi K2 on OpenRouter.
# Using 'moonshotai/kimi-k2:free' targets the free tier of the model.
# Be aware of OpenRouter's free tier rate limits (e.g., 50 requests/day, 20/min).
KIMI_K2_MODEL_NAME = "moonshotai/kimi-k2:free"
Step 2: Create instrumented functions with rich metadataWeave allows you to wrap your functions with @weave.op() decorators to automatically track their execution, inputs, and outputs. This is where we'll define our interactions with Kimi K2, embedding valuable metadata for detailed analysis in the Weights & Biases dashboard.
@weave.op()
def generate_code_with_context(
    prompt: str,
    language: str = "python",
    complexity: str = "medium",
    include_tests: bool = False
) -> Dict[str, Any]:
    """
    Calls Kimi K2 to generate code based on a prompt,
    tracking various parameters and returning structured output for observability.
    """
﻿
    # Build an enhanced prompt that provides Kimi K2 with more context.
    # This helps guide the model's generation and ensures better quality outputs.
    enhanced_prompt = f"""
    Language: {language}
    Complexity: {complexity}
    Include tests: {include_tests}
    
    Task: {prompt}
    
    Please provide clean, well-documented code.
    """
﻿
    start_time = time.time()
﻿
    # Call Kimi K2 via the OpenRouter-configured OpenAI client
    response = client.chat.completions.create(
        model=KIMI_K2_MODEL_NAME, # Use the defined Kimi K2 model name (including ':free' suffix)
        messages=[
            {"role": "system", "content": "You are an expert programmer. Write clean, efficient code with clear comments."},
            {"role": "user", "content": enhanced_prompt}
        ],
        temperature=0.6, # Recommended temperature for balanced creativity and accuracy with Kimi K2
        max_tokens=1000,
        # Optional: Add OpenRouter specific headers for tracking/ranking
        extra_headers={
            "HTTP-Referer": "https://kimi-k2-tutorial.example.com", # Your site URL for rankings on openrouter.ai
            "X-Title": "Kimi K2 Weave Tutorial", # Your app title for rankings on openrouter.ai
        }
    )
﻿
    end_time = time.time()
﻿
    # Weave automatically captures basic OpenAI client metrics (tokens, cost).
    # By returning a dictionary, we can add custom metadata that will appear in the trace.
    return {
        "generated_code": response.choices[0].message.content,
        "language": language,
        "complexity": complexity,
        "include_tests": include_tests, # Log input parameter for easier filtering later
        "response_time_seconds": end_time - start_time,
        "token_usage": {
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens,
            "total_tokens": response.usage.total_tokens
        },
        "model_info": {
            "model_id": response.model, # The actual model ID used by OpenRouter/Kimi K2
            "finish_reason": response.choices[0].finish_reason
        }
    }
﻿
@weave.op()
def evaluate_code_quality(code: str, language: str) -> Dict[str, Any]:
    """
    Uses Kimi K2 to evaluate the quality of a given code snippet.
    This creates a feedback loop, demonstrating Kimi K2's ability to "self-critique"
    or provide an objective assessment.
    """
﻿
    evaluation_prompt = f"""
    Please evaluate this {language} code on a scale of 1-10 for the following criteria:
    - Correctness (Is the code logically sound and bug-free?)
    - Readability (Is the code easy to understand and well-documented?)
    - Efficiency (Is the code optimized for performance?)
    - Adherence to Best Practices (Does it follow common {language} conventions and patterns?)
    
    Code to evaluate:
    ```{language}
    {code}
    ```
    
    Provide your scores and brief explanations for each criterion in JSON format.
    Example expected output:
    {{
      "correctness": {{ "score": 9, "explanation": "..." }},
      "readability": {{ "score": 8, "explanation": "..." }},
      "efficiency": {{ "score": 7, "explanation": "..." }},
      "best_practices": {{ "score": 8, "explanation": "..." }}
    }}
    """
﻿
    response = client.chat.completions.create(
        model=KIMI_K2_MODEL_NAME,
        messages=[{"role": "user", "content": evaluation_prompt}],
        temperature=0.3, # Lower temperature for more consistent and factual evaluation
        max_tokens=500,
        extra_headers={
            "HTTP-Referer": "https://kimi-k2-tutorial.example.com",
            "X-Title": "Kimi K2 Weave Tutorial (Evaluator)",
        }
    )
﻿
    try:
        # Attempt to parse the evaluation response as JSON.
        # Weave will log parsing errors if they occur, aiding debugging.
        evaluation_content = response.choices[0].message.content
        evaluation = json.loads(evaluation_content)
        return {
            "evaluation_scores": evaluation,
            "evaluator_model_id": response.model,
            "evaluation_tokens": response.usage.total_tokens
        }
    except json.JSONDecodeError as e:
        # If Kimi K2 doesn't return perfect JSON, Weave still logs the raw response,
        # which is invaluable for debugging prompt engineering.
        return {
            "evaluation_scores": {"error": f"JSON parsing failed: {e}"},
            "raw_evaluation_response": response.choices[0].message.content,
            "evaluator_model_id": response.model,
            "evaluation_tokens": response.usage.total_tokens
        }
﻿
@weave.op()
def multi_step_coding_agent(task_description: str) -> Dict[str, Any]:
    """
    Demonstrates Kimi K2's agentic capabilities by orchestrating a multi-step workflow.
    Each sub-step (planning, code generation, evaluation) is tracked separately
    within the overall trace, providing deep insight into the agent's decision-making.
    """
﻿
    # Step 1: Kimi K2 acts as a planner to break down the complex task.
    # This simulates a "thought" process, which is a hallmark of agentic AI.
    print(f"  --> Agent Planning for: {task_description}")
    planning_response = client.chat.completions.create(
        model=KIMI_K2_MODEL_NAME,
        messages=[{
            "role": "system",
            "content": "You are a helpful assistant that can break down complex programming tasks into actionable steps."
        }, {
            "role": "user",
            "content": f"Break down this coding task into logical, sequential steps for a developer to follow: {task_description}"
        }],
        temperature=0.5,
        max_tokens=300,
        extra_headers={
            "HTTP-Referer": "https://kimi-k2-tutorial.example.com",
            "X-Title": "Kimi K2 Weave Tutorial (Planner)",
        }
    )
﻿
    task_plan = planning_response.choices[0].message.content
    print(f"  --> Plan generated:\n{task_plan}\n")
﻿
    # Step 2: Generate the code based on the original task.
    # This uses our previously defined, instrumented function.
    print("  --> Agent generating code...")
    code_result = generate_code_with_context(
        prompt=task_description,
        language="python",
        complexity="medium",
        include_tests=True
    )
    print("  --> Code generated.")
﻿
    # Step 3: Evaluate the generated code.
    # This showcases a self-correction or verification step within the agent.
    print("  --> Agent evaluating generated code...")
    evaluation = evaluate_code_quality(
        code=code_result["generated_code"],
        language="python"
    )
    print("  --> Code evaluation complete.")
﻿
    return {
        "task_description": task_description,
        "agent_plan": task_plan,
        "generated_code_result": code_result,
        "code_evaluation_result": evaluation,
        "workflow_completed": True
    }
Step 3: Execute and Monitor Multiple ScenariosNow we'll run our instrumented functions with different inputs. Each call will generate a trace that Weave captures and sends to your Weights & Biases dashboard, providing a rich dataset for analysis.
# Test different types of coding tasks to observe Kimi K2's performance
test_scenarios = [
    {
        "name": "Basic Algorithm",
        "prompt": "Write a Python function to check if a string is a palindrome, ignoring case and non-alphanumeric characters.",
        "language": "python",
        "complexity": "easy"
    },
    {
        "name": "Data Manipulation",
        "prompt": "Write a Python function to flatten a nested list of integers. Example: [[1,2],[3,[4,5]],6] -> [1,2,3,4,5,6]",
        "language": "python",
        "complexity": "medium"
    },
    {
        "name": "Simple Web Utility",
        "prompt": "Create a simple Python Flask API endpoint that returns the current server time.",
        "language": "python",
        "complexity": "medium"
    }
]
﻿
# Execute scenarios and collect results
print("--- Running Individual Code Generation Scenarios ---")
results = []
for scenario in test_scenarios:
    print(f"\n🔄 Running scenario: {scenario['name']}")
﻿
    # Generate code with enhanced tracking
    result = generate_code_with_context(
        prompt=scenario["prompt"],
        language=scenario["language"],
        complexity=scenario["complexity"],
        include_tests=True # Always include tests for these examples
    )
﻿
    # Add scenario metadata to the result for easier filtering and comparison in Weave
    result["scenario_name"] = scenario["name"]
    result["scenario_complexity"] = scenario["complexity"]
﻿
    results.append(result)
﻿
    print(f"✅ Scenario completed in {result['response_time_seconds']:.2f}s")
    print(f"📊 Tokens used: {result['token_usage']['total_tokens']}")
    print(f"Generated code snippet:\n```python\n{result['generated_code'][:200]}...\n```") # Print a snippet
﻿
# Demonstrate a more complex, multi-step agentic workflow
print("\n--- Running Multi-Step Agentic Workflow ---")
workflow_result = multi_step_coding_agent(
    "Develop a Python script that uses requests to fetch data from a public API (e.g., JSONPlaceholder /posts), processes it to count words in titles, and then writes the top 5 most common words and their counts to a CSV file."
)
﻿
print("\n✅ Multi-step workflow completed. Check your W&B Weave dashboard for details!")
Step 4: Analyzing results in W&B WeaveAfter running this script, you'll see a link in your console (look for the 🟢 or 🍩 emoji) directing you to your dashboard. Click this link to explore the rich observability data.
Key Features to Explore in the W&B Weave UI:Trace Visualization: Navigate to the "Traces" tab. You'll see individual runs for generate_code_with_context and a nested, multi-step trace for multi_step_coding_agent.
Click on a trace: See the complete execution flow, including inputs (prompts, parameters), outputs (generated code, evaluation JSON), and intermediate steps.
For multi_step_coding_agent, you'll observe nested calls to generate_code_with_context and evaluate_code_quality, providing a clear visualization of how Kimi K2 planned and executed the task. This nested view is incredibly powerful for understanding complex agentic behaviors.
Performance Metrics: In the trace view, observe the response_time_seconds and token_usage for each operation. These metrics help you identify bottlenecks, understand the computational cost of different Kimi K2 interactions, and track efficiency over time.
Input/Output Inspection: For every logged operation, you can inspect the exact prompt sent to Kimi K2 and the complete response received. This is crucial for debugging Kimi K2's outputs and refining your prompt engineering strategies.
Custom Metadata: Notice how the language, complexity, include_tests, scenario_name, and scenario_complexity fields appear in your traces. These custom metadata fields make it easy to filter, group, and analyze runs in the Weights & Biases UI for targeted comparisons.
Comparison Views: W&B Weave allows you to select multiple runs and compare them side-by-side. You can easily see how Kimi K2's output or performance varies with different temperature settings, max_tokens, or prompt variations. This A/B testing capability is invaluable for optimization.
Error Analysis: If evaluate_code_quality encountered a JSONDecodeError (e.g., Kimi K2 failed to produce perfect JSON), you'll see it logged in the trace along with the raw_evaluation_response. This helps you quickly diagnose why Kimi K2 might not be adhering to a specific output format.
After visiting your Weave dashboard you'll find a list of your traces, including those that failed - assisting in troubleshooting errors you might not know existed:
﻿
You can further dig into individual traces to explore the inputs and outputs, to explore how things are working, where they can be improved and provide not just explainability but observability.
﻿
Step 5: (Optional) Custom evaluation and feedback loopWhile Weave automatically logs a lot, you can further enhance your observability by explicitly logging custom aggregate metrics or evaluation results. This is useful for building dashboards that track long-term performance or specific KPIs.
@weave.op()
def batch_code_generation_with_metrics(prompts: List[str]) -> Dict[str, Any]:
    """
    Processes multiple prompts in a batch and collects comprehensive aggregate metrics.
    """
﻿
    batch_results = []
    total_tokens_sum = 0
    total_time_sum = 0
    successful_requests_count = 0
﻿
    for i, prompt in enumerate(prompts):
        print(f"  Processing batch item {i+1}/{len(prompts)}")
        try:
            result = generate_code_with_context(
                prompt=prompt,
                language="python",
                complexity="medium"
            )
﻿
            batch_results.append({
                "prompt_index": i,
                "prompt": prompt,
                "success": True,
                "result": result
            })
﻿
            total_tokens_sum += result["token_usage"]["total_tokens"]
            total_time_sum += result["response_time_seconds"]
            successful_requests_count += 1
﻿
        except Exception as e:
            # Weave will log the exception details automatically
            print(f"  Error processing prompt {i}: {e}")
            batch_results.append({
                "prompt_index": i,
                "prompt": prompt,
                "success": False,
                "error": str(e)
            })
﻿
    # Calculate aggregate metrics for the batch
    avg_tokens_per_request = total_tokens_sum / len(prompts) if prompts else 0
    avg_response_time = total_time_sum / successful_requests_count if successful_requests_count > 0 else 0
    success_rate = successful_requests_count / len(prompts) if prompts else 0
﻿
    return {
        "batch_processing_summary": {
            "total_requests": len(prompts),
            "successful_requests": successful_requests_count,
            "success_rate": success_rate,
            "total_tokens_generated": total_tokens_sum,
            "avg_tokens_per_request": avg_tokens_per_request,
            "avg_response_time_seconds": avg_response_time,
            "total_processing_time_seconds": total_time_sum
        },
        "individual_batch_results": batch_results # Can be used to drill down into individual failures
    }
﻿
# Run batch processing
batch_prompts = [
    "Write a Python function to reverse a linked list.",
    "Create a class for a simple calculator with add, subtract, multiply, and divide methods.",
    "Implement the bubble sort algorithm in Python.",
    "Write a Python function to validate if a string is a valid email address.",
    "Create a Python decorator for timing function execution and printing the duration."
]
﻿
print("\n--- Running Batch Processing with Aggregate Metrics ---")
batch_summary_result = batch_code_generation_with_metrics(batch_prompts)
﻿
# Display summary metrics in the console and they'll also be logged by Weave
metrics = batch_summary_result["batch_processing_summary"]
print(f"""
📈 Batch Processing Summary:
- Total Requests: {metrics['total_requests']}
- Successful Requests: {metrics['successful_requests']}
- Success Rate: {metrics['success_rate']:.2%}
- Average Response Time: {metrics['avg_response_time_seconds']:.2f}s
- Average Tokens per Request: {metrics['avg_tokens_per_request']:.0f}
- Total Processing Time for Batch: {metrics['total_processing_time_seconds']:.2f}s
""")
﻿
# Example of how you might create a dataset of evaluations, if you were manually
# reviewing and rating generated code (human-in-the-loop) or had a more
# sophisticated automated testing setup.
@weave.op()
def create_human_evaluation_dataset(generated_codes: List[Dict[str, Any]]) -> Dict[str, Any]:
    """
    Placeholder for creating a dataset of human or automated code evaluations.
    This demonstrates how to log a 'dataset' artifact to W&B for further analysis.
    """
    
    # In a real scenario, you might have a process where human reviewers
    # score the generated_codes and then you log those scores.
    # For this example, we'll just prepare a structure.
    
    evaluation_entries = []
    for i, code_info in enumerate(generated_codes):
        # Simulate some evaluation here, or integrate with an external testing framework
        simulated_score = 7 + (i % 3) # Dummy score for demonstration
        evaluation_entries.append({
            "sample_id": i,
            "prompt": code_info.get("prompt", "N/A"),
            "generated_code": code_info["generated_code"],
            "simulated_human_score": simulated_score,
            "model_response_time": code_info["response_time_seconds"]
        })
    
    # Weave automatically logs the return value of @weave.op functions.
    # This dictionary will appear as a logged object in the trace.
    return {
        "evaluation_dataset_summary": {
            "dataset_size": len(evaluation_entries),
            "timestamp": time.time(),
            "average_simulated_score": sum([e["simulated_human_score"] for e in evaluation_entries]) / len(evaluation_entries) if evaluation_entries else 0
        },
        "evaluation_details": evaluation_entries
    }
﻿
# Create evaluation dataset from our generated code (using dummy scores for demonstration)
print("\n--- Creating Simulated Evaluation Dataset ---")
evaluation_dataset_output = create_human_evaluation_dataset([r for r in results if r.get("generated_code")])
print(f"📋 Created a simulated evaluation dataset with {evaluation_dataset_output['evaluation_dataset_summary']['dataset_size']} samples.")
print(f"Average simulated score: {evaluation_dataset_output['evaluation_dataset_summary']['average_simulated_score']:.2f}")
﻿
print("\nTutorial execution complete. Check your W&B Weave dashboard for detailed observability!")
Producing something like:
﻿
Benefits of This Enhanced Approach:By combining Kimi K2's capabilities with W&B Weave's robust observability, you gain:
Comprehensive Tracking: Every interaction with Kimi K2, including inputs, outputs, and internal parameters, is automatically logged and easily accessible.
Deep Agentic Visibility: For multi-step tasks, Weave visualizes the entire chain of thought and execution, allowing you to trace Kimi K2's decisions and tool uses. This is critical for understanding and debugging complex autonomous agents.
Performance Analytics: Effortlessly track and analyze key metrics like response times, token usage, and success rates across different prompts or model configurations. Identify bottlenecks and cost drivers.
Quality Assurance & Feedback Loops: By instrumenting evaluation steps (like evaluate_code_quality), you can build continuous feedback loops, monitor the quality of generated outputs, and pinpoint areas for prompt engineering or model improvement.
Scalable Monitoring: Whether you're running a few tests or a high-volume production service, Weave scales to capture all your LLM interactions, providing a single source of truth for your AI application's behavior.
Custom Metrics & Metadata: Beyond default metrics, you can log any custom information relevant to your application (e.g., complexity, scenario_name), enabling highly specific filtering and analysis in the W&B UI.
Faster Debugging and Iteration: With detailed traces and clear input/output logs, you can quickly identify the root cause of issues, experiment with prompt variations, and iterate on your Kimi K2-powered application with confidence.
This tutorial provides a robust foundation for building, monitoring, and continuously improving Kimi K2-powered applications. Leverage the insights gained from W&B Weave to unlock the full potential of Kimi K2 for your coding, reasoning, and agentic tasks. Happy coding (and observing)!
Advanced use cases and customizationKimi K2’s capabilities open the door to a range of advanced use cases, especially in areas requiring complex reasoning or multi-step actions. Let’s explore a few prominent scenarios and how developers can customize the model for their needs:
Competitive Programming and Code Assistant: Kimi K2 is particularly strong in coding tasks. It nails competitive coding problems and real-world bug fixing, outperforming many proprietary models on coding benchmarks. For instance, it can take a description of an algorithm problem and generate a correct, efficient solution in code. This makes it an excellent coding assistant; you can integrate Kimi K2 into IDEs or code review tools to suggest fixes or improvements. Because it has a deep understanding of code, you might use it to analyze a codebase for bugs or even translate code (e.g., convert a Python script to Rust, as the Moonshot demo did). To tailor it to your needs, you could fine-tune Kimi K2-Base on your company’s code style or domain-specific API usage. This would give it expertise in, say, your internal frameworks, making its suggestions even more relevant.
Math and Logical Reasoning: Thanks to its training and MoE design, Kimi K2 crushes math and STEM tasks like AIME and MATH-500, demonstrating genuine problem-solving rather than just memorizing answers. This opens use cases in education (solving or tutoring math problems), data analysis, or scientific research assistance. A researcher could use Kimi K2 to derive formulas or analyze experimental data by having it generate code to perform calculations. Customizing here might involve providing the model with specialized tool access (for example, a calculator or algebra system) and letting it figure out how to use them – something K2 is designed to do.
Autonomous Agents and Tool Use: Perhaps the most exciting aspect of Kimi K2 is its agentic capability. It was trained to simulate thousands of tool-use tasks and can orchestrate complex sequences of actions. In practice, this means you can use Kimi K2 as the brain of an autonomous agent. For example, you could connect it to a suite of tools: web browsers, databases, shell commands, etc. Give Kimi K2 a high-level goal (like “Gather the latest stock prices and plot a graph of their weekly trend”) and it can break this down: search the web for data, call an API to get prices, use a plotting library via a tool, and return an image or report. Moonshot’s demos included things like planning a travel itinerary by actually using a flight search tool and booking information. Kimi K2 achieved top-tier results on agent benchmarks such as TauCo and AceBench, which test an AI’s ability to use tools effectively. For developers, using K2 in this way might involve setting up a tool registry (as shown in the GitHub tool-calling example) and simply allowing the model to invoke functions. You can customize which tools it has (to constrain its actions for safety) and even write a “critic” that evaluates its decisions (Moonshot mentions a self-critique mechanism with rubrics). With open access to K2, you could integrate it into frameworks like LangChain or custom agent loops, modifying its prompt or adding guardrails as you see fit.
Customization and Fine-Tuning: Because Kimi K2-Base is available, you’re free to fine-tune it on any domain-specific data. This could mean training a biomedical version of Kimi, a legal assistant that knows a country’s laws, or a gaming AI that understands the lore of a fantasy world. Fine-tuning a model of this size is non-trivial (you’d likely use techniques like Low-Rank Adaptation (LoRA) to adapt it with fewer resources), but the possibility is there. Even without full fine-tuning, you can customize behavior heavily via prompting. By crafting a detailed system prompt or few-shot examples, you can make Kimi K2 adopt a certain persona or follow specific formats. For example, you might make it an expert SQL query generator by providing a few examples in the prompt and a role like “You are an expert data analyst.”
One compelling example of customization is combining Kimi K2 with other open-source tools to create a hybrid system. For example, Claude Code Router allows developers to route code-related queries to different models. By plugging Kimi K2 into this router, alongside models like Anthropic’s Claude, you can have a setup where your application chooses the best model per request. Some have reported that using Kimi K2 via Claude Code Router yields an amazing coding experience, essentially getting Claude’s reliable instruction following with Kimi’s coding prowess. This speaks to a larger benefit of open models: you can integrate them in creative ways. You could even have an ensemble where Kimi K2 handles coding questions, another model handles general chat, etc., all coordinated by your custom logic.
The open-source nature of Kimi K2 encourages a broader range of uses. Developers are not limited by strict terms, so we’re seeing Kimi K2 pop up in experimental projects. For instance, someone might embed Kimi K2 into a robot for planning tasks, or use it in a Slack bot that not only answers questions but executes commands on request. Each of these advanced use cases is aided by the fact that you can inspect and modify how Kimi works.
If something goes wrong, you’re not dealing with a black box; you can dig into logs (especially with tools like Weave) or even adjust the model if you have the skill. This transparency and modifiability are accelerating innovation and “democratizing” what can be built with AI. Moonshot’s release of Kimi K2 has implications for the future of agentic AI – it shows that top-tier capabilities need not be locked behind closed APIs, and that a community-driven approach can push the envelope in AI functionality. We encourage developers to experiment with Kimi K2, share their custom use cases, and contribute back improvements or insights. With models like this, we might see an explosion of creative AI applications tailored to every niche imaginable.
ConclusionIn this guide, I've hopefully show how to unlock the power of Kimi K2 – from understanding its trillion-parameter architecture and standout performance, to deploying it via API and enhancing it with W&B Weave observability. The key takeaways are that Kimi K2 is a state-of-the-art open model excelling in coding, reasoning, and autonomous task execution, and it’s available for everyone to use and adapt. By walking through a W&B Weave tutorial, we demonstrated how you can monitor and debug Kimi K2’s behavior in practice, which is vital for building reliable AI applications.
The combination of Kimi K2’s capabilities and Weave’s observability tools gives developers an unprecedented level of control and insight. You can build complex AI-driven systems (like coding assistants or autonomous agents) with Kimi K2 at their core, and use Weave to ensure those systems are performing as expected – seeing each decision the model makes and each action it takes. This empowers you to iterate quickly and confidently, knowing that you have transparency into the AI’s workings.
Kimi K2 represents more than just another language model release; it signals a shift towards more open, agentic AI that developers can truly own and innovate on. Whether you’re interested in its raw coding prowess, its potential to automate multi-step tasks, or you want to fine-tune it to create a unique AI service, Kimi K2 provides a robust foundation. We encourage you to explore it further: try out the model on your own problems, set up Weave to gather insights, and engage with the community of Kimi K2 users. With tools like these at your disposal, the future of building powerful, observable AI applications is incredibly bright – and in your hands. 
Sources:MoonshotAI – Kimi K2 GitHub README ﻿
VentureBeat – Moonshot AI’s Kimi K2 outperforms GPT-4 in key benchmarks 
Medium (Mehul Gupta) – Kimi-k2: The best open-sourced AI model with 1 Trillion params ﻿
Hugging Face Community – Use Kimi K2 with Claude Code Router ﻿
W&B Weave Documentation – Tracking LLM Inputs & Outputs ﻿
﻿
﻿
Add a comment
Tags: Articles, Community Posts, LLM, Weave, GenAI, Evaluations
Iterate on AI agents and models faster. Try Weights & Biases today.