W&B Weave

Deliver AI with confidence

Evaluate, monitor, and iterate on AI applications. Get started with one line of code.

1 2 3 4

import weave
weave.init("quickstart")
@weave.op()
def llm_app(prompt):

Keep an eye on your AI

Improve quality, cost, latency, and safety

Weave works with any LLM and framework and comes with a ton of integrations out of the box

Quality

Accuracy, robustness, relevancy

Cost

Token usage and estimated cost

Latency

Track response times and bottlenecks

Safety

Protect your end users using guardrails

Evaluations

Measure and iterate

Visual comparisons

Use powerful visualizations for objective, precise comparisons

Automatic versioning

Save versions of your datasets, code, and scorers

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

import openai, weave
weave.init("weave-intro")

@weave.op
def correct_grammar(user_input):
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="o1-mini",
        messages=[{
            "role": "user", 
            "content": "Correct the grammar:\n\n" + 
            user_input,
        }],
    )
    return response.choices[0].message.content.strip()

result = correct_grammar("That was peace of cake!")
print(result)

Playground

Iterate on prompts in an interactive chat interface with any LLM

Leaderboards

Group evaluations into leaderboards featuring the best performers and share across your organization

Tracing and monitoring

Log everything for production monitoring and debugging

Debugging with trace trees

Weave organizes logs into an easy to navigate trace tree so you can identify issues

Multimodality

Track any modality—text, code, documents, image, and audio. Other modalities coming soon

Easily work with long form text

View large strings like documents, emails, HTML, and code in their original format

The blog post "Creating a LLM-as-a-Judge That Drives Business Results" by Hamel Husain provides a detailed guide on implementing large language models (LLMs) as judges for AI evaluation. It outlines a seven-step process called "Critique Shadowing" to improve AI systems. Key points include: 1. Identifying a Principal Domain Expert: Engage key individuals with domain expertise early to ensure evaluations align with user needs and standards. 2. Creating a Diverse Dataset: Build datasets that reflect diverse interactions the AI will encounter, using both real and synthetic data. 3. Pass/Fail Judgments with Critiques: Use simple binary judgments and detailed critiques to evaluate AI outputs, avoiding complex scoring systems. 4. Fixing Errors: Prioritize resolving obvious errors before building an LLM judge. 5. Iterative LLM Judge Development: Use expert examples to refine LLM prompts iteratively, aiming for high agreement between the LLM judge and the domain expert. 6. Performing Error Analysis: Analyze errors to identify root causes and improve AI performance. 7. Creating Specialized LLM Judges: Develop targeted judges for specific issues after completing the critique shadowing process. The post emphasizes the importance of focusing on data analysis and critiques over mere metric collection, and it provides insights into optimizing LLM prompts and handling potential pitfalls. Additionally, it discusses the value of involving human expertise in the evaluation process to ensure AI alignment with business goals.

Creating a LLM-as-a-Judge That Drives Business Results

By Hamel Husain

A detailed guide on implementing large language models (LLMs) as judges for AI evaluation, featuring a seven-step "Critique Shadowing" process.

Seven-Step Process

1. Identifying a Principal Domain Expert

Engage key individuals with domain expertise early
Ensure evaluations align with user needs and standards

2. Creating a Diverse Dataset

Build comprehensive datasets reflecting diverse interactions
Include both real and synthetic data

3. Pass/Fail Judgments with Critiques

Implement simple binary judgments
Include detailed critiques for evaluation
Avoid complex scoring systems

4. Fixing Errors

Prioritize resolving obvious errors
Address issues before LLM judge implementation

5. Iterative LLM Judge Development

Use expert examples to refine LLM prompts
Aim for high expert-LLM agreement

6. Performing Error Analysis

Analyze errors to identify root causes
Improve AI performance based on findings

7. Creating Specialized LLM Judges

Develop targeted judges for specific issues
Implement after critique shadowing completion

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758

{
    "title": "Creating a LLM-as-a-Judge That Drives Business Results",
    "author": "Hamel Husain",
    "description": "A detailed guide on implementing LLMs as judges for AI evaluation",
    "process": "Critique Shadowing",
    "steps": {
        "1_domain_expert": {
            "title": "Identifying a Principal Domain Expert",
            "key_points": [
                "Engage domain experts early",
                "Align evaluations with user needs"
            ]
        },
        "2_dataset": {
            "title": "Creating a Diverse Dataset",
            "key_points": [
                "Build comprehensive datasets",
                "Include real and synthetic data"
            ]
        },
        "3_judgments": {
            "title": "Pass/Fail Judgments with Critiques",
            "key_points": [
                "Use binary judgments",
                "Include detailed critiques",
                "Avoid complex scoring"
            ]
        },
        "4_errors": {
            "title": "Fixing Errors",
            "key_points": [
                "Prioritize obvious errors",
                "Fix before LLM judge implementation"
            ]
        },
        "5_development": {
            "title": "Iterative LLM Judge Development",
            "key_points": [
                "Refine prompts with expert examples",
                "Maximize expert-LLM agreement"
            ]
        },
        "6_analysis": {
            "title": "Performing Error Analysis",
            "key_points": [
                "Identify root causes",
                "Improve AI performance"
            ]
        },
        "7_specialized": {
            "title": "Creating Specialized LLM Judges",
            "key_points": [
                "Develop targeted judges",
                "Implement post-critique shadowing"
            ]
        }
    }
}

Online evaluations

Score live incoming production traces for monitoring without impacting performance

Agents

Observability and governance tools for agentic systems

Build state-of-the-art agents

Supercharge your iteration speed and top the charts

Agent framework and protocol agnostic

Integrates with leading agent frameworks such as OpenAI Agents SDK and protocols such as MCP

agent.py

1 2 3 4 5 6 7 8 9 10 11

import weave
from openai import OpenAI

weave.init("agent-example")

@weave.op()
def my_agent(query: str):
    client = OpenAI()
    response = client.chat.completions.create(...)
    return response

my_agent("What is the weather?")

Trace trees purpose-built for agentic systems

Easily visualize agents rollouts to pinpoint issues and improvements

Scoring

Use our scorers or bring your own

Pre-built scorers

Jumpstart your evals with out-of-box scorers built by our experts

Toxicity

Hallucinations

Content Relevance

Write your own scorers

Near-infinite flexibility to build custom scoring functions to suit your business

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

import weave, openai

llm_client = openai.OpenAI()

@weave.op()
def evaluate_output(generated_text, reference_text):
    """
    Evaluates AI-generated text against a reference answer.
    
    Args:
        generated_text: The text generated by the model
        reference_text: The reference text to compare against
        
    Returns:
        float: A score between 0-10
    """
    system_prompt = """You are an expert evaluator of AI outputs.
    Your job is to rate AI-generated text on a scale of 0-10.
    Base your rating on how well the generated text matches 
    the reference text in terms of factual accuracy,
    comprehensiveness, and conciseness."""
    
    user_prompt = f"""Reference: {reference_text}
    
    AI Output: {generated_text}
    
    Rate this output from 0-10:"""
    
    response = llm_client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.2
    )
    
    # Extract the score from the response
    score_text = response.choices[0].message.content
    # Parse score (assuming it returns a number between 0-10)
    try:
        score = float(score_text.strip())
        return min(max(score, 0), 10)  # Clamp between 0-10
    except:
        # Fallback score if parsing fails
        return 5.0