Deliver AI with confidence

Evaluate, monitor, and iterate on AI applications. Get started with one line of code.
img--dashboard img--dashboard-m
1 2 3 4
import weave
weave.init("quickstart")
@weave.op()
def llm_app(prompt):
Keep an eye on your AI

Improve quality, cost, latency, and safety

Weave works with any LLM and framework and comes with a ton of integrations out of the box

img--quality

Quality

Accuracy, robustness, relevancy

img--cost

Cost

Token usage and estimated cost

img--latency

Latency

Track response times and bottlenecks

img--safety

Safety

Protect your end users using guardrails

Anthropic
Cohere
Groq
EvalForge
LangChain
OpenAI
Together
LlamaIndex
Mistral AI
Crew AI
OpenTelemetry
MCP
Evaluations

Measure and iterate

Visual comparisons

Use powerful visualizations for objective, precise comparisons

img--visual-comparison--1
img--visual-comparison--2
img--visual-comparison--3

Automatic versioning

Save versions of your datasets, code, and scorers

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
import openai, weave
weave.init("weave-intro")

@weave.op
def correct_grammar(user_input):
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="o1-mini",
        messages=[{
            "role": "user", 
            "content": "Correct the grammar:\n\n" + 
            user_input,
        }],
    )
    return response.choices[0].message.content.strip()

result = correct_grammar("That was peace of cake!")
print(result)

Playground

Iterate on prompts in an interactive chat interface with any LLM

img--playground img--playground

Leaderboards

Group evaluations into leaderboards featuring the best performers and share across your organization

img--leaderboards img--leaderboards
Tracing and monitoring

Log everything for production monitoring and debugging

Debugging with trace trees

Weave organizes logs into an easy to navigate trace tree so you can identify issues

Trace tree Trace tree

Multimodality

Track any modality—text, code, documents, image, and audio. Other modalities coming soon

icon--camera
icon--image
icon--sound
icon--text

Easily work with long form text

View large strings like documents, emails, HTML, and code in their original format

Online evaluations (coming soon)

Score live incoming production traces for monitoring without impacting performance

Online evaluations Online evaluations
Agents

Observability and governance tools for agentic systems

Build state-of-the-art agents

Supercharge your iteration speed and top the charts

Agent leaderboard

Agent framework and protocol agnostic

Integrates with leading agent frameworks such as OpenAI Agents SDK and protocols such as MCP

agent.py
1 2 3 4 5 6 7 8 9 10 11 12
import weave
from openai import OpenAI

weave.init("agent-example")

@weave.op()
def my_agent(query: str):
    client = OpenAI()
    response = client.chat.completions.create(...)
    return response

my_agent("What is the weather?")

Trace trees purpose-built for agentic systems

Easily visualize agents rollouts to pinpoint issues and improvements

Agent trace trees
Scoring

Use our scorers or bring your own

Pre-built scorers

Jumpstart your evals with out-of-box scorers built by our experts

Toxicity
Hallucinations
Content Relevance

Write your own scorers

Near-infinite flexibility to build custom scoring functions to suit your business

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
import weave, openai

llm_client = openai.OpenAI()

@weave.op()
def evaluate_output(generated_text, reference_text):
    """
    Evaluates AI-generated text against a reference answer.
    
    Args:
        generated_text: The text generated by the model
        reference_text: The reference text to compare against
        
    Returns:
        float: A score between 0-10
    """
    system_prompt = """You are an expert evaluator of AI outputs.
    Your job is to rate AI-generated text on a scale of 0-10.
    Base your rating on how well the generated text matches 
    the reference text in terms of factual accuracy,
    comprehensiveness, and conciseness."""
    
    user_prompt = f"""Reference: {reference_text}
    
    AI Output: {generated_text}
    
    Rate this output from 0-10:"""
    
    response = llm_client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.2
    )
    
    # Extract the score from the response
    score_text = response.choices[0].message.content
    # Parse score (assuming it returns a number between 0-10)
    try:
        score = float(score_text.strip())
        return min(max(score, 0), 10)  # Clamp between 0-10
    except:
        # Fallback score if parsing fails
        return 5.0

Human feedback

Collect user and expert feedback for real-life testing and evaluation

Human feedback Human feedback

Third-party scorers

Plug and play off-the-shelf scoring functions from other vendors

RAGAS
EvalForge
LangChain
LlamaIndex
HEMM