Evaluations

Measure and iterate on AI application and agent performance

Iterate on your AI applications by experimenting with LLMs, prompts, RAG, agents, fine-tuning, and guardrails. Use Weave’s evaluation framework and scoring tools to measure the impact of improvements across multiple dimensions—including accuracy, latency, cost, and user experience. Centrally track evaluation results and lineage for reproducibility, sharing, and rapid iteration.

Flexible evaluation framework

Weave evaluations combine a test dataset with a set of scorers, making them flexible to work with. Weave aggregates the scores for each evaluation, allowing you to compare different evaluations side by side. You can also drill down into individual examples within an evaluation to understand where your prompt or model selection needs improvement.

Learn more

AI system of record

Centrally track all evaluation data to enable reproducibility, collaboration, and governance. Trace lineage back to LLMs used in your application to make continuous improvements. Weave automatically versions your code, datasets, and scorers by tracking changes between experiments, enabling you to pinpoint performance drivers across evaluations.

Powerful visual comparisons

Use powerful visualizations for objective, precise comparisons between evaluations.

Pre-built scorers or bring your own

Weave comes out of the box with a slew of industry-standard scorers. We also make it easy to define your own.

Leaderboards

Aggregate evaluations into leaderboards featuring the best performers and share them across your organization.

Online evaluations

Run evaluations on live incoming production traces, which is useful when you don’t have a curated evaluation dataset.

Get started with AI evaluations

The Weights & Biases end-to-end AI developer platform

Weave

Traces

Debug agents and AI applications

Evaluations

Rigorous evaluations of agentic AI systems

Playground

Explore prompts
and models

Agents

Observability tools for agentic systems

Guardrails

Block prompt attacks and harmful outputs

Monitors

Continuously improve in prod

Models

Experiments

Track and visualize your ML experiments

Sweeps

Optimize your hyperparameters

Tables

Visualize and explore your ML data

Core

Inference

Explore hosted, open-source LLMs

Registry

Publish and share your AI models and datasets

Artifacts

Version and manage your AI pipelines

Reports

Document and share your AI insights

SDK

Log AI experiments and artifacts at scale

Automations

Trigger workflows automatically

Evaluations

Measure and iterate on AI application and agent performance

Flexible evaluation framework

AI system of record

Powerful visual comparisons

Pre-built scorers or bring your own

Leaderboards

Online evaluations

Get started with AI evaluations

The Weights & Biases end-to-end AI developer platform

The Weights & Biases platform helps you streamline your workflow from end to end

Models

Experiments

Sweeps

Registry

Automations

Weave

Traces

Evaluations

Core

Artifacts

Tables

Reports

SDK

The Platform

Article

Resources

Company