Preview

Monitors

Score production traces using custom LLM judges

Score production traces in real time and continuously track AI applications and agent performance with Weave Online Evaluations. 

Create LLM judges that give you total control over online evaluations so you can catch issues instantly and maintain quality over time.

Continuously monitor your AI with online evaluations

LLMs are powerful but can be unpredictable, causing unexpected issues when they’re live. Weave’s online evaluations give you real-time visibility into your AI application’s real-world performance, helping you quickly catch and resolve issues.

Use our built-in LLM judges or configure your own to grade your application’s performance instantly across standard or custom metrics. Available through the Weave UI or SDK.

Hone in on the most relevant traces

Scoring every trace can be unnecessary and inefficient. Choose precisely which traces your online evaluations run on with random sampling and custom filters. Score only the calls that matter. Fully control sampling to keep your evaluations focused and efficient.

Build a data flywheel

Evaluation datasets are hard to build since you cannot predict every scenario during testing, yet your evaluations must reflect what happens in production. Use online evaluations to flag examples worth adding to your evaluation datasets for testing.

Keep an eye on trends over time

Your environment can change over time, causing issues like data drift that you need to spot and fix quickly before they impact your users. Monitors track metrics over time, letting you compare snapshots, catch trends early, proactively solve issues, and maintain quality.

Empower non-technical users to monitor applications

With Weave, product managers and domain experts—even those without coding experience—can directly create LLM judges and monitor their applications. Weave’s intuitive UI simplifies specifying prompts, LLM configurations, and sampling filters, enabling anyone to quickly identify and investigate critical issues.

Reduce friction to get started

Mixing evaluation with application code introduces dependencies that slow down development and add to latency.

Weave’s Online Evaluations reside and run on the Weights & Biases environment, eliminating code dependencies, blocking, and extra latency. As a result, you can decouple scoring from application code, eliminating extra integration steps, deployment hassles, and extra latency for end users.

The Weights & Biases end-to-end AI developer platform

Weave

Traces

Debug agents and AI applications

Evaluations

Rigorous evaluations of agentic AI systems

Playground

Explore prompts
and models

Agents

Observability tools for agentic systems

Guardrails

Block prompt attacks and harmful outputs

Monitors

Continuously improve in prod

Models

Experiments

Track and visualize your ML experiments

Sweeps

Optimize your hyperparameters

Tables

Visualize and explore your ML data

Core

Registry

Publish and share your AI models and datasets

Artifacts

Version and manage your AI pipelines

Reports

Document and share your AI insights

SDK

Log AI experiments and artifacts at scale

Automations

Trigger workflows automatically

Inference 

Explore hosted, open-source LLMs

The Weights & Biases platform helps you streamline your workflow from end to end

Models

Experiments

Track and visualize your ML experiments

Sweeps

Optimize your hyperparameters

Registry

Publish and share your ML models and datasets

Automations

Trigger workflows automatically

Weave

Traces

Explore and
debug LLMs

Evaluations

Rigorous evaluations of GenAI applications

Core

Artifacts

Version and manage your ML pipelines

Tables

Visualize and explore your ML data

Reports

Document and share your ML insights

SDK

Log ML experiments and artifacts at scale

Set up your first monitor