Preview
Monitors
Score production traces using custom LLM judges
Score production traces in real time and continuously track AI applications and agent performance with Weave Online Evaluations.
Create LLM judges that give you total control over online evaluations so you can catch issues instantly and maintain quality over time.


Continuously monitor your AI with online evaluations
LLMs are powerful but can be unpredictable, causing unexpected issues when they’re live. Weave’s online evaluations give you real-time visibility into your AI application’s real-world performance, helping you quickly catch and resolve issues.
Use our built-in LLM judges or configure your own to grade your application’s performance instantly across standard or custom metrics. Available through the Weave UI or SDK.
Hone in on the most relevant traces
Scoring every trace can be unnecessary and inefficient. Choose precisely which traces your online evaluations run on with random sampling and custom filters. Score only the calls that matter. Fully control sampling to keep your evaluations focused and efficient.


Build a data flywheel
Evaluation datasets are hard to build since you cannot predict every scenario during testing, yet your evaluations must reflect what happens in production. Use online evaluations to flag examples worth adding to your evaluation datasets for testing.
Keep an eye on trends over time
Your environment can change over time, causing issues like data drift that you need to spot and fix quickly before they impact your users. Monitors track metrics over time, letting you compare snapshots, catch trends early, proactively solve issues, and maintain quality.


Empower non-technical users to monitor applications
With Weave, product managers and domain experts—even those without coding experience—can directly create LLM judges and monitor their applications. Weave’s intuitive UI simplifies specifying prompts, LLM configurations, and sampling filters, enabling anyone to quickly identify and investigate critical issues.
Reduce friction to get started
Mixing evaluation with application code introduces dependencies that slow down development and add to latency.
Weave’s Online Evaluations reside and run on the Weights & Biases environment, eliminating code dependencies, blocking, and extra latency. As a result, you can decouple scoring from application code, eliminating extra integration steps, deployment hassles, and extra latency for end users.

The Weights & Biases end-to-end AI developer platform
Weave
Models
The Weights & Biases platform helps you streamline your workflow from end to end
Models
Experiments
Track and visualize your ML experiments
Sweeps
Optimize your hyperparameters
Registry
Publish and share your ML models and datasets
Automations
Trigger workflows automatically
Weave
Traces
Explore and
debug LLMs
Evaluations
Rigorous evaluations of GenAI applications