Ship AI with confidence: Introducing W&B Weave Online Evaluations
Real‑time LLM scoring on live incoming traces for production monitoring of AI agents and applications
Created on June 18|Last edited on June 18
Comment
Large language models (LLMs) are incredible at generating human‑like text, but they’re also non‑deterministic and occasionally unpredictable. Even the most exhaustive offline test suite can’t anticipate every edge case your application will face once it hits the real world.
That’s why today we’re excited to launch Weave Online Evaluations, a new capability for scoring production traces in real time so you can monitor, diagnose, and improve your AI agents and applications.

Offline vs. Online: A 30 second refresher
Offline evaluations
- What it is: Run scorers against a fixed dataset to iterate quickly on prompts, metrics, or model versions
- When to use it: During development and experimentation
Online evaluations:
- What it is: Sample live production traces and score them continuously to understand actual user behavior over time
- When to use it: After you ship, to ensure reliability and quality
With Weave you now get both. Iterate offline to improve your application from an idea to a real product, then deploy and keep tabs online without bolting extra code into your application. Online evaluations let you understand real-world user experience and incorporate those learnings continually, creating a virtuous closed loop of iteration.
Why online evaluations matter
Online evaluations surface edge cases that emerge from real user traffic. Use an LLM judge to identify interesting production traces and instantly feed them into your offline datasets. This iterative loop enhances dataset quality, boosting your evaluations. Better evaluations mean your application constantly evolves, meeting a growing range of user scenarios over time.
Meet Weave Online Evaluations
Score what matters, skip the noise
Choose exactly which traces you evaluate, whether by random sampling for broad coverage or through precise filters like route, user cohort, or model version. This lets you control evaluation costs and maximize useful insights, keeping the balance of signal and noise fully in your hands.
Bring‑your‑own judges
You can compose custom scorers with prompt‑based templates provided by Weave within the UI. All scorers run asynchronously on W&B infrastructure—no extra latency for your users.
See trends over time
Weave tracks every score in Monitors, making it trivial to compare snapshots, spot drift, and correlate incidents to deployments. Use the Weave SDK or the W&B MCP Server to pull the traces along with the monitor scores. You can then use your custom analytics to visualize trends over time.
Empower the whole team
Product managers and domain experts can create or tweak judges directly in the Weave UI without requiring any repo access or pull request. Simply configure your own LLM as judges from Monitors UI, no need to write any code.
Zero‑friction integration
Because Online Evaluations live completely outside your application code:
- No extra libraries to import
- No multithreading gymnastics
- No scorer‑induced latency
Just log your traces to Weights & Biases, Weave takes care of the rest.
From trace to insight—in less than 60 seconds
💡
Let's look at how to get started.
import weaveweave.init("my-chatbot-prod")# Log a production trace@weave.opdef your_code(user_input):agent_response = 'How can I help?'return agent_responseyour_code('can you help please?')
Next, just follow the link to see your trace.

Then, navigate to the Monitors tab and create a new monitor. From here, you can choose the traced operation to be monitored and add create an LLM judge that will be called against the incoming traces that meet the filters & sample rate that you set.

That’s it. Now your LLM judge will be triggered as traces arrive that match your filters.

Build a data flywheel
Every low‑score trace is a goldmine for improvement. Mark it with a click and Weave will add it into an evaluation dataset. Next sprint, use that dataset offline to tweak prompts, swap models, or test guardrails - then redeploy with confidence.
Pricing and free tier
Online evaluations are available today for all Weave users. Built-in scorers run on W&B Inference powered by CoreWeave and come with a generous free tier, so you can start monitoring without watching the meter.
Add a comment
Tags: Articles
Iterate on AI agents and models faster. Try Weights & Biases today.