Skip to main content

Ship AI with confidence: Introducing W&B Weave Online Evaluations

Real‑time LLM scoring on live incoming traces for production monitoring of AI agents and applications
Created on June 18|Last edited on June 18
Large language models (LLMs) are incredible at generating human‑like text, but they’re also non‑deterministic and occasionally unpredictable. Even the most exhaustive offline test suite can’t anticipate every edge case your application will face once it hits the real world.
That’s why today we’re excited to launch Weave Online Evaluations, a new capability for scoring production traces in real time so you can monitor, diagnose, and improve your AI agents and applications.


Offline vs. Online: A 30 second refresher

Offline evaluations

  • What it is: Run scorers against a fixed dataset to iterate quickly on prompts, metrics, or model versions
  • When to use it: During development and experimentation

Online evaluations:

  • What it is: Sample live production traces and score them continuously to understand actual user behavior over time
  • When to use it: After you ship, to ensure reliability and quality
With Weave you now get both. Iterate offline to improve your application from an idea to a real product, then deploy and keep tabs online without bolting extra code into your application. Online evaluations let you understand real-world user experience and incorporate those learnings continually, creating a virtuous closed loop of iteration.

Why online evaluations matter

Online evaluations surface edge cases that emerge from real user traffic. Use an LLM judge to identify interesting production traces and instantly feed them into your offline datasets. This iterative loop enhances dataset quality, boosting your evaluations. Better evaluations mean your application constantly evolves, meeting a growing range of user scenarios over time.

Meet Weave Online Evaluations

Score what matters, skip the noise

Choose exactly which traces you evaluate, whether by random sampling for broad coverage or through precise filters like route, user cohort, or model version. This lets you control evaluation costs and maximize useful insights, keeping the balance of signal and noise fully in your hands.

Bring‑your‑own judges

You can compose custom scorers with prompt‑based templates provided by Weave within the UI. All scorers run asynchronously on W&B infrastructure—no extra latency for your users.
Weave tracks every score in Monitors, making it trivial to compare snapshots, spot drift, and correlate incidents to deployments. Use the Weave SDK or the W&B MCP Server to pull the traces along with the monitor scores. You can then use your custom analytics to visualize trends over time.

Empower the whole team

Product managers and domain experts can create or tweak judges directly in the Weave UI without requiring any repo access or pull request. Simply configure your own LLM as judges from Monitors UI, no need to write any code.

Zero‑friction integration

Because Online Evaluations live completely outside your application code:
  • No extra libraries to import
  • No multithreading gymnastics
  • No scorer‑induced latency
Just log your traces to Weights & Biases, Weave takes care of the rest.

From trace to insight—in less than 60 seconds

NOTE: Please visit to https://weave-docs.wandb.ai/ for our latest documentation
💡
Let's look at how to get started.
import weave

weave.init("my-chatbot-prod")
# Log a production trace
@weave.op
def your_code(user_input):
agent_response = 'How can I help?'
return agent_response

your_code('can you help please?')
Next, just follow the link to see your trace.

Then, navigate to the Monitors tab and create a new monitor. From here, you can choose the traced operation to be monitored and add create an LLM judge that will be called against the incoming traces that meet the filters & sample rate that you set.

That’s it. Now your LLM judge will be triggered as traces arrive that match your filters.



Build a data flywheel

Every low‑score trace is a goldmine for improvement. Mark it with a click and Weave will add it into an evaluation dataset. Next sprint, use that dataset offline to tweak prompts, swap models, or test guardrails - then redeploy with confidence.

Pricing and free tier

Online evaluations are available today for all Weave users. Built-in scorers run on W&B Inference powered by CoreWeave and come with a generous free tier, so you can start monitoring without watching the meter.

Tags: Articles
Iterate on AI agents and models faster. Try Weights & Biases today.