Run LLM evaluations right in the W&B Weave UI

How to evaluate your LLM inside our application—no code, just clicks
Created on September 11|Last edited on September 11
Comment
For as long as we’ve built applications with large language models, evaluation is essential but not easy. You craft a promising prompt, line up a dataset, and then comes the evaluation harness. Someone has to write the loop that feeds every row of the dataset to the model, record the outputs along with the prompt and model version, then repeat the process every time you tweak a prompt or swap a model. If you’re a developer, that’s hours of tedium you’d rather not maintain. If you’re not, evaluation can become the gate that keeps you on the sidelines.
Today, that changes. We’re bringing W&B Weave Evaluations straight into the UI. That means anyone—product managers, analysts, researchers, and yes, developers who’d prefer to spend their time on higher-leverage work—can evaluate models and prompts without writing a single line of code. It’s fast, easy, and designed to help you move from “this looks good” to “this is proven” with far less friction. 
For developers who need more flexibility, Weave already offers an Evaluations API. It’s easy to get started, supports evaluating entire AI applications and agents (not just prompts), and remains flexible for advanced logging such as custom aggregations and per-example loops. 
If you want to jump in now and run evaluations in the UI, you can open the Playground and give it a try:
Playground Quickstart﻿
﻿
So what exactly is our new release? In short, a no-code evaluation Playground where you configure a dataset, choose one or more models, add LLM-as-a-judge scorers, and hit Run. 
The Playground evaluates each model, scores the results, and packages everything into Weave evaluations with references to the respective prompt and model versions. You can save Weave models (an LLM paired with your system prompt), scorers (your judge and scoring instructions), and datasets as reusable project assets and reuse those references when comparing evaluations. Over time, this becomes a living library of experiments you and your teammates can build on.
Let’s walk through a simple example. Imagine you’re evaluating a question-answering prompt and you want to check basic correctness. You begin in the Weave UI inside your project. From the left menu, you open the Playground and switch to the Evaluate tab. You’ll see two paths: load a demo configuration to explore the interface, or start from scratch. 
Starting fresh, you add a title and a short description so future you and your collaborators know what this run is about. Next comes the dataset. Think of it as your table of test cases, with user inputs and expected outputs that represent the behavior you want. You can create the dataset right in the UI, upload a file from your computer, or select an existing dataset you’ve saved before. Common formats like CSV, TSV, JSON, and JSONL are supported. As you add or edit rows in the right-hand pane, take a moment to name the user query column user_input so your scorer can reference it easily. Save the dataset to your project to make it a first-class, shareable artifact.
﻿
With data in place, you add your first model. In Weave, a “model” is the combination of a foundation model (for example, a GPT-class model from a provider you’ve connected or one of the open-source models hosted on W&B Inference) and the system prompt that sets its behavior. You click Add Model and choose New Model. Give it a name you’ll recognize later, select the foundation model, and paste in a clear system prompt like “You are a helpful assistant that answers concisely using the provided context.” If you’d like to compare approaches, add a second model with a different system prompt or even a different provider. Side-by-side evaluation is one of the fastest ways to see what really moves your metric.
﻿
Now you’re ready to tell the Playground how to judge the outputs. You add a scorer, pick whether you want a boolean decision (correct/incorrect) or a numeric grade (0–1), and select an LLM to act as the judge. Then you describe, in plain language, how it should evaluate. For simple correctness, you might say: “Given the user input, the expected output, and the model’s output, return True if they match; otherwise return False.” You can reference your dataset fields and model outputs using variables like {user_input}, {expected_output}, and {output}. Because it’s all in the UI, you don’t have to wire anything up—no templating code, no function signatures, just clear instructions to the judge. If you expect nuance instead of exact matches, you can opt for a numeric scorer and ask the judge to rate helpfulness, clarity, grounding, or any domain-specific criteria you care about.
﻿
With your dataset, models, and scorer in place, click Run eval. The Playground takes it from there: it runs each example through each model, captures outputs, records latency and token usage, and applies your scoring logic. When it’s done, you land on the evaluation results page with row-by-row and aggregate views. You can compare evaluations using Weave’s visual comparisons and diff tables, and you can drill into individual examples for side-by-side comparisons to see exactly what changed.
Two things usually surprise teams the first time they use this flow. First, how quickly you can move from an idea to a defensible result. Because Weave models are automatically versioned, you can easily see which changes drove metric improvements and iterate systematically instead of cycling in circles. Second, experimentation becomes much easier. Since datasets, models, and scorers are saved objects, you can share a link with a teammate, ask them to tweak a prompt or judge instruction, and rerun the exact same experiment with one click. Decisions stop being opinions; they become artifacts you can revisit, reproduce, and refine.
This is also a win for developers. Even if you’re comfortable writing evaluation code, the boilerplate around data loading and tracking model and scorer versions in each run doesn’t make your product better. Offloading that scaffolding to the Playground lets you focus where it matters: designing better prompts and pushing the frontier of what your application can do. When you move to production, your saved models and scorers carry forward, turning experiments into the foundation of the build. The Weave evaluation APIs are the path for carrying those prompts into the final application or agent.
All of this adds up so evaluation doesn’t slow you down. It makes you confident. With Weave’s evaluations available via the Playground UI, you can compare ideas quickly, bring non-developers into the loop, and ground your decisions in visible, repeatable evidence. That’s how you move an application from a demo that impresses in a meeting to a product that performs in the real world.
If you’re ready to try it, open the Playground, load the demo to get a feel for it, or start from scratch and build your first evaluation in minutes. Save what works, share it with your team, and keep iterating.
Get started today and turn your LLM workflow into a measured, collaborative engine for progress:
Playground Quickstart﻿
﻿
﻿
Add a comment
Tags: Articles, GenAI, Evaluations
Iterate on AI agents and models faster. Try Weights & Biases today.