Make evaluations count: Comparing AI application evaluation results using W&B Weave
Learn how you can build better agents faster with W&B's suite of evaluation tooling
Created on October 2|Last edited on October 2
Comment

W&B Weave streamlines the development process for AI and agentic applications by offering powerful tools for continuous evaluation, monitoring, and iteration.
But to truly realize the potential of these tools, developers need optimization methods that are straightforward to implement and designed for consistent, repeatable improvement. Weave facilitates this optimization workflow through structured processes and actionable insights.
The typical optimization journey involves the following steps:
- Evaluate your application version on quality, latency, cost, and safety.
- Compare results from current and past evaluations.
- Identify which version performs best according to your criteria.
- Define this version as the new standard.
- Make adjustments that could enhance performance and run evaluations again.
Versioning AI assets is critical for measuring change. Weave’s Model class lets you define your application, track updates, including prompts, RAG content, LLMs, and other settings, and compare results over time.
When building an application using Weave, something as minor as editing a prompt or as significant as swapping an LLM automatically creates a new version. Weave automatically tracks all versions and their history, with no extra steps for the developer. Each version is saved and can be brought into your code with a simple URI call, making it easy to use for evaluations or other tasks.
Once you've evaluated your AI application, it's time to analyze and compare results. Weave makes this easy. Review every prompt, trace, and dataset in detail. Compare versions side by side. The interface highlights every change, so it’s easy to catch every difference, big or small.
LLM evaluations are costly in resources, time and tokens. To see immediate and impactful value, you need to turn results into clear, useful metrics. Weave’s interface supports this process. It offers several ways to compare application versions, ranging from high-level summaries or detailed responses to individual questions.
Let’s run through the following comparison views available in the Weave UI to show how each one helps you get the most from your evaluations:
- Leaderboards
- The evaluations page
- The evaluations summary dashboard
- The evaluation dataset results page
Leaderboards

What is this showing me?
Leaderboards show a side-by-side comparison of your application versions. Each metric, including scores for hallucination, friendliness, return policy handling, and latency, gets its own column. The heatmap coloring highlights top and bottom performers, making patterns easy to spot. You can sort each column and choose between total scores or percentages. This simple, interactive view helps you quickly see where each version excels or falls short.
When is it useful?
Use the Leaderboard for a high-level summary of which version performs best. It compares all versions using identical questions from the same evaluation dataset, ensuring results are consistent and meaningful.
When the number of differences between versions are minimal, like just switching between large language models (LLMs), the Leaderboard provides a clear comparison. For instance, if LLM #1 scores significantly higher than LLM #2, it indicates which model offers better performance. However, for more complex differences between versions, you’ll want to look deeper into your evaluation results before making any optimization decisions.
Evaluations page

What is this showing me?
The Evaluations page displays all your evaluation results in a tabular format. The first column is for feedback. The second shows status. You’ll see details like app version and dataset used for each evaluation. Each score appears in its own column. Use the Filter button at the top to quickly narrow down the results.
Clicking an evaluation will open a detailed drawer on the right that includes all the information you need about the evaluation run, such as metrics from each scorer and a table containing links for accessing each traced application call.

You can export data from the Evaluations page for use in tools like Google Sheets, Excel, or Python (as a pandas dataframe). This flexibility lets you choose the tools you prefer for analysis. Of course, Weave offers the optimal interface for evaluating and comparing results, just a click or two away from this row-by-row view.
When is it useful?
Like the Leaderboard, this table view presents a high-level overview of evaluations and their scores. It includes columns for total tokens and costs, providing a spreadsheet-like experience. This view is perfect for quickly identifying which application versions perform better across various dimensions. We get easy access to the data, but, ultimately, we are staring at the “what” and not the “why.” To understand why results differ and break down the aggregates, let’s open up the Compare evaluations page.
Evaluations Summary

Comparing application versions in the Weave UI is easy. From the Evaluations page, just select the checkboxes for the evaluations you're interested in and hit the Compare button at the top. You’re immediately greeted by a series of insightful visualizations on the Summary tab.
What is this showing me?
The Summary dashboard helps you compare application versions quickly. The top half displays a spider chart and column charts. You can drag and drop version chips to set the order and choose a baseline version for comparison. The Metrics section highlights positive and negative differences with green and red values next to each score. The dashboard also automatically adds charts for latency and total tokens, making it simple to see how each version stacks up in performance and efficiency.
When is it useful?
The Summary page in Weave organizes data from the Evaluations page into easy-to-read charts. The spider chart gives a quick overview of version performance across several metrics. Column charts let you compare versions on each metric. You can get actionable insights in seconds, with little need to click or scroll. The dashboard makes it clear which versions meet your requirements based on your prioritized performance metrics.. It’s the most straightforward way to evaluate your application in Weave.
Click on any object or result, and a drawer including extra details opens on the right. Depending on what you click - a Weave Model, a scorer, an evaluation, or a single score - you’ll see properties or traces from the scoring process. This design keeps everything you need within easy reach and speeds up your workflow.
Evaluations Dataset results
The final destination on our path from high-level to low-level, granular results, is the Dataset results tab.

What is this showing me?
When comparing two evaluation runs, the Dataset results dashboard displays two interactive scatterplots. These charts help you compare how each version performs based on your chosen metric. Under the scatterplots, you’ll find a detailed list of responses from each application version for every question in the dataset.
Effective evaluations depend on strong datasets. The most valuable datasets come from real-world usage or closely resemble it. By examining each dataset question alongside application version responses, you can compare evaluation results at the most granular level of detail.
When is it useful?
The scatter plots make version-to-version changes easy to spot. The dashed regression, or trend line, running through the plot helps to clarify and visualize the distance between the version responses.
During AI application development, you try out different prompts and models. At first, their performance tends to vary a lot. Some versions answer questions well, while others miss the mark. As you refine your application, you start to get similar answers for most questions. Now, the key questions are those where the versions still respond differently. Drilling down into these responses lends insight into which version performs best.

A quick visual scan of the dataset results exposes which questions and results to further investigate. Click on any Row ID to see a pivoted, detailed view of responses and scores for each application version. Set a baseline version at the top to compare color-coded score differences effortlessly.
As you narrow down to high-quality application versions, the Dataset results page becomes more valuable. The Summary dashboard exposes the top performers. The Dataset results page then gives you a closer look at each response. You see exactly which cases each version handles well and which it misses.
Conclusion
AI applications are easy to demo, but hard to productionize. Moving from prototype to production requires evaluation, iteration, and optimization. Comparison drives that work. You need an interface that shows the right data and lets you act on it. Weave delivers this with multiple comparison views, from Leaderboards for quick rankings to per-question analysis on evaluation datasets. With a full picture of version performance, developers and teams build better AI applications faster.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.