Announcing: New evaluation comparisons in W&B Weave
Our newest Weave feature lets you interactively compare LLM evaluations at both an overall and granular level. Here's how it works.
Created on July 12|Last edited on July 22
Comment
We’re excited to share our newest W&B Weave feature: compare evaluations. You can now get visual, high-level summaries of how your LLM evaluations stack up, then drill down to compare example outputs and scores.
It’s easy to get started. Just navigate to the evaluations tab, select a few evaluations, and then click the “evaluation comparison” button. You can immediately scan through comprehensive reports comparing models across performance, latency, token usage, and more. You can also quickly see the differences between both applications being evaluated, whether you were swapping in a new model, or trying a new iteration of your prompt.
We'll get into detail below, but if you'd like to see evaluation comparisons in action, you can watch the video below:
Radar and bar plots compare summary metrics for different models, including user-defined scores as well as automatically captured metrics like latency and total tokens. Plus, it’s easy to add custom metrics by defining your own scoring functions.
In addition to the plots, there’s a model score card with your base and challenger models alongside views where you can explore user defined properties such as model names, prompt templates, and temperature. This summary allows you to surface important information quickly; for example, you can see if one model used more tokens but delivered better performance.

The new evaluation comparisons also provide an easy way to explore all the different examples that your models were evaluated on and the outputs for each example. If you're using the evaluation framework, you can jump between trials and see model latency, model summaries, and overall aggregate metrics. And we always make the code available so you can see what generated the score.
One of the biggest advantages of the new evaluation comparison feature is that you can dig into challenging examples or find outputs where the models have very different evaluations so you can discover novel behavior in your challenger models.
Different evaluation performance is easy to spot because we plot each example as a dot with the X axis as the baseline model and the Y axis as the challenger model. We created a center line that shows where the performance is the same.

Once you've filtered down the examples, you can compare them side by side and page through each output and its corresponding scores. You can also quickly navigate to the application trace to get to the root cause of an issue. This is particularly helpful for examples where you want to see what exact data was used by the LLM.

We built the evaluation comparison feature for scale so you can compare a single evaluation or select multiple elevations. The report automatically scales for any number of model metrics.
Ready to give it a try? Checking out our Weave resources is a great place to start:
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.