Building better evaluations with high-quality data

Digging into what data to use when evaluating your AI applications—and showing how W&B Weave can help
Created on March 3|Last edited on March 3
Comment
﻿
﻿
AI application evaluations using high-quality data produce better AI applications. The LLMs driving AI applications are non-deterministic and can provide 10 different answers when asked the exact same question 10 different times. Rigorous evaluations and optimization mitigate the inherent unpredictability of LLMs, ensuring consistent and reliable AI applications that can be deployed into production with confidence.
A thorough evaluation process preceding AI application deployment requires three things: the application itself, scorers for measuring the application’s performance, and input data for testing the application. This combination permits measuring an application’s performance across multiple dimensions, including quality, latency, cost, and safety. “Vibe checks,” basically eyeballing application results visually and viscerally, lack the scale and repeatability required to produce AI applications that produce consistent performance and a positive user experience.
Datasets for AI application evaluationsEvaluations are an integral part of developing AI applications—and deploying them into production. The best input data for evaluations is collected from “real-world” interactions, but getting such data is not always possible. If you have the luxury of using real-world data for evaluations, it is important to consider the following interaction types:
Happy pathsThis data reflects the most favorable conditions under which an application is expected to perform. By focusing on happy path scenarios, developers can benchmark their models against ideal outcomes.
Edge casesEdge cases are rare situations that can reveal flaws in application behavior. By rigorously testing these scenarios, developers can enhance the resilience of their applications, ensuring they perform well even in unexpected circumstances.
Error conditionsError conditions provide critical insights into application performance during adverse scenarios. By examining this data, developers can identify vulnerabilities and optimize their applications for better reliability.
User variationsUser variation data captures the many different ways users engage with an application, offering valuable insights into its effectiveness. This data allows developers to identify trends and areas for enhancement, ensuring the application is responsive to user needs.
﻿
In cases where real-world data is not available, it is possible to generate synthetic data resembling realistic interactions. The goal is to create synthetic data that is indistinguishable from real data. For example, if an AI medical assistant agent is expected to perform an initial sickness diagnosis based on symptoms, the data used during the evaluation process should mimic patients describing the symptoms associated with their health problems.
Just as is the case when collecting real data for evaluations, it is imperative that the synthetic input data used for evaluations reflects the different types of interactions that the AI application will be expected to handle. It is also especially important when generating synthetic data that domain experts be included in the validation process to ensure that the data is realistic and the output from the AI application is satisfactory.
Synthetic data can be generated through a variety of different methods. When using an LLM to generate data, it is important to use an effective prompt, often containing suggested examples of the output data. In fact, an LLM-powered data generator is its own type of AI application that requires its own evaluations and optimizations to build properly.
New W&B Weave featuresGiven the importance of using high-quality data in the AI application evaluation process, W&B Weave has recently released two new features increasing access and control over data and datasets stored in a user’s Weave account.
Manage datasets in Weave's UI
﻿
Whereas before datasets could be stored using Weave and updated by running code outside of Weave, it is now possible to add, edit, and delete data in those datasets directly in the Weave user interface. Conducting thorough evaluations for AI applications requires the ability to efficiently view and manage datasets. Editing data quickly and easily using the Weave UI allows users to efficiently refine and improve evaluations in-place rather than having to make changes elsewhere, such as updating datasets by executing code in notebooks or scripts on their laptops.
Not only is it possible to change evaluation inputs that are irrelevant, erroneous, or no longer useful, it is also possible to quickly add inputs driven by new observations about customer behavior or simply curiosity about how an AI application might react to hypothetical user queries. By removing the burden of altering datasets, Weave permits more rapid iteration on evaluations and a quicker path to optimization of LLM-powered applications.
Build datasets with Weave call data
﻿
﻿
You can add Weave call data to datasets in the Weave UI (pictured above) or using the SDK (code below).
💡
@weave.op
def model(task: str) -> str:
    return f"Now working on {task}"
﻿
res1, call1 = model.call(task="fetch")
res2, call2 = model.call(task="parse")
﻿
dataset = Dataset.from_calls([call1, call2])
# Now you can use the dataset to evaluate the model, etc.
It's challenging to build effective datasets for evaluations. Synthesized data intended to resemble production data works well, but has difficulty capturing edge cases, error conditions, and user variations. Nothing compares to production data for AI application evaluation purposes. Optimizing an AI application means preparing it to predictably handle all real-world scenarios. Using production data for evaluations mitigates the risks posed by the non-deterministic nature of LLM’s.
Weave collects traces for AI applications during development and in production. It is now possible to add Weave call data directly to any dataset with just a few clicks in the Weave UI or with just a couple of lines of code using the SDK. Creating datasets with real production data captured using Weave including edge cases and error conditions drives better evaluations. And better evaluations lead to a battle-hardened AI application that can be deployed into production with confidence.
You can learn more about storing and managing datasets in W&B Weave and easily adding call data to Weave datasets in the product docs. Thanks for reading. ﻿﻿
﻿﻿﻿
﻿
Add a comment
Tags: Articles, Weave, Evaluations, Agents
Iterate on AI agents and models faster. Try Weights & Biases today.