Building better AI applications: Why evaluations matter

A primer on why evaluations are vital, which evaluations matter, and how W&B Weave can help you build better AI apps, faster
Russell Ratshin
Created on January 31|Last edited on April 17
Comment
Unit tests are automated assessments built by developers to verify that specific sections of an application—often individual functions or methods—perform as intended. With traditional software, it’s reasonable to expect that each test would consistently produce a predetermined and specific result each time it was run. Now, as developers explore ways to use large language models in production AI applications, that is no longer the case.
Why? Because LLMs are non-deterministic. Asking the same question multiple times does not mean you’ll receive the same answers every time. This contrasts significantly with traditional software development, where conditional logic allows for precise code that delivers predictable outputs. 
We can think of LLM evaluations like unit tests but for AI applications. By conducting evaluations, we can identify areas for improvement and enhance consistency, ensuring that end-users receive similar—and accurate—responses.
As an example of this shift in how applications are tested, consider this: Today, creating a basic AI chatbot is quick and easy. What would have been a massive undertaking just a couple of years ago can now be done using no-code (or low-code) tools, simple Python scripts, or a code provided in response to a ChatGPT request.
﻿
﻿
The problem with the chatbot above is that while it may look good and perform well on the surface, it can be unpredictable—potentially even dangerous—in practice. Preparing any AI application for a real-world scenario where it presents end-users with accurate and credible answers requires strenuous hardening and battle testing via rigorous evaluation. “Vibe checks” based on subjective impressions of LLM interactions—no matter how frequently conducted—are a fine first step but simply not enough to ensure a production-quality application. Without constant evaluation of common, corner, and edge cases, developers lack the necessary visibility and oversight of their chatbot’s behavior.
Of course, the need for constant evaluation does not end after initial testing. Application changes or updates to underlying LLMs require that effective monitoring of production AI applications includes consistent and evolving evaluation strategies.
The gap between an AI application demo or prototype and a production-quality application is enormous. Simply put, AI applications are easy to demo, but hard to productionize. 
Enter W&B Weave, the AI developer tool helping developers build trusted and credible AI applications with confidence.
Evaluation dimensionsWhen defining and measuring the performance of an AI application, it’s essential to recognize that the quality of user experience is multi-dimensional. Determining success is not black and white. 
A number of different factors affect the decision of whether an application is ready for production but chief among them are evaluations. Deciding which dimensions are more important—and which are less important—depends on the use case and the priorities of involved stakeholders.
Let’s take a look at some of the most crucial dimensions for evaluating the performance of an LLM-driven AI application:
AccuracyNot only are you likely to receive twenty slightly different answers in response to asking an LLM the exact same question twenty times, there is no guarantee that any of the answers will be correct. Evaluations can help improve accuracy to ensure that end-users are not receiving incorrect responses.
LatencyDepending on the perceived complexity of a query, there is a maximum amount of time that an end-user can be expected to wait for a response. A challenging mathematical puzzle will likely take more time to resolve than a question involving simple arithmetic. Some models or prompts may yield a faster response time than others. It is important to understand what amount of latency is acceptable.
Cost While the cost per token of LLM API calls is often only a few cents or less, these numbers can add up fast. What is the price of acceptable performance and is this cost reasonable? A developer must always be conscious of the predicted costs incurred from a successful AI application. Evaluations provide cost metrics to the developer making this prediction possible.
Safety There are two sides to the safety dimension. One, the AI developer must be aware of any attempt made by end-users to trick or take advantage of an LLM is a malicious manner. The AI application must guard against nefarious intentions that might seek to thwart security measures. Second, an AI application must protect the end-user from receiving toxic, misleading, or harmful information. Evaluations help to ensure that the proper safeguards are in place to protect both the AI application and the end-user.
﻿
Trade-offs may be necessary to improve the performance of one dimension while requiring sacrificing the performance of another (for example, a more accurate response may result in greater latency while a less accurate response may require less time to generate). 
That said, whether specific trade-offs are acceptable depends on the purpose of the AI application. An AI agent tasked with making airline reservations or financial decisions will have very little room for error, so it’s fair to ask a customer’s patience for slightly longer response times. But no one is going to wait long for the answer to a less critical question about which actors starred in a blockbuster movie.
Relaxing accuracy demands may also be considered if financial considerations are a driving factor as minimizing prompt token-length translates to lower costs. Using more expensive and extensive models or longer prompts and RAG content may result in greater accuracy and less potential for hallucinations, but that accuracy improvement comes at a cost.
Evaluations
﻿
Evaluations are critical at every stage of AI application development, from building the first prototype to optimization through productionization to production monitoring. Every change to an AI application, no matter how small or inconsequential it may seem, requires examining how the change impacts application performance. 
W&B Weave enables straightforward model performance evaluation by tracking every single application data point and effectively organizing and visualizing all relevant inputs, outputs, code, and metadata.
Evaluation results inform whether optimization techniques improve or diminish performance and quality of an AI application. An AI developer can employ different strategies and techniques depending on which areas of an application require improvement.
The chart below shows LLM optimization strategies, but these are equally applicable to optimizing AI applications:
﻿Source﻿
Some techniques that are frequently included in optimization efforts include:
LLM selectionDifferent large language models perform differently. Higher quality LLMs may cost more, but also provide greater accuracy. Particular LLMs have been touted for greater ability generating programming code or solving math problems or summarizing content. An AI application tends to produce much different results using one LLM versus another.
Prompt engineering Prompt engineering affects AI applications by enhancing the relevance and quality of the underlying LLM output. Crafting the right prompt and choosing appropriate LLM API settings ensure that the model understands the context and intent, leading to more accurate and coherent results.
RAG optimizationRetrieval-augmented generation (RAG) optimization is valuable when building AI applications because it allows models to leverage external data repositories for contextually rich responses. By optimizing the process to supply the most pertinent information, RAG ensures that the LLM has access to high-quality content, directly impacting the coherence and relevance of the generated text.
Fine-tuning   Fine-tuning an LLM for AI applications allows the model to adapt to specific domain knowledge, user preferences, and contextual nuances, resulting in outputs that are more relevant and aligned with particular use cases. This process enhances the model’s ability to understand and generate content that resonates with the target audience, thereby improving overall performance.
AI application optimization may not require using every one of these techniques, but evaluations should be conducted every time application code is updated in any way that will affect the output provided to end-users.
Evaluation datasetsEffective LLM evaluations start with high-quality data (recall the phrase 'garbage in, garbage out' axiom). Therefore, it’s imperative to design evaluation datasets with a strong emphasis on quality to ensure meaningful evaluations.
Hamel Husain’s excellent blog post Creating a LLM-as-a-Judge That Drives Business Results includes a thorough explanation on why properly structuring datasets for your evaluation is critical to producing the best results. He mentions the following dimensions to consider when building the dataset:
Features: Specific functionalities of your AI product. Examples include email summarization, order tracking, and content recommendation.
Scenarios: Situations or problems the AI may encounter and needs to handle. Examples include ambiguous requests, incomplete data provided by end-users, and system errors.
Personas: Representative user profiles with distinct characteristics and needs. Examples include new users, expert users, and technophobes.
Put simply, evaluation datasets are not one size fits all. Audience and context must always be considered as part of the effort to create a dataset reflective of the input to be expected in “the real world.”
An AI application’s development stage determines the starting point for building an effective dataset. For productionized AI applications, production data logged from the application itself can be used as the foundation for an evaluation dataset. This extremely valuable data presumably includes:
Interactions with real-life end-users
Authentic edge cases entered by real-life end-users
Actual language and behavior from real-life end-users
True usage performance metrics from real-life end-users
The common theme here: real world end-users. The best data to fuel evaluations of your AI application is the data captured from your intended audience using your product. Unfortunately, this data may be limited in volume or non-existent if this is a brand new AI application.
When real-world data is scarce or unavailable, focus shifts to synthetically generating an evaluation dataset. As Husain notes in his blog, “Often, you’ll do a combination of [using] both [real-world data and generating synthetic data] to ensure comprehensive coverage. Synthetic data is not as good as real data, but it’s a good starting point. Also, we are only using LLMs to generate the user inputs, not the LLM responses or internal system behavior.”
The very LLMs that power our AI applications can also help us generate realistic data. Husain offers some great prompt strategies for generating this input data. When generating data, it is important to consider the following:
Use personas to guide generation: Imitate the behavior and voice of the personas that will be using your AI application to generate relevant input queries.
Create diverse scenarios: Make sure not to repeat identical or near-identical queries over and over again. Consider the different requests that an end-user might present and generate input data accordingly.
Simulate edge cases: Edge cases are often unexpected scenarios. But, if possible, try to use past experience or imagine how an end-user might use your AI application in an unexpected way.
Validate with domain experts: Ensuring that your input data is validated by experts is crucial. Unrecognized hallucinations and inaccuracies not only jeopardize our AI application evaluations but could also compromise the overall performance of our final product.
Evaluation ScorersW&B Weave Scorers evaluate AI outputs and provide essential metrics. Once you have your dataset ready, the next step is to select the appropriate Scorers for your AI application. Weave includes several pre-built Scorers tailored to common AI use cases. Additionally, you can create custom Scorers or leverage those from other frameworks and libraries.
There are two categories of scorers frequently used when evaluating LLMs: human annotation scorers and programmatic scorers. Each type possesses specific strengths and weaknesses, making them ideal for different evaluation purposes.
A human annotation scorer enlists an individual or a team responsible for manually evaluating and assigning scores or labels to the outputs generated by AI models, particularly LLMs. This process often involves assessing various aspects of the model's performance, including accuracy, relevance, coherence, fluency, and adherence to specific guidelines or criteria.
Some advantages of human annotation scorers are:
Contextual understandingHuman annotators possess the ability to interpret nuances, context, and subtleties in language that automated systems may overlook. This human insight is essential for evaluating the quality of generated responses, particularly in complex or ambiguous scenarios.
Subjectivity and quality assessmentSome aspects of language output, such as creativity, coherence, or conversational appropriateness, can be inherently subjective. Human annotators can provide richer feedback on these qualitative aspects, making their evaluations crucial for applications where user experience is paramount.
Iterative improvementHuman annotation allows for deeper engagement with the model’s output, facilitating discussions about potential improvements and guiding further training or fine-tuning. This iterative process can lead to enhanced model performance by identifying specific areas that require attention.
﻿
In contrast to human annotators, a programmatic scorer is an automated evaluation system designed to assess the performance of AI-generated outputs using predefined algorithms and metrics. These scoring systems can provide quantitative assessments of language model performance, often used to compare different models or iterations systematically and quickly.
For less complicated evaluations, such as searching for specific words or punctuation in LLM output, a regular expression in an if-then-else statement will suffice. The simpler, the better. But there are occasions where including a “human-in-the-loop” would be ideal, but for reasons of scale, impractical. For such scenarios, an LLM-as-a-Judge is often the best approach. This involves prompting an LLM to assess the quality of outputs based on criteria such as coherence, relevance, or creativity. For example, these types of scorers are often used to determine whether output contains dreaded LLM hallucinations or misleading information.
The combination of more deterministic conditional-based code scorers and non-deterministic LLM-as-a-Judge scorers offers a comprehensive approach to evaluation, allowing for both strict if-then-else style conditional assessments and more nuanced, qualitative analysis of AI-generated outputs. This dual approach enhances the reliability and depth of evaluations, ultimately contributing to better model refinement and performance.
Some advantages of programmatic scorers are:
Consistency and scalabilityAutomated scoring systems provide consistent evaluations across large datasets, making them highly efficient for benchmarking and performance comparisons. This scalability is essential for rapidly evaluating multiple models or iterations without the time and resource constraints associated with human evaluation.
Speed and cost  Programmatic scorers can quickly process vast amounts of data that can significantly reduce the time and cost associated with the evaluation process, making it feasible to conduct extensive testing more often.
Quantitative metricsAutomated evaluations can generate quantitative metrics such as BLEU, ROUGE, or accuracy scores, enabling a straightforward comparison of model performance. These metrics help establish baseline performance and facilitate systematic improvements through experimentation.
Evaluation comparisons
﻿
﻿
Optimization is the result of comparing evaluations to determine which combination of objects such as prompts, RAG content, or specific LLMs, has produced the best performing AI application. To make visual and programmatic comparisons as simple as possible, Weave (a) records all inputs, outputs, code, results, and other relevant data points from every single evaluation, and (b) provides an intuitive and coherent interface to examine and explore the differences between any two or more evaluation runs.
The interactive “Compare Evaluations” dashboard in Weave not only visualizes and summarizes all evaluation results, but allows the developer to highlight only the differences between multiple evaluations. A developer can even define one evaluation as the baseline against which multiple other evaluations and their respective metrics are compared. Additionally, single-click drill-down is available for every important object, including Weave models and Traces. 
A developer can focus on both the difference in results, but also quickly understand the details of each object underlying these differences. It is important to quickly understand how the evaluation results differ, but permitting a developer to also quickly understand why the results differ enables faster optimization, decision-making, and time to value.
ConclusionEvaluations provide insight. Every effort to optimize an AI application is likely to affect the resulting output. Add to this the fact that LLMs are non-deterministic, and the need for constant and thorough evaluation is obvious. The lightning-fast pace of LLM and AI innovation compels AI developers to constantly and relentlessly evaluate AI applications. Putting the best possible application in front of customers and other end-users demands it.
Every interaction between an end-user and your application reflects on your company and brand. Imagine a customer walking away from a discussion with a representative from your company in a state of confusion or doubt. Or, even worse, walking away satisfied but unknowing that she or he has been provided inaccurate information. It is as crucial to avoid negative encounters with your company's AI application as it is to ensure that customers do not face unpleasant interactions with your employees.
Many people are apprehensive about AI and the content it generates. Trust is essential to a successful AI experience. This applies both to those building AI applications and to those using them. Weights & Biases is the AI developer platform for building applications with confidence. Robust evaluation functionality and clear visibility into the results ensure Weights & Biases is the AI developer platform for building applications that your end-users can trust.
Download Weights & Biases’ whitepaper, Evaluating AI agent applications, to discover proven strategies to help you accelerate your path from prototype to production.
﻿Sign up for a 30-minute personalized demo to see firsthand how W&B Weave helps you build AI applications with confidence.
To try building an AI application using W&B Weave today, please sign-up for a free account at: https://wandb.me/tryweave﻿
﻿﻿You can also sign up for the free Weights & Biases LLM Apps: Evaluation course to learn:
Key principles, implementation methods, and appropriate use cases for LLM application evaluation.
How to create a working LLM as a judge.
How to align your auto-evaluations with minimal human input
And lastly, for more detailed information about Weave evaluations, check out our product docs.
﻿﻿﻿
﻿﻿﻿
﻿
Add a comment
Tags: Articles, Weave, Evaluations, GenAI
Iterate on AI agents and models faster. Try Weights & Biases today.