Why you need evals: A primer and new techniques for evaluating LLMs

Created on October 30|Last edited on October 30

Comment

Do you need better evaluation systems on your gen AI app? Chances are you do. But don't worry: so does everyone else.
Measuring the performance of AI systems is the biggest blocker for putting them into production. By implementing a strong evaluations system, you ultimately save time and get a better product out the door, faster.
However, writing good evaluations isn't easy. Human evals are expensive, programmatic evals are limited, and LLM evals are themselves in need of evaluation.
In brief, here's what we'll be covering today: 
TLDR:Evals are critical to fast development and to deploying an LLM-enabled production application. However, it's tricky to do this well.
You need to create traces and datasets, decide on your criteria, implement programmatic and LLM evaluators, and use your evals actively during development.
You should use programmatic evals where possible (e.g. string comparison, keyword checking, NLP sentiment analysis).
LLM as a judge is straightforward but requires work to align to human feedback. This is good for qualitative criteria like verbosity, but requires care for things like hallucination checking and accuracy.
The development cycle of identifying problems, modifying the prompt, running your evals, and adding to the dataset is key for rapid development.
The most pressing research challenges in evals revolve around their complexity of implementation, their alignment with human evaluation, their need to change over time, and the lack of solid tools for building them.
Value you should expect from evalsDramatically faster prompt improvements
Fewer regressions from language model process changes
A more reliable, production-ready product by default
Confidence
How to implement evals for your LLM-powered appAt a high level, the process for implementing evals follows the bullet points below. (And don't worry, we'll get into them in detail afterwards.)
Create Traces for your app
Turn those Traces into datasets (example inputs and outputs). It's worth noting you can do this with higher-level functions too, not just LLMs.
Observe problems, decide what metrics you should create and evaluate on. For example: correct formatting, correctness according to experts, etc.
Implement evaluators. Use programmatic evaluators where possible. Otherwise, LLMs as judges work well when executed carefully.
Use a comparison of human-labeled data to make sure your LLM judges are aligned.
Run your new set of evals during development, after making changes, and in production to ensure continued quality.
Implementing traces Traces let you track and version objects and function calls in your applications. You can use a tool like Weights & Biases Weave to implement LLM traces in two lines of code and function traces with a simple decorator on any function. Some code to get you going: 
# uv pip install weave
import weave
import your_llm_client
﻿
weave.init() # every LLM call in the app is traced with this one line
﻿
@weave.op # this function now tracks inputs, outputs, latency, cost, etc
def respond_to_message(message: str):
	response = your_llm_client.generate(message)
	return response
You can find a complete guide to set this up in our docs.
Programmatic evals Any criteria of yours that can be tested programmatically, should be tested programmatically. This has key benefits, including the fact it's nearly free to run and it is reliable in a deterministic way.
Weave Evals You'll want to use a tool for tracking datasets and eval runs. We recommend W&B Weave. You can the quick start code here.
You'll need three things to run evals:
A dataset, basically a list of dictionaries
One or more evaluation functions, with a dict as input
A Weave model
Here's an example of all three: 
# uv pip install weave
from dotenv import load_dotenv
import weave
﻿
load_dotenv()
﻿
weave.init("evals-example")
﻿
# dataset
dataset = [
	{"input": "Apple", "output": "Fruit", "correct_answer": "Fruit"},
	{"input": "Tomato", "output": "Vegetable", "correct_answer": "Fruit"},
	{"input": "Carrot", "output": "Vegetable", "correct_answer": "Vegetable"}
]
﻿
# evaluator
# this can return a boolean, a number score, or a dict
def exact_match(datapoint):
	return datapoint["a"] == datapoint["b"]
﻿
# run the evaluator
evaluation = weave.Evaluation(  
	name='fruit_eval',  
	dataset=examples,
	scorers=[exact_match]
)
﻿
# llm to generate an output
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")
﻿
class AnthropicChatbot(weave.Model):
    system_prompt: str = ""
    model_name: str = "claude-3-haiku-20240307"
﻿
    def __init__(self, **data):
        super().__init__(**data)
        formatted_shoes_data = json.dumps(shoes_data, indent=2)
        self.system_prompt = f"""You are a fruit expert. Given one word, specify whether an input is a 'Fruit' or a 'Vegetable'. Only return that one word, with no other commentary."""
﻿
    @weave.op()
    def predict(self, input: str) -> str:
        if isinstance(input, str):
            input = [
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": input}
            ]
        client = Anthropic()
        response = client.messages.create(
            model=self.model_name,
            system=self.system_prompt,
            messages=input,
        )
        return response.content[0].text
﻿
print(asyncio.run(evaluation.evaluate(model)))
note
Strict right-answer comparison / exact string comparisonnote
# this can return a boolean, a number score, or a dict
def exact_match(datapoint):
	return datapoint["a"] == datapoint["b"]
Keyword checking note
# check if a keyword is contained in the answer
def keyword_match(datapoint):
	return datapoint["correct_answer"] in datapoint["output"]
NLP tone evaluationThis one can get complex, so I'll omit a code example for brevity. You can use a library like Python's NLTK. See https://www.datacamp.com/tutorial/text-analytics-beginners-nltk﻿
Link checking This can take various forms, but generally it involves making sure any given link is explicitly mentioned in the prompt given.
import re
﻿
# check to see if any link mentioned is present in a list of valid links
valid_links = [
	"https://store.com/about_us",
	"https://store.com/product1",
	"https://store.com/product2"
]
﻿
@weave.op()
def are_links_valid(model_output: str):
    # Use a regular expression to check for links
    url_pattern = r'https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&//=]*)'
    url_regex = re.compile(url_pattern, re.IGNORECASE)
﻿
    urls = url_regex.findall(model_output)
    
    # Check to see if found URLs are valid or not
    for url in urls:
        if url not in valid_links:
            return False
    
    return True
JSON validation There are many libraries that assist with validating JSON. The most popular is the Instructor (AI) library which uses Pydantic under the hood. In addition to LLM provider options in tool calling and other modes like JSON mode, there's also BAML, Outlines, Guidance, and many other choices.
import json
from pydantic import BaseModel
﻿
# validate it as a Pydantic object here
class UserProfile(BaseModel):
    name: str
    age: int
    email: str
﻿
def validate_user_profile(llm_output: str):
    try:
        # Attempt to parse the LLM output as JSON and validate it against the UserProfile model
        user_data = json.loads(llm_output)
        UserProfile(**user_data)
        return True
    except (json.JSONDecodeError, ValidationError):
        return False
﻿
# Example usage:
# llm_output = '{"name": "John Doe", "age": 30, "email": "john@example.com"}'
# is_valid = validate_user_profile(llm_output)
LLM evals (LLM as a judge) These are faster and cheaper than human evaluations, but they do often require human evaluations and oversight to become accurate. A mature AI app development workflow will include both LLM evaluations and human evaluations working in tandem.
LLM as a judge is the most straightforward way to use LLMs to help with evals. You pass your criteria into a language model along with your model's outputs (and sometimes inputs). You'll typically have this output a score or a binary value, possibly with an annotation. This can be used to evaluate qualitative criteria such as tone, conciseness, and helpfulness. Or, in more complex, interconnected criteria like faithfulness to sources, hallucinations, similarity to a gold standard answer from a human, self-awareness of gaps in ability, or vulnerability to red-teaming.
﻿
﻿
﻿
﻿

Add a comment

Justin Tenuto • 12 months ago

note same deal---need a quick intro here.

Justin Tenuto • 12 months ago

JSON validation none of the links in the draft were active. feel free to link the ones that matter. don't want to send people to some marketing page they wont find helpful

Justin Tenuto • 12 months ago

but generally it involves making sure any given link is explicitly mentioned in the prompt given this is basically exactly what we need above. don't have to belabor it.

Justin Tenuto • 12 months ago

note We jump straight into some additional checks and it would be good to tell people what's coming and why they might care, whats the purpose of these evals, etc.

Justin Tenuto • 12 months ago

note same as above---need a little context for readers

Justin Tenuto • 12 months ago

note let's intro this a bit. sentence or two on what this and why it matters

Justin Tenuto • 12 months ago

should be tested programmatically Can we get an example here? Section is a bit sparse. One or two should carry it fine.