Skip to main content

Why you need evals: A primer and new techniques for evaluating LLMs

Created on October 30|Last edited on October 30
Do you need better evaluation systems on your gen AI app? Chances are you do. But don't worry: so does everyone else.
Measuring the performance of AI systems is the biggest blocker for putting them into production. By implementing a strong evaluations system, you ultimately save time and get a better product out the door, faster.
However, writing good evaluations isn't easy. Human evals are expensive, programmatic evals are limited, and LLM evals are themselves in need of evaluation.
In brief, here's what we'll be covering today:

TLDR:

  • Evals are critical to fast development and to deploying an LLM-enabled production application. However, it's tricky to do this well.
  • You need to create traces and datasets, decide on your criteria, implement programmatic and LLM evaluators, and use your evals actively during development.
  • You should use programmatic evals where possible (e.g. string comparison, keyword checking, NLP sentiment analysis).
  • LLM as a judge is straightforward but requires work to align to human feedback. This is good for qualitative criteria like verbosity, but requires care for things like hallucination checking and accuracy.
  • The development cycle of identifying problems, modifying the prompt, running your evals, and adding to the dataset is key for rapid development.
  • The most pressing research challenges in evals revolve around their complexity of implementation, their alignment with human evaluation, their need to change over time, and the lack of solid tools for building them.

Value you should expect from evals

  • Dramatically faster prompt improvements
  • Fewer regressions from language model process changes
  • A more reliable, production-ready product by default
  • Confidence

How to implement evals for your LLM-powered app

At a high level, the process for implementing evals follows the bullet points below. (And don't worry, we'll get into them in detail afterwards.)
  • Create Traces for your app
  • Turn those Traces into datasets (example inputs and outputs). It's worth noting you can do this with higher-level functions too, not just LLMs.
  • Observe problems, decide what metrics you should create and evaluate on. For example: correct formatting, correctness according to experts, etc.
  • Implement evaluators. Use programmatic evaluators where possible. Otherwise, LLMs as judges work well when executed carefully.
  • Use a comparison of human-labeled data to make sure your LLM judges are aligned.
  • Run your new set of evals during development, after making changes, and in production to ensure continued quality.

Implementing traces

Traces let you track and version objects and function calls in your applications. You can use a tool like Weights & Biases Weave to implement LLM traces in two lines of code and function traces with a simple decorator on any function. Some code to get you going:
# uv pip install weave
import weave
import your_llm_client

weave.init() # every LLM call in the app is traced with this one line

@weave.op # this function now tracks inputs, outputs, latency, cost, etc
def respond_to_message(message: str):
response = your_llm_client.generate(message)
return response

Programmatic evals

Any criteria of yours that can be tested programmatically, should be tested programmatically. This has key benefits, including the fact it's nearly free to run and it is reliable in a deterministic way.

Weave Evals

You'll want to use a tool for tracking datasets and eval runs. We recommend W&B Weave. You can the quick start code here.
You'll need three things to run evals:
  • A dataset, basically a list of dictionaries
  • One or more evaluation functions, with a dict as input
  • A Weave model
Here's an example of all three:
# uv pip install weave
from dotenv import load_dotenv
import weave

load_dotenv()

weave.init("evals-example")

# dataset
dataset = [
{"input": "Apple", "output": "Fruit", "correct_answer": "Fruit"},
{"input": "Tomato", "output": "Vegetable", "correct_answer": "Fruit"},
{"input": "Carrot", "output": "Vegetable", "correct_answer": "Vegetable"}
]

# evaluator
# this can return a boolean, a number score, or a dict
def exact_match(datapoint):
return datapoint["a"] == datapoint["b"]

# run the evaluator
evaluation = weave.Evaluation(
name='fruit_eval',
dataset=examples,
scorers=[exact_match]
)

# llm to generate an output
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")

class AnthropicChatbot(weave.Model):
system_prompt: str = ""
model_name: str = "claude-3-haiku-20240307"

def __init__(self, **data):
super().__init__(**data)
formatted_shoes_data = json.dumps(shoes_data, indent=2)
self.system_prompt = f"""You are a fruit expert. Given one word, specify whether an input is a 'Fruit' or a 'Vegetable'. Only return that one word, with no other commentary."""

@weave.op()
def predict(self, input: str) -> str:
if isinstance(input, str):
input = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": input}
]
client = Anthropic()
response = client.messages.create(
model=self.model_name,
system=self.system_prompt,
messages=input,
)
return response.content[0].text

print(asyncio.run(evaluation.evaluate(model)))
note

Strict right-answer comparison / exact string comparison

note
# this can return a boolean, a number score, or a dict
def exact_match(datapoint):
return datapoint["a"] == datapoint["b"]

Keyword checking

note
# check if a keyword is contained in the answer
def keyword_match(datapoint):
return datapoint["correct_answer"] in datapoint["output"]

NLP tone evaluation

This one can get complex, so I'll omit a code example for brevity. You can use a library like Python's NLTK. See https://www.datacamp.com/tutorial/text-analytics-beginners-nltk
This can take various forms, but generally it involves making sure any given link is explicitly mentioned in the prompt given.
import re

# check to see if any link mentioned is present in a list of valid links
valid_links = [
"https://store.com/about_us",
"https://store.com/product1",
"https://store.com/product2"
]

@weave.op()
def are_links_valid(model_output: str):
# Use a regular expression to check for links
url_pattern = r'https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&//=]*)'
url_regex = re.compile(url_pattern, re.IGNORECASE)

urls = url_regex.findall(model_output)
# Check to see if found URLs are valid or not
for url in urls:
if url not in valid_links:
return False
return True

JSON validation

There are many libraries that assist with validating JSON. The most popular is the Instructor (AI) library which uses Pydantic under the hood. In addition to LLM provider options in tool calling and other modes like JSON mode, there's also BAML, Outlines, Guidance, and many other choices.
import json
from pydantic import BaseModel

# validate it as a Pydantic object here
class UserProfile(BaseModel):
name: str
age: int
email: str

def validate_user_profile(llm_output: str):
try:
# Attempt to parse the LLM output as JSON and validate it against the UserProfile model
user_data = json.loads(llm_output)
UserProfile(**user_data)
return True
except (json.JSONDecodeError, ValidationError):
return False

# Example usage:
# llm_output = '{"name": "John Doe", "age": 30, "email": "john@example.com"}'
# is_valid = validate_user_profile(llm_output)

LLM evals (LLM as a judge)

These are faster and cheaper than human evaluations, but they do often require human evaluations and oversight to become accurate. A mature AI app development workflow will include both LLM evaluations and human evaluations working in tandem.
LLM as a judge is the most straightforward way to use LLMs to help with evals. You pass your criteria into a language model along with your model's outputs (and sometimes inputs). You'll typically have this output a score or a binary value, possibly with an annotation. This can be used to evaluate qualitative criteria such as tone, conciseness, and helpfulness. Or, in more complex, interconnected criteria like faithfulness to sources, hallucinations, similarity to a gold standard answer from a human, self-awareness of gaps in ability, or vulnerability to red-teaming.



Justin Tenuto
Justin Tenuto •  
note same deal---need a quick intro here.
Reply
Justin Tenuto
Justin Tenuto •  
JSON validation none of the links in the draft were active. feel free to link the ones that matter. don't want to send people to some marketing page they wont find helpful
Reply
Justin Tenuto
Justin Tenuto •  
but generally it involves making sure any given link is explicitly mentioned in the prompt given this is basically exactly what we need above. don't have to belabor it.
Reply
Justin Tenuto
Justin Tenuto •  
note We jump straight into some additional checks and it would be good to tell people what's coming and why they might care, whats the purpose of these evals, etc.
Reply
Justin Tenuto
Justin Tenuto •  
note same as above---need a little context for readers
Reply
Justin Tenuto
Justin Tenuto •  
note let's intro this a bit. sentence or two on what this and why it matters
Reply
Justin Tenuto
Justin Tenuto •  
should be tested programmatically Can we get an example here? Section is a bit sparse. One or two should carry it fine.
Reply