Why you need evals: A primer and new techniques for evaluating LLMs
Created on October 30|Last edited on October 30
Comment
Do you need better evaluation systems on your gen AI app? Chances are you do. But don't worry: so does everyone else.
Measuring the performance of AI systems is the biggest blocker for putting them into production. By implementing a strong evaluations system, you ultimately save time and get a better product out the door, faster.
However, writing good evaluations isn't easy. Human evals are expensive, programmatic evals are limited, and LLM evals are themselves in need of evaluation.
In brief, here's what we'll be covering today:
TLDR:
- Evals are critical to fast development and to deploying an LLM-enabled production application. However, it's tricky to do this well.
- You need to create traces and datasets, decide on your criteria, implement programmatic and LLM evaluators, and use your evals actively during development.
- You should use programmatic evals where possible (e.g. string comparison, keyword checking, NLP sentiment analysis).
- LLM as a judge is straightforward but requires work to align to human feedback. This is good for qualitative criteria like verbosity, but requires care for things like hallucination checking and accuracy.
- The development cycle of identifying problems, modifying the prompt, running your evals, and adding to the dataset is key for rapid development.
- The most pressing research challenges in evals revolve around their complexity of implementation, their alignment with human evaluation, their need to change over time, and the lack of solid tools for building them.
Value you should expect from evals
- Dramatically faster prompt improvements
- Fewer regressions from language model process changes
- A more reliable, production-ready product by default
- Confidence
How to implement evals for your LLM-powered app
At a high level, the process for implementing evals follows the bullet points below. (And don't worry, we'll get into them in detail afterwards.)
- Turn those Traces into datasets (example inputs and outputs). It's worth noting you can do this with higher-level functions too, not just LLMs.
- Observe problems, decide what metrics you should create and evaluate on. For example: correct formatting, correctness according to experts, etc.
- Implement evaluators. Use programmatic evaluators where possible. Otherwise, LLMs as judges work well when executed carefully.
- Use a comparison of human-labeled data to make sure your LLM judges are aligned.
- Run your new set of evals during development, after making changes, and in production to ensure continued quality.
Implementing traces
Traces let you track and version objects and function calls in your applications. You can use a tool like Weights & Biases Weave to implement LLM traces in two lines of code and function traces with a simple decorator on any function. Some code to get you going:
# uv pip install weaveimport weaveimport your_llm_clientweave.init() # every LLM call in the app is traced with this one line@weave.op # this function now tracks inputs, outputs, latency, cost, etcdef respond_to_message(message: str):response = your_llm_client.generate(message)return response
Programmatic evals
Any criteria of yours that can be tested programmatically, should be tested programmatically. This has key benefits, including the fact it's nearly free to run and it is reliable in a deterministic way.
Weave Evals
You'll want to use a tool for tracking datasets and eval runs. We recommend W&B Weave. You can the quick start code here.
You'll need three things to run evals:
- A dataset, basically a list of dictionaries
- One or more evaluation functions, with a dict as input
- A Weave model
Here's an example of all three:
# uv pip install weavefrom dotenv import load_dotenvimport weaveload_dotenv()weave.init("evals-example")# datasetdataset = [{"input": "Apple", "output": "Fruit", "correct_answer": "Fruit"},{"input": "Tomato", "output": "Vegetable", "correct_answer": "Fruit"},{"input": "Carrot", "output": "Vegetable", "correct_answer": "Vegetable"}]# evaluator# this can return a boolean, a number score, or a dictdef exact_match(datapoint):return datapoint["a"] == datapoint["b"]# run the evaluatorevaluation = weave.Evaluation(name='fruit_eval',dataset=examples,scorers=[exact_match])# llm to generate an outputanthropic_api_key = os.getenv("ANTHROPIC_API_KEY")class AnthropicChatbot(weave.Model):system_prompt: str = ""model_name: str = "claude-3-haiku-20240307"def __init__(self, **data):super().__init__(**data)formatted_shoes_data = json.dumps(shoes_data, indent=2)self.system_prompt = f"""You are a fruit expert. Given one word, specify whether an input is a 'Fruit' or a 'Vegetable'. Only return that one word, with no other commentary."""@weave.op()def predict(self, input: str) -> str:if isinstance(input, str):input = [{"role": "system", "content": self.system_prompt},{"role": "user", "content": input}]client = Anthropic()response = client.messages.create(model=self.model_name,system=self.system_prompt,messages=input,)return response.content[0].textprint(asyncio.run(evaluation.evaluate(model)))
note
Strict right-answer comparison / exact string comparison
note
# this can return a boolean, a number score, or a dictdef exact_match(datapoint):return datapoint["a"] == datapoint["b"]
Keyword checking
note
# check if a keyword is contained in the answerdef keyword_match(datapoint):return datapoint["correct_answer"] in datapoint["output"]
NLP tone evaluation
This one can get complex, so I'll omit a code example for brevity. You can use a library like Python's NLTK. See https://www.datacamp.com/tutorial/text-analytics-beginners-nltk
Link checking
This can take various forms, but generally it involves making sure any given link is explicitly mentioned in the prompt given.
import re# check to see if any link mentioned is present in a list of valid linksvalid_links = ["https://store.com/about_us","https://store.com/product1","https://store.com/product2"]@weave.op()def are_links_valid(model_output: str):# Use a regular expression to check for linksurl_pattern = r'https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&//=]*)'url_regex = re.compile(url_pattern, re.IGNORECASE)urls = url_regex.findall(model_output)# Check to see if found URLs are valid or notfor url in urls:if url not in valid_links:return Falsereturn True
JSON validation
There are many libraries that assist with validating JSON. The most popular is the Instructor (AI) library which uses Pydantic under the hood. In addition to LLM provider options in tool calling and other modes like JSON mode, there's also BAML, Outlines, Guidance, and many other choices.
import jsonfrom pydantic import BaseModel# validate it as a Pydantic object hereclass UserProfile(BaseModel):name: strage: intemail: strdef validate_user_profile(llm_output: str):try:# Attempt to parse the LLM output as JSON and validate it against the UserProfile modeluser_data = json.loads(llm_output)UserProfile(**user_data)return Trueexcept (json.JSONDecodeError, ValidationError):return False# Example usage:# llm_output = '{"name": "John Doe", "age": 30, "email": "john@example.com"}'# is_valid = validate_user_profile(llm_output)
LLM evals (LLM as a judge)
These are faster and cheaper than human evaluations, but they do often require human evaluations and oversight to become accurate. A mature AI app development workflow will include both LLM evaluations and human evaluations working in tandem.
LLM as a judge is the most straightforward way to use LLMs to help with evals. You pass your criteria into a language model along with your model's outputs (and sometimes inputs). You'll typically have this output a score or a binary value, possibly with an annotation. This can be used to evaluate qualitative criteria such as tone, conciseness, and helpfulness. Or, in more complex, interconnected criteria like faithfulness to sources, hallucinations, similarity to a gold standard answer from a human, self-awareness of gaps in ability, or vulnerability to red-teaming.
Add a comment
note same deal---need a quick intro here.
Reply
JSON validation none of the links in the draft were active. feel free to link the ones that matter. don't want to send people to some marketing page they wont find helpful
Reply
but generally it involves making sure any given link is explicitly mentioned in the prompt given
this is basically exactly what we need above. don't have to belabor it.
Reply
note We jump straight into some additional checks and it would be good to tell people what's coming and why they might care, whats the purpose of these evals, etc.
Reply
note same as above---need a little context for readers
Reply
note let's intro this a bit. sentence or two on what this and why it matters
Reply
should be tested programmatically
Can we get an example here? Section is a bit sparse. One or two should carry it fine.
Reply
