How to Evaluate, Compare, and Optimize LLM Systems
This article provides an interactive look into how to go about evaluating your large language model (LLM) systems and how to approach optimizing the hyperparameters.
Created on July 2|Last edited on December 22
Comment
The space of large language models (LLMs) has exploded since the public release of ChatGPT (more of an agent, in all honesty) and GPT4 (a pure LLM, god knows!) We'll only see more advancements moving forward and, attendant to that, systems (tools, bots, services, etc.) that use LLM(s) under the hood.
Still: many of the examples out there are toy examples or demos, or proofs of concept. Right now, only a handful of LLM systems are in production, and perhaps the biggest hurdle has been the faithful evaluation of such a system.
Consider a simple example of a medical QA bot. You'd input a diagnosis (query), and the system might recommend medicines with dosages (response). Such a system will retrieve information (context) from a medicine database, collate the context and the query (prompt), and finally use an LLM to generate a response.
But how would you ensure the response is correct? After all, in a medical use case, getting this right matters a ton. False positives or negatives can have life-changing repercussions.
So how can we go about evaluating an LLM system like this? This article will:
- Examine evaluation as a concept for LLM-based systems. We'll start with a simple example and build up to evaluating a system here; we QA documents.
- We'll also dig into how to leverage hyperparameter optimization (HPO) to find a better-tuned LLM-based system.
If you like this report and are building LLM apps, we also have a free course you can sign up for by clicking the button below.
Sign up for our free LLM course
Here's what we'll be covering in detail:
Table of Contents
Table of ContentsStraightforward LLM Evaluation1. Eyeballing2. Human Annotation, a.k.a. Supervised EvaluationThe Calculator, aka a simple LLM-based systemSupervised EvaluationHyperparameter OptimizationObservationsLLMs Evaluating LLMs1. Generate Eval Dataset Using An LLMLangchain's QAGenerationChain2. Metrics1. LLMs as a Metric?2. Standard MetricsHyperparameter OptimizationObservationsWhat can be improved in the evaluation strategy?How about evaluating the retrieval system separately?ConclusionRelated Resources
Try out the code here
Straightforward LLM Evaluation
There's a difference between evaluating an LLM versus evaluating an LLM-based system. After all, today's large language models can do multiple tasks. They can summarize text, do question-answering, find the sentiment of a text, can do translation, and more.
Typically after pre-training (generally on a huge dataset), LLMs are evaluated on standard benchmarks — GLUE, SQuAD 2.0, and SNLI, to name a few — using standard metrics. But these LLMs might not be useful to us out of the box. Here's why:
- We might have to fine-tune the LLM on our "private" dataset for our particular use case. In this case, evaluation is usually straightforward - we have a ground truth dataset against which we evaluate our fine-tuned model, mostly using standard metrics. However, fine-tuning should not be the first thing we should consider, given it's expensive and takes time. To continue our example above, our medical system might be fine-tuned on patient data or medical research not available to out-of-the-box LLM architects.
- LLMs are powerful, but with well-thought pre/post-engineering, we can build LLM-based systems that, in many cases, might perform well enough. Building such a system has gotten easier thanks to the tools like Langchain, LlamaIndex, and others. However, it is still tricky to find the right components (more on this later) and to evaluate the system properly.
Here are a few straightforward evaluation tactics, starting with the most straightforward of all:
1. Eyeballing
When we start building an LLM-based system, we usually start our evaluation by eyeballing the responses from the system. We usually have a few inputs and expected responses, and we tune and build the system by trying different components, prompt templates, etc. This gives us a good proof of concept, but systems must be evaluated thoroughly.
To support this workflow, we suggest using W&B Prompts. In fact, if you use Langchain or LlamaIndex, you can use Prompts out of the box. You can also instrument your chains/plug-ins with Prompts.
Below, you'll see an example using LlamaIndex to query over a simple document. If your workflow involves eyeballing, this tool will let you see what's happening under the hood. Feel free to click around this Prompts example. It's interactive:
query
4346ms
synthesize
4138ms
llm
4115ms
retrieve
208ms
embedding
201ms
Click and drag to pan
query
4346ms
Result Set 1 | |
Inputs | |
query_str | What did the author do growing up? |
Outputs | |
response |
The author grew up writing essays, learning Italian, exploring Florence, painting people, working with computers, attending RISD, living in a rent-stabilized apartment, building an online store builder, editing Lisp expressions, publishing essays online, writing essays, painting still life, working on spam filters, cooking for groups, and buying a building in Cambridge. |
Metadata | |
Kind | AGENT |
Start Time | Thu Jun 08 2023 11:47:52 GMT+0000 (Coordinated Universal Time) |
End Time | Thu Jun 08 2023 11:47:56 GMT+0000 (Coordinated Universal Time) |
Child Spans | 2 |
2. Human Annotation, a.k.a. Supervised Evaluation
The best and most reliable way to evaluate an LLM system is to create an evaluation dataset for each component of the LLM-based system. The biggest downside of this approach is the cost to create such a dataset and the time it will take to make one. Depending on the LLM-based system, the design of the evaluation dataset can be challenging.
You might question creating a new dataset and wonder why we wouldn't use standard benchmarks. The idea of building an LLM-based system is to bring private data to the LLM without actually fine-tuning it. Ideally, we'd want to evaluate the system on our private data, a.k.a. our "domain."
Let's take an example of a simple LLM-based system - a calculator. The calculator can solve a given mathematical expression using the BODMAS rule (fun fact: this is also known as the PEMDAS or BEDMAS rule, depending on where you grew up).
The Calculator, aka a simple LLM-based system
Our calculator will accept a mathematical expression with brackets, symbols, etc. It will compute and return the answer using an underlying LLM (e.g., GPT4). One can build it in a few lines of code using Langchain.
Supervised Evaluation
Surprisingly, GPT4 cannot solve the expressions correctly many a time without proper tuning of the system. This brings the question: before discussing evaluation, what are the different things we can tune?
We can experiment with different LLMs (gpt-4, gpt-3.5-turbo, etc.), experiment with different prompts, tweak the arguments (like temperature) provided to the API, and more. The "more" will depend on the system you are trying to build.
We can also try to tune using the eyeballing technique blindly, but as stated previously, it will not give the confidence required to push the system to production. In the example of our calculator, we use a Python function to evaluate the expressions and call it the "human-annotated evaluation dataset." (Yes, this is simulated data, but the aim is to showcase how to evaluate such a system by collecting relevant data.)
The W&B Table below shows the true_result, which is human-annotated eval data (in our case, generated using a Python function), while the pred_result is computed by our LLM calculator. We are also logging the token counts and the cost of generating the response.
Using Langchain you can track your token usage and cost. However, it is currently only implemented for the OpenAI API.
💡
Depending on the use case and the availability of such an eval dataset, you can choose the most relevant metric(s) for your use case. In this instance, accuracy is good enough. Below you can see the accuracy of the LLM calculator on the eval dataset, along with the cost of making 106 predictions.
Surely, we can improve the performance of our LLM-based calculator. (In fact, we'll be doing so in just a moment here.)
Hyperparameter Optimization
Setting up a hyperparameter optimization makes sense when we have an eval dataset and multiple components to tweak. It's fairly straightforward using W&B Sweeps. We'll try out three different prompt templates, three models (text-davinci-003, gpt4, gpt-3.5-turbo), and temperatures in the range of 0-1.
NOTE: Temperature is a parameter where we adjust how "creative" our model's responses are. Lower temperatures produce less novel outputs with less randomness. Higher temperatures produce more diverse content but also increase the odds our model will move a bit out of context.
💡
The different prompt templates we'll try:
- maths_prompt_template_1.txt
The following is the mathematical expression provided by the user.{question}Find the answer using the BODMAS rule in the {format_instructions}:
2. maths_prompt_template_2.txt
The following is the mathematical expression provided by the user.{question}Find the answer in the {format_instructions}:
3. maths_prompt_template_3.txt
You are an expert mathematician. You can solve a given mathematical expression using the BODMAS rule.BODMAS stands for Bracket, Orders of Indices, Division, Multiplication, Addition and Subtraction. The computation should happen in that order.The dorder is as follows:B: Solve expressions inside brackets in this order -> small bracket followed by curly bracket and finally square bracket.O: Solve the indices such as roots, powers, etc.D: Divide the numbers which are givenM: Multiply the numbers nextA: Sum up the next numbersS: Subtract the numbers left in the endThe following is the mathematical expression provided by the user.{question}Think about it step-by-step. Don't skip steps.When ready with the answer return in the {format_instructions}:
As you can see, maths_prompt_template_2.txt is the simplest prompt without any mention of the "BODMAS" rule. We mention "BODMAS" in the maths_prompt_template_1.txt template. We give detailed instructions in the final prompt template, maths_prompt_template_3.txt.
Run set
44
Observations
- There are three accuracy bands - low (10-40%), medium (60-70%), and finally high (88%+). The three bands are positively correlated with the cost. Higher accuracy is achieved by spending more money.
- GPT4 rules them all. The GPT4 model gives the highest accuracy.
3. GPT4 is great, but you need to prompt it correctly. Prompt template 3 yielded ~90% accuracy, while prompt template 1 yielded 60-70%. That's more than a 20% improvement in the system performance. In our case here, being more instructive to the LLM produced better outcomes.
4. The result will be bad if you use a bad model. No matter the prompt template, text-davinci-003 didn't perform well, resulting in an accuracy of less than 20%. Even gpt-3.5-turbo fails to perform well (<40% accuracy).
The best accuracy comes with a price. As expected, gpt4 performs best using a more detailed and concise prompt template. The result is as expected for a simple LLM-based system, but that might not be the case for a slightly more complex system.
If one can spend time/money to build an evaluation dataset, it will give the most confidence when deploying LLM-based systems evaluated on such a dataset. This is the most methodologically sound way of evaluating your LLM-based system.
💡
LLMs Evaluating LLMs
Try out the code here
LLMs are versatile and show interesting characteristics. They are particularly good at extracting information from the provided text, and they're only getting better. This has led to the evolving practice of using LLMs to evaluate LLMs. The core idea is to use an LLM to generate test cases and then evaluate the LLM-based system on them.
Our LLM-based system will be a Retrieval Augmented QA bot in this section. Such a system has a few components: an embedding model, a retrieval system, and an LLM-powered QA chain. You can learn more about how to build such a system in this LangChain Chat with Your Data Deeplearning.ai course.
The QA bot is built to answer questions based on a paper titled "The Cookbook of Self-Supervised Learning" (arXiv:2304.12210). Let's dive into evaluating such a system. First, we need an evaluation dataset and decide the metrics.
1. Generate Eval Dataset Using An LLM
We need pairs of questions and answers in our evaluation set to actually evaluate a QA bot. Since our bot uses an information retrieval (IR) system, we must also consider evaluating it (more on it later).
As stated in the previous section, we can hire human annotators to create gold-standard pairs of questions and answers manually. This is a great method overall, but it is costly and time-consuming. After all, building an evaluation set for a QA bot over medical data needs trained people.
For many such niche use cases—ones where simple crowd-sourcing doesn't work well, and you're far better off with experts—it's hard to find the right talent, leading to higher costs and making it infeasible for individuals and small businesses. One feasible way of creating such a dataset is to leverage an LLM. This approach has obvious benefits and limitations:
- It's scalable. We can generate a vast number of test cases on demand.
- It's flexible. The test cases can be generated for special edge cases and adapted to multiple domains, ensuring relevance and applicability.
- It's cheap and fast. LLMs can quickly collate information from multiple documents at a far lower price.
As for limitations, we covered the biggest above: use cases where you need expert labelers are the sorts of use cases you most often want them (i.e., in the medical domain).
Langchain's QAGenerationChain
Langchain has a useful chain called QAGenerationChain, which can extract pairs of questions and answers from specific document(s). We can load the document(s) using the relevant data loader (great piece by Hamel here), split it into smaller chunks, and use the chain to extract QA pairs.
I used this prompt to generate QA pairs using the QAGenerationChain. I created 60 such pairs, of which 45 were created using gpt-3.5-turbo, while the rest 15 were created using Cohere's command model.
You are a smart assistant designed to come up with meaninful question and answer pair. The question should be to the point and the answer should be as detailed as possible.Given a piece of text, you must come up with a question and answer pair that can be used to evaluate a QA bot. Do not make up stuff. Stick to the text to come up with the question and answer pair.When coming up with this question/answer pair, you must respond in the following format:```{{"question": "$YOUR_QUESTION_HERE","answer": "$THE_ANSWER_HERE"}}```Everything between the ``` must be valid json.Please come up with a question/answer pair, in the specified JSON format, for the following text:----------------{text}
Check out the generated QA pairs below. Run through the generated QA pairs from both models. Can you spot the difference in the generation quality?
You can experiment with different prompts to change the tone of the questions and answers, put more or less attention to details, create negative answers, and more. The best part of this approach is that LLMs will improve with time, making this approach more feasible for a wide set of use cases.
2. Metrics
Now that we have an eval set of QA pairs, we can let our LLM-based QA bot generate predictions for the questions. We can then use a metric to evaluate the predicted and "true" answers.
1. LLMs as a Metric?
Given a predicted and a "true" answer, we can literally use an LLM to find how well the prediction is compared to the true answer! Continuing: LLMs are powerful because they now have a good understanding of the semantics of the text. Given two texts (true and predicted answers), an LLM can, in theory, find whether they are semantically identical. If identical, we can give that prediction a "CORRECT" label; otherwise, an "INCORRECT" label.
Luckily, Langchain has a chain called QAEvalChain that can take in a question and "true" answer along with the predicted answer and output "CORRECT" and "INCORRECT" labels for them. Check out the W&B Table below with one such evaluation job where an LLM was used as a metric (llm_based_eval_acc).
2. Standard Metrics
As an NLP task, question-answering has rich literature with few dominant metrics. Two dominant metrics used in various QA benchmarking datasets, including SQuAD, are:
- Exact Match: For each question-answer pair, if the tokens of the model's prediction exactly match the tokens of the true answer, exact_match is 100; otherwise, exact_match is 0. One can imagine that each token matching is a rare occurrence for a stochastic system. This metric should be taken with a grain of salt for our use case.
- F1 Score: This is a well-known metric that cares equally about the precision and recall of the system. Precision is the ratio of shared tokens to the total number of tokens in the prediction. Recall is the ratio of shared tokens to the total number of tokens in the ground truth.
We can use HuggingFace's Evaluate library to load the squad metric and compute the exact_match and f1. For the same evaluation job above, check out the exact_match and the f1 scores on a per-sample basis below.
Hyperparameter Optimization
Given we have an eval set, let's use W&B Sweeps to quickly set up a hyperparameter optimization search component that will improve a metric. In this case: mean F1 score. I used the following sweep configuration:
method: randomname: random_qa_full_sweepsparameters:embedding:values:- SentenceTransformerEmbeddings- OpenAIEmbeddings- CohereEmbeddingsllm:values:- gpt-4- gpt-3.5-turbo- text-davinci-003- command- command-lightprompt_template_file:values:- data/qa/prompt_template_1.txt- data/qa/prompt_template_2.txtretriever:values:- Chroma- TFIDFRetriever- FAISSprogram: qa_full_sweeps.py
As you can see, I'm experimenting with a few embedding models, different LLMs (GPT family of models coming from OpenAI and Command family of models coming from Cohere), prompt templates, and a few retrievers. Prompt template 2 is a slight modification of prompt template 1. Another thing to note is that the TFIDFRetriever doesn't use an embedding model (obviously).
Run set
65
Observations
- OpenAI models are performing better than Cohere model. The top F1 scores (50+) come from the OpenAI model family.
2. The TFIDFRetriever works surprisingly well compared to embedding-based Chroma and FAISS. The TFIDFRetriever doesn't use any embedding model; thus, using this retriever can cut costs without reducing performance.
3. gpt-3.5-turbo seems to be performing better than gpt4 in general. Is it because the eval set was generated using gpt-3.5-turbo? This begs further investigation into the evaluation strategy but also shows how powerful gpt-4 is.
Click on the check box below to select the experiments for that model. The resulting F1 score is the mean across the experiments.
6
10
4. The lower F1 scores for gpt-4 are due to prompt template 2. It shows how important correct prompting can be.
What can be improved in the evaluation strategy?
Based on the observations above, one can think of ways to improve the evaluation strategy.
- Maybe use a better metric than the F1 score. Maybe using some semantic similarity metric like the one proposed in "Semantic Answer Similarity for Evaluating Question Answering Models" (arXiv:2108.06130).
- Since gpt-3.5-turbo is performing better on average, it would be good to update the evaluation set to include the following:
- More QA pairs are generated using Cohere family of models (command and command-light).
- QA pairs generated using all the LLMs that are further scrutinized.
How about evaluating the retrieval system separately?
Information retrieval (IR) is a crucial step in a QA pipeline. The evaluation strategy suggested above evaluates the pipeline as a whole. We need ways to evaluate individual systems.
While generating pairs of questions and answers using chunks of the documents, the chunks (source truth) should be saved alongside the pairs. The IR system will select the top k chunks for a given question during evaluation. A score for that retriever will be determined if the source truth is in the selected chunk. The score will also depend on the source truth's rank in the selected chunks.
I will show this in action in a separate report.
Conclusion
The evaluation of LLM-based systems is still in the early stage of development, with a lot of research and tooling developed for it. LLMs are here to stay, and many problem statements will start leveraging LLMs in some capacity. I hope this report will illuminate the importance of evaluating an LLM-based system. I hope it also gave some practical ways of evaluating your LLM-based systems.
I believe that LLMs evaluating LLMs will become a common practice "eventually", but some progress has to be made. The most methodical way of evaluating any system is by using a human-generated eval set. We will probably see a hybrid evaluation strategy. To begin with, we will use an LLM to evaluate another LLM, deploy the system in some capacity, and collect real data from the humans. We will then update the eval set with more humane test cases.
I hope you enjoyed reading this post. If you have any questions/suggestions, please drop in the comment below or reach out at @ayushthakur0.
Related Resources
A Gentle Introduction to LLM APIs
In this article, we dive into how large language models (LLMs) work, starting with tokenization and sampling, before exploring how to use them in your applications.
How to Run LLMs Locally With llama.cpp and GGML
This article explores how to run LLMs locally on your computer using llama.cpp — a repository that enables you to run a model locally in no time with consumer hardware.

COURSE: Building LLM-Powered Apps
A free, interactive course from Weights & Biases
Prompt Engineering LLMs with LangChain and W&B
Join us for tips and tricks to improve your prompt engineering for LLMs. Then, stick around and find out how LangChain and W&B can make your life a whole lot easier.
Add a comment
Very nice, thanks.
I think you have a typo "The dorder is as follows"
Reply
Hey Ayush, Thank you for a really awesome article.
Quck question for you: If you wanted to evaluate medical domain QA initially without expert label, what would you do?
1 reply
Wonderful aricticle!
Could you attach a link to the full source code of the project?
Thank you!
1 reply
Iterate on AI agents and models faster. Try Weights & Biases today.