Skip to main content

How to Evaluate, Compare, and Optimize LLM Systems

This article provides an interactive look into how to go about evaluating your large language model (LLM) systems and how to approach optimizing the hyperparameters.
Created on July 2|Last edited on December 22
The space of large language models (LLMs) has exploded since the public release of ChatGPT (more of an agent, in all honesty) and GPT4 (a pure LLM, god knows!) We'll only see more advancements moving forward and, attendant to that, systems (tools, bots, services, etc.) that use LLM(s) under the hood.
Still: many of the examples out there are toy examples or demos, or proofs of concept. Right now, only a handful of LLM systems are in production, and perhaps the biggest hurdle has been the faithful evaluation of such a system.
Consider a simple example of a medical QA bot. You'd input a diagnosis (query), and the system might recommend medicines with dosages (response). Such a system will retrieve information (context) from a medicine database, collate the context and the query (prompt), and finally use an LLM to generate a response.
But how would you ensure the response is correct? After all, in a medical use case, getting this right matters a ton. False positives or negatives can have life-changing repercussions.
So how can we go about evaluating an LLM system like this? This article will:
  • Examine evaluation as a concept for LLM-based systems. We'll start with a simple example and build up to evaluating a system here; we QA documents.
  • We'll also dig into how to leverage hyperparameter optimization (HPO) to find a better-tuned LLM-based system.
If you like this report and are building LLM apps, we also have a free course you can sign up for by clicking the button below.
Sign up for our free LLM course


Here's what we'll be covering in detail:

Table of Contents



Try out the code here \rightarrow

Straightforward LLM Evaluation

There's a difference between evaluating an LLM versus evaluating an LLM-based system. After all, today's large language models can do multiple tasks. They can summarize text, do question-answering, find the sentiment of a text, can do translation, and more.
Typically after pre-training (generally on a huge dataset), LLMs are evaluated on standard benchmarks — GLUE, SQuAD 2.0, and SNLI, to name a few — using standard metrics. But these LLMs might not be useful to us out of the box. Here's why:
  • We might have to fine-tune the LLM on our "private" dataset for our particular use case. In this case, evaluation is usually straightforward - we have a ground truth dataset against which we evaluate our fine-tuned model, mostly using standard metrics. However, fine-tuning should not be the first thing we should consider, given it's expensive and takes time. To continue our example above, our medical system might be fine-tuned on patient data or medical research not available to out-of-the-box LLM architects.
  • LLMs are powerful, but with well-thought pre/post-engineering, we can build LLM-based systems that, in many cases, might perform well enough. Building such a system has gotten easier thanks to the tools like Langchain, LlamaIndex, and others. However, it is still tricky to find the right components (more on this later) and to evaluate the system properly.
Here are a few straightforward evaluation tactics, starting with the most straightforward of all:

1. Eyeballing

When we start building an LLM-based system, we usually start our evaluation by eyeballing the responses from the system. We usually have a few inputs and expected responses, and we tune and build the system by trying different components, prompt templates, etc. This gives us a good proof of concept, but systems must be evaluated thoroughly.
To support this workflow, we suggest using W&B Prompts. In fact, if you use Langchain or LlamaIndex, you can use Prompts out of the box. You can also instrument your chains/plug-ins with Prompts.
Below, you'll see an example using LlamaIndex to query over a simple document. If your workflow involves eyeballing, this tool will let you see what's happening under the hood. Feel free to click around this Prompts example. It's interactive:

Success
startTime
Timestamp
Input
Output
Chain
Error
Model ID
3
2
1
5
query
4346ms
synthesize
4138ms
llm
4115ms
retrieve
208ms
embedding
201ms
Click and drag to pan
query
4346ms
Result Set 1
Inputs
query_str
What did the author do growing up?
Outputs
response
The author grew up writing essays, learning Italian, exploring Florence, painting people, working with computers, attending RISD, living in a rent-stabilized apartment, building an online store builder, editing Lisp expressions, publishing essays online, writing essays, painting still life, working on spam filters, cooking for groups, and buying a building in Cambridge.
Metadata
Kind
AGENT
Start Time
Thu Jun 08 2023 11:47:52 GMT+0000 (Coordinated Universal Time)
End Time
Thu Jun 08 2023 11:47:56 GMT+0000 (Coordinated Universal Time)
Child Spans
2

2. Human Annotation, a.k.a. Supervised Evaluation

The best and most reliable way to evaluate an LLM system is to create an evaluation dataset for each component of the LLM-based system. The biggest downside of this approach is the cost to create such a dataset and the time it will take to make one. Depending on the LLM-based system, the design of the evaluation dataset can be challenging.
You might question creating a new dataset and wonder why we wouldn't use standard benchmarks. The idea of building an LLM-based system is to bring private data to the LLM without actually fine-tuning it. Ideally, we'd want to evaluate the system on our private data, a.k.a. our "domain."
Let's take an example of a simple LLM-based system - a calculator. The calculator can solve a given mathematical expression using the BODMAS rule (fun fact: this is also known as the PEMDAS or BEDMAS rule, depending on where you grew up).

The Calculator, aka a simple LLM-based system

Our calculator will accept a mathematical expression with brackets, symbols, etc. It will compute and return the answer using an underlying LLM (e.g., GPT4). One can build it in a few lines of code using Langchain.

Supervised Evaluation

Surprisingly, GPT4 cannot solve the expressions correctly many a time without proper tuning of the system. This brings the question: before discussing evaluation, what are the different things we can tune?
We can experiment with different LLMs (gpt-4, gpt-3.5-turbo, etc.), experiment with different prompts, tweak the arguments (like temperature) provided to the API, and more. The "more" will depend on the system you are trying to build.
We can also try to tune using the eyeballing technique blindly, but as stated previously, it will not give the confidence required to push the system to production. In the example of our calculator, we use a Python function to evaluate the expressions and call it the "human-annotated evaluation dataset." (Yes, this is simulated data, but the aim is to showcase how to evaluate such a system by collecting relevant data.)
The W&B Table below shows the true_result, which is human-annotated eval data (in our case, generated using a Python function), while the pred_result is computed by our LLM calculator. We are also logging the token counts and the cost of generating the response.
Using Langchain you can track your token usage and cost. However, it is currently only implemented for the OpenAI API.
💡

expression
true_result
pred_result
LLM Prompt Tokens
LLM Completion Tokens
LLM Total Tokens
LLM Total Cost (USD)
Parsing Prompt Tokens
Parsing Compeltion Tokens
Parsing Total Tokens
Parsing Total Cost (USD)
1
2
3
4
5
Depending on the use case and the availability of such an eval dataset, you can choose the most relevant metric(s) for your use case. In this instance, accuracy is good enough. Below you can see the accuracy of the LLM calculator on the eval dataset, along with the cost of making 106 predictions.
Surely, we can improve the performance of our LLM-based calculator. (In fact, we'll be doing so in just a moment here.)



Hyperparameter Optimization

Setting up a hyperparameter optimization makes sense when we have an eval dataset and multiple components to tweak. It's fairly straightforward using W&B Sweeps. We'll try out three different prompt templates, three models (text-davinci-003, gpt4, gpt-3.5-turbo), and temperatures in the range of 0-1.
NOTE: Temperature is a parameter where we adjust how "creative" our model's responses are. Lower temperatures produce less novel outputs with less randomness. Higher temperatures produce more diverse content but also increase the odds our model will move a bit out of context.
💡
The different prompt templates we'll try:
  1. maths_prompt_template_1.txt
The following is the mathematical expression provided by the user.
{question}

Find the answer using the BODMAS rule in the {format_instructions}:
2. maths_prompt_template_2.txt
The following is the mathematical expression provided by the user.
{question}

Find the answer in the {format_instructions}:
3. maths_prompt_template_3.txt
You are an expert mathematician. You can solve a given mathematical expression using the BODMAS rule.
BODMAS stands for Bracket, Orders of Indices, Division, Multiplication, Addition and Subtraction. The computation should happen in that order.
The dorder is as follows:
B: Solve expressions inside brackets in this order -> small bracket followed by curly bracket and finally square bracket.
O: Solve the indices such as roots, powers, etc.
D: Divide the numbers which are given
M: Multiply the numbers next
A: Sum up the next numbers
S: Subtract the numbers left in the end

The following is the mathematical expression provided by the user.
{question}

Think about it step-by-step. Don't skip steps.

When ready with the answer return in the {format_instructions}:
As you can see, maths_prompt_template_2.txt is the simplest prompt without any mention of the "BODMAS" rule. We mention "BODMAS" in the maths_prompt_template_1.txt template. We give detailed instructions in the final prompt template, maths_prompt_template_3.txt.
I ran the optimization for 50 runs using the random strategy. Here are some of the observations:

Run set
44


Observations

  1. There are three accuracy bands - low (10-40%), medium (60-70%), and finally high (88%+). The three bands are positively correlated with the cost. Higher accuracy is achieved by spending more money.
  2. GPT4 rules them all. The GPT4 model gives the highest accuracy.


3. GPT4 is great, but you need to prompt it correctly. Prompt template 3 yielded ~90% accuracy, while prompt template 1 yielded 60-70%. That's more than a 20% improvement in the system performance. In our case here, being more instructive to the LLM produced better outcomes.


4. The result will be bad if you use a bad model. No matter the prompt template, text-davinci-003 didn't perform well, resulting in an accuracy of less than 20%. Even gpt-3.5-turbo fails to perform well (<40% accuracy).


The best accuracy comes with a price. As expected, gpt4 performs best using a more detailed and concise prompt template. The result is as expected for a simple LLM-based system, but that might not be the case for a slightly more complex system.
If one can spend time/money to build an evaluation dataset, it will give the most confidence when deploying LLM-based systems evaluated on such a dataset. This is the most methodologically sound way of evaluating your LLM-based system.
💡

LLMs Evaluating LLMs

Try out the code here \rightarrow

LLMs are versatile and show interesting characteristics. They are particularly good at extracting information from the provided text, and they're only getting better. This has led to the evolving practice of using LLMs to evaluate LLMs. The core idea is to use an LLM to generate test cases and then evaluate the LLM-based system on them.
Our LLM-based system will be a Retrieval Augmented QA bot in this section. Such a system has a few components: an embedding model, a retrieval system, and an LLM-powered QA chain. You can learn more about how to build such a system in this LangChain Chat with Your Data Deeplearning.ai course.
The QA bot is built to answer questions based on a paper titled "The Cookbook of Self-Supervised Learning" (arXiv:2304.12210). Let's dive into evaluating such a system. First, we need an evaluation dataset and decide the metrics.

1. Generate Eval Dataset Using An LLM

We need pairs of questions and answers in our evaluation set to actually evaluate a QA bot. Since our bot uses an information retrieval (IR) system, we must also consider evaluating it (more on it later).
As stated in the previous section, we can hire human annotators to create gold-standard pairs of questions and answers manually. This is a great method overall, but it is costly and time-consuming. After all, building an evaluation set for a QA bot over medical data needs trained people.
For many such niche use cases—ones where simple crowd-sourcing doesn't work well, and you're far better off with experts—it's hard to find the right talent, leading to higher costs and making it infeasible for individuals and small businesses. One feasible way of creating such a dataset is to leverage an LLM. This approach has obvious benefits and limitations:
  • It's scalable. We can generate a vast number of test cases on demand.
  • It's flexible. The test cases can be generated for special edge cases and adapted to multiple domains, ensuring relevance and applicability.
  • It's cheap and fast. LLMs can quickly collate information from multiple documents at a far lower price.
As for limitations, we covered the biggest above: use cases where you need expert labelers are the sorts of use cases you most often want them (i.e., in the medical domain).

Langchain's QAGenerationChain

Langchain has a useful chain called QAGenerationChain, which can extract pairs of questions and answers from specific document(s). We can load the document(s) using the relevant data loader (great piece by Hamel here), split it into smaller chunks, and use the chain to extract QA pairs.
I used this prompt to generate QA pairs using the QAGenerationChain. I created 60 such pairs, of which 45 were created using gpt-3.5-turbo, while the rest 15 were created using Cohere's command model.
You are a smart assistant designed to come up with meaninful question and answer pair. The question should be to the point and the answer should be as detailed as possible.
Given a piece of text, you must come up with a question and answer pair that can be used to evaluate a QA bot. Do not make up stuff. Stick to the text to come up with the question and answer pair.
When coming up with this question/answer pair, you must respond in the following format:
```
{{
"question": "$YOUR_QUESTION_HERE",
"answer": "$THE_ANSWER_HERE"
}}
```

Everything between the ``` must be valid json.

Please come up with a question/answer pair, in the specified JSON format, for the following text:
----------------
{text}
Check out the generated QA pairs below. Run through the generated QA pairs from both models. Can you spot the difference in the generation quality?


You can experiment with different prompts to change the tone of the questions and answers, put more or less attention to details, create negative answers, and more. The best part of this approach is that LLMs will improve with time, making this approach more feasible for a wide set of use cases.

2. Metrics

Now that we have an eval set of QA pairs, we can let our LLM-based QA bot generate predictions for the questions. We can then use a metric to evaluate the predicted and "true" answers.

1. LLMs as a Metric?

Given a predicted and a "true" answer, we can literally use an LLM to find how well the prediction is compared to the true answer! Continuing: LLMs are powerful because they now have a good understanding of the semantics of the text. Given two texts (true and predicted answers), an LLM can, in theory, find whether they are semantically identical. If identical, we can give that prediction a "CORRECT" label; otherwise, an "INCORRECT" label.
Luckily, Langchain has a chain called QAEvalChain that can take in a question and "true" answer along with the predicted answer and output "CORRECT" and "INCORRECT" labels for them. Check out the W&B Table below with one such evaluation job where an LLM was used as a metric (llm_based_eval_acc).



2. Standard Metrics

As an NLP task, question-answering has rich literature with few dominant metrics. Two dominant metrics used in various QA benchmarking datasets, including SQuAD, are:
  • Exact Match: For each question-answer pair, if the tokens of the model's prediction exactly match the tokens of the true answer, exact_match is 100; otherwise, exact_match is 0. One can imagine that each token matching is a rare occurrence for a stochastic system. This metric should be taken with a grain of salt for our use case.
  • F1 Score: This is a well-known metric that cares equally about the precision and recall of the system. Precision is the ratio of shared tokens to the total number of tokens in the prediction. Recall is the ratio of shared tokens to the total number of tokens in the ground truth.
We can use HuggingFace's Evaluate library to load the squad metric and compute the exact_match and f1. For the same evaluation job above, check out the exact_match and the f1 scores on a per-sample basis below.



Hyperparameter Optimization

Given we have an eval set, let's use W&B Sweeps to quickly set up a hyperparameter optimization search component that will improve a metric. In this case: mean F1 score. I used the following sweep configuration:
method: random
name: random_qa_full_sweeps
parameters:
embedding:
values:
- SentenceTransformerEmbeddings
- OpenAIEmbeddings
- CohereEmbeddings
llm:
values:
- gpt-4
- gpt-3.5-turbo
- text-davinci-003
- command
- command-light
prompt_template_file:
values:
- data/qa/prompt_template_1.txt
- data/qa/prompt_template_2.txt
retriever:
values:
- Chroma
- TFIDFRetriever
- FAISS
program: qa_full_sweeps.py
As you can see, I'm experimenting with a few embedding models, different LLMs (GPT family of models coming from OpenAI and Command family of models coming from Cohere), prompt templates, and a few retrievers. Prompt template 2 is a slight modification of prompt template 1. Another thing to note is that the TFIDFRetriever doesn't use an embedding model (obviously).

Run set
65


Observations

  1. OpenAI models are performing better than Cohere model. The top F1 scores (50+) come from the OpenAI model family.


2. The TFIDFRetriever works surprisingly well compared to embedding-based Chroma and FAISS. The TFIDFRetriever doesn't use any embedding model; thus, using this retriever can cut costs without reducing performance.




3. gpt-3.5-turbo seems to be performing better than gpt4 in general. Is it because the eval set was generated using gpt-3.5-turbo? This begs further investigation into the evaluation strategy but also shows how powerful gpt-4 is.
Click on the check box below to select the experiments for that model. The resulting F1 score is the mean across the experiments.

gpt-3.5-turbo
6
gpt-4
10

4. The lower F1 scores for gpt-4 are due to prompt template 2. It shows how important correct prompting can be.



What can be improved in the evaluation strategy?

Based on the observations above, one can think of ways to improve the evaluation strategy.
  • Maybe use a better metric than the F1 score. Maybe using some semantic similarity metric like the one proposed in "Semantic Answer Similarity for Evaluating Question Answering Models" (arXiv:2108.06130).
  • Since gpt-3.5-turbo is performing better on average, it would be good to update the evaluation set to include the following:
    • More QA pairs are generated using Cohere family of models (command and command-light).
    • QA pairs generated using all the LLMs that are further scrutinized.

How about evaluating the retrieval system separately?

Information retrieval (IR) is a crucial step in a QA pipeline. The evaluation strategy suggested above evaluates the pipeline as a whole. We need ways to evaluate individual systems.
While generating pairs of questions and answers using chunks of the documents, the chunks (source truth) should be saved alongside the pairs. The IR system will select the top k chunks for a given question during evaluation. A score for that retriever will be determined if the source truth is in the selected chunk. The score will also depend on the source truth's rank in the selected chunks.
I will show this in action in a separate report.

Conclusion

The evaluation of LLM-based systems is still in the early stage of development, with a lot of research and tooling developed for it. LLMs are here to stay, and many problem statements will start leveraging LLMs in some capacity. I hope this report will illuminate the importance of evaluating an LLM-based system. I hope it also gave some practical ways of evaluating your LLM-based systems.
I believe that LLMs evaluating LLMs will become a common practice "eventually", but some progress has to be made. The most methodical way of evaluating any system is by using a human-generated eval set. We will probably see a hybrid evaluation strategy. To begin with, we will use an LLM to evaluate another LLM, deploy the system in some capacity, and collect real data from the humans. We will then update the eval set with more humane test cases.
I hope you enjoyed reading this post. If you have any questions/suggestions, please drop in the comment below or reach out at @ayushthakur0.

Matthew Gurney
Matthew Gurney •  
Very nice, thanks. I think you have a typo "The dorder is as follows"
Reply
Joel Mushagasha
Joel Mushagasha •  
Hey Ayush, Thank you for a really awesome article. Quck question for you: If you wanted to evaluate medical domain QA initially without expert label, what would you do?
1 reply
Vadim Barilko
Vadim Barilko •  
Wonderful aricticle! Could you attach a link to the full source code of the project? Thank you!
1 reply
Iterate on AI agents and models faster. Try Weights & Biases today.
List<wb_trace_tree>
File<(table)>