How to evaluate an LLM Part 3: LLMs evaluating LLMs

Employing auto-evaluation strategies to evaluate different component of our Wandbot RAG-based support system.
Created on October 18|Last edited on June 12
Comment
In Part 1 of this LLM Evaluation series, we documented the steps taken to build a gold standard set of questions that can be used to evaluate Wandbot, our LLM-powered documentation application. We also showed how we performed a manual evaluation in part 2 of the series and got to a response accuracy of 66.67%.
In this report, we will employ ways to leverage GPT-4 (or any other powerful LLM of your choice), to build synthetic evaluation datasets and to evaluate Wandbot using LLMs. As crazy as it sounds, LLMs can be useful for evaluating parts of an LLM-based application. We will also leverage our gold-standard eval set to evaluate Wandbot using LLMs.
A common way to evaluate an LLM-based pipeline is to employ human annotators, but this comes with a high cost, both in time and man hours. Thus, for LLM-based applications, clever utilization of LLMs and thoughtful metric/algorithm selection have shown promise in expediting the evaluation process.
The following sections present baseline evaluation scores across various categories, interpret their significance and provide insights into the methodology employed for Wandbot evaluation.
﻿
﻿
What we'll coverWhat is a typical RAG pipeline?Evaluating the quality of the responseEvaluating the faithfulness of our responsesEvaluating context relevance The evaluation resultsEvaluating the retrieverEvaluating retrieverFinal Thoughts
﻿
﻿
What is a typical RAG pipeline?A typical retrieval-augmented generation (RAG) pipeline is illustrated below, with an indexing stage and a querying stage. If you want to learn more about RAG and how to build one, check out my report on Building Advanced Query Engine and Evaluation with LlamaIndex and W&B. ﻿﻿
Figure 1: Diagram of a typical RAG pipeline (source: hand made)
There are decisions to be made in almost every component of the pipeline:
What data to ingest? Should we use the documentation or also use the content of wandb/wandb repo and wandb/examples repo?
What should the data parsing strategy be? How should code snippets, raw text, and Jupyter notebooks be parsed? How should they be chunked into smaller sections? Should parsing be done by taking the docs format (headers) into the picture?
What should the chunk size be?
What embedding model to use? Should we use OpenAI's text-embedding-ada-002 embedding model or Cohere's embedding model or should we fine-tune one?
What kind of retriever best suits the use case? Should be use a simple vector-based retriever or a knowledge graph-based one?
How many chunks/nodes should we use as context for the LLM? How many semantically similar chunks that "can" answer the user query are enough?
Should we rerank the retrieved chunks? Should we use a powerful reranking model to pick a few chunks from a pool of chunks?
Should we build multiple retrievers following some hierarchal strategy? Weights & Biases has both English and Japanese docs. Should we have separate retrievers for both of them? Should we use individual retrievers for each product and route the query to each of them?
The list can go on, but we can group our questions into two main categories:
How good is my retriever?
How good is the final synthesized response?
In this report, we will employ a few strategies to evaluate our retriever and the response. We will, in subsequent reports, try to answer a few questions posed above, but for now, the focus of this report will be to showcase strategies that might be useful for evaluating Wandbot.
Let's dig in. 
Evaluating the quality of the responseEvaluating the quality of the response based on various criteria helps evaluate the pipeline in an end-to-end fashion. Examples of the criteria we might use:
Correctness: Is the response answering the query correctly?
Faithfulness: Is the response supported by the context, even if most of the information in the context is irrelevant?
Relevancy: Is the response for the query in line with our contextual information?
We can use GPT-4 for this evaluation and in the subsequent subsections we'll see how. Note that we'll not be doing the correctness evaluation using GPT-4 as we plan to do this manually (more on that later).
The evaluation dataset The faithfulness and relevancy can be measured easily using LlamaIndex's FaithfulnessEvaluator and RelevancyEvaluator class. Both the classes uses the query engine and a list of questions.
We are using our wandbot's chat engine as query engine - in LlamaIndex one can use the index interchangeably as query engine (index.as_query_engine()) or chat engine (index.as_chat_engine()).
The list of gold standard questions for evaluation was generated in the part 1 of this series. 
You can check out the questions below:
﻿
﻿
Generate the responsesWith the gold standard questions (creating this is the hard part), we can use the query engine to generate the responses. Here are the key parameters used by the query engine: 
similarity_top_k = 25 # retriever will select top 25 chunks
response_mode = "compact" # stuff as many text that can fit within the context window.(source)
node_postprocessors = [CohereRerank(top_n=15, model="rerank-english-v2.0")] # Using Cohere's model to rerank and select top 15 chunks.
These key parameters are used by our wandbot v1.0 and we will evaluate this version of the bot to generate the responses. You can see the logged responses (along with the retrieved context) in the W&B Table below. They were all generated using GPT 4.
﻿
﻿
If multiple evaluation strategies (in our case faithfulness and relevancy evaluation) are using the same query-context-response triplets, it's best to serialize them.
💡
Serializing expensive operations should be considered to eliminate losing the work. I used Python's Pickle module to serialise and de-serialize the list of responses. They were further logged to W&B Artifacts for version management. 
import time
import pickle
﻿
﻿
# Get all the responses
responses = []
﻿
﻿
for question in tqdm(questions):
    response = query_engine.query(question)
    responses.append(response)
    time.sleep(2) # helped overcome timeout issues.
﻿
﻿
# Save to disk
with open('responses.pkl', 'wb') as file:
    pickle.dump(responses, file)
Note that both FaithfulnessEvaluator and RelevancyEvaluator are making LLM calls under the hood with some default prompt template. For your use case, you can consider providing your own prompt template.
💡
Evaluating the faithfulness of our responsesThe response from a RAG pipeline is dependent on the context retrieved by the retriever. We usually never want the response to hallucinate. In other words: the informational content of the response should come from the provided context. If the retrieved context cannot answer the query, the response should not be made up.
The faithfulness evaluation measures if the response from a RAG pipeline matches any retrieved chunk. This is useful for measuring if the response was hallucinated.
import pickle
from llama_index.evaluation import FaithfulnessEvaluator
﻿
﻿
# Load it back
with open("responses.pkl", "rb") as file:
    responses = pickle.load(file)
﻿
﻿
# Initialise the faithfulness evaluator
service_context = query_engine.retriever._service_context
faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context)
﻿
﻿
# Faithfulness evaluation
faithfulness_eval_results = []
for response in tqdm(responses):
    eval_result = faithfulness_evaluator.evaluate_response(response=response)
    faithfulness_eval_results.append(eval_result)
    time.sleep(2)
﻿
Under the hood the evaluator is using the default prompt template which we can tinker with as per our use case. We went ahead with the default prompt template. Below are the faithfulness score for each query-context-response triplets in our evaluation dataset: 
﻿
﻿
﻿
Evaluating context relevance A distance metric (usually cosine similarity) between the query and all the chunks are computed and the top k chunks are retrieved. The chunks should ideally have information i.e, the chunks should be relevant to answer the query. The relevancy evaluation metric measures is the generated response is in-line with the context.
Note that this evaluation has a subtle difference from the faithfulness evaluation strategy. In faithfulness evaluation (check out the prompt template here) it explicitly instructs to answer "YES" if any part of the context supports the information, even if most of the context is unrelated. 
While in this relevancy evaluation (check out the prompt template here) the instruction is to answer "YES" if the response for the query is in line with context information, without specifying the degree of relevance or the possibility of unrelated context supporting the information. They both are, in essence, identical with faithfulness evaluation being more constraint while the relevancy evaluation being generic and open for interpretation. You will realise this by looking at the evaluation scores below.
import pickle
from llama_index.evaluation import RelevancyEvaluator
﻿
# Initialise the relevancy evaluator 
service_context = query_engine.retriever._service_context
relevancy_evaluator = RelevancyEvaluator(service_context=service_context)
﻿
# Relevancy evaluation
relevancy_eval_results = []
for query, response in tqdm(zip(questions, responses)):
    eval_result = relevancy_evaluator.evaluate_response(query=query, response=response)
    relevancy_eval_results.append(eval_result)
    time.sleep(2)
Below, are the relevancy score for each query-context-response triplets in our evaluation dataset.
﻿
Run set1
﻿
The evaluation resultsBelow are the faithfulness and relevancy accuracies. Note how faithfulness accuracy is lower than relevancy - this is coming from the fact stated in the previous section where we noted that the former is more constrained while the latter is more generic.
﻿
﻿
Clearly, there is scope for improvement. The experiment for improving these scores will be discussed in a later report in this series.
Evaluating the retrieverThe evaluation of the retriever is probably one of the most important aspect for improving the quality of a RAG pipeline. The job of a retriever is to select a few chunks from the index (vector-based, keyword-based, graph-based, etc.) given a query. If the retriever fails to retrieve accurate and contextually relevant information, it can lead to incorrect or nonsensical responses from the generator. But more importantly, the retriever should pick the most relevant context first followed by decreasing relevancy.
Obviously, for doing such an evaluation, we'll need an evaluation set where we have a ground-truth context for a given query. One way would be to use human annotators to build such a dataset, but that can be onerously time-consuming and expensive.
Luckily, we can employ GPT-4 for building a good query-context dataset. In LlamaIndex, we can use the generate_question_context_pairs function. 
Before we get into how to use this function, here's the idea behind the evaluation strategy:
Figure 2: Pipeline to build the query-context evaluation set. (Source: hand made)
After node parsing stage we end up with multiple chunks. If you started with just 1 document, after this stage, depending on your parsing strategy and max token limit of each chunk you can end up with multiple chunks. In the case of wandbot's node parsing strategy, we got 4128 nodes from our English and Japanese documentations (docs.wandb.ai), our wandb/wandb repo, as well as our wandb/examples repo.
We took a subset of 1392 nodes from our English documentations. Obviously the retriever should be evaluated for Japanese language as well. We decided on documentations for two reasons:
They are the most dense source of information about W&B and users will likely start there.
Through this evaluation we also considered a subgoal to see if we can improve our documentation (more on it later).
Naively sample nodes and generate query-context pairsFrom those 1392 nodes, we first decided to sample single nodes randomly to generate the query-context pairs. The code snippet shown below shows, how to use the generate_question_context_pairs function:
import random
from llama_index.llms import OpenAI
from llama_index.evaluation import generate_question_context_pairs
﻿
# naively sample NUM_SAMPLES nodes.
nodes = [node1, node2, node3, .....]
sampled_nodes = random.sample(nodes, NUM_SAMPLES)
﻿
# Initialise GPT-4
llm = OpenAI(model="gpt-4")
﻿
# Generate query-context pairs.
qa_dataset = generate_question_context_pairs(
    sampled_nodes, llm=llm, num_questions_per_chunk=1
)
The generate_question_context_pairs function returns a dict of expected node ids (chunk_id in the table below) and the generated queries. We've logged the generated query-context pairs below in a W&B Table. 
Note that one chunk with max chunk_size of 1024 was used to generate the queries. If you check out a few samples, you will find questions like - "Where can you find more and detailed examples?", "What is the function Files in this context?", and many questions with the "what is this?" format.
Clearly, this is not representative of the type of questions users will ask. Naive sampling is not very useful especially if there aren't much context within the context for the generation of questions.
﻿
Run set1
﻿
Sample nodes using token countingInstead of randomly sampling the chunks to generate query-context pairs, we can consider sampling tokens that fits a certain criteria. Counting the number of tokens in each chunk and plotting it as a histogram can help derive a useful criteria.
Interestingly there are a few chunks with 0 token count and a few with over 900 token counts. The majority of good chunks are in the range of 100-400. We can randomly sample within this range to improve the quality of the query-context pairs. 
﻿
﻿
Sampling by considering the format of the documentThe document in this case is our documentation. Since most docs have a tree like structure where the parent topic (page) links to multiple sub-topics. We can group all the sub-topics within that parent topic and sample from them. 
For wandbot we evaluated the retriever for topics like Launch, Sweeps, Artifacts, Tables, etc. (essentially, our product features). Counting the number of tokens in this filtered down list of nodes (belonging to one product category) and sampling in a range of min token count and max token count help improve the quality of the generated query-context pairs. 
Figure 3: Tree-like structure of the Weights & Biases documentation.
For your use case, such structure might not exist. Consider a sampling that makes sense for your use case.
Understanding Mean Reciprocal Ranking and Hit Rate MetricsMean reciprocal ranking (MRR) is a metric used to evaluate the effectiveness of search engines, recommendation systems, or any system that involves ranking a list of items. It's particularly common in information retrieval. It's usually formulated as...
﻿MRR=1N∑i=1N1rankiMRR = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i}MRR=N1​∑i=1N​ranki​1​﻿﻿
...where, N is the number of queries or instances, and rankirank_{i}ranki​﻿, is the position of the first correct item for the i-th query.
In simpler terms, MRR penalizes systems that have correct answers ranked lower in the list and rewards those that have correct answers ranked higher.
In the example below, reciprocal rank for each row is computed. The proposed results are the retrieved chunks by the query engine. The correct response is what we call the ground truth chunk. The rank is the position of the correct chunk in the retrieved chunks. The average of the reciprocal of these ranks is MRR.
Figure 4: An example to understand MRR and hit rate.
What about hit rate? The hit rate is a binary score indicating the presence of the correct chunk in the retrieved chunks. In the example above, for every sample the hit-rate is 1. We usually report the average of the hit-rates.
Evaluating retrieverThe RetrieverEvaluator class makes doing this kind of evaluation easy. All we need is to pass the query and the expected (ground-truth) context which we are generating using the generate_question_context_pairs function.
Notice that in LlamaIndex one can easily get a  retriever from the index which can be passed to the RetrieverEvaluator. What's even more useful is the fact that with the same index, we can have retrievers with any arbitrary number for the  similarity_top_k argument. Thus if the similarity_top_k is k, we get the mrr and hit_rate metrics as mrr@topk and hit_rate@topk respectively.
import random
from llama_index.llms import OpenAI
from llama_index.evaluation import generate_question_context_pairs
from llama_index.evaluation import RetrieverEvaluator
﻿
﻿
# sample nodes belong to W&B Launch
nodes = [node1, node2, node3, .....]
sampled_nodes = []
for node in nodes:
    if node.toke_counts < MIN_TOKEN_COUNT and node.token_counts >= MAX_TOKEN_COUNT:
        sampled_nodes.append(node)
﻿
﻿
# Initialise GPT-4
llm = OpenAI(model="gpt-4")
﻿
﻿
# Generate query-context pairs.
qa_dataset = generate_question_context_pairs(
    sampled_nodes, llm=llm, num_questions_per_chunk=1
)
﻿
﻿
wandbot_retriever = index.as_retriever(similarity_top_k=2)
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=wandbot_retriever
)
﻿
﻿
# Computing mrr and hit-rate for one generate query-context pair
sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]    
eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
﻿
The generated query-context for the entire generated dataset can be seen below along with the MRR and the hit rate scores. I have only logged the ground truth and retrieved node ids. Check out the MRR and hit_rates for the generated queries.
﻿
﻿
The launch-en in the panels' title indicate that these scores are computed for W&B Launch product category for the English documentations. Note that only this product category was used to generate the query-context, while the retriever had access to the entire documentation. The MRR closer to 1 would make for a perfect retriever and thus we have room for improving our retriever for this product category.
Moving forward, will be doing this evaluation for every product category and use this information to improve the wandbot's response and also improve the quality of our documentation.
Final ThoughtsThe aim of writing this report was to share how we approached evaluating our LLM-based application, wandbot. The evaluation strategies are closely dependent on the use case.
I plan to continue this series with results from both the manual annotation and how wandbot is performing in different product categories. The aim is to also answer some of the questions about a RAG pipeline in the context of wandbot— the chunk size, retriever techniques, etc. I hope you find this report insightful and help you build your own evaluation strategies.
You can also check out our preceding wandbot eval reports here: 
How to Evaluate an LLM, Part 1: Building an Evaluation Dataset for our LLM System
Building gold standard questions for evaluating our QA bot based on production data.
How to Evaluate an LLM, Part 2: Manual Evaluation of Wandbot, our LLM-Powered Docs Assistant
How we used manual annotation from subject matter experts to generate a baseline correctness score and what we learned about how to improve our system and our annotation process
﻿
﻿
Add a comment
Tags: Articles, LLM, GenAI, NLP, OpenAI
Iterate on AI agents and models faster. Try Weights & Biases today.