How to Evaluate an LLM, Part 1: Building an Evaluation Dataset for our LLM System

Building gold standard questions for evaluating our QA bot based on production data.
Created on September 25|Last edited on April 17
Comment
Evaluating any LLM-based system isn't easy. It requires multiple steps and many weeks of deep thought. In this report, we'll look at how to do just that, evaluating the LLM-powered documentation application we call WandBot which answers user questions on how to use W&B.
If you haven't checked out WandBot yet, you can head over to our Discord Server and join the #wandbot channel to try it out!
💡
In a previous W&B report, "How to Evaluate, Compare, and Optimize LLM Systems?" I tried to cover the whats and hows of evaluating an LLM-based system. We won't go into detail here (you can read the piece for that), but broadly speaking, there are three main categories we looked at: 
Eyeballing: While building a baseline LLM system, we usually review the system on a few samples to evaluate its performance. In other words, we eyeball the quality of retrieved chunks from the retriever or check if the generated output satisfies the asked question.
Supervised: This is the recommended way to evaluate LLM apps where you manually annotate LLM system outputs.
Auto Evaluation: In this paradigm, we can use a powerful LLM to generate a meaningful synthetic evaluation dataset or use a manually curated evaluation dataset to then ask an LLM to evaluate different components of the LLM-based system. For example, you can ask an LLM to grade the quality of the generated response or mark it correct/incorrect.
WandBot has already gone through an extensive "eyeballing" phase of evaluation. Every tweak has been carefully looked through and evaluated. Since WandBot can already reply with standard answers to questions asked outside W&B documentation, we deployed it confidently and have had it running in production for some time now. 
QA bots for sensitive domains like law, health care, policies, etc should not be put in production even after thorough "eyeballing" evaluation. Properly evaluate using supervised methods.
💡
In this report, we will look through the steps taken to better understand the kinds of questions asked by the users of WandBot. The idea is to sample a few hundred samples from 800+ questions asked to WandBot. These sample questions will act like gold standard "queries" that we can use for both manual evaluation and auto-evaluation.
Here's what we'll cover: 
Table of contentsUnderstanding our distribution of user queriesPreprocessing the user queriesCan we count tokens to filter out bad samples?Understanding the semantics of the questionsRemove near-deduplicatesHow do we cluster similar questions?Sample from the clusters of user queries using GPT-4Final thoughtsRelated reading: 
﻿
Understanding our distribution of user queriesAs mentioned, WandBot was already in production. That means we have access to the questions asked by real users (a.k.a. production distribution). Understanding what questions the users are asking is key to deriving a gold-standard evaluation set. This understanding will also help when we see a temporary shift in the distribution due to the launch of a new product feature (users asking questions about it).  
Let's look at the publicly asked questions logged as W&B Tables and perform some EDA on top of it. If it is your first time hearing about W&B Tables, here's a quick way to get started:
import pandas as pd
﻿
# read the csv file as Pandas DataFrame
df = pd.read_csv("path/to/file/you_want_to_log.csv")
﻿
# Log the DataFrame to W&B - it will automatically be converted to W&B Tables.
run = wandb.init(project="your_project_name")
run.log({"my_dataframe": df}
The ﻿public question and response (generated by the bot) is logged to this W&B project. I have curated the dataset and dumped in a single W&B Table shown below as of September 22nd, 2023 at 6:21:42 pm (IST).
💡
﻿
﻿
﻿
The query is the questions asked by W&B's discord community. The response is generated by various beta versions of WandBot. The feedback is collected as emoji (👍🏻/👎🏻) reacts to the response mostly by the person who asked the question.
How many users gave a feedback (👍/👎)?Out of 872 questions, most did not receive any feedback. But with 187 thumbs ups and 74 thumbs downs, we've got 261 questions with feedback—that's a solid number for RAG evaluation.
﻿
﻿
Preprocessing the user queriesFirst, let's clean up our data here. For starters, most queries start with a substring "@wandbot (beta)" which is an unnecessary information. We also see a few queries like "@wandbot (beta) are you there?" which, again, is not a relevant question. 
We can also do simple deduplication with df.drop_duplicates. Note that this is a good first way of removing exact same strings from the dataframe.
The preprocessed data is shown below:
﻿
﻿
The processed queries look better. What's more, deduplication reduced the number of queries from 872 to 801. Because the data is logged to W&B Tables, we can run through it and find out all the samples we don't want in our final question bank and remove them from the local csv file and later log the final csv file as a W&B Tables. I have removed the following ids: 
drop_ids = [3, 42, 194, 268]
﻿
for drop_id in drop_ids:
    public_df = public_df.drop(drop_id)
﻿
public_df = public_df.reset_index(drop=True)
Can we count tokens to filter out bad samples?In our use-case a bad sample can be anything that is not related to W&B or is ill-posed. Texts with a few token counts are usually gibberish and not relevant to W&B. Yes the user can ask such questions and from an evaluation point of view they should not be considered "bad". But we decided to not waste dollars on such short samples instead we decided to nail down a few questions that are hard for WandBot to answer.
Tokenizing the sentences and counting the tokens can help filter out these bad data samples. Short sentences will have a lower token count and we can choose to remove the samples below a certain threshold of token counts. While, longer sentences will have large token counts.
I used tiktoken, a fast byte pair encoding tokenizer used with OpenAI's models, and tokenized all the public preprocessed questions. The code snippet shown below encodes a text phrase:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
enc.encode("hello world")
The total number of samples is 797 at this point.
﻿
﻿
Let's group by the tiktoken_gpt4_token_count column to maybe find a few more bad examples. Interestingly, we see a few data points with just 1 or 2 or 4 tokens. I modified the column setting of processed_query column to show, three samples vertically.
﻿
Run set1
﻿
Clearly, there are a few "questions" asked by the users that are not useful from an evaluation point of view. However, this also give us the indication that the WandBot should be able to deal with similar cases in the future.
Questions like "how is the weather?" or "are you sentient?" can be removed from our eval set. We can consider keeping a few of them as edge cases for which the WandBot should return a standard response suggesting user that enough information was not given or that as an AI assistant bot, it cannot answer such a question. 
We can either note down the IDs and remove it from the pandas data frame or just filter out questions with token counts below 9 (from my observation, it's a good number). 
On the extreme end, folks have asked questions with error stack trace and someone dumped Lorum Ipsum.
﻿
Run set1
﻿
I can remove questions that have token counts of less than 9 but instead, I went through each question (token counts < 9) and picked the following ids to be dropped - 
drop_ids = [
    705, 336, 635, 769, 267, 284, 772, 77, 420, 562, 125, 207, 285, 374,
    514, 518, 576, 661, 223, 337, 429, 499, 577, 621, 4, 298, 327, 434,
    553, 638, 759, 415
]
I also removed the question with token count 990 (Lorem Ipsum, though our data here chose a different spelling). 
While going through few more samples, I noticed multiple semantically similar questions. The naive deduplication using df.drop_duplicates didn't do a great job removing those since it doesn't look at the semantics. Before we do that, let's look at the newly filtered questioners:
Just a note: the number of questions dropped from 797 to 763.
💡
﻿
﻿
Understanding the semantics of the questionsAnother technique folks usually use is to group semantically similar data points in a higher dimensional embedding space. This cluster gives a good idea of questions that are similar and can be dropped. We need not evaluate our bot on the same "type" of question multiple times.
I am using OpenAI's text-embedding-ada-002 embedding model to embed each question. For visualization and insights, I am using Atlas by Nomic.ai.
How to visualize embedding using Atlas?
Atlas embedding visualizationThe public URL of the Atlas embedding visualizer is here.
﻿
﻿
I highly recommend playing with the tools (especially the lasso selection) to select clusters and run through the data points. You'll realise that there are a few semantically similar questions which one can drop.
(One of the reasons we chose to use Atlas was to perform Duplicate Clustering out-of-the-box but it seems like a broken feature for now. I have raised an issue in their GitHub repo. If it gets resolved, I'm excited to try it.)
Remove near-deduplicatesAfter spending some time with the questions, I found multiple near-duplicates. Since everyone has their own definition of near-duplicate, here's what I am using:
Any pair of text strings asking roughly the same kind of question for the same product category or raising an issue about the same feature is a near-duplicate.There are many advanced algorithms to filter out near-duplicates from a text dataset but for 700ish data samples, we can get away with a two for-loop implementation of Jaccard Similarity. 
Implementation of Jaccard Similarity
Visualizing near-duplicates
A graphical representation of near-duplicatesHover the mouse pointer over the nodes to see the text and over the edge to see the jaccard similarity score. Note that I have used a high threshold of 0.5 as it works as per my desire in this use case. For your own use case, experiment with different thresholds.
﻿
﻿
Here's a filtered W&B table, post-duplicates. The quality of questions has improved over every iteration. We also went down from 763 data samples to 671 samples.
﻿
﻿
I was also able to to delete the near-duplicates from the Atlas project. Check it out here.
from nomic import AtlasProject
﻿
map = AtlasProject(name='Public Questions Wandbot')
atlas_df = map.maps[0].data.df
﻿
dropped_ids = []
﻿
for dropped_id in dropped_index:
    id_ = atlas_df.loc[dropped_id].id_
    dropped_ids.append(id_)
﻿
with map.wait_for_project_lock():
    map.delete_data(ids=droped_ids)
    map.rebuild_maps()
﻿
Run set1
﻿
How do we cluster similar questions?After all the preprocessing and deduplication steps above, we're still left with 671 questions. But we still want to find a meaningful number of clusters that is representative of our production distribution. Why is that? Well, evaluating on same kind of questions using GPT-4 will cost unnecessary money. Clusters allow us to functionally evaluate multiple similar questions simultaneously. Plus, making multiple API calls to GPT-4 takes time and we want our evaluation to be fast. Further, we decided that we need 100-150 questions (sampled from the clusters) as we felt that is the limit of what our ML Engineers could manual annotate repeatedly as we iterate on our LLM system.
There are many ways to go about clustering the questions (which I 've documented below). Note that for your use case the same algorithm might not give the best result so experiment extensively to find out. Another good strategy for clustering, especially if you have a lot of samples, is to randomly pick a subsample and manually annotate it into different clusters and then use this for evaluation. It should improve the clusters. I skipped this step and eyeballed to evaluate the clusters.
KNN, BisectingKNN, DBSCAN, othersI employed clustering algorithms like KNN, DBSCAN, etc. to find good clusters. I used the OpenAI embedding model - text-embedding-ada-002 to embed each questions, followed by UMAP dimensionality reduction (used the default parameters) and then using one of the algorithm from scikit-learn. If you are new to clustering, the scikit-learn's documentation on it is a must read.
The code snippet shown below performs dimensionality reduction and clustering, followed by logging the clusters in a W&B Table. As you can see, I went ahead with the default parameters and didn't experiment a lot with tuning them. The aim was never to have the most perfect cluster but to get clusters that can be representative of the production distribution and from which I can sample one question to form my eval set.
import wandb
import umap # pip install map-learn
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
﻿
# List of embeddings for each query
embeddings = [...] # 1536 dimensional
# Initialize UMAP model
umap_model = umap.UMAP(n_components=50)
# Fit and transform the data
embeddings = umap_model.fit_transform(embeddings)
# Standardize
scaler = StandardScaler()
embeddings = scaler.fit_transform(embeddings)
# Get cosine similarity
cosine_sim = cosine_similarity(embeddings, embeddings)
﻿
# The `public_df` is a DataFrame where the column 
# "processed_query" contain the questions.
df = public_df[["processed_query"]]
# Perform clustering - replace KMeans with a different algorithm
kmeans = KMeans(n_clusters=80, random_state=42, max_iter=500, init="k-means++")
df['k_mean_cluster'] = kmeans.fit_predict(cosine_sim)
﻿
# Log the clusters as a W&B Table
run = wandb.init(project="wandbot-eval", job_type="clustering")
wandb.log({"public_questions_clusters": df})
The cluster ids along with the questions are logged to W&B Table, which can be used to go through each cluster and eyeball, if the cluster makes sense. A few tables are shown below that were created using the output of the mentioned clustering algorithms. Feel free to run through them and see for yourself, if the clusters can be improved?
After selecting a new run set (checkbox below), group by the column with the cluster ids. Click on the processes_query column name and select the panel type to be "String". Change the parameters after this selection to customize the viewing experience.
💡
After trying different traditional clustering algorithms, I was not very convinced by the quality of the clusters based on manually reviewing it. 
﻿
KNN1
 
AgglomerativeClustering1
 
BisectingKNN1
﻿
How to use Community Detection for better clusters?I wasn't satisfied with the clusters and imagined using the vector index itself to first get the semantically similar questions in a graph and then somehow group the questions into meaningful clusters. A vector index is an ordered collection of embeddings (vector representation) of each chunk of text. Why use vector index?
LlamaIndex's VectorStoreIndex class can consume a list of questions (docs) and embed them - one liner, 
the resulting index can easily be used to retrieve top K semantically similar question - one liner.
The main point is to get top K (in my case K=2) semantically similar question and make a graph out of it. The following code snippet do just that:
from llama_index import VectorStoreIndex
from llama_index import Document
import networkx as nx
﻿
query_list = [...] # List of 671 questions
query_docs = [Document(text=t) for t in query_list] # To make it LlamaIndex compatible
﻿
# Get the top 2 semantic search retriever
vector_index = VectorStoreIndex.from_documents(query_docs)
vector_retriever = vector_index.as_retriever(similarity_top_k=2)
﻿
# Create a similarity graph
G = nx.Graph()
﻿
for i, text in enumerate(query_list):
    similar_texts = vector_retriever.retrieve(text)
    for similar_text in similar_texts:
        G.add_edge(text, similar_text.text)
The resulting graph has semantically similar questions (nodes) close to each other. There are some local clusters in the graph which can further be grouped into communities increasing the overall density of these communities. Based on this idea Louvain Community Detection was used to partition the graph into clusters. This algorithm is used for market segmentation, criminal detection, recommendation systems, etc. and are mostly used for clustering large graphs. There are a few resources online to understand the mathematics of this algorithm but this ELI5 explanation by ChatGPT might help better understand the idea.
from community import community_louvain # pip install python-louvain
﻿
# Apply Louvain Community Detection
partition = community_louvain.best_partition(G)
I have logged the partitions as a W&B Table as well as visualized it as a graph. Below, I am showing the clusters formed because of community detection. Each unique color represents a unique cluster of similar questions. Hover the mouse over the nodes to read through the questions.
﻿
﻿
Sample from the clusters of user queries using GPT-4There are a total of 132 clusters. I could have randomly sampled one question per cluster and called it a gold-standard eval set but I went a step ahead and used GPT4 with the prompt shown below. The idea is to let GPT 4 pick one question from the list of questions that might be hard to evaluate on. Obviously, this is not a perfect sampling technique and the prompt template can be improved further by providing a few-shot examples of what a hard question might be.
prompt_template = """
Wandbot is a question answering bot for Weights & Biases documentation. I have provided a list of questions
that are semantically similar. As a smart assistant, your job is to find one question from the list of questions
that will be best to evaluate wandbot on. Find one hard question from the list of question.
﻿
# QUESTIONS
﻿
{questions}
﻿
Make sure to return just one question from the list of question in the given format:
﻿
```
['question']
```
﻿
Everything between ``` should be a valid python list.
﻿
ANSWER: 
"""
The resulting 132 samples are shown below. The following samples cover user questions on OG feature like experiment tracking, W&B Tables and W&B Artifacts along with questions from newer features like weave. I was happy with the sampled questions and decided to use it as an eval set.
Note that the eval set can improve with iterations as well. Having an eval set that covers some base - in this case the eval set covers most of the product features, is crucial for rapid iteration of the system. Do employ human feedback in the loop to remove and add new samples from and to the eval set.
💡
﻿
﻿
Final thoughtsThe evaluation strategy for any LLM-based application is closely tied to the application. The decision to deploy a rough version of your LLM-based application to collect some user data is subject to the sensibility and feasibility of the use case. WandBot was not harmful to deploy and we were always inviting the users to open an official  In this report, I have documented the steps that I took to come up with part 1 of the evaluation strategy for WandBot.  
A gold-standard set of queries will help do manual annotation (evaluation) of the system, use auto eval strategies to get the faithfulness and relevancy of the response and context respectively. (More on them in part 2 of this series.)
I hope the methods presented here will help seed newer ideas as well. The main take away is that everyone is trying to figure out evaluation for their LLM-based applications and there is no silver bullet. Having said that, the ideas presented in this report should work for most cases. Computing tokens, using is to filter out unnecessary texts, using embedding visualization to find relevant clusters, using clustering algorithms to find clusters, etc., are crucial steps that you might encounter in your LLM-eval journey.
See you later in the week for part 2, where you will learn about how we implemented our first round of manual evaluation.
Related reading: 
A Gentle Introduction to Retrieval Augmented Generation (RAG)
In this article, we will learn about Retrieval Augmented Generation (RAG) and how it helps pre-trained LLM models to generate more specific, diverse and factual responses. 
Should You Purchase an LLM or Train Your Own?
An excerpt from our Training LLMs from Scratch piece to help you decide if you should purchase a large language model or train your own
Training Tiny Llamas for Fun—and Science
Exploring how SoftMax implementation can impact model performance using Karpathy's Tiny llama implementation.
The Art and Science of Prompt Engineering
Darek Kleczek takes us through the art and science of how to prompt engineer like a pro.
﻿
﻿
Add a comment
Tags: Articles, LLM, NLP, GenAI, Tutorial, Intermediate
Iterate on AI agents and models faster. Try Weights & Biases today.