How to Evaluate an LLM, Part 2: Manual Evaluation of Wandbot, our LLM-Powered Docs Assistant

How we used manual annotation from subject matter experts to generate a baseline correctness score and what we learned about how to improve our system and our annotation process
Ayush Thakur
Created on October 23|Last edited on January 11
Comment
﻿
Before digging in too deeply, we recommend reading part 1 of this series﻿
💡
Manual evaluation as a gold standardIn our previous installment of our LLM evaluation series, we explored the creation of a gold-standard evaluation set, a crucial foundation for assessing wandbot, our technical support bot running in Discord, Slack, Zendesk and ChatGPT. In this blog, our focus shifts to running manual evaluation using our gold-standard eval set. The objective is to establish a baseline accuracy score by assessing how correct wandbot's generated responses actually are.
To refresh: when a user submits a query, wandbot retrieves relevant information chunks and employs a LLM to craft a response. The important criteria for assessment is the accuracy of this response in addressing the user's query. 
The accuracy results from our evaluation are below. You'll want to keep reading for a detailed analysis of this result and how we're using the results to improve wandbot.
﻿
﻿
To facilitate our manual evaluation, we used Argilla as our annotation tool. In this reports we'll share our insights into the manual evaluation process and results of LLM-based systems as well as show how we used Argilla for annotation.
Table of contentsManual evaluation as a gold standardTable of contentsOur RAG LLM Evaluation DatasetWhat makes a LLM response accurate?How to analyze the results from manual evaluationHow accurate are wandbot's responses?Meta metrics: Link hallucination and query relevancyUsing Argilla as our manual annotation toolGetting started with ArgillaAssigning annotation samples to our annotatorsA look into the annotation UIConclusion: Building on our baselineRead the other installments in this series: 
﻿
Our RAG LLM Evaluation DatasetIn the part 1 of this series, we described how we built a gold-standard set of 132 questions sourced from real user queries. To run our manual evaluation, theses queries were passed to wandbot to generate responses which were evaluated. Our golden set of evaluation questions are shown below in a W&B Table:
﻿
﻿
While generating our response, we also stored the retrieved context as the annotators might find the retrieved context useful while annotating the correctness of the response. The query, context and generated response triplets were logged as a W&B Table shown below:
﻿
﻿
What makes a LLM response accurate?The manual evaluation was done primarily to determine the accuracy of wandbot. But what is the definition of accuracy here? What are the different criteria under which we can call a response accurate? Properly defining your evaluation metric and ensuring it is specific to your use case is critical to the quality of annotation from manual annotators. 
We highly recommend thinking through your evaluation criteria, or even running a trial annotation session, before starting your main annotation 
💡
The following evaluation criteria was shared as memo to all the annotators:
- If the response have a code snippet - run it to confirm if it is working.
﻿
- If the response is subjective - use your expert opinion to consider it correct or incorrect.
﻿
- The response should answer the question/query. It should not hallucinate to answer a "similar" type of question.
﻿
- In case, the response is incorrect - please provide a correct answer. In most cases, you can just copy-paste 
the generated response and make the minor change to correct it.
﻿
- Spend a few minutes on each data sample before marking it correct on incorrect. Try to use "unsure" as little 
as possible - ideally zero unsures.
Guidelines used:
Code accuracy: Wandbot should be used by developers looking for code snippets to get a task done. The code accuracy is crucial and any response with code snippets were ran separately to confirm that they worked. For responses that should contain a code snippet but don't, the response was marked incorrect.
Subjectiveness of generated response: Sometimes, a generated response may be correct for a group of developers while inadequate for another group. In such cases, the annotators had the liberty to use their domain expertise to annotate.
Hallucination result in incorrect responses: We noticed during the EDA phase (part 1) that some generated responses were made up to answer a query that might be "similar" to the asked question. The response failed to capture minute differences. Such responses were marked incorrect.
Our annotators were all internal machine learning engineers at Weights & Biases. We had booked a 3 hour slot on every annotators' calendar to start annotating together. This helped us to discuss those few "subjective" responses to reach a satisfactory annotation. While over the call, we slightly modified the evaluation criteria on the fly. We were able to get away with this because of expert annotators. In your use case, you might have to work closely with your hired annotators to get the most out of this evaluation exercise.
How to analyze the results from manual evaluationSo how good are the generated answers? In this section we'll look into the results, then slice and dice them to get a deeper understanding of wandbot's performance.
Argilla, the tool we used for manual annotation, makes it really easy to pull the annotated data﻿. You can then dump the annotations in a pandas dataframe to further analyze it. On top you can log the dataframe as W&B Tables as shown below to get powerful interactive visualization. 
Paginate through the Table below to check out the raw annotations. The notes added by the annotators were really insightful to help improve wandbot—something we'll take a look at in just a moment.
﻿
﻿
How accurate are wandbot's responses?This is the primary metric for which we wanted to get a baseline score using manual evaluation. As you can see in the bar chart below, 88 out of 132 responses are correct (note: there are 12 responses for which the annotators weren't sure about their correctness). 
Overall the wandbot response accuracy is computed using the formula: (correct/total)*100. This formulation assumes the unsure annotated samples to be incorrect.
 wandbot's response accuracy after manual evaluation is 66.67%.
💡
If we remove unsure annotations from the computation the accuracy is 73.3%. Nevertheless this is a good baseline score for our LLM-based system. And obviously there's room for improvement.
﻿
﻿
A look into incorrect responses
So what are some of the reasons for annotating the responses as incorrect? Let us look at the notes left by the annotators for the 32 incorrect responses. The notes can be put into a few categories to get a holistic picture of the limitations of wandbot.
Wrong language: The user queries are all in English but a few responses were generated in Japanese. The wandbot ingested both our English and Japanese documentation and a single retriever was used to get the context. Sometimes the retrieved context is in Japanese leading to Japanese response. This is annotated as incorrect because a person asking a query in English will expect the answer in the same language. Our initial hypothesis was that GPT can handle it by itself without us needing to have two separate retrievers. We have since updated wandbot to avoid this language mix up from happening.
Documentation limitations: Some of the responses were incorrect because of missing or confusing documentation. This is not exactly wandbot's issue but data source is a crucial part of an LLM-based system. In some responses wandbots suggestion (choice of API, etc.) works, but is a usage pattern we discourage and thus have marked such responses as incorrect. Manual evaluation showed us a few holes in our documentation which should be fixed. 
Broadly asked question: A few queries are very broad. Ideally the system should be able to judge it and ask for more information from the user. The wandbot's retriever confuses multiple keywords in such broad queries and generates a patched up answer which in practice is not correct.
Out of scope: Sometimes user will ask a question that's not directly related to W&B. In such a case, wandbot's response should be to politely inform the user to ask W&B specific question - this is something we have mentioned in the system prompt. We have seen instances where wandbot will retrieve context that to some extent is connected to the question and will make up stuff.
Hallucination: We have seen multiple example of hallucinated responses. Either the code snippet is hallucinated or the response is made up by stitching irrelevant contexts. Either way, there is room to improve the quality of our retriever and overall prompt design of wandbot.
﻿
﻿
A look into unsure responses
There are a total of 12 samples for which the annotators weren't sure to mark them as correct or incorrect. This uncertainty is mostly attributed to:
Insufficient information in the query: Even though a response felt to be correct, the annotators weren't sure if the response satisfactorily answers the query. This was mostly because of insufficient or confusing details provided in the query.
Response being generic: A few response even though felt to be correct were too generic. For example, the user question might be on how to use W&B Sweeps with HF Trainer but the response talks about how to use the Sweeps APIs to get the same thing done.
﻿
﻿
Meta metrics: Link hallucination and query relevancyWhile our primary metric was to get the accuracy of wantbot response, we also evaluated for two meta metrics:
Link Hallucination: The generated links needs to be valid and more importantly should be relevant to the user query.
Query Relevancy: The user question in the first place should be relevant to Weights and Biases. If it is not the case, the wandbot should response with a templated answer.
Link Hallucination
There's a small percentage of responses with hallucinated links. Looking at the notes left by the annotators most of the hallucinated links are W&B support email ID. Overall, we are happy with the ability of wandbot to add the correct reference links.
﻿
﻿
Query Relevancy
Wandbot should respond to queries not related to Weights and Biases with an answer like "Your question doesn't pertain to wandb. I'm here to assist with wandb-related queries. Please ask a wandb-specific question." Failing to do so mostly results in hallucinated answer.
Wandbot successfully responded to a few of the irrelevant queries with a standard response ((8/15)*100 = 53.33%).
The incorrect response happened because the query even though not related to W&B has some keywords for which a few relevant contexts can be retrieved.
The best way to filter out an irrelevant queries would be to use a classifier.
﻿
﻿
Using Argilla as our manual annotation toolChoosing the right annotation tool is pivotal for a seamless evaluation process. Argilla.io emerged as the optimal choice for our LLM-based system assessment for a number of reasons:
Tailored for LLM Use Cases: Argilla.io is purpose-built for LLM use cases, ensuring a seamless integration with our evaluation needs.
Effortless User Experience (UX): Our criteria included avoiding the need to code the user experience. Argilla.io eliminated this concern, offering an intuitive interface without the hassle of manual coding.
Simplified Deployment and Database Management: We wanted a tool that spared us the complexities of database management and deployment. Argilla.io excelled in this aspect, allowing us to set up quickly and commence annotation without unnecessary complications.
While alternatives such as Streamlit and Gradio were considered, they proved to be more involved. Although these options provide extensive control over UX and deployment, our priority was swift setup and ease of use, aligning perfectly with Argilla.io.
For those venturing into manual annotation to evaluate LLM-based systems, we recommend exploring various options and selecting the one that aligns most closely with your specific criteria. Argilla.io's compatibility with our needs suggests it's worth consideration in your evaluation toolkit. 
Getting started with ArgillaArgilla is composed of two components:
Argilla Client: a Python library for reading and writing data into Argilla. The client can be used to assign data splits to different annotators, control the overlap of samples between annotators, add or remove annotators, add new samples to be annotated and more.
Argilla Server and UI: the API and UI for data annotation and curation. This was the selling point for our efforts to annotate the wandbot responses for accuracy. The default UX provided for evaluating the responses were more than sufficient for us.
We hosted Argilla on HuggingFace Spaces which is one of the two recommended ways to get started with using Argilla. To avoid losing data, we used the persistent storage layer offered by HuggingFace. Deploying Argilla on HF Spaces takes only a few clicks. 
You can check out our hosted wandbot evaluation app here: https://huggingface.co/spaces/wandb/wandbot_response_accuracy﻿
💡
You can install the Argilla client by running:
pip install argilla
﻿﻿﻿﻿To learn more about how to interact with the deployed Argilla app using the client check out the official practical guides.
Figure 1: The home page of our deployed Argilla app for manual annotation.
Assigning annotation samples to our annotatorsWe planned to manually annotate the generated answers in-house using some of our machine learning engineers. Wandbot is a QA bot that uses W&B documentation as the data source and users of W&B are the target users of this system. MLEs working at W&B being domain experts were ideal for this evaluation/annotation task. Using in-house MLEs to annotate allowed us to:
Quickly evaluate the generated answers for correctness,
Gain the ability to leave extra feedback on how to improve the wandbot as a whole
Avoid articulating many rules and evaluation criteria to get everyone on the same page
100 out of the 132 samples were evenly distributed (20 each) between 5 in-house MLEs. The rest 32 samples were evaluated by  myself, the author of this report. The samples were randomly sampled and assigned to each annotator so that maximum product surface area is covered by each annotator.
The Argilla client makes it very easy to assign unique workspace with their assigned split of the dataset for annotation. The client also makes it easy to configure the overlap between the assigned data splits. We opted for zero overlap given all the annotators are domain experts in our case. Check out Argilla's documentation on how to assign records (data splits) to annotators. 
Figure 2: Assigned data samples to 6 annotators with zero overlap. A view of Argilla workspace using admin access. Annotators can only access their own workspace.
A look into the annotation UIAs mentioned above, Argilla comes baked in with practical and elegant UI/UX for the annotation use case. Using Argilla client, we can easily define multiple tasks per data sample (record) in a coherent UI. 
A record in Argilla refers to a data item that requires annotation and can consist of one or multiple fields i.e., the pieces of information that will be shown to the user in the UI in order to complete the annotation task. In our case, the fields are "query", "response" and "context."
Additionally, the record will contain questions that the annotators will need to answer and guidelines to help them complete the task. All the defined questions render in a coherent UI make it easy for the annotators to get started with annotating the records.
Fig 3: the Argilla data annotation UI
As shown in figure 3, we have three fields for the task in hand. We went with four questions overall:
Wandbot Response Accuracy: The annotators are asked to select a label (correct, incorrect and unsure) that best describes the accuracy of the response. If the unsure label was selected, the annotators were asked to provide a reason in the Note box below.
Wandbot Link Hallucination: LLMs can come up with incorrect links (made up URLs). The annotators were asked to check if the links in the response are invalid or not related to the query. One can programatically check is the links are valid or not but if manual annotation, we were able to find links that are not related to the query.
Is the Query related to W&B?: A query asked by the user might not be related to Weights and Biases. We asked the annotators to annotate the relevancy of the query. The wandbot should come up with a templated response if the query is not related to W&B something like - "the question is not related to W&B, please provide more context". If wandbot is making up stuff to answer an irrelevant query, the response is marked as incorrect.
Note: This question field was used to provide reason in case the response was incorrect and leave other miscellaneous comments on the quality of the response. The idea to have a free text form is to gather ideas to further improve the quality of our LLM-based system aka wandbot.
Beside configurable fields and the questions, Argilla UI have a few baked in off-the-shelf features like the ability to do semantic search, the status indicator for the given record, ability to navigate between assigned records, the submission button and more. Learn more about the Argilla UI here.
Conclusion: Building on our baselineIn the journey of evaluating wandbot, we embarked on a multi-step process outlined in our series of reports. In the first part, we meticulously curated an evaluation dataset to ensure a robust assessment. This report encapsulates our hands-on effort to gauge the quality of responses generated by wandbot. The establishment of a predefined set of rules further fortified our evaluation criteria, ensuring a structured and objective analysis. We also benefitted from in-house MLEs as domain expert annotators. Our methodology involved leveraging Argilla.io, a powerful tool that facilitated the seamless setup of an annotation app. 
The quantitative aspect of our evaluation reveals that wandbot achieves an impressive accuracy rate of 66.67% in generating responses. However, it's crucial to recognize that the true value of this LLM-based system transcends mere response accuracy.  Moreover, the evaluation process illuminated an unexpected but valuable outcome - identifying areas of our documentation that need to be augmented or updated. 
We have many ideas on how to improve our wandbot system to improve this baseline result, from trialling other embeddings models, testing different retrieval methods and also more subtle changes to our annotation guidelines.
In conclusion, while the quantitative metric provides a snapshot of wandbot's accuracy, its qualitative impact on user experience and the discovery of documentation gaps emphasize the broader significance of this evaluation. As we move forward, this comprehensive assessment lays the groundwork for refining wandbot and, concurrently, enhancing the quality of our documentation to better serve our users.
Read the other installments in this series: 
How to Fine-tune an LLM Part 3: The HuggingFace Trainer
Exploring how to get the best out of the Hugging Face Trainer and subclasses
How to Evaluate an LLM, Part 1: Building an Evaluation Dataset for our LLM System
Building gold standard questions for evaluating our QA bot based on production data.
﻿
﻿
Add a comment
Tags: Articles, LLM, NLP, GenAI, Tutorial, Intermediate
Iterate on AI agents and models faster. Try Weights & Biases today.