Evaluation-Driven Development: Improving WandBot, our LLM-Powered Documentation App

This report describes the changes and enhancements we made to wandbot during our most recent sprint
Created on January 19|Last edited on April 18
Comment
﻿
IntroductionWe've been working on Wandbot for a while. If you're unfamiliar, Wandbot is our LLM-powered documentation app, meant to give our users answers to myriad W&B-related questions, from uncovering the right advice from our docs to helping debug code, and everything in between. We'll link a few of the articles below about how we built Wandbot and what we've learned along the way but we recommend checking out our Discord and taking it for a spin if you want to see it in action. 
Our last piece on Wandbot—"RAGs To Riches: Bringing Wandbot into Production—described how we refactored our bot to bring it to production. We explained the changes we made to the codebase to improve performance and development speed. 
Today, we are excited to announce the release of Wandbot 1.1, a significant upgrade that builds upon our commitment to enhancing user experience. This report aims to detail our journey of continuous improvement. It describes the journey of evaluation-driven development we undertook to add new features and enhancements to improve the performance of wandbot.
This development cycle was a meticulous process, marked by evaluations and enhancements. We focused on understanding the nuances of queries Wandbot dealt with and dove into the intricacies of our RAG pipeline to enhance its performance. 
A key takeaway from this journey was the importance of rigorous evaluation. By closely examining every aspect of Wandbot's interactions, responses, and components, we were able to pinpoint areas for improvement, leading to more refined and practical updates in this new release.
TL;DRHere's a quick overview of the different topic discussed in this report:
﻿Evaluation-Driven Development: Dive into how we used GPT-4 evaluations to drive design and development choices for new components in the RAG pipeline.
﻿The Auto Evaluation Framework: A GPT-powered evaluation framework aligned with human annotations and evaluations. We discuss the metrics used and their implementation details.
﻿Enhancements: Improvements we made to the data ingestion pipeline, query augmentations, and document retrieval to improve performance across metrics.
﻿Comparative Analysis: Analysis and comparison of different models and pipelines and discussion of results across various metric dimensions. 
﻿
Run set1
﻿
Evaluation-Driven DevelopmentPreviously, we conducted extensive evaluations of the Wandbot RAG pipeline. The evaluations included manual and automatic annotation and evaluation of the bot's response to historical queries from users we had seen in the past year. We documented these efforts in the following reports:
How to Evaluate an LLM, Part 1: Building an Evaluation Dataset for our LLM System
Building gold standard questions for evaluating our QA bot based on production data.
How to Evaluate an LLM, Part 2: Manual Evaluation of Wandbot, our LLM-Powered Docs Assistant
How we used manual annotation from subject matter experts to generate a baseline correctness score and what we learned about how to improve our system and our annotation process
How to evaluate an LLM Part 3: LLMs evaluating LLMs
Employing auto-evaluation strategies to evaluate different component of our Wandbot RAG-based support system.
﻿
These evaluations and datasets were a starting point for our new auto-eval framework. We needed a way to iteratively and quickly evaluate improvements and enhancements to the Wandbot RAG pipeline. Although the earlier auto-evaluation system proved a good starting point, it had a few limitations. 
For instance, we relied on the default prompts to evaluate Correctness, Faithfulness, and Relevance, resulting in a misalignment between the auto and manual evaluation results. We wanted to align the auto-evaluation system with the manual evaluation so that we didn't have to perform repeated manual assessments as those were both time-consuming and a bit tedious.
Our earlier manual evaluations:
﻿
﻿
Building on these annotations, we cleaned up the manual evaluation dataset with the help of Argilla. We used examples from earlier manual evaluation results to create a few-shot prompt to instruct GPT-4 to provide a decision and explanation for the annotation. Here's the base system prompt we used:
"You are a Weight & Biases support expert tasked with evaluating the correctness of answers to questions asked by users. "
"Given the following question, document, and answer, you must analyze the provided answer and document before determining whether the answer is relevant for the provided question "
"and faithful to the contents of the document. The answer must not contradict information provided in the document. "
"In your evaluation, you should consider whether the answer addresses all aspects of the question and provides only correct information from the document for answering the question. "
"You must also validate the answer for correctness and ensure that any code snippets provided in the answer are correct and run without errors. "
'Output your final verdict by strictly following JSON format:'
'''
{
    "decision": <<Provide your decision here, either correct, or incorrect>>,
    "explanation": <<Provide a brief explanation for your decision>>
}
'''
' Use "correct" if the answer is correct, relevant and faithful for the given question and document "incorrect" if the answer is not. \n\n'
Alignment with Few-shot PromptNext, we sampled correct and incorrect examples from the datasets to create our few-shot prompt. Here's the code we used for this sampling and annotations:
Click to Expand
Curation and AlignmentFinally, we ingested the data into Argilla with the initial user annotations and GPT-4 annotations, as Argilla Suggestions. Here's a sample record in the Argilla UI after ingestion:
A sample annotation record in Argilla. Notice how we used text fields for user Notes and GPT Explanations and pre-annotate the example.
We re-annotated the data to eliminate ambiguities and inaccuracies in the annotations, creating a more precise standard for Wandbot's performance. The annotated dataset contains a robust set of 98 Question-Answer pairs. The following table shows these test data points as reference answers in our final auto-evaluation.
﻿
﻿
The Auto Evaluation FrameworkOur framework utilizes the GPT-4 model to evaluate Wandbot's responses across multiple dimensions and metrics. These comprehensive metrics played a pivotal role in shaping the enhancements made in this version.  
Response
﻿Answer Correctness - Is the generated answer correct compared to the reference and thoroughly answers the user's query?
﻿Answer Relevancy - Is the generated answer relevant and comprehensive?
﻿Answer Factfulness - Is the generated answer factually consistent with the context document?
﻿Answer Similarity - Assessment of the semantic resemblance between the generated answer and the ground truth
Context
﻿Context Precision - Whether all of the ground-truth relevant items present in the contexts are ranked higher or not
﻿Context Recall - The extent to which the retrieved context aligns with the annotated answer, treated as the ground truth.
We sub-classed and customized the CorrectnessEvaluator class in llama-index to compute the Answer Correctness, Relevancy, and Factfulness. 
Click on the links above to see the implementation of each metric
💡
For completeness, we also used RAGAS to compute the same metrics. We also measured the Answer Similarity, Context Precision, and Recall. The full evaluation script that runs the complete evaluation across all metrics can be seen here.
EnhancementsThe v1.1 update introduces many substantial improvements to the Wandbot pipeline. These improvements include enhancements to the ingestion pipeline to improve the knowledge base, introducing a query enhancer to improve query understanding, and improvements to the retriever. 
Data Ingestion ImprovementsWhile manually annotating the dataset, we noticed issues with retrieved contexts that stemmed from incorrect data parsing. This was attributed to the usage of the default MarkdownNodeParser in llama-index. Our documentation in Docusaurous uses MarkdownX Features that allow ingestion of Javascript components and plugins such as Tabs, Frontmatter, Admonitions, and other artifacts. These artifacts were initially incorrectly parsed by the MarkdownNodeParser, resulting in either too-short or too-long context document chunks. We fixed these issues with the parsing logic by handling these artifacts before passing the document to the Parser. 
Next, we found multiple queries during the annotation exercise that could have been answered correctly by Wandbot had the correct documents been included in our index. For example, here's a query from a user.
can you give me a suggestion of how to logged named entity recognition values? 
While Wandbot provided an answer, it was not correct or complete. However, a Fully Connected report describes precisely how to do this. This prompted us to include new knowledge sources in the wandbot index. These include Fully Connected Reports, Weave Examples, and Wandb SDK Tests.  The expanded knowledge base with more diverse sources improved the retriever's ability to provide accurate and relevant information during query time.
Query EnhancerAnother significant addition to the RAG pipeline is the Query Enhancement Stage, which ensures queries are concise, contextually relevant, and free from extraneous information. 
The enhancer first uses simple string manipulation and regex to remove bot and user mentions. Then, we incorporated Cohere for language detection to detect the query language and enable multilingual support. We also fine-tuned a cohere classification model to classify the query and detect the user intent. The multi-label classification model provided hints to the query enhancer, which uses the Instructor library to identify the user's intent and enhances the query with keywords and sub-queries. These enhancements were then injected into the system prompt and used during retrieval to provide hints to the model during response synthesis. The complete implementation of the Query enhancer can be seen here. The panel below shows a view of all the enhancements the query enhancer adds to the initial query.
﻿
﻿
Hybrid RetrieverDuring the annotation, we also noticed that the retrieval performance was not optimal and had scope for improvement. We also noticed that some queries related to code troubleshooting and sales required web knowledge outside the knowledge base. We tackled this by including AI snippets from the you.com API. We used a new custom retriever that retrieved relevant snippets from the web using the you.com web-search API and added them in the retrieval results. The following code demonstrates how we add results from you.com to our retrieval results.
Click to Expandurl = "https://api.ydc-index.io/search"
querystring = {query_str,"num_web_results": self.similarity_top_k,}
response = requests.get(url, headers=headers, params=querystring)
results = response.json()
search_hits = [(
  "\n".join(hit["snippets"]),
  {
    "source": hit["url"],
    "language": "en",
    "description": hit["description"],
    "title": hit["title"],
    "tags": ["you.com"],
  },
)
for hit in results["hits"]
]
Keyword RetrievalNext, we included a new BM25Retriever from llama-index that uses BM25Okapi to retrieve documents based on keywords generated in the query enhancement stage. 
Finally, we introduced a hybrid retriever that combines FAISS Vectorstore, BM25, and you.com retrievers and added a new metadata filtering post-processor to improve our retrieval capabilities significantly. We also modularized and moved all retrieval-related implementations into a separate retriever module to improve maintainability and code quality. 
Comparative Analysis and Evaluation ResultsThe transition from version 1.0 to 1.1 is best understood through a comparative lens. The following section helps visually understand the performance improvements. These visual representations are meant to depict the enhancements across each evaluation metric, demonstrating how each aspect of Wandbot has been refined. The comparison highlights the improvements in correctness, relevancy, and faithfulness and showcases the reduced latency and increased system efficiency. 
﻿
﻿
Run set1
﻿
Comparative Performance: The chart above shows the performance of four models: gpt-3.5-turbo-16k-0613, gpt-4-0613, gpt-4-1106-preview, and gpt-4-1106-preview-v1.1. Among these the new model - gpt-4-1106-preview-v1.1, generally outperforms the others across most task metrics, indicating iterative improvements over the versions. Notably, gpt-3.5-turbo-16k-0613 lags behind, particularly in Answer Correctness and Answer Relevancy, which may reflect the advancements in the GPT-4 series.
We instructed the GPT-4 Auto evaluation model to score the across the dimensions using a ordinal scoring scale. In the following charts, note that 
A score of 1 means that incorrect/unfaithful/irrelevant.
A score of 2 means ambiguous.
A score of 3 means correct/faithful/relevant.
For more details about the scores please refer to the evaluation prompts in the evaluation module.
The next chart shows the distributions of the scores for Answer Correctness, Answer Relevancy and Answer Faithfulness
﻿
﻿
Metric Analysis: Metrics like Answer Faithfulness and Answer Correctness are pivotal, as they directly affect the model's utility in practical applications. The gpt-4-1106-preview-v1.1 excels in Answer Correctness, which is critical for accurate information. However, Answer Faithfulness's tighter grouping suggests that even earlier models like gpt-3.5-turbo-16k-0613 perform comparably in ensuring that answers align with the context provided.
	
﻿
﻿
Context Understanding: Context Precision and Context Recall are closely related to a model's understanding of the input context. The retrieved context should ideally contain essential information to address the provided query. To compute the relevancy, we initially estimate the value by identifying sentences within the retrieved context relevant to answering the given question. Again, the new v1.1 pipeline shows superiority in Context Recall, indicating its ability to retrieve more relevant contexts to answer the given query. 
ConclusionWe'll be continuously improving Wandbot over time but we're really quite proud of our most recent iteration. What's more: we've learned a ton about building, evaluating, and productionizing bots like these along our journey. 
If you'd like to read more about Wandbot, we've included some links below. And if you'd like to build your own LLM-powered app, we offer a free, interactive course you can take at your own pace. You can enroll here. 
Till next time!
Creating a Q&A Bot for W&B Documentation
In this article, we run through a description of how to build a question-and-answer (Q&A) bot for Weights & Biases documentation.
WandBot: GPT-4 Powered Chat Support
This article explores how we built a support bot, enriched with documentation, code, and blogs, to answer user questions with GPT-4, Langchain, and Weights & Biases.
How to Evaluate an LLM, Part 2: Manual Evaluation of Wandbot, our LLM-Powered Docs Assistant
How we used manual annotation from subject matter experts to generate a baseline correctness score and what we learned about how to improve our system and our annotation process
How to Evaluate an LLM, Part 1: Building an Evaluation Dataset for our LLM System
Building gold standard questions for evaluating our QA bot based on production data.
﻿
﻿
Add a comment
Tags: Articles, LLM, Experiment, GenAI, NLP, Text Generation
Iterate on AI agents and models faster. Try Weights & Biases today.