Refactoring Wandbot—our LLM-powered document assistant—for improved efficiency and speed

This report tells the story of how we utilized auto-evaluation-driven development to enhance both the quality and speed of Wandbot.
Bharat Ramanathan, Ayush Thakur
Created on May 2|Last edited on September 16
Comment
﻿
﻿
While massively reducing the latency (seconds)!
While massively reducing the latency (seconds)!
1.2_replit1.1_replit050100150200250300350400450
How we improved accuracy (%)?
How we improved accuracy (%)?
final_1.2.0correct_repro_wandbot-main-branch-eval0.00.10.20.30.40.50.60.70.8
﻿
We've been working on Wandbot—our LLM-powered documentation assistant—for quite a while now. We've documented the beginnings of our journey, how we implemented a RAG system, and how we've been evaluating its performance. You can check out anything you've missed below, though you won't necessarily need the background to enjoy this report. 
WandBot: GPT-4 Powered Chat Support
This article explores how we built a support bot, enriched with documentation, code, and blogs, to answer user questions with GPT-4, Langchain, and Weights & Biases.
RAGs To Riches: Bringing Wandbot into Production
Lessons we learned bringing our LLM-powered documentation bot into production 
How to Evaluate an LLM, Part 1: Building an Evaluation Dataset for our LLM System
Building gold standard questions for evaluating our QA bot based on production data.
How to Evaluate an LLM, Part 2: Manual Evaluation of Wandbot, our LLM-Powered Docs Assistant
How we used manual annotation from subject matter experts to generate a baseline correctness score and what we learned about how to improve our system and our annotation process
How to evaluate an LLM Part 3: LLMs evaluating LLMs
Employing auto-evaluation strategies to evaluate different component of our Wandbot RAG-based support system.
Evaluation-Driven Development: Improving WandBot, our LLM-Powered Documentation App
This report describes the changes and enhancements we made to wandbot during our most recent sprint
﻿
Our previous post on evaluation driven development described how we developed a GPT-4 powered auto-evaluation framework to enhance Wandbot's retrieval augmented generation (RAG) pipeline. We used the framework to inform design choices and improve the performance of the Wandbot across various metrics, including answer correctness, faithfulness, and relevance.
Although our previous changes improved the performance of the RAG pipeline, we noticed inefficiencies in document parsing issues in the data ingestion pipeline, slower retrieval speeds, and iterative and redundant LLM calls that increased the system's latency. Furthermore, we saw an opportunity to improve metrics such as answer correctness and relevance by tuning hyperparameters such as chunk size, number of retrieved documents, and system prompt templates for response synthesis.  
For those eager to get straight to the point, here’s a TL;DR of our key recommendations and learnings:
TL;DRPrioritize Evaluation: Make evaluation central to the development process, ensuring that changes lead to measurable enhancements. For example, using W&B, we tracked our changes systematically, which helped improve our accuracy from 72% to 81%.
Switch to More Efficient Vector Stores: Switching from FAISS to ChromaDB reduced our retrieval latency by approximately 69%, making it more suitable for larger datasets.
Optimize Ingestion Pipelines: Use multiprocessing to speed up the ingestion process. This approach significantly improved our data ingestion speeds.
Use Smaller Chunk Sizes with Efficient Retrieval: Experiment with chunk sizes; we found that using a chunk size of 512 improved our system's accuracy.
Implement Modular Pipeline Components: Splitting the RAG pipeline into query enhancement, retrieval, and response synthesis allowed us to independently tune each component and measure their impacts effectively.
Incorporate Metadata: Enhancing metadata usage in retrieval and synthesis processes improved the relevance and accuracy of responses.
Adopt Asynchronous and Parallel Processing: Utilizing frameworks like LangChain Expression Language (LECL) to handle asynchronous API calls and parallelize iterative LLM calls led to significant efficiency gains.
Iterate and Evaluate Continuously: Systematically testing and evaluating changes at each stage of refactoring helped us quickly identify and address issues, ensuring continuous improvement.
This report walks through exactly how we did that and improved our bot's performance. Here's what we'll be covering: 
Table of contentsTL;DRTable of contentsKey ChangesTracing Wandbot with W&B WeaveEvaluation processEfforts to make wandbot configurableImprove the evaluation speedDebugging our refactored branch with evaluationThe speed gainsChallenges facedCoordination and parallelization of asynchronous API callsTransition to LangChain Expression Language (LECL)Implementation challenges with LECLOverhaul of ingestion and chat modulesIterative evaluation during library transitionSuccess storiesLessons learned
﻿
﻿
Key ChangesSince we have already documented the critical components of Wandbot in previous reports, we're going to summarize the fundamental changes we made in our refactoring effort here. 
Main changes
We replaced llama-index with LangChain Expression Language (LCEL) to parallelize iterative LLM calls and improve reliability using fallbacks
Replaced FAISS with ChromaDB to improve retrieval speeds and include document-level metadata storage and filtering
Changes to ingestion
Improved the speed of the ingestion pipeline using multiprocessing
Improved the parsing logic in the data ingestion pipeline using new custom Markdown and SourceCode parsers, and chunking logic
Added parent-child document chunks to the vector store ingestion
Included metadata in the ingested document chunks to allow metadata-based filtering
Changes to the RAG pipeline
We split the RAG pipeline into three major components—query enhancement, retrieval, and response synthesis—to make it easier to tune each component independently and measure the impact of the changes on the evaluations metrics
Query enhancement stage was reduced to a single LLM call, improving the speed and performance of the pipeline
Added a parent document retrieval step to the retrieval module to enhance the context provided to the LLM
Incorporated a sub-query answering step to in the response synthesis module to improve the completeness and relevance of generated answers
Changes to the API
We split the API into routers for retrieval, database, and chat to make it easier to integrate client applications such as Slack, Discord, and Zendesk
Tracing Wandbot with W&B WeaveThe ability to peek inside and observe intermediate steps and LLM calls enabled us to better debug a complex LLM-based system. W&B Weave provides a lightweight and easy-to-use  weave.op() decorator. Use this to decorate a function or a class method and, W&B Weave will automatically trace it. Learn more about Weave here.
Check out this PR to see how easy it was to add Weave to our complex Wandbot pipeline. The resulting trace timeline enabled us to examine complex data transfer in the intermediate steps.
Figure 1: W&B Weave Trace timeline. Check out the intermediate steps in our RAG pipeline.
Evaluation processAs you might guess, we refactored the codebase with rigorous evaluations. However, here's a little secret: we initially refactored without any evaluation. 
The refactoring process took around two weeks, and we documented it in this pull request. We assumed it wouldn't significantly impact the system's performance since we were only "refactoring" and not changing much logic (as noted above). We planned to evaluate at the end of the refactor to address any performance degradation. That was the plan, but it turned into a test of patience, grit, and determination.
Here's the story of how we spent about six weeks (or more!) debugging the refactored branch and, in the process, actually improved the system's performance in both accuracy and speed. Like many LLM-app builders, our journey wasn't linear, so sit tight. 
When evaluating an LLM-based system with an LLM as a judge, the evaluation scores are not deterministic. They will always move in a range. Best practice is to average across multiple evals due to stochasticity, but that the costs of doing this should be considered. 
💡
Efforts to make wandbot configurableAfter completing the refactoring, we added LiteLLM to make Wandbot more configurable, allowing us to experiment with different vendors. Our goal was to make Wandbot configurable, reproduce the evaluation results of Wandbot v1.1, and test some innovative ideas for future development.
The evaluation results were disappointing. The refactored branch with LiteLLM scored only ~23% (baseline_v1.3) while our deployed Wandbot v1.1 had an accuracy of ~70%; that's a considerable drop. When we tried to reproduce our v1.1 system with this refactored branch, we got a score of only ~25% (baseline_v1.1). We then dug into the code to fix a few things to improve the score. This approach revealed our naivety (or our initial plan to get by with a few tweaks).
Our query enhancer used a higher temperature, so we reduced it to 0.0. This adjustment improved the performance by approximately 3% (baseline_v1.1@temp0).
We realized we had removed the few-shot examples from our system prompt. Adding them back improved the accuracy to ~50% (baseline_v1.1@system_prompt_1.1).
We were getting somewhere but no other tweaks from here on helped us to reproduce the v1.1 Wandbot system:
﻿
﻿
Improve the evaluation speedEach evaluation mentioned above took an average of 2 hours and 17 minutes, slowing down our tweaking and experimentation speed. We also realized that our evaluation script contained many synchronous processes (here is our old evaluation script if you are interested).
We made the evaluation script purely asynchronous, reducing the evaluation time to below 10 minutes. This change allowed us to perform many more evaluations per day.
Lesson: make sure your evaluation is not the bottleneck for experimenting with new ideas or rapid iteration cycles. Spend time to optimize your evaluation pipeline.
💡
Debugging our refactored branch with evaluationAt this point, we knew that evaluating the refactored branch would be like shooting in the dark. We decided to assess systematically (which, yes, we should have done it from the beginning). We reproduced the evaluation result of our main branch, aka Wandbot v1.1 (~70%).
We calculated answer correctness by dividing the total number of answers marked "true" by our LLM judge by the total number of samples. Additionally, we ask the LLM judge to grade on a scale of 1-3 (with three being the best). Of the 98 evaluation samples, 69 are marked 3, while 22 are marked 2 (partially correct). This distribution of grades matches that of Wandbot v1.1.
﻿
﻿
We started cherry-picking one commit or collection of commits from the feat/v1.3 branch (PR) to a new branch debug/feat/v1.3 (PR). We systematically went through each commit while evaluating it. If the accuracy dropped, we either reverted the change or experimented with a new change that made sense. For example, one of the first commits changed the Chat model from gpt-4-1106-preview to gpt-4-0125-preview. The accuracy dropped from ~72% to ~65%. 
﻿
﻿
From here on, we kept cherry-picking commits and evaluating them. The whole process was tedious and time-consuming. We will not go into each evaluation just because something that worked initially might not have worked in the end and vice versa. Showing such examples without proper intuition might be misleading (and documenting them one by one would make this report too long). One example: when we added another commit to change the embedding model from `text-embedding-ada-002 to `text-embedding-small, the accuracy dropped to ~57% ("lunar-shape-167" in the plot below). However, later down the road, `text-embedding-small became a better embedding model, and we deployed the system with it.
The commit also changed the chunk size from 512 to 384. When we changed it back to 512, the accuracy improved from ~57% to ~59% (lyric-yogurt-169). We still use 512 as the chunk size.
﻿
Run set2
﻿
We ran nearly 50 unique evaluations and spent nearly $2500 to debug our refactored system. The bar graph below shows around 30 of these experiments. We began by accurately reproducing the accuracy of Wandbot v1.1 (correct_repro_wandbot-main-branch-eval) and continued to cherry-pick the commits and evaluate them. The final_1.2.0 evaluation gave us an accuracy of 81.63%. By systematically debugging the refactored branch with evaluation, we not only reproduced the baseline with the new Wandbot v1.2 but also outperformed it by approximately 9%. 🔥
Here are some key highlights what worked—and what didn't.
Wandbot v1.1 was using the FAISS vector store. The refactor introduced ChromaDB, which allowed for a massive speedup in the system's latency (more on it later). The switch used metadata filtering. text-embedding-small worked better with this new vector store than text-embedding-ada-002
We experimented extensively with the number of retrieved chunks, or the top_k parameter, in our retrieval engine. A smaller chunk size of 384 required us to retrieve more chunks. However, when we used a chunk size 512, the performance increased for the same top_k value.
At some point, the refactor removed the few shot prompts from the system prompt. Reintroducing it back gave a massive jump.
We kept using `gpt-4-1106-preview as our main LLM. Switching it to `gpt-4-0125-preview degraded the performance early on. However, the final jump from the 70-75% mark to above 80% accuracy happened because we used `gpt-4-0125-preview. This "better" LLM worked better with a more sophisticated system. 
﻿
﻿
The speed gainsWe had deployed Wandbot on Replit. When the system attained a score of ~81%, we tested the latency of the newly deployed Wandbot v1.2 against the previously deployed Wandbot v1.1.
The v1.2 system locally took 64.142 seconds per response generation and 79.72 seconds when deployed. On the contrary, the deployed v1.1 system took 491.75 seconds per response, an 84% decrease in speed.
In other words: we improved the system's accuracy from 72% to 81% and reduced the latency by 84%.
﻿
﻿
Challenges faced
Coordination and parallelization of asynchronous API callsInitially, Wandbot used a combination of Instructor and llama-index to implement the RAG pipeline, which included several sequential steps such as contextualizing user queries, language identification with FastText, intent classification with a finetuned Cohere classifier model, and extracting keywords and generating vector search queries with Openai. 
This sequential process introduced significant delays. Our efforts to parallelize these steps uncovered coordination challenges among asynchronous API calls from Instructor, llama-index, and Cohere clients, which increased the complexity due to multiple potential points of failure.
Transition to LangChain Expression Language (LECL)To address these inefficiencies, we evaluated the potential of replacing the three libraries with LangChain Expression Language (LECL), which natively supports asynchronous API calls, optimized parallel execution, retries, and fallbacks. 
This transition was not straightforward, as LECL did not directly replicate every functionality in these libraries. For example, instructor featured pydantic validators to run validations on function outputs, allowing us to re-ask an LLM with validation errors—functionality not natively supported by LECL. We addressed this gap by developing a custom re-ask loop within the LangChain framework, which allowed us to consolidate multiple serial LLM calls into a single, more efficient call.
Implementation challenges with LECLOur initial implementation of LECL faced hurdles, particularly with the use of LECL primitives like RunnableAssign and RunnableParallel, which we initially applied inconsistently. This led to errors and performance issues. As we continued to adapt more of our pipeline to LECL, our understanding of these primitives improved, allowing us to correct our approach and effectively utilize the right primitives for optimal performance.
Overhaul of ingestion and chat modulesAdopting LECL also necessitated a complete overhaul of the ingestion pipeline and chat modules. The combination of LangChain and llama-index created unnecessary customizations and complexities. To simplify and enhance efficiency, we opted for a single framework throughout, focusing on refining the speed and performance of the entire system.
Iterative evaluation during library transitionA significant challenge was maintaining the ability to perform iterative evaluations during the transition from different libraries. The shift in components made it challenging to assess the impact of changes continuously. We adopted a staged approach to address this: after refactoring the entire codebase, we selectively ran evaluations on major groups of commits to verify performance improvements and ensure that each change contributed positively to the overall efficiency and effectiveness of the Wandbot.
Success storiesIn addition to reducing the latency of Wandbot's RAG pipeline, we also improved the evaluation performance. Parallelizing the different sequential calls in the Query Enhancer reduced the latency of the RAG pipeline. Our thorough evaluation-driven decisions helped us improve the correctness of the generated answers from 72% to 81%. Another specific instance that improved Wandbot's performance involved the inclusion of sub-query generation and answering. Our evaluations show that breaking down complex queries into sub-queries and generating responses for each sub-query, along with the initial user query, significantly enhanced Wandbot's performance. 
The introduction of ChromaDB sped up retrieval times and provided a more robust metadata storage and filtering mechanism. The systematic evaluation and debugging process highlighted the importance of retaining and tweaking essential components like the system prompt with few-shot examples, which substantially impacted accuracy. Transitioning to a more asynchronous and parallelized evaluation pipeline drastically reduced evaluation times, allowing for more rapid iteration and improvement cycles.
Lessons learnedOur refactoring journey was eye-opening. It revealed the importance of systematic evaluation and the risks of assuming that refactoring won't impact performance. The process underscored the need for a robust evaluation framework to guide every development step and refactoring. 
Leveraging asynchronous API calls and parallel processing frameworks like LECL reduced latency and improved system performance. A systematic approach to evaluation that included reproducing baseline results and conducting incremental testing was crucial in identifying and addressing performance issues. Fine-tuning chunk size and the number of retrieved documents were pivotal in optimizing the RAG pipeline's performance. Reintroducing critical components like few-shot prompts was vital in maintaining and enhancing system accuracy.
We should implement dynamic configuration options for various system parameters to optimize future iterations, allowing for real-time tuning and adaptation based on different use cases. Enhancing our use of metadata in the retrieval and synthesis processes could improve the relevance and accuracy of responses. Additionally, exploring scalability options, particularly for handling larger datasets from our integrations and more complex queries, could further enhance Wandbot's capabilities.
Based on the lessons learned from the refactoring process, we recommend prioritizing evaluation by integrating it as a core part of the development and refactoring process to ensure that changes lead to tangible improvements. Adopting asynchronous and parallel processing frameworks can enhance system efficiency. Documenting and retaining critical components and configurations during refactoring can help avoid unintentional performance degradation. Finally, approaching refactoring iteratively, with systematic testing and evaluation at each step, can help identify and address issues promptly.
If you'd like to read any of our prior pieces on Wandbot, we'll include those below. Thanks for reading!
WandBot: GPT-4 Powered Chat Support
This article explores how we built a support bot, enriched with documentation, code, and blogs, to answer user questions with GPT-4, Langchain, and Weights & Biases.
RAGs To Riches: Bringing Wandbot into Production
Lessons we learned bringing our LLM-powered documentation bot into production 
How to Evaluate an LLM, Part 1: Building an Evaluation Dataset for our LLM System
Building gold standard questions for evaluating our QA bot based on production data.
How to Evaluate an LLM, Part 2: Manual Evaluation of Wandbot, our LLM-Powered Docs Assistant
How we used manual annotation from subject matter experts to generate a baseline correctness score and what we learned about how to improve our system and our annotation process
How to evaluate an LLM Part 3: LLMs evaluating LLMs
Employing auto-evaluation strategies to evaluate different component of our Wandbot RAG-based support system.
Evaluation-Driven Development: Improving WandBot, our LLM-Powered Documentation App
This report describes the changes and enhancements we made to wandbot during our most recent sprint
﻿
﻿
﻿
Add a comment
Tags: Articles, LLM, Tutorial, NLP, GenAI, Intermediate
Iterate on AI agents and models faster. Try Weights & Biases today.