RAGs To Riches: Bringing Wandbot into Production
Lessons we learned bringing our LLM-powered documentation bot into production
Created on November 3|Last edited on March 1
Comment
At the start of 2023, we created Wandbot, a conversational developer assistant to help users converse with the W&B documentation and examples as naturally as they would with a colleague.
Initially available on Discord and Slack, our bot offered a straightforward, interactive way to navigate documentation and resolve debugging issues. If you'd like learn a little more about how we developed the initial version of the bot, check out the reports below:
WandBot: GPT-4 Powered Chat Support
This article explores how we built a support bot, enriched with documentation, code, and blogs, to answer user questions with GPT-4, Langchain, and Weights & Biases.
Creating a Q&A Bot for W&B Documentation
In this article, we run through a description of how to build a question-and-answer (Q&A) bot for Weights & Biases documentation.
The first version of Wandbot was a monolith, and while it worked as an initial implementation, it had its limitations. It was difficult to maintain, didn't scale well, and wasn't ready for production. We needed something more flexible and adaptable.
So, we refactored Wandbot. We chose a microservices architecture, a method that breaks down the bot into smaller, manageable parts. This new design is not only easier to maintain but also cost-effective and fully equipped for production use. It also allowed us to add new features like multilingual support, follow-up conversations, and model fallback mechanisms.
This article is a journey through Wandbot's evolution. From its humble beginnings to its latest features, we'll explore the re-architecture process, discuss why we decided to change, how we did it, and the results of our efforts. Furthermore, we'll also share insights from our manual LLM evaluation within the team. Here's what we'll cover:
Table of Contents
Table of ContentsTransition to Microservices ArchitectureDeep Dive into Core ComponentsIngestion ModuleDocument Parsing and PreprocessingVector Store IngestionReport GenerationChat ModuleDatabase ModuleAPI ModuleNew CapabilitiesCommitment to GrowthEvaluationDeploymentConclusion
First, let's look at a critical pivot that compelled these changes and pushed Wandbot's capabilities forward.
Transition to Microservices Architecture
Before diving into the details, let's first understand the challenges we faced with the initial version of Wandbot.
Initially built as a monolithic structure, the Discord and Slack apps were deployed separately, resulting in duplicated code with minor configuration changes. This approach brought with it a range of issues:
- Maintenance difficulty: Any modification necessitated updates in multiple areas, often resulting in bugs and inconsistencies due to unsynchronized deployments.
- Increased costs: Operating two distinct bots meant duplicating resources, such as vector stores and app deployments, thereby inflating our infrastructure expenses.
- Complexity and scalability issues: As we integrated new features like conversation history, the system's complexity escalated. The monolithic architecture became increasingly cumbersome to work with, hindering scalability.
To address these problems and prepare Wandbot for future growth, we resolved to transition the implementation to a microservices-oriented architecture. This strategy involves breaking down the bot into smaller, manageable components, providing flexibility, adaptability, and scalability. This restructuring allowed us to:
- Organize the system into distinct components like ingestion, chat, and database services.
- Centralize core services and models for use across applications.
- Develop dedicated APIs for seamless integration with existing and potential future platforms.
- Independently modify each service, minimizing the impact on the overall system.
In the following sections, we'll navigate through the specifics of this transition, illustrating how each change helped evolve Wandbot into a more robust, efficient, and scalable assistant.
Deep Dive into Core Components
Ingestion Module
The Ingestion Module, a new feature we added during the transition to microservices, is designed for parsing and processing raw documentation in diverse formats such as Markdown, Python code, and Jupyter Notebooks. Additionally, it plays a crucial role in creating embedding vectors for document chunks and indexing these documents into a FAISS vector store, complete with relevant metadata.
The introduction of this module brings multiple advantages. It distinctly separates the ingestion process from other components, allowing each to evolve and refine independently. By caching embedding calls, we significantly lower operational costs and avoid redundant model invocations. This strategic design not only boosts maintainability but also optimizes resource use. The module also creates artifacts for the vector store index and generates comprehensive data ingestion reports, crucial for ongoing evaluation and improvement.
Here's representation of the process:
wandbot_index
Direct lineage view
Document Parsing and Preprocessing
This process begins with syncing the latest updates from GitHub repositories.
- We use the MarkdownNodeParser from llama-index for parsing and chunking Markdown documents by identifying headers and code blocks.
- Jupyter Notebooks are converted into Markdown using nbconvert and undergo a similar parsing routine.
- Code blocks are parsed and chunked using Concrete Syntax Trees (CST), segmenting the code logically into functions, classes, and statements.
- Each document chunk is enriched with metadata like source URLs and languages, enhancing future retrieval.
The output is the following WandB artifact containing all parsed and preprocessed documents:
raw_dataset
Vector Store Ingestion
In this phase, the WandB artifact from parsing serves as the input to create a vector store index. Using OpenAI's ada2 model, document chunks are embedded. SQL-lite caching, a part of Langchain, minimizes redundant model calls, crucial for cost and operational efficiency. The outcome is a FAISS index with embedded chunks and metadata, stored as a WandB artifact.
wandbot_index
Report Generation
The process culminates in creating a W&B Report using the artifacts from the previous step. The following report outlines GitHub repository revision numbers, the volume of documents ingested, and artifacts comprising parsed documents and vector stores. This practice offered us a clear, transparent view of the ingestion process, facilitating analysis and future improvements.
Chat Module
Let's now explore the heart of user interaction and its impressive development: the chat module. Its evolution over time has brought about significant improvements in how it operates and connects with users.
Recent Transformations
We migrated the implementation from Langchain to Llama-index, which gave us better control over the chat module's underlying functionality, like managing the retrieval methods, response synthesis pipeline and other customizations.
An exciting development is Wandbot's integration with Cohere's rerank-v2 endpoint. This allows Wandbot to sift through retriever results more effectively, ensuring that responses are not just accurate, but also highly relevant to the user's query.
We have made language inclusivity a priority in our recent updates. The chat module now recognizes and responds to queries in the same language, with particular emphasis on Japanese to better serve our W&B Japan Slack community.
We also wanted to ensure uninterrupted service by switching to a backup LLM (GPT-3.5-turbo) if the primary one (GPT-4) is down. We managed this within the llama-index service-context and adds a layer of resilience against potential downtimes.
Enhancing Conversations
Wandbot now retains conversation history, enabling it to recall past discussions and user feedback. Additionally, we’ve also upgraded the system-prompt to guide the LLM's behavior, ensuring responses align better with user needs. Here's the upgraded system-prompt truncated for brevity. You can see the complete system prompt here.
You are wandbot, a developer assistant designed to guide users with tasks related to Weight & Biases,its sdk `wandb` and its visualization library `weave`. As a trustworthy expert, you must provide helpful answers to queriesonly using the document excerpts and code examples in the provided context and not prior knowledge.Here are your guidelines:1. Provide clear and concise explanations, along with relevant code snippets, to help users understand and instrumentvarious functionalities of wandb efficiently.2. Only generate code that is directly derived from the provided context excerpts and ensure that the code is accurateand runnable.3. Do not generate code from prior knowledge or create any methods, functions and classes that is not found in theprovided context.4. Always cite the sources from the provided context in your response.5. Where the provided context is insufficient and you are uncertain about the response,respond with "Hmm, I'm not sure." and direct the user to theWeights & Biases [support](support@wandb.com) or [community forums](http://wandb.me/community)6. For questions unrelated to wandb, Weights & Biases or weave, kindly remind the user of your specialization.7. Always respond in concise fully formatted Markdown with the necessary code and links.8. For best user experience, always respond in the user's language. For instance, if the query is in Japanese,you should respond in JapaneseHere are some examples:<!--start-example1--><!--start-relevant-documents-->...<!--end-relevant-documents--><!--Start-Question-->...<!--End-Question--><!--Final Answer in Markdown-->...<!--end-example1--><!--start-example2--><!--start-relevant-documents-->...<!--end-relevant-documents--><!--Start-Question-->...<!--Final Answer in Markdown-->...<!--end-example2--><!--Begin--><!--Start Relevant Documents-->{context_str}<!--End Relevant Documents-->Human:<!--Start Question-->{query_str}<!--End Question--><!--Final Answer in Markdown-->
Streamlining Operations
- Caching: To enhance efficiency, Wandbot now caches embedding results. This eliminates the need for repetitive embedding model calls, cutting down operational costs.
- Modular design: By encapsulating the chat logic in a standalone module, we've simplified maintenance and updates. This modular approach also eases the integration of Wandbot across various platforms, including Discord, Slack, and an upcoming Zendesk Application.
Benefits from Improvements
The cumulative impact of these upgrades has been crucial. Users now enjoy more engaging, continuous conversations with Wandbot. The accuracy of Wandbot's responses has significantly improved, thanks to the reranking process and prompt tuning. Adding new services like reranking and language-based retrieval has become more streamlined. Additionally, Wandbot now maintains a consistent conversational context across different client applications.
Overall, these enhancements have made Wandbot not only more efficient in its operation but also more adaptive and useful in its interactions with users.
Database Module
Let's turn to the database module, another critical component that's improved Wandbot's functionality and user engagement.
Key Functions
The database module is akin to a memory bank, meticulously storing and managing conversation data. It acts as a foundation in Wandbot's data and conversational memory, providing essential services:
- Storing conversational history: Like a detailed journal, it records all user questions and Wandbot's responses.
- Providing conversational context: Utilizing chat history, the module informs future queries, thus enhancing relevance and continuity in interactions.
- Enabling personalization: Wandbot leverages conversation threads to offer a customized responses to follow-up queries.
- Persisting user feedback: Collecting user opinions on Wandbot's responses, aiding in continuous improvement.
Operational Benefits
The introduction of the database module has brought about several operational advantages:
- Caching: It stores LLM query results, reducing the need for repetitive queries and thereby cutting down on operational costs.
- Consistent User Experience: As a unified data layer across different clients, it ensures users receive consistent information, regardless of the platform.
- Data-Driven Improvements: User feedback, stored within the database, is a goldmine for refining Wandbot's performance.
SQLite as the Database Choice
In choosing SQLite, Wandbot gains a lightweight yet powerful database solution:
- Serverless Architecture: With no need for a separate database server, SQLite simplifies overall management.
- Embeddable Nature: All data is contained within a single, easily transportable database file.
- Ease of Integration: The availability of Python SQLite libraries, makes integration a breeze.
We also periodically (every 10 mins), backup the data in the database to a W&B Table. This allows us to persist the data in the database as a W&B artifact that can then be utilized in our evaluation and feedback loops. Here's an example table that shows one such backup.
By establishing this dedicated database module, Wandbot has not only enhanced its conversational abilities but also streamlined engineering efforts and optimized costs. This module is key to transforming Wandbot into a more effective, efficient, and user-friendly tool.
API Module
The API module serves as the central interface through which clients like Discord and Slack interact with Wandbot. It plays a pivotal role in simplifying integration, enhancing scalability, and ensuring the robustness of our bot's services.
Key endpoints in the API include:
- /question_answer: This endpoint allows the storage of question-answer pairs in the database.
- /chat_thread: Clients can retrieve chat threads stored in the database using this endpoint.
- /query: The primary chat endpoint that responds to user queries.
- /feedback: This endpoint is responsible for storing user feedback provided by clients.
By centralizing our API, we achieve several significant advantages:
Wandbot's internal workings are decoupled from the intricacies of client interactions. This separation enables us to optimize performance through mechanisms like caching layers while keeping the client-side code unaltered, even when we make changes to the APIs.
Furthermore, our modular approach facilitates maintenance and scalability. We can horizontally scale individual API services independently to handle increased loads, ensuring smooth performance. The consistent interface of the API also reduces the complexity of client-side code and safeguards the inner workings of Wandbot from direct external access.
In addition to these technical benefits, our centralized API empowers us in multiple ways:
- Loose Coupling: It fosters loose coupling between our frontend applications and backend services, allowing us to modify internals without impacting clients.
- Developer Productivity: By abstracting away complexities behind simple APIs, we enhance developer productivity, making it easier to work with Wandbot.
- Scalability: The ability to scale individual API services independently ensures that Wandbot can handle an increasing load gracefully.
- Security: By avoiding direct exposure to core modules, we enhance the security of Wandbot, protecting sensitive data and functionalities.
In essence, the API module facilitates a modular, scalable, and secure framework that enables seamless integration with new clients while ensuring the smooth functioning of our bot.
New Capabilities
Wandbot's recent transformation into a microservices architecture has not only revamped its core, but has also infused it with several new capabilities. These enhancements directly cater to user experience and functionality:
- Multilingual Mastery: Wandbot now effortlessly handles both English and Japanese, fetching documents and responding in the user’s language. This bilingual ability makes it more accessible and user-friendly.
- Reliable Backup: If Wandbot's primary Large Language Model (LLM) faces downtime, it seamlessly switches to a secondary model. This failover mechanism ensures uninterrupted service.
- Smart Conversation Context: Leveraging chat history, Wandbot now provides context-rich conversations. It also evolves through user feedback, making interactions more natural and intuitive.
- New integrations: The new architecture allowed us to seamlessly integrate new applications like Zendesk(coming soon ...) and the absolutely New WandB-GPT - a Technical support GPT for Weights & Biases
Commitment to Growth
Wandbot's evolution is fueled by user feedback and rigorous testing:
- Its knowledge base is constantly refreshed, keeping it up-to-date.
- We update the system regularly, based on evaluation results and how users interact with it.
- Built-in user feedback persistence to ensure Wandbot stays relevant and attuned to their needs.
These improvements, stemming from the flexible architecture, greatly enhance Wandbot’s accuracy, scalability, and readiness for production.
Evaluation
To measure the success of our efforts, we conducted extensive testing. We evaluated Wandbot on a custom test set across diverse query types and measured its retrieval accuracy and response relevance.
Here's an overview of the manual-evaluation we did internally.
Additionally, here's a brief overview of the auto-evaluation results.
It's crucial to note that evaluating Retrieval-Augmented Generation (RAG) systems like Wandbot is no small feat. The complexity lies in the multitude of evaluation methods and the need to examine each component of the system, both individually and as a whole. These detailed evaluations are extensive enough to merit separate reports. For further insight into this comprehensive evaluation process, refer to the following reports
How to Evaluate an LLM, Part 1: Building an Evaluation Dataset for our LLM System
Building gold standard questions for evaluating our QA bot based on production data.
How to Evaluate an LLM, Part 2: Manual Evaluation of Wandbot, our LLM-Powered Docs Assistant
How we used manual annotation from subject matter experts to generate a baseline correctness score and what we learned about how to improve our system and our annotation process
How to evaluate an LLM Part 3: LLMs evaluating LLMs
Employing auto-evaluation strategies to evaluate different component of our Wandbot RAG-based support system.
Deployment
During deployment, the individual micro-services for the Database, API, and each client application are run in a single repl. Our current usage pattern doesn't require anymore compute. However, to improve reliability and scalability, we migrated Wandbot to Replit Deployments. This move has improved uptime, allowed for auto-scaling, and improved monitoring and security.
Conclusion
With companies of every stripe building LLM-powered applications, we hope the lessons we've learned building and rebuilding our documentation bot can help you avoid some common pitfalls and architect your solution intelligently from jump street. We're excited to keep improving Wandbot in 2024, employing new techniques and libraries, and we'll make sure to keep this series updated alongside the app. Thanks for reading!
Add a comment
ere's an example table that shows one such backup. this Table says "something went wrong" and nothing is displayed
2 replies
Panel
this table is currently saying "No row to display" 1 reply
Language Learning Model is this possibly a typo?
1 reply
You can see the full system prompt here
Looks like the current link needs to be replaced with this one https://github.com/wandb/wandbot/blob/main/data/prompts/chat_prompt.json
1 reply
Iterate on AI agents and models faster. Try Weights & Biases today.