Skip to main content

WandBot: GPT-4 Powered Chat Support

This article explores how we built a support bot, enriched with documentation, code, and blogs, to answer user questions with GPT-4, Langchain, and Weights & Biases.
Created on April 19|Last edited on May 21
As large language models (LLMs) increasingly blur the line between software engineering and machine learning, it is crucial to explore and better understand this intersection so that we can build the best tools for this new cohort of ML practitioners entering the field.
To deeply understand these new workflows and pain points, we built our own support bot, which we've named WandBot. In this article, we'll discuss the implementation of a Q&A bot for Weights & Biases (W&B) using GPT-4, Langchain, OpenAI embeddings, and FAISS. If you'd like to take it for a spin, it's live on our Discord in the #wandbot channel. Let us know what you think!
We have also open-sourced the actual code we're using to run WandBot, you can find the Github repo here. We're hoping we can improve this tool together with the community.
In this article, you'll learn about the technical aspects of the code and updates made to the original implementation and how to create frontend applications for the Q&A bot on platforms like Discord and Slack, ensuring seamless integration and broader accessibility.

Table of Contents



Working with Data

Data Collection and Preprocessing

In the original implementation, a CharacterTextSplitter was used to split the text. We updated the code to better accommodate different types of files by using MarkdownTextSplitter, PythonCodeTextSplitter, and NotebookTextSplitter.
Here's the code, followed by an explainer:
from langchain.text_splitter import MarkdownTextSplitter, PythonCodeTextSplitter, NotebookTextSplitter

markdown_splitter = MarkdownTextSplitter()
python_code_splitter = PythonCodeTextSplitter()
notebook_splitter = NotebookTextSplitter()
  • MarkdownTextSplitter: Parses and processes markdown files, ensuring that the bot can understand and generate responses related to content within these files. This is particularly useful when dealing with W&B documentation, as a significant portion is written in Markdown.
  • PythonCodeTextSplitter: Processes Python code files, enabling the bot to understand and generate responses based on code snippets and programming concepts. This is crucial for addressing technical questions related to the W&B API and its usage in Python projects.
  • NotebookTextSplitter: Handles Jupyter Notebook files, which are a common format for sharing code, data, and documentation in the machine learning community. This splitter ensures that the bot can understand and generate responses based on the content within these notebooks, including code, markdown, and outputs.
By using these dedicated text splitters, the Q&A bot can better understand and process various document formats, leading to more accurate and relevant responses.

Data Ingestion

After preprocessing, the data is ingested into the Langchain framework. The code demonstrates the use of HydeEmbeddings instead of simple embeddings. HydeEmbeddings are a more advanced type of embedding based on the Hypothetical Document Embeddings (HyDE) method, which seeks to improve search results by leveraging hypothetical answers generated by an LLM like ChatGPT instead of just keywords. Compared to simple embeddings, HyDE embeddings have real benefits for this project:
  1. Higher dimensionality: HydeEmbeddings use a higher-dimensional vector space, allowing them to capture more nuanced relationships between words and phrases.
  2. Context-awareness: HydeEmbeddings are designed to incorporate contextual information, resulting in a better understanding of the meaning and intent behind the text. By using an LLM-generated document based on a specific question or topic, HydeEmbeddings can capture relevant patterns that help find similar documents in a trusted knowledge base
  3. Robustness: HydeEmbeddings are more resistant to noise and ambiguity, making them better suited for handling complex language structures and diverse document formats. The use of hypothetical answers in the HyDE method helps mitigate the risk of "hallucinations" from the LLM, which can be especially useful in sensitive applications where precise information is critical, such as medicine.
from langchain.embeddings import HydeEmbeddings

hyde_embeddings = HydeEmbeddings()

# Ingest data into Langchain framework using HydeEmbeddings
processed_data = []
for document in preprocessed_data:
embedding = hyde_embeddings.get_embedding(document)
processed_data.append({"text": document, "embedding": embedding})
Using HydeEmbeddings allows the Q&A bot to take advantage of the full context of the knowledge base without the need for fine-tuning or exceeding token limits, enhancing the overall user experience. By utilizing HydeEmbeddings, our Q&A bot saw improved performance and understanding of the text. These embeddings are then used to create and store documents with metadata, forming the basis for the bot's knowledge and response generation capabilities.

Storing our Vectors in a Vector Store

Storing document embeddings into our FAISS Index

The next step is to create a FAISS index, which is a powerful and efficient similarity search library for high-dimensional data (FAISS stands for Facebook AI Similarity Search).
In the complementary code, we subclassed Langchain’s FAISS class to also return the similarity scores for the retrieved document. We called it FAISSWithScore, and we used this to store document embeddings in the FAISS index. This allows for efficient document retrieval based on user queries and filtering of the retrieved documents based on similarity scores. We also updated the retriever to VectorStoreRetrieverWithScore to utilize the FAISS index for document and score retrieval, adapting to the changes in the Langchain framework.
faiss_index = FAISSWithScore()
retriever = VectorStoreRetrieverWithScore(faiss_index)

Storing the FAISS index and embeddings in Weights & Biases Artifacts

To ensure ease of portability and effective version control, the FAISS index and embeddings are stored separately as files within a single Weights and Biases Artifact. This approach not only allows for easier access and sharing of the data but also enables tracking of changes to the available data in the store as the LLM evolves. By leveraging W&B Artifacts, the Q&A bot can be continuously updated and improved to provide the most accurate and relevant responses to user queries.
import wandb

# Log the FAISS index and embeddings to W&B Artifacts
artifact = wandb.Artifact("faiss_index_and_embeddings", type="data")
artifact.add_file("faiss_index_file.faiss")
artifact.add_file("embeddings.npy")

run = wandb.init()
run.log_artifact(artifact)
run.finish()

Creating WandBot

With the dataset and FAISS index in place, we can create our Q&A bot. This consists of three main components: designing a robust prompt, creating a Q&A pipeline, and, finally, developing the chat interface. We'll go in that order, respectively.

Designing a Robust Prompt for the LLM

To ensure the desired behavior (and output format) from the language model, the code utilizes Langchain's ChatPromptTemplate class. This class enables users to design a custom prompt tailored to the specific requirements of the Q&A bot. By using the ChatPromptTemplate, developers can provide context, specify desired answer format, and manage token constraints. This ensures that the model's output is not only relevant but also well-structured. The relevant code:
from langchain import ChatPromptTemplate

# Create a custom prompt for the Q&A bot
prompt_template = ChatPromptTemplate(
user_prompt="User: {question}",
assistant_prompt="Assistant: {answer}",
token_constraints={"max_length": 2048}
)

Creating the Q&A Pipeline

Our Q&A pipeline was created using RetrievalQAWithSourcesChainWithScore in Langchain, replacing the earlier VectorDBQAWithSourcesChain. This pipeline leverages the power of OpenAI embeddings and the FAISS index for efficient document retrieval. The benefits of using RetrievalQAWithSourcesChainWithScore include:
  1. Improved search efficiency: By using the FAISS index and OpenAI embeddings, the pipeline can search through a large number of documents quickly, yielding relevant results.
  2. Contextual understanding: Since the pipeline incorporates the HydeEmbeddings, it has a better understanding of the context and can provide more accurate responses.
  3. Scoring mechanism: The RetrievalQAWithSourcesChainWithScore class also provides a scoring mechanism, allowing the system to rank the relevance of the retrieved documents and filter the documents by the similarity scores.
  4. Usage with Weights & Biases: Storing the pipeline components–such as the FAISS index and embeddings–in W&B Artifacts allows for better version control, collaboration, and data portability. The use of W&B Artifacts ensures that the pipeline can be easily updated and shared among team members, facilitating the continuous improvement of the Q&A bot. In the code, the pipeline loads the artifacts using the run.use_artifact method, which simplifies the process of accessing the required data and components.
# Load FAISS index and embeddings from W&B Artifacts
run = wandb.init()
artifact = run.use_artifact("faiss_index_and_embeddings:latest")

faiss_index = FAISSWithScore.load(artifact.get_path("faiss_index_file.faiss").download())
embeddings = np.load(artifact.get_path("embeddings.npy").download())

# Create the Q&A pipeline
pipeline = RetrievalQAWithSourcesChainWithScore(
retriever=faiss_index.as_retriever(),
embeddings=embeddings,
prompt_template=prompt_template
)
This pipeline is responsible for processing user queries and generating appropriate responses based on the information stored in the dataset.

Creating the Chat Interface

The Chat class in the code serves as the chat interface (surprise!), providing stateful storage for user inputs and model responses. This is particularly useful for maintaining context during an ongoing conversation. The benefits of using the Chat class include:
  1. Interactive experience: The stateful storage can enable a more interactive and dynamic chat experience, as the model can eventually be improved to generate context-aware responses based on previous interactions with the user. This is great for diving deeper or refining queries to get the answer you really want.
  2. Flexibility: The Chat class can be easily adapted to work with various user interfaces, such as Discord and Slack applications, allowing developers to integrate the Q&A bot into different platforms seamlessly.
from langchain import Chat

# Instantiate the Chat class with the Q&A pipeline
chat = Chat(pipeline)
By combining these three components, our Q&A bot can effectively answer user queries and provide an engaging and interactive experience for users seeking information from the Weights & Biases documentation.

Model Selection and Fallback Mechanism

The implementation of a model selection and fallback mechanism is crucial for reliability and robustness. More specifically, we get these benefits:
  1. Service Continuity: By using GPT-4 as the primary model and GPT-3.5 Turbo as a fallback, the Q&A bot can ensure continuous operation even if the primary model is unavailable or encounters issues. This is particularly important for maintaining a consistent user experience and preventing downtime, which can negatively impact user satisfaction and trust in the system.
  2. Performance Optimization: GPT-4 provides state-of-the-art performance in natural language understanding and generation. By default, our Q&A bot leverages this model to deliver the highest-quality responses to user queries. However, GPT-3.5 Turbo, while slightly less powerful, still offers a high level of performance. Utilizing this fallback mechanism allows the Q&A bot to maintain its effectiveness even when the primary model is not available. Speaking of which:
  3. Resource Management: In some cases, the availability of the primary model, GPT-4, might be limited due to resource constraints or other factors. By incorporating a fallback mechanism, the Q&A bot can seamlessly switch to GPT-3.5 Turbo, ensuring that users continue to receive responses to their queries without being negatively impacted by resource limitations.
  4. Flexibility and Scalability: The inclusion of a fallback mechanism provides the Q&A bot with the flexibility to adapt to changes in the underlying language models or infrastructure. This makes it easier to scale the system, accommodate new models or updates, and ensure that the bot remains up-to-date with the latest advancements in natural language processing.
The model selection and fallback mechanism is a crucial aspect of the Q&A bot's design, as it ensures service continuity, optimizes performance, manages resources effectively, and provides the flexibility needed for future enhancements and scalability.

Deploying the WandBot on Discord and Slack

With the Q&A bot's backend fully implemented, it's time to deploy the bot on Discord and Slack. After all, the whole idea is making our docs more accessible so allowing for easy access outside of the docs experience itself helps us accomplish just that.

Integrating with Discord

To integrate the Q&A bot with Discord, we first need to create a Python script that interacts with the Discord API. The discord.py library is a widely-used tool for this purpose, as it simplifies the process of connecting the bot to Discord.
The code for the Discord integration can be structured as follows:
  1. Import the required libraries: Import the discord library, along with any other necessary libraries, such as the Chat class from the Q&A bot implementation.
import discord
from qa_bot import Chat
2. Initialize the chat object: Create an instance of the Chat class to manage user inputs and model responses.
chat = Chat(**kwargs)
3. Define the bot's behavior: Create an asynchronous function that will be triggered when a message is received from a user. In this function, process the message using the Chat instance and send the response back to the user.
async def on_message(message):
if message.author == client.user:
return

response = chat.handle_message(message.content)
await message.channel.send(response)
4. Connect the bot to Discord: Create a new discord.Client instance, assign the on_message function as an event handler, and start the bot using the bot token obtained from the Discord Developer Portal.
client = discord.Client()

@client.event
async def on_ready():
print(f'{client.user} has connected to Discord!')

client.run('your_bot_token')

Integrating with Slack

For Slack integration, we can use the slack-bolt library, which simplifies the process of connecting the Q&A bot.
The code for the Slack integration:
  1. Import the required libraries: Import the slack_bolt library and its components, along with any other necessary libraries, such as the Chat class from the Q&A bot implementation.
from slack_bolt import App
from slack_bolt.adapter.socket_mode import SocketModeHandler
from qa_bot import Chat
2. Initialize the chat object: Create an instance of the Chat class to manage user inputs and model responses.
chat = Chat(**kwargs)
3. Define the bot's behavior: Create a function that will be triggered when a message is received from a user. In this function, process the message using the Chat instance and send the response back to the user.
@app.event("app_mention") ...
def command_handler(body, say):
message = body['event']['text']
response = chat.handle_message(message)
say(response)
4. Connect the bot to Slack: Create a new App instance, and start the bot using the Slack app token and bot token obtained from the Slack Developer Portal.
app = App()

handler = SocketModeHandler(app, 'your_app_token')
handler.start()
By integrating the Q&A bot with both Discord and Slack, users can conveniently access the wealth of information in the Weights & Biases documentation, regardless of their preferred communication platform. This ensures a broader reach and a more engaging experience for users seeking assistance with the Weights & Biases toolset.

Logging and Analysis with Weights & Biases Stream Table

Both the Discord and Slack applications are designed to log user interactions with our Q&A bot to a Weights & Biases StreamTable. This integration offers several advantages for analysis, debugging, and continuous improvement of the pipeline:
  1. Model Analysis: By logging user queries and the bot's responses, the team can analyze the performance of the model, identify areas for improvement, and gain insights into how well the model is meeting users' needs. This information can be used to iteratively refine the model and its underlying components for better performance and user satisfaction.
  2. Debugging: The W&B StreamTable serves as a central repository of user interactions, allowing developers to identify and diagnose any issues that arise during the bot's operation. This is particularly useful for understanding and resolving edge cases or unexpected behavior that may occur in real-world use.
  3. Downstream Evaluation Tasks: The logged data can be utilized for downstream evaluation tasks such as refining the prompt, adjusting the pipeline's components, or even training new models based on specific user requirements. This continuous feedback loop enables the development team to adapt and improve the Q&A bot in response to real-world user interactions.
  4. Monitoring and Reporting: The W&B StreamTable provides a visual and easily accessible platform for monitoring the bot's performance over time, enabling the team to track improvements, spot trends, and generate reports as needed.
By leveraging the Weights & Biases StreamTable in conjunction with the Q&A bot's Discord and Slack applications, the development team can maintain a robust, dynamic, and continuously improving pipeline that meets the ever-evolving needs of its users.

Evaluation of the Q&A Bot

To evaluate the performance of our Q&A bot, we use a combination of metrics that assess the retrieval accuracy, string similarity, and model-generated responses' correctness. To perform this evaluation, we follow these steps:
  1. Load the evaluation dataset: We utilize the evaluation dataset stored as a Weights & Biases Artifact, which includes a set of questions, original responses, and document sources.
  2. Test the chat model: For each question in the evaluation dataset, we use the chat model to generate a response along with the retrieved documents and their scores.
  3. Calculate retrieval accuracy: We assess the retrieval accuracy by checking if the original document is present in the retrieved documents.
  4. Calculate string similarity: We use the FuzzyWuzzy library to compute the similarity between the chatbot-generated response and the original response. This similarity score ranges from 0 to 100, with higher scores indicating greater similarity.
  5. Grade the chatbot's responses: We use the language model to evaluate the chatbot's responses by comparing them with the original answers. The Language Model is instructed to grade the chatbot's answers as either CORRECT or INCORRECT.
Specifically in the case of wandbot we followed the process below to evaluate its performance:
  1. Generate a large dataset of simulated user questions for each code snippet in the W&B documentation. The snippet serves as the "ideal response," providing both the question and answer.
  2. Utilize the existing Wandbot code and algorithm to generate a response for each question.
  3. Implement an automated evaluation using a GPT-3.5 model-based approach.
  4. Perform human labeling on a subset of questions and model answers by our MLEs (Machine Learning Engineers). We filtered out examples where our experts assessed the question or "ideal response" to be incorrect.
Based on this process, we calculated 74% accuracy of model responses, as judged by W&B experts. While correlated, the model-based accuracy assessment had a lot of false negatives, and there is more work to ensure the automated (model-based) evaluation delivers good results.

Run set
1

By evaluating the Q&A bot using these metrics, we gain insights into the model's performance and identify areas for improvement. This evaluation process can be repeated as needed to track the progress of the chatbot and ensure it continues to deliver accurate and relevant information to users.

Conclusion

And that, in a nutshell, is how we created our Q&A bot. As a next step, potential improvements and future work may include expanding the Q&A bot to other communication platforms, incorporating active learning for continuous model improvement, personalizing user experiences, and extending support for multiple languages and domains. By exploring these avenues, we can further enhance the Q&A bot's capabilities and applicability, making it an even more valuable tool for users seeking information and assistance in their respective fields.
If you are interested in leveraging Weights & Biases for your own machine learning projects, sign up for a free account and explore the wide array of features and capabilities that can help you build, track, and improve your models with ease.
And if you'd like to check out our bot, here's that link one last time: wandb.me/wandbot


Yeshwant Kumar
Yeshwant Kumar •  
Great Project
Reply
JK
JK •  
Where can I find documentation on `StreamTable`?
Reply
Fang Jun
Fang Jun •  
wonderful project. Would you please explain more details on how to use HyDE (HypotheticalDocumentEmbedder chain in LangChain) in document ingestion? I think the chain is used in QA which embeds user query and hypothetical answer and searches in the vector store. How do you use it in document ingestion? Thanks.
Reply
Atharva Ingle
Atharva Ingle •  
Awesome report and loved how the bot was built. Especially enjoyed the Data Ingestion and Logging and Analysis with Weights & Biases Stream Table part a lot!
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.