Skip to main content

Creating a Q&A Bot for W&B Documentation

In this article, we run through a description of how to build a question-and-answer (Q&A) bot for Weights & Biases documentation.
Created on February 13|Last edited on March 17

Introduction

This article is a description of the documentation Q&A bot I built as part of the Replit x Weights & Biases ML Hackathon. The bot uses OpenAI's GPT3 to answer natural language questions and developer queries related to Weights & Biases documentation. I use Langchain, Openai Embeddings, and FAISS to create the Q&A backend, and the bot is served as a Gradio application.
This is a very simple and rudimentary proof-of-concept to get a feel for what Q&A over documentation might look like. There is a lot of scope for improvement in various parts of the pipeline to make this a production ready application.
💡
Here's a quick preview of what's in this article:


Before we dive in, here's what the repl looks like:

Checkout the repl on replit.com here: 🪄🐝 Documentation Q&A bot With LangChain and OpenAI
Credit: This bot was largely inspired by the following tweet:
💡




Creating the Documentation Dataset

Data Collection

The W&B documentation can be found at docs.wandb.ai. It contains guides, API references, and examples.
Scraping this was more challenging than I had initially thought. I took an alternate route by collecting the documentation from the W&B/docodile GitHub repository instead. In the repository, each webpage is represented as a markdown file organized in sub-directories that represent the website's tree structure. This also made it easier to parse the documentation. For completeness, I also added data from top forum questions, support-rotation tickets, and API developer references. This additional data can be found organized in the following google spreadsheet - wandb_bot finetune data.

Preprocessing

The only preprocessing I did was to remove multiple new lines (think \n) from the documentation text and to convert the spreadsheet data into a single document. The data was finally stored as a JSONL file with a source key to represent the source of the text. Here's the artifact containing the dataset:

docs_dataset
Version overview
Full Name
parambharat/wandb_docs_bot/docs_dataset:v1
Aliases
v1
Tags
Digest
cc8d0cc556fc5f399cda1dca42ce468f
Created By
Created At
February 8th, 2023 05:58:10
Num Consumers
19
Num Files
2
Size
2.7KB
TTL Remaining
Inactive
Upstream Artifacts
Description
We can retrieve the dataset artifact by running the following code:
PROJECT = "wandb_docs_bot"
run = wandb.init(project=PROJECT)

def download_raw_dataset():
dataset_artifact_path = 'parambharat/wandb_docs_bot/docs_dataset:latest'
artifact = run.use_artifact(dataset_artifact_path, type='dataset')
artifact_path = artifact.get_path("wandb_docs.json")
file = artifact_path.download()
return file

Data Ingestion

Creating the Documents

The next step was to store the documents, metadata, and their corresponding OpenAI embeddings for search and retrieval. Langchain, by default, uses the text-embedding-ada-002 model to embed the documents. The model generates 1536 dimensional embedding for documents up to 8191 tokens in length.
However, at query time, we will also be passing the retrieved documents through the text-davinci-003 model. This model has a context length of 4096. Therefore, I split the text into chunks of 1024 characters and stored each chunk as a Langchain Document along with the corresponding source file stored as metadata.
Note: I used the CharacterTextSplitter class from langchain to do the chunking. Should have used TokenTextSplitter.from_tiktoken_encoder instead.
💡
Here's the code I used for this step:
import json
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter

def load_documents(fname):
source_chunks = []
splitter = CharacterTextSplitter(separator=" ", chunk_size=1024, chunk_overlap=0)
for line in open(fname, "r"):
line = json.loads(line)
for chunk in splitter.split_text(line["reference"]):
source_chunks.append(Document(page_content=chunk, metadata={"source": line["source"]}))
return source_chunks

Creating the FAISS index

Finally, we are ready to call the Openai Embeddings API Endpoint and store the documents as indexed on the embeddings for dense vector search and retrieval. While there are many choices for storing the vectors, I used faiss-gpu since it could be easily installed via pip and run on replit. A more production-ready version should use a more resilient vector store database or services like qdrant or weaviate.
Here's the code to create and store the document embeddings using Langchain.
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.faiss import FAISS

def create_and_save_index(documents):
store = FAISS.from_documents(documents,OpenAIEmbeddings())
artifact = wandb.Artifact("faiss_store", type="search_index")
faiss.write_index(store.index, "docs.index")
artifact.add_file("docs.index")
store.index = None
with artifact.new_file("faiss_store.pkl", "wb") as f:
pickle.dump(store, f)
wandb.log_artifact(artifact, "docs_index", type="embeddings_index")
return store
The code above stores the documents index and embeddings separately as files in a single artifact. Checkout the artifact for this below.
💡

faiss_store
root
docs.index
7.4MB
faiss_store.pkl
1.3MB

Checkout the ingest.py file in the repl to know how all of the above code was put together for data ingestion.
💡

Creating the Q&A Bot

With the data ready and in the right format, we are almost ready to create our documentation bot.
Keep in mind: we want our bot to be a conversation agent. While GPT-3 has shown to be a reasonably good zero-shot performance for in-context question answering, we need to design a prompt that is robust and ensures that the model hallucinations are kept to a minimum. This brings us to prompt designing.

Designing a Robust Prompt for the LLM

While prompt designing has evolved from an art to a science in recent days, I still tend to treat it like an art. Case in point: I drew inspiration from and imitated other prompt engineers in the field to design a prompt that worked well for the use case. Here's the final prompt that I created for the bot:

You are an AI assistant for the open source library wandb. The documentation is located at https://docs.wandb.ai.
You are given the following extracted parts of a long document and a question. Provide a conversational answer with a hyperlink to the documentation.
You should only use hyperlinks that are explicitly listed as a source in the context. Do NOT make up a hyperlink that is not listed.
If the question includes a request for code, provide a code block directly from the documentation.
If you don't know the answer, just say "Hmm, I'm not sure." Don't try to make up an answer.
If the question is not about wandb, politely inform them that you are tuned to only answer questions about wandb.

QUESTION: How to log audio with wandb?
=========
Content: Weights & Biases supports logging audio data arrays or file that can be played back in W&B. You can log audio with `wandb.Audio()`
Source: 28-pl
Content: # Log an audio array or file
wandb.log({{"my whale song": wandb.Audio(
    array_or_path, caption="montery whale 0034", sample_rate=32)}})

# OR  

# Log your audio as part of a W&B Table
my_table = wandb.Table(columns=["audio", "spectrogram", "bird_class", "prediction"])
for (audio_arr, spec, label) in my_data:
       pred = model(audio)
       
       # Add the data to a W&B Table
       audio = wandb.Audio(audio_arr, sample_rate=32)
       img = wandb.Image(spec)
       my_table.add_data(audio, img, label, pred) 

# Log the Table to wandb
 wandb.log({{"validation_samples" : my_table}})'
Source: 30-pl
=========
FINAL ANSWER: Here is an example of how to log audio with wandb:

```
import wandb

# Create an instance of the wandb.data_types.Audio class
audio = wandb.data_types.Audio(data_or_path="path/to/audio.wav", sample_rate=44100, caption="My audio clip")

# Get information about the audio clip
durations = audio.durations()
sample_rates = audio.sample_rates()

# Log the audio clip
wandb.log({{"audio": audio}})
```
SOURCES: 28-pl 30-pl

QUESTION: How to eat vegetables using pandas?
=========
Content: ExtensionArray.repeat(repeats, axis=None) Returns a new ExtensionArray where each element of the current ExtensionArray is repeated consecutively a given number of times. 

Parameters: repeats int or array of ints. The number of repetitions for each element. This should be a positive integer. Repeating 0 times will return an empty array. axis (0 or ‘index’, 1 or ‘columns’), default 0 The axis along which to repeat values. Currently only axis=0 is supported.
Source: 0-pl
=========
FINAL ANSWER: You can't eat vegetables using pandas. You can only eat them using your mouth.
SOURCES:

Question: {question}
=========
{summaries}
=========
Answer in Markdown:
While the prompt is quite long, it describes exactly how we want the language model to behave and provides few-shot examples to ensure that the model generates the response in the desired fashion.
NOTE: Langchain prompts uses jinja templates. Where a {xxx} is used for placeholder text. If you have code blocks in your prompt you can escape { with a double {{ . See the above prompt for an example.
💡
The above prompt template can be easily downloaded using the following code snippet.
def load_prompt():
dataset_artifact_path = 'parambharat/wandb_docs_bot/docs_dataset:latest'
artifact = run.use_artifact(dataset_artifact_path, type='dataset')
artifact_path = artifact.get_path("combine_prompt.txt")
file = artifact_path.download()
prompt_template = (open(file, "r").read())
prompt = PromptTemplate(input_variables=["question", "summaries"],
template=prompt_template)
return prompt

Creating the Q&A Pipeline

To create the Q&A pipeline that references the documents at query time, I made use of the VectorDBQAWithSourcesChain in Langchain. As the name suggests, the chain uses a vector store to first query the nearest embeddings of a given query. The retrieved documents are then inserted into the prompt along with the query and passed to the LLM to generate a response. Here's the code snippet used to achieve this:
def load_chain(openai_api_key):
if validate_openai_key(openai_api_key):
vectorstore = load_vectostore()
prompt = load_prompt()
chain = VectorDBQAWithSourcesChain.from_chain_type(
llm=OpenAI(temperature=0, openai_api_key=openai_api_key),
chain_type="map_reduce",
vectorstore=vectorstore,
combine_prompt=prompt,)
return chain

def get_answer(question, chain):
if chain is not None:
result = chain(
{
"question": question,
},
return_only_outputs=True,
)
response = f"Answer:\t{result['answer']}\n\nSources:\t{result['sources']}\n"
return response

Creating the Chat Interface

When creating a chat interface, it's important to ensure that the user inputs, data, and model responses are stored in a stateful way. This ensures that follow-up queries make use of the state of the chat and that the chat can be rendered fully along with the user query and model response in the UI. We achieve this by creating a wrapper class that makes sure that the above Q&A chain is initialized at the beginning of a chat. Here's the code:
class Chat:
def __init__(self):
self.chain = None

def __call__(self, message, history, openai_api_key):
if self.chain is None:
self.chain = load_chain(openai_api_key)

history = history or []
message = message.lower()
response = get_answer(message, self.chain)
if response is None:
response = "Please enter a valid Openai API Key and try again. "
history.append((message, response))
return history, history
Note that we maintain the chat state in the history variable and use the class to store an initialized chain upon the first call.
💡

The User Interface

Using Gradio was incredibly easy to create a simple UI for the application. The library even provides a Chatbot class that implements a text chatbot interface. I created a very minimal and simple interface that takes the user question and their open-API-key as text input and displays the LLMs output in response. Here's the code to do this:
with gr.Blocks() as demo:
with gr.Row():
question = gr.Textbox(
label='Type in your questions about wandb here and press Enter!',
placeholder='How do i log images with wandb ?')
openai_api_key = gr.Textbox(type='password',
label="Enter your OpenAI API key here",)
state = gr.State()
chatbot = gr.Chatbot()
question.submit(Chat(), [question, state, openai_api_key], [chatbot, state])

I also added a simple HTML block in the code above to introduce the bot and its usage. The full code can be seen in the main.py file of the repl.
💡

The final application can be seen below:


Final Thoughts and Future Work

This hackathon was a really cool and fun opportunity that I thoroughly enjoyed. I learned that LLMs can be used to create many interesting applications over existing data and resources. I was also able to understand how it is possible to overcome the prompt-length limitations in LLMs using embeddings and semantic search. While the chatbot developed was quite simple and has a lot of scope for improvement, I still think it's a powerful way to use LLMs to automate mundane tasks and create rich user experiences.
The project also inspired me to work on more applications of LLMs. One such idea I'm currently exploring as a side project is to generate chapters and summaries for Gradient Dissent episodes using LLM embeddings and LangChain. I'll post a report and update you soon!
Iterate on AI agents and models faster. Try Weights & Biases today.
artifact
artifact
File<{extension: txt}>