Skip to main content

Building a RAG system with Llama 3 and Weights & Biases

Dive into the world of retrieval-augmented generation (RAG) and learn how to harness the capabilities of the cutting-edge Llama 3 language model.
Created on May 20|Last edited on June 28

Introduction

Today, we're going to guide you through the process of building a RAG system using Llama 3 model while tracking its performance with the aid of Weights & Biases (W&B). Let's get started:

Understanding RAG systems

Retrieval-augmented Generation (RAG) systems are a novel approach to natural language generation that leverages the power of both large language models and information retrieval techniques. At their core, RAG systems consist of two key components: a retriever and a generator.

Image By Author
The retriever component is responsible for searching through a vast corpus of text data and identifying the most relevant information related to a given input. This process is powered by advanced information retrieval techniques, such as vector similarity search or keyword matching.
The generator component, on the other hand, is a large language model like Llama 3. It takes the relevant information retrieved by the retriever and generates coherent and contextually appropriate text based on that information.
By combining these two components, RAG systems can generate high-quality, factual, and contextually relevant text while mitigating the risk of hallucinating or generating inconsistent information.

Performance monitoring with W&B

As we delve into the process of building our RAG system, it is crucial to monitor its performance and track key metrics. This is where Weights & Biases comes into play. W&B is a powerful tool for experiment tracking, model monitoring, and performance visualization, allowing us to gain valuable insights into our RAG system's behavior and identify areas for improvement.
In the coding section, we will demonstrate how to leverage W&B to track metrics such as retrieval accuracy, latency, and generation quality, enabling us to make data-driven decisions and continuously refine our RAG system.

Step-by-step instructions on preparing your datasets

Before diving into the coding process, it is essential to prepare our datasets correctly. Here are the steps to follow:
Image By Author

Dataset selection

Choosing the right dataset is crucial for building an effective RAG system. We should consider datasets that are large, diverse, and relevant to the domain we are targeting.
For this project, we will be using the Investment Guidelines of Europe dataset pdf, which is a collection of detailed information regarding Investment and financial guidelines in Europe. This dataset is well-suited for our RAG system as it covers a wide range of topics and provides a rich corpus of text data.


Data preprocessing

Raw data often contains noise, inconsistencies, and irrelevant information that can negatively impact our RAG system's performance. To preprocess this data, we leverage the PyPDF library to extract the text content from the PDF file. Once extracted, we need to clean and prepare the data for our RAG system:
  1. Text cleaning: Remove any unwanted elements like page numbers, headers/footers, or artifacts introduced during the extraction process.
  2. Deduplication: Identify and remove any duplicate text segments that may have been inadvertently included during the extraction process.
  3. Normalization: Convert the text to a consistent format, such as lowercase and removing any accents or special characters, to improve retrieval accuracy.

Data vectorization:

For our RAG system to effectively retrieve relevant information, we need to convert our textual data into numerical vectors. This process is known as vectorization, and it involves representing words or documents as high-dimensional vectors that capture their semantic meaning.
One popular vectorization technique is TF-IDF (term frequency-inverse document frequency), which assigns higher weights to words that are more relevant to a specific document while down weighting words that appear frequently across the entire corpus.
Another widely used technique is Word2Vec, which maps words to dense vectors in a continuous vector space, preserving semantic relationships between words.
For this purpose, we are using the SentenceTransformerEmbeddings library from Hugging Face, which is based on the Sentence Transformer model & TF-IDF and freely available to use.
The SentenceTransformerEmbeddings library allows us to generate high-quality embeddings (vectorized representations) for our text data.
After vectorizing our data, we need to store these vectors efficiently to enable quick retrieval during the generation process. One effective approach is to use specialized data structures like Approximate Nearest Neighbor (ANN) libraries, which allow for fast vector similarity searches.

Building a RAG system with Llama 3

Now that our datasets are prepared, let's dive into the exciting process of building our RAG system with the powerful Llama 3 model.
Before we begin, let us try to understand the prompt format for Llama 3. Llama 3 features a far more sophisticated prompt format than other models. Here it is:
<|begin_of_text|>
<|start_header_id|>
user
<|end_header_id|>
Hello it is nice to meet you!
<|eot_id|>
<|start_header_id|>
assistant
<|end_header_id|>
The prompt starts with <|begin_of_text|>, and each header section ends with <|eot_id|> tags. Currently, Llama-3 supports three user roles: "system," "user," and "assistant." The above prompt just has a user message for the LLM, which reads "Hello, it is nice to meet you!" To perform RAG, we require a more nuanced prompt. You'll see that below.

Step 1: Installing necessary libraries

!pip install langchain
!pip install chromadb
!pip install sentence-transformers
!pip install pypdf
!pip install -U bitsandbytes
!pip install -U git+https://github.com/huggingface/peft.git
!pip install -U git+https://github.com/huggingface/accelerate.git
!pip install -U einops
!pip install -U safetensors
!pip install -U xformers
!pip install -U ctransformers[cuda]
!pip install huggingface_hub
!pip install wandb

Step 2: Importing necessary libraries

from transformers import AutoTokenizer,AutoModelForCausalLM,AutoConfig
from time import time
import transformers
import torch
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma
To utilize gated models such as Llama 3, we need to apply for access through HuggingFace. This ensures that these models are not misused to cause any harm. Once our access is granted, we can easily access the model by logging into the hugging face with our personal access token.
from huggingface_hub import notebook_login
notebook_login()

Step 3: Load the model and tokenizer

model_checkpoint = 'meta-llama/Meta-Llama-3-8B-Instruct'
model_config = AutoConfig.from_pretrained(model_checkpoint,
trust_remote_code=True,
max_new_tokens=1024)

model = AutoModelForCausalLM.from_pretrained(model_checkpoint,
trust_remote_code=True,
config=model_config,
device_map='auto')

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Step 4: Set up a text generation pipeline on our device and make a HuggingFacePipeline object

pipeline = pipeline("text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.float16,
max_length=3000,
device_map="auto",)

llm = HuggingFacePipeline(pipeline=pipeline)

Step 5: Quickly verify that everything is in order by prompting the LLM

prompt = """<|begin_of_text|>
<|start_header_id|>
user
<|end_header_id|>
Hello it is nice to meet you!
<|eot_id|>
<|start_header_id|>
assistant
<|end_header_id|>
"""
output = llm.invoke(prompt)
print(output)
We will get something like the below as out and we have to parse it in order to extract the result:

Let;s write a function to do the same
def parse(string):
return string.split("<|end_header_id|>")[-1]

Step 6: Finish setting up the retriever and our database

loader = PyPDFLoader("<path to your PDF>")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embedding_function= SentenceTransformerEmbeddings(model_name="BAAI/bge-small-en-v1.5")
vectorstore = Chroma(collection_name="sample_collection",
embedding_function = embedding_function)

vectorstore.add_documents(texts)

retriever = vectorstore.as_retriever(k=7)
Currently, it's extremely difficult to use LangChain Expression Language to create chains. Instead, let's manually connect the components.
class Pipeline:
def __init__(self,llm,retriever):
self.llm = llm
self.retriever = retriever
def retrieve(self,question):
docs = self.retriever.invoke(question)
return "\n\n".join([d.page_content for d in docs])
def augment(self,question,context):
return f"""
<|begin_of_text|>
<|start_header_id|>
system
<|end_header_id|>
You are a helpful, respectful and honest assistant designated answer
questions related to the user's document.If the user tries to ask out of
topic questions do not engange in the conversation.If the given context
is not sufficient to answer the question,Do not answer the question.
<|eot_id|>
<|start_header_id|>
user
<|end_header_id|>
Answer the user question based on the context provided below
Context :{context}
Question: {question}
<|eot_id|>
<|start_header_id|>
assistant
<|end_header_id|>"""

def parse(self,string):
return string.split("<|end_header_id|>")[-1]
def generate(self,question):
context = self.retrieve(question)
prompt = self.augment(question,context)
answer = self.llm.invoke(prompt)
return self.parse(answer)

def llama3_chat():
print("Hello!!!! I am llama3 and I can help with your document. \nIf you want to stop you can enter STOP at any point!")
print()
print("-------------------------------------------------------------------------------------")
pipe = Pipeline(llm,retriever)
question = input()
while question != "STOP":
out = pipe.generate(question)
print(out)
print("\nIs there anything else you would like my help with?")
print("-------------------------------------------------------------------------------------")
question = input()
llama3_chat()
Now we have our pipeline ready, we can use it to ask questions about the file we uploaded earlier.

Performance monitoring with W&B:

wandb.login()
import os
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"


from langchain.chains import RetrievalQA
# from langchain.llms import OpenAI


qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever= retriever,
)

print(qa_chain.run("What are the common requirements for financing and investment operations under the InvestEU Fund?"))

By logging these sample queries and responses to W&B, we can visually compare the performance of Llama 3 alone versus our RAG system. This will help us understand the impact of the retrieval component and identify areas for further improvement.

Now that we have an eval set of QA pairings, we can use our LLM-based QA bot to make predictions for the questions. We may then apply a measure to compare the expected and "true" results. Given a predicted and "true" result, we can literally use an LLM to determine how well the prediction matches the true answer.
LLMs are effective because they can comprehend the semantics of the text. Given two texts (actual and anticipated replies), an LLM can theoretically determine whether they are semantically equivalent. If the predictions are the same, we can assign an accuracy score ranging from 0 to 10, with 0 representing extremely low similarity and 10 representing very high similarity.
def load_eval_prompt(f_name: Union[pathlib.Path, str] = None) -> ChatPromptTemplate:

human_template = """\nQUESTION: {query}\nCHATBOT ANSWER: {result}\n
ORIGINAL ANSWER: {answer} GRADE:"""

system_message_prompt = SystemMessagePromptTemplate.from_template(
"""You are an evaluator for the W&B chatbot.You are given a question, the chatbot's answer, and the original answer, and are asked to score the chatbot's answer as either CORRECT or INCORRECT. Note that sometimes, the original answer is not the best answer, and sometimes the chatbot's answer is not the best answer. You are evaluating the chatbot's answer only. Example Format:\nQUESTION: question here\nCHATBOT
ANSWER: student's answer here\nORIGINAL ANSWER: original answer here\nGRADE: from 0 to 10, where 0 is the lowest (very low similarity) and 10 is the highest (very high similarity) here\nPlease
remember to grade them based on being factually accurate. Begin!"""
)

human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)
chat_prompt = ChatPromptTemplate.from_messages(
[system_message_prompt, human_message_prompt]
)
return chat_prompt


def evaluate_model(eval_dataset: pd.DataFrame, model_name: str) -> list:
eval_prompt = load_eval_prompt()
llm = ChatOpenAI(
model_name=model_name,
temperature=0,
)
eval_chain = QAEvalChain.from_llm(llm, prompt=eval_prompt)

examples = []
predictions = []
for i in range(len(eval_dataset)):
examples.append(
{
"query": eval_dataset["question"].iloc[i],
"answer": eval_dataset["answer"].iloc[i],
}
)
predictions.append(
{
"query": eval_dataset["question"].iloc[i],
"answer": eval_dataset["answer"].iloc[i],
"result": eval_dataset["model_answer"].iloc[i],
}
)

graded_outputs = eval_chain.evaluate(examples, predictions, question_key="query", answer_key="answer", prediction_key="result")
model_scores = [x.get("results", "None") for x in graded_outputs]
return model_scores

def evaluate_answers(eval_dataset: pd.DataFrame, config: SimpleNamespace) -> pd.DataFrame:
# Evaluate with Llama 3 with RAG
model_name_with_rag = config.eval_model + "_with_RAG"
eval_dataset["model_score_with_RAG"] = evaluate_model(eval_dataset, model_name_with_rag)
# Evaluate with Llama 3 without RAG
model_name_without_rag = config.eval_model + "_without_RAG"
eval_dataset["model_score_without_RAG"] = evaluate_model(eval_dataset, model_name_without_rag)

# Generate final output DataFrame similar to the reference table
output_df = eval_dataset.rename(columns={
"context": "Context",
"question": "Questions",
"answer": "Llama3 with RAG",
"model_answer": "Llama3 without RAG",
"model_score_with_RAG": "Model score Llama3 with RAG",
"model_score_without_RAG": "Model score Llama3 without RAG"
})

return output_df

# Define a sample config
config = SimpleNamespace(eval_model="gpt-3")

# Evaluate answers
evaluated_df = evaluate_answers(eval_dataset, config)
print(evaluated_df)


Using Weights & Biases to track our model's performance, we can clearly see that the results are much better and more relevant when we combine Llama 3 with a Retrieval-Augmented Generation (RAG) system.
Without the RAG system, Llama 3 alone sometimes gives incorrect or made-up information because it doesn't have access to factual data on every topic. However, when we do add the RAG system, it provides Llama 3 with accurate information from reliable sources.
This way, Llama 3 can use that factual information to generate high-quality and truthful responses. Weights & Biases allows us to easily compare the outputs before and after adding the RAG system, and we can visually see the huge improvement in accuracy and relevance. The combination of Llama 3's language abilities and the RAG system's access to real-world data, along with the monitoring power of Weights & Biases, creates a powerful system that generates excellent results.

Best practices and tips

Throughout the process of building our RAG system with Llama 3, we've learned several valuable lessons and best practices. Here are some tips that can enhance your experience:
  1. Data quality: The quality of your dataset is paramount. Ensure that your corpus is diverse, relevant, and accurately represents the domain you are targeting. Investing time in data preprocessing and cleaning will pay dividends in the long run.
  2. Vectorization techniques: Experiment with different vectorization techniques (e.g., TF-IDF, Word2Vec, BERT embeddings) and evaluate their impact on retrieval accuracy. Some techniques may work better than others for your specific use case.
  3. Retrieval efficiency: As your dataset grows larger, retrieval efficiency becomes increasingly important. Consider implementing techniques like approximate nearest neighbor search or indexing strategies to maintain high performance.
  4. Ensemble models: Consider combining multiple models or retrieval techniques to create an ensemble system. This approach can leverage the strengths of different models and mitigate their individual weaknesses, resulting in improved overall performance.
  5. Interpretability: Ensure that your RAG system's outputs are interpretable and easy to understand for end-users. Implement techniques like output formatting, highlighting retrieved information, or providing confidence scores to enhance transparency and trust in the system.
  6. Continuous monitoring: Regularly monitor your RAG system's performance using tools like Weights & Biases (W&B). Track metrics such as retrieval accuracy, latency, and generation quality over time, and identify patterns or anomalies that may require further investigation or model updates.

Conclusion

As language models like Llama 3 continue to evolve and RAG systems become more sophisticated, we should expect to see more performant models for natural language generation and understanding. If you'd like to learn more about building LLM applications, we recommend checking out our courses, especially our course on building LLM apps and our newest offering: LLM engineering: Structured outputs
Iterate on AI agents and models faster. Try Weights & Biases today.