Skip to main content

Document question answering with Llama 3 and Weaviate

Created on August 23|Last edited on October 24
Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP), dramatically improving performance across a wide range of tasks. From text classification and machine translation to sentiment analysis and question-answering, these powerful models have remarkable capabilities.
One helpful application of LLMs is document question answering systems. These systems leverage LLMs' advanced language understanding and generation capabilities to provide accurate and contextually relevant answers to questions based on given documents.
In this article, we'll create a document question answering system using two powerful tools: Llama 3 and Weaviate. This practical guide will showcase how to harness the strengths of a state-of-the-art language model alongside a vector database to build an efficient and effective document analysis solution.


source

Understanding language models

Language models, especially Large Language Models (LLMs), are pre-trained on vast amounts of textual data, enabling them to understand and generate human-like language. Meta's Llama 3.1 405B is an example of such a model, boasting 405 billion parameters. With this scale, it has the capacity to grasp intricate patterns in language and generate contextually relevant responses.

Benefits of document question answering with LLMs

Traditional question answering systems have generally relied on keyword matching or rule-based approaches to question answering, with little-to-no "understanding" of context and providing accurate answers. LLMs, on the other hand, excel at contextual comprehension, allowing for a more nuanced understanding of the questions posed to them.
In document question answer with LLMs, users can input a question related to a specific document or dataset, and the language model can analyze the content to generate precise and contextually relevant answers. This shift has significant implications for information retrieval and user experience.

Some key benefits of question answering with LLMs include

  • Contextual Understanding: Unlike traditional methods that often rely on literal keyword matching, LLMs can capture the nuances of language and the relationships between words and phrases. This enables them to provide more accurate and contextually relevant answers.
  • Multimodal Integration: Advanced LLMs can integrate information from various modalities, such as text, images, and tables, allowing users to ask questions about diverse types of content.
  • Dynamic Adaptability: LLMs can adapt to different domains and types of documents, making them versatile tools for a wide range of applications.
  • Reduced Dependency on Keywords: LLMs can infer connections and provide answers based on the overall meaning of the document, reducing the reliance on keyword-based search.
  • User-Friendly Interface: Document question answering with LLMs simplifies the user experience by allowing natural language queries, making information retrieval more intuitive and accessible.

What is LangChain?

LangChain is a framework that enables developers to build applications powered by Large Language Models (LLMs). It provides a set of tools and APIs that simplify the process of integrating LLMs into applications, allowing developers to focus on building innovative solutions rather than worrying about the underlying complexity of LLMs.
LangChain can be thought of as a modular toolkit for building LLM-powered applications. It provides pre-built blocks, called Chains, that can be easily combined to create complex workflows. Each Chain performs a specific task, such as summarizing text, translating languages, or generating code. By snapping these Chains together, developers can build a wide range of applications, from chatbots and document analysis tools to code generation and personalized content creation systems.

What is Weaviate?

Weaviate is an open-source vector database that stores and indexes data points by their underlying meanings and relationships, allowing for more intelligent and nuanced information retrieval. It uses vector embeddings to understand the meanings behind the data, enabling it to provide accurate and relevant results.
Unlike traditional databases that rely on keyword matching, Weaviate's vector-based approach enables it to "understand" the context and relationships between different pieces of data. This makes it possible for Weaviate to provide a more intuitive and human-like experience, making it easier for users to find the information they need.
  • Precise Results: Weaviate retrieves information based on semantic similarity, surfacing content that aligns more closely with user intent.
  • Contextual Awareness: Weaviate considers the context of your search, factoring in surrounding information and user preferences.
  • Unforeseen Connections: Weaviate can uncover hidden relationships and unexpected connections within your data.
source

Llama 3.1 405B

In this tutorial we will be using the Llama 3.1 405B model.

What is Llama 3.1 405B?

Llama 3.1 405B is a large open-source language model designed to compete with closed models like GPT-4o and Claude 3.5 Sonnet. It is the largest and most capable model in the Llama series, with 405 billion parameters and a context length of 128K.

Key features of Llama 3.1 405B

Llama 3.1 405B boasts several key capabilities that make it a competitive model in the field of natural language processing. Its large size and advanced architecture enable it to capture complex patterns and nuances in language, making it well-suited for a wide range of applications, including long-form text summarization, multilingual conversational agents, and of course ... question answering.
Llama 3.1 405B has been evaluated on over 150 benchmark datasets and has demonstrated competitive performance with leading models like GPT-4o and Claude 3.5 Sonnet.
Source

Fine-Tuning with LORA

To further enhance the capabilities of Llama 3.1 405B, we'll be using the LORA (Low-Rank Adaptation) technique for fine-tuning. LORA is a lightweight and efficient method that allows us to adapt the model to specific tasks like question answering, without modifying its entire weight matrix.

The Benefits of LORA

LORA offers several advantages over traditional fine-tuning methods:
  • Memory-Efficient: LORA uses smaller matrices, reducing memory consumption during training and making it accessible on less powerful hardware.
  • Faster Training: With fewer parameters to adjust, LORA trains faster than full-parameter methods, saving time and computational resources.
  • Surprisingly Good Results: Despite its compact size, LORA often matches or even surpasses the performance of full-parameter fine-tuning, demonstrating its flexibility and adaptability.
With LORA, we can fine-tune Llama 3.1 405B for our question answering task, while also reducing the computational requirements and memory usage.
Now, let's get to the code.

Building our document question answering system

We will walk you through building a document question answering (QA) system using Llama 3.1, Weaviate, and LangChain. We will also leverage W&B Weave for tracking and visualizing the fine-tuning and training process.


Step 1: Install Required Libraries

First, we need to install the necessary libraries.
!pip install torch transformers weaviate-client langchain datasets bitsandbytes wandb

Step 2: Set Up Weights & Biases

  • Create a free Weights & Biases account here if you don’t already have one.
  • Log in using the command:
import wandb
wandb.login()

Step 3: Load and Prepare the Dataset

For this example, we'll use the SQuAD dataset, a popular dataset for question-answering tasks.
from datasets import load_dataset
dataset = load_dataset("squad", split="train")


Step 4: Set Up Llama 3.1 with QLoRA Fine-Tuning

We will use QLoRA (Quantized LoRA) to fine-tune the Llama 3.1 model for our QA system, keeping it lightweight but powerful. Below is the configuration to get Llama 3.1 405B up and running:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Load the tokenizer and model with QLoRA configuration
compute_dtype = torch.bfloat16 # Set to bfloat16 if supported, else use float16
bnb_config = BitsAndBytesConfig(load_in_4bit=True)

# Load Llama 3.1 405B with quantization config
model_name = "NousResearch/Llama-3-405B"
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)

# Load Llama tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # Avoid overflow issue

# Optional: Fine-tune LoRA configuration for QA task
from peft import LoraConfig, SFTTrainer, TrainingArguments

peft_config = LoraConfig(r=8, lora_alpha=32, lora_dropout=0.1)
training_args = TrainingArguments(
output_dir="./outputs",
num_train_epochs=3,
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
save_steps=100,
logging_steps=10,
learning_rate=2e-5,
bf16=True, # If GPU supports bf16, otherwise use fp16
report_to="wandb" # Log to W&B
)

# Set up trainer for supervised fine-tuning
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
args=training_args,
tokenizer=tokenizer
)

trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)


Step 5: Load the Weaviate Vector Database

Now that the model is ready, we need to set up Weaviate to store document embeddings and perform efficient document retrieval. Install the Weaviate client and set up an embedded instance:
import weaviate

client = weaviate.Client(embedded_options=weaviate.EmbeddedOptions())

# Load your documents and split into manageable chunks
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("path_to_your_document.pdf")
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100)
splits = text_splitter.split_documents(docs)

# Store chunks in Weaviate
from langchain.vectorstores import Weaviate
from langchain.embeddings import HuggingFaceEmbeddings

vectorstore = Weaviate.from_documents(
client=client,
documents=splits,
embedding=HuggingFaceEmbeddings(model_name="NousResearch/Llama-3-405B")
)


dataset = load_dataset(dataset_name, split="train")

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
load_in_4bit=use_4bit,
bnb_4bit_quant_type=bnb_4bit_quant_type,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
major, _ = torch.cuda.get_device_capability()
if major >= 8:
print("=" * 80)
print("Your GPU supports bfloat16: accelerate training with bf16=True")
print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias="none",
task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
optim=optim,
save_steps=save_steps,
logging_steps=logging_steps,
learning_rate=learning_rate,
weight_decay=weight_decay,
fp16=fp16,
bf16=bf16,
max_grad_norm=max_grad_norm,
max_steps=max_steps,
warmup_ratio=warmup_ratio,
group_by_length=group_by_length,
lr_scheduler_type=lr_scheduler_type,
report_to="wandb"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
packing=packing,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

Run set
0



Understanding RAG (Retrieval-Augmented Generation)

In the ever-evolving landscape of natural language processing (NLP), the emergence of powerful models like RAG (Retrieval-Augmented Generation) has opened new frontiers in text generation and comprehension. RAG combines the strengths of retrieval-based methods and generative models, offering a unique approach to understanding and producing human-like text. We will explore what RAG is, its underlying principles, and provide a step-by-step guide on how to build your own RAG model.
RAG is a state-of-the-art NLP model that integrates the strengths of retrieval-based approaches with the creative capabilities of generative models. Developed to handle complex language tasks, RAG excels at information retrieval while maintaining the ability to generate coherent and contextually relevant responses.


Key Components of RAG:


Retrieval Module:
The retrieval module in RAG is responsible for efficiently searching through large amounts of data, such as documents or passages, to find relevant information. It employs advanced techniques like dense vector representations to capture semantic similarity and retrieve contextually relevant content.

Generative Module:
The generative module is the creative aspect of RAG. It takes the retrieved information and synthesizes it into human-like responses. This module is typically based on powerful language models like GPT (Generative Pre-trained Transformer) or similar architectures.

Integration Layer:
RAG's unique strength lies in its ability to seamlessly integrate the retrieval and generative modules. The integration layer ensures that the retrieved information is effectively utilized to enhance the quality and relevance of the generated responses.


Using W&B

Create a W&B account and install W&B using
pip install wandb
Then login using
wandb login


Creating the QnA pipeline and integrating with WANDB

docs = []
loader = PyPDFLoader("/content/main_notes.pdf")
docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 1500,
chunk_overlap = 150,
separators=["\n\n", "\n", " ", ""]
)

splits = text_splitter.split_documents(docs)
client = weaviate.Client(
embedded_options = EmbeddedOptions()
)

vectorstore = Weaviate.from_documents(
client = client,
documents = splits,
embedding = HuggingFaceEmbeddings(model_name = "NousResearch/Llama-2-7b-chat-hf"),
by_text = False
)
retriever = vectorstore.as_retriever()

template = """You serve as a assistant specialized in answering questions. Utilize the provided context fragments to respond to the question. If unsure, simply state your lack of knowledge. Provide a concise answer within three sentences at most.
Question: {question}
Context: {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)

llm = HuggingFaceHub(
repo_id=repo_id, model_kwargs={"temperature": 0.5, "max_length": 64}
)
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)

query = "Tell me about Support Vector Machine"
rag_chain.invoke(query)


Run set
10


References