Document question answering with Llama 3 and Weaviate
Created on August 23|Last edited on October 24
Comment
Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP), dramatically improving performance across a wide range of tasks. From text classification and machine translation to sentiment analysis and question-answering, these powerful models have remarkable capabilities.
One helpful application of LLMs is document question answering systems. These systems leverage LLMs' advanced language understanding and generation capabilities to provide accurate and contextually relevant answers to questions based on given documents.
In this article, we'll create a document question answering system using two powerful tools: Llama 3 and Weaviate. This practical guide will showcase how to harness the strengths of a state-of-the-art language model alongside a vector database to build an efficient and effective document analysis solution.

Understanding language models
Language models, especially Large Language Models (LLMs), are pre-trained on vast amounts of textual data, enabling them to understand and generate human-like language. Meta's Llama 3.1 405B is an example of such a model, boasting 405 billion parameters. With this scale, it has the capacity to grasp intricate patterns in language and generate contextually relevant responses.
Benefits of document question answering with LLMs
Traditional question answering systems have generally relied on keyword matching or rule-based approaches to question answering, with little-to-no "understanding" of context and providing accurate answers. LLMs, on the other hand, excel at contextual comprehension, allowing for a more nuanced understanding of the questions posed to them.
In document question answer with LLMs, users can input a question related to a specific document or dataset, and the language model can analyze the content to generate precise and contextually relevant answers. This shift has significant implications for information retrieval and user experience.
Some key benefits of question answering with LLMs include
- Contextual Understanding: Unlike traditional methods that often rely on literal keyword matching, LLMs can capture the nuances of language and the relationships between words and phrases. This enables them to provide more accurate and contextually relevant answers.
- Multimodal Integration: Advanced LLMs can integrate information from various modalities, such as text, images, and tables, allowing users to ask questions about diverse types of content.
- Dynamic Adaptability: LLMs can adapt to different domains and types of documents, making them versatile tools for a wide range of applications.
- Reduced Dependency on Keywords: LLMs can infer connections and provide answers based on the overall meaning of the document, reducing the reliance on keyword-based search.
- User-Friendly Interface: Document question answering with LLMs simplifies the user experience by allowing natural language queries, making information retrieval more intuitive and accessible.
What is LangChain?
LangChain is a framework that enables developers to build applications powered by Large Language Models (LLMs). It provides a set of tools and APIs that simplify the process of integrating LLMs into applications, allowing developers to focus on building innovative solutions rather than worrying about the underlying complexity of LLMs.
LangChain can be thought of as a modular toolkit for building LLM-powered applications. It provides pre-built blocks, called Chains, that can be easily combined to create complex workflows. Each Chain performs a specific task, such as summarizing text, translating languages, or generating code. By snapping these Chains together, developers can build a wide range of applications, from chatbots and document analysis tools to code generation and personalized content creation systems.
What is Weaviate?
Weaviate is an open-source vector database that stores and indexes data points by their underlying meanings and relationships, allowing for more intelligent and nuanced information retrieval. It uses vector embeddings to understand the meanings behind the data, enabling it to provide accurate and relevant results.
Unlike traditional databases that rely on keyword matching, Weaviate's vector-based approach enables it to "understand" the context and relationships between different pieces of data. This makes it possible for Weaviate to provide a more intuitive and human-like experience, making it easier for users to find the information they need.
- Precise Results: Weaviate retrieves information based on semantic similarity, surfacing content that aligns more closely with user intent.
- Contextual Awareness: Weaviate considers the context of your search, factoring in surrounding information and user preferences.
- Unforeseen Connections: Weaviate can uncover hidden relationships and unexpected connections within your data.

Llama 3.1 405B
In this tutorial we will be using the Llama 3.1 405B model.
What is Llama 3.1 405B?
Llama 3.1 405B is a large open-source language model designed to compete with closed models like GPT-4o and Claude 3.5 Sonnet. It is the largest and most capable model in the Llama series, with 405 billion parameters and a context length of 128K.
Key features of Llama 3.1 405B
Llama 3.1 405B boasts several key capabilities that make it a competitive model in the field of natural language processing. Its large size and advanced architecture enable it to capture complex patterns and nuances in language, making it well-suited for a wide range of applications, including long-form text summarization, multilingual conversational agents, and of course ... question answering.
Llama 3.1 405B has been evaluated on over 150 benchmark datasets and has demonstrated competitive performance with leading models like GPT-4o and Claude 3.5 Sonnet.

Fine-Tuning with LORA
To further enhance the capabilities of Llama 3.1 405B, we'll be using the LORA (Low-Rank Adaptation) technique for fine-tuning. LORA is a lightweight and efficient method that allows us to adapt the model to specific tasks like question answering, without modifying its entire weight matrix.
The Benefits of LORA
LORA offers several advantages over traditional fine-tuning methods:
- Memory-Efficient: LORA uses smaller matrices, reducing memory consumption during training and making it accessible on less powerful hardware.
- Faster Training: With fewer parameters to adjust, LORA trains faster than full-parameter methods, saving time and computational resources.
- Surprisingly Good Results: Despite its compact size, LORA often matches or even surpasses the performance of full-parameter fine-tuning, demonstrating its flexibility and adaptability.
With LORA, we can fine-tune Llama 3.1 405B for our question answering task, while also reducing the computational requirements and memory usage.
Now, let's get to the code.
Building our document question answering system
We will walk you through building a document question answering (QA) system using Llama 3.1, Weaviate, and LangChain. We will also leverage W&B Weave for tracking and visualizing the fine-tuning and training process.
Working on this section in Jupyter - http://localhost:8888/notebooks/Document%20QA%20Llama%203%20%2C%20Weaviate.ipynb
Step 1: Install Required Libraries
First, we need to install the necessary libraries.
!pip install torch transformers weaviate-client langchain datasets bitsandbytes wandb
Step 2: Set Up Weights & Biases
- Log in using the command:
import wandbwandb.login()
Step 3: Load and Prepare the Dataset
For this example, we'll use the SQuAD dataset, a popular dataset for question-answering tasks.
from datasets import load_datasetdataset = load_dataset("squad", split="train")
Step 4: Set Up Llama 3.1 with QLoRA Fine-Tuning
We will use QLoRA (Quantized LoRA) to fine-tune the Llama 3.1 model for our QA system, keeping it lightweight but powerful. Below is the configuration to get Llama 3.1 405B up and running:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfigimport torch# Load the tokenizer and model with QLoRA configurationcompute_dtype = torch.bfloat16 # Set to bfloat16 if supported, else use float16bnb_config = BitsAndBytesConfig(load_in_4bit=True)# Load Llama 3.1 405B with quantization configmodel_name = "NousResearch/Llama-3-405B"model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)# Load Llama tokenizertokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)tokenizer.pad_token = tokenizer.eos_token # Avoid overflow issue# Optional: Fine-tune LoRA configuration for QA taskfrom peft import LoraConfig, SFTTrainer, TrainingArgumentspeft_config = LoraConfig(r=8, lora_alpha=32, lora_dropout=0.1)training_args = TrainingArguments(output_dir="./outputs",num_train_epochs=3,per_device_train_batch_size=8,gradient_accumulation_steps=4,save_steps=100,logging_steps=10,learning_rate=2e-5,bf16=True, # If GPU supports bf16, otherwise use fp16report_to="wandb" # Log to W&B)# Set up trainer for supervised fine-tuningtrainer = SFTTrainer(model=model,train_dataset=dataset,peft_config=peft_config,args=training_args,tokenizer=tokenizer)trainer.train()# Save trained modeltrainer.model.save_pretrained(new_model)
Step 5: Load the Weaviate Vector Database
Now that the model is ready, we need to set up Weaviate to store document embeddings and perform efficient document retrieval. Install the Weaviate client and set up an embedded instance:
import weaviateclient = weaviate.Client(embedded_options=weaviate.EmbeddedOptions())# Load your documents and split into manageable chunksfrom langchain.document_loaders import PyPDFLoaderfrom langchain.text_splitter import RecursiveCharacterTextSplitterloader = PyPDFLoader("path_to_your_document.pdf")docs = loader.load()text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100)splits = text_splitter.split_documents(docs)# Store chunks in Weaviatefrom langchain.vectorstores import Weaviatefrom langchain.embeddings import HuggingFaceEmbeddingsvectorstore = Weaviate.from_documents(client=client,documents=splits,embedding=HuggingFaceEmbeddings(model_name="NousResearch/Llama-3-405B"))
dataset = load_dataset(dataset_name, split="train")# Load tokenizer and model with QLoRA configurationcompute_dtype = getattr(torch, bnb_4bit_compute_dtype)bnb_config = BitsAndBytesConfig(load_in_4bit=use_4bit,bnb_4bit_quant_type=bnb_4bit_quant_type,bnb_4bit_compute_dtype=compute_dtype,bnb_4bit_use_double_quant=use_nested_quant,)# Check GPU compatibility with bfloat16if compute_dtype == torch.float16 and use_4bit:major, _ = torch.cuda.get_device_capability()if major >= 8:print("=" * 80)print("Your GPU supports bfloat16: accelerate training with bf16=True")print("=" * 80)# Load base modelmodel = AutoModelForCausalLM.from_pretrained(model_name,quantization_config=bnb_config,device_map=device_map)model.config.use_cache = Falsemodel.config.pretraining_tp = 1# Load LLaMA tokenizertokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)tokenizer.pad_token = tokenizer.eos_tokentokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training# Load LoRA configurationpeft_config = LoraConfig(lora_alpha=lora_alpha,lora_dropout=lora_dropout,r=lora_r,bias="none",task_type="CAUSAL_LM",)# Set training parameterstraining_arguments = TrainingArguments(output_dir=output_dir,num_train_epochs=num_train_epochs,per_device_train_batch_size=per_device_train_batch_size,gradient_accumulation_steps=gradient_accumulation_steps,optim=optim,save_steps=save_steps,logging_steps=logging_steps,learning_rate=learning_rate,weight_decay=weight_decay,fp16=fp16,bf16=bf16,max_grad_norm=max_grad_norm,max_steps=max_steps,warmup_ratio=warmup_ratio,group_by_length=group_by_length,lr_scheduler_type=lr_scheduler_type,report_to="wandb")# Set supervised fine-tuning parameterstrainer = SFTTrainer(model=model,train_dataset=dataset,peft_config=peft_config,dataset_text_field="text",max_seq_length=max_seq_length,tokenizer=tokenizer,args=training_arguments,packing=packing,)# Train modeltrainer.train()# Save trained modeltrainer.model.save_pretrained(new_model)
Run set
0
Understanding RAG (Retrieval-Augmented Generation)
In the ever-evolving landscape of natural language processing (NLP), the emergence of powerful models like RAG (Retrieval-Augmented Generation) has opened new frontiers in text generation and comprehension. RAG combines the strengths of retrieval-based methods and generative models, offering a unique approach to understanding and producing human-like text. We will explore what RAG is, its underlying principles, and provide a step-by-step guide on how to build your own RAG model.
RAG is a state-of-the-art NLP model that integrates the strengths of retrieval-based approaches with the creative capabilities of generative models. Developed to handle complex language tasks, RAG excels at information retrieval while maintaining the ability to generate coherent and contextually relevant responses.
Key Components of RAG:
Retrieval Module:
The retrieval module in RAG is responsible for efficiently searching through large amounts of data, such as documents or passages, to find relevant information. It employs advanced techniques like dense vector representations to capture semantic similarity and retrieve contextually relevant content.
Generative Module:
The generative module is the creative aspect of RAG. It takes the retrieved information and synthesizes it into human-like responses. This module is typically based on powerful language models like GPT (Generative Pre-trained Transformer) or similar architectures.
Integration Layer:
RAG's unique strength lies in its ability to seamlessly integrate the retrieval and generative modules. The integration layer ensures that the retrieved information is effectively utilized to enhance the quality and relevance of the generated responses.
Using W&B
Create a W&B account and install W&B using
pip install wandb
Then login using
wandb login
Creating the QnA pipeline and integrating with WANDB
docs = []loader = PyPDFLoader("/content/main_notes.pdf")docs.extend(loader.load())text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150,separators=["\n\n", "\n", " ", ""])splits = text_splitter.split_documents(docs)client = weaviate.Client(embedded_options = EmbeddedOptions())vectorstore = Weaviate.from_documents(client = client,documents = splits,embedding = HuggingFaceEmbeddings(model_name = "NousResearch/Llama-2-7b-chat-hf"),by_text = False)retriever = vectorstore.as_retriever()template = """You serve as a assistant specialized in answering questions. Utilize the provided context fragments to respond to the question. If unsure, simply state your lack of knowledge. Provide a concise answer within three sentences at most.Question: {question}Context: {context}Answer:"""prompt = ChatPromptTemplate.from_template(template)llm = HuggingFaceHub(repo_id=repo_id, model_kwargs={"temperature": 0.5, "max_length": 64})rag_chain = ({"context": retriever, "question": RunnablePassthrough()}| prompt| llm| StrOutputParser())query = "Tell me about Support Vector Machine"rag_chain.invoke(query)
Run set
10
References
Add a comment