Skip to main content

Building a RAG pipeline using LLamaIndex and Claude 3

This article digs into Claude3, comparing its capabilities with those of GPT. It guides the reader through constructing a complete RAG pipeline offering insights into whether Claude3 represents the next generation of LLMs or if GPT continues to be the preferred choice.
Created on April 14|Last edited on March 1

In this piece, we'll guide you through the process of leveraging recent advancements in LLM. We'll focus specifically on how to integrate LLMs for retrieval augmented generation (RAG) using LLamaIndex (a framework for connecting data source to LLM) and Claude 3, (an LLM capable of analyzing images with text) to generate insightful, contextually relevant responses.
We'll also explore how Claude 3 compares to the famous GPT-4 model and whether it's worth the hype.

What we'll be covering



Understanding Claude 3

Claude is a family of LLMs developed by Anthropic, claiming to have near-human abilities and the functionality to process text and images for reasoning.
The Claude 3 family tree includes three models: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus. Each successive model offers increasingly powerful performance, allowing users to select the optimal balance of intelligence, speed, and cost for their specific application.
When comparing the three models, Opus is the most capable, offering better accuracy in terms of math problem-solving and creative content generation. Sonnet comes next, providing a combination of skills and speed at a lower cost to Opus suiting best for code generation and summarization, and Haiku is the fastest and least expensive model (which can be great for use cases live customer chats).
The figure below shows the cost difference between each of Claude 3 models:

Source: Anthropic
What makes Claude 3 special is its large context window, which measures the amount of text you can feed into the model. Claude 3 Opus offers a 200,000-token context window while consistently achieving over 99% recall. This allows it to analyze lengthy documents, such as legal summaries and books.
The graph below shows the recall for each model as the context length increases. It is noteworthy that, across all versions of Claude, the recall remains above 90%, with increased context length contributing to its standout performance
Source: Anthropic

Claude 3 Vs. GPT-4

In terms of performance, Claude 3 beats GPT-4 when tested for reasoning, summarization, providing factual inforamation, math problem-solving, and generating code, while GPT-4 still outperforms in creative writing tasks and drafting emails.
Both models possess the ability to understand and analyze images. However, unlike GPT-4, Claude 3 can only analyze and not generate images.
As noted above, Claude 3 offers a large context window of 200k tokens as opposed to GPT-4’s 8,192. This simplifies workflows that previously required splitting inputs to fit within shorter context windows. Large inputs can now be sent in as a single input allowing the model to better understand the context and eventually produce better results.
Claude 3 is available at a lower cost compared to GPT-4. The table below gives a comprehensive summary of both LLM tokens and costs for better comparison.



What is a RAG pipeline?

RAG is the process of improving an LLM's response by incorporating data outside of what it has been already trained on. Here's how it works:

Source: Author

1. Creation of embeddings

First, embeddings for the documents in the knowledge base are created using a model. These embeddings are high-dimensional vectors that represent the semantic meaning of the documents.

2. Query embedding

When a query or prompt is received, the model generates an embedding for the query using the same method.

3. Similarity matching

The system then compares the query embedding with the document embeddings in the knowledge base to find the most relevant documents. This is often done using cosine similarity or other distance metrics, which measure how close or related the query is to each document in the vector space.

4. Retrieval of documents

Documents with the highest similarity scores are retrieved as they are considered most relevant to the query.

5. Response generation

The retrieved documents are then input into an LLM, in combination with the original query, to generate a synthesized and contextually relevant response.

W&B Weave

In this project, we'll utilize W&B Weave to log and track the performance of our RAG pipeline. Weave allows us to automatically capture the inputs, outputs, and various metrics of our functions, facilitating detailed analysis and comparison. By decorating our functions with @weave.op(), we can seamlessly log data and visualize the performance of our retrieval and generative models.
When building and evaluating a RAG system, W&B Weave helps visualize the effectiveness of the retrieval component in fetching relevant information or assessing the accuracy of the generative model's responses. Multiple components of a RAG pipeline can be tuned (such as the embedding function) alongside the LLM you're testing (such as GPT or Claude), and those results can be logged to W&B Weave for comparison to determine the best-performing one.
For instance, in our Claude 3 RAG pipeline, we use Weave to log queries, contexts, and responses automatically. This allows us to track how well Claude 3 retrieves and synthesizes information from our research paper dataset, making it easier to identify areas for improvement and validate the overall performance of our pipeline.

Building our Claude 3 RAG pipeline

Now we will build a complete RAG pipeline using Claude 3 to help analyze and extract information from lengthy research papers instead of manually having to read through and locate relevant information.
We'll be using ChromaDB to store our embeddings which will be created using SentenceTransformer embedding function, and then derive answers using Claude 3 Opus.
Finally, the results will be compared against that of GPT 3.5 to help us choose the right LLM for our RAG application.

1. Installing the necessary packages

!pip install langchain
!pip install chromadb
!pip install sentence-transformers
!pip install openai
!pip install pypdf
!pip install anthropic
!pip install weave

2. Importing the libraries

import openai
import chromadb
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
import pandas as pd
from pypdf import PdfReader
import anthropic
import weave

3. Setting up Claude API key

client = anthropic.Anthropic(
api_key="YOUR API KEY HERE",
)

4. Setting up W&B Weave

Here we initialize our
weave.init('RAG_Claude_Research')

5. Loading our data

We're working on a research paper from CVPR on the topic of crop segmentation using remote sensing imagery in Africa. The pdf paper is loaded and each page is parsed.
reader = PdfReader("/content/19_CVPR_Semantic Segmentation of Crop Type in Africa.pdf")
pdf_texts = [p.extract_text().strip() for p in reader.pages]
combined_text = ' '.join(pdf_texts)

6. Chunking the data

The loaded data is divided into chunks for efficiently handling the context within the text when creating corresponding embeddings.
A very large context length may lead to hallucinations while a very small window might not capture the context. We'll be using the thousand (1000) chuck size in our case with 200 characters overlap with the previous chunk. After creating chunks, the sentence transformer is used to tokenize these smaller units into tokens that are optimal for sentence embedding.
character_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
character_split_texts = character_splitter.split_text((combined_text))


token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=100, tokens_per_chunk=256)


token_split_texts = []
for combined_text in character_split_texts:
token_split_texts += token_splitter.split_text(combined_text)

7. Creating embeddings and Chroma collection/vector store

The sentences are then converted into embeddings, and we create a vector store using Chroma that helps create, store, and retrieve our embeddings.
embedding_function = SentenceTransformerEmbeddingFunction()


chroma_client = chromadb.Client()
chroma_collection = client.create_collection(name = "research_paper", embedding_function=embedding_function)


ids = [str(i) for i in range(len(token_split_texts))]
chroma_collection.add(ids=ids, documents=token_split_texts)

8. Querying data from the vector store

The embeddings are now ready for use, and we set up a function that handles the incoming query by retrieving the relevant chunk from our embeddings. These retrieved chunks are then passed into Claude 3 as context to get the final response and allow us to reduce the number of tokens passed as context.
def llm_response(query, model):
results = chroma_collection.query(query_texts=[query], n_results=10)
retrieved_documents = results['documents'][0]
relevant_passage="".join(retrieved_documents)
return make_rag_prompt(model, relevant_passage, query)

9. Prompting for RAG’s response

We access the Claude 3 model and format the system prompt to return the response.
model = 'claude-3-opus-20240229'

@weave.op()
def make_rag_prompt(model, context, query):
response = client.messages.create(
system = "You are a helpful reserach assistant. You will be shown data from a research paper and you have to answer questions about the paper",
messages=[
{"role": "user", "content": "Context: " + context + "\n\n Query: " + query},
],
model= model, # Choose the model you wish to use
temperature=0,
max_tokens=160
)
return response.content[0].text

10. Evaluating and logging the results to W&B Weave

A user query is now passed to the Claude 3 model and the returned results are logged into Weights & Biases for tracking, recording, and analysis using Weave. We have shown responses for a single query below and the same can be repeated for others. Since we added the @weave.op() decorator for our make_rag_prompt function, the query, context, and responses are automatically logged to Weights & Biases.
rag_results_claude=[]
query = "Which crops are mapped in the paper?"
rag_results_claude.append({
"Query": query,
"Response": llm_response(query, model)
})
print(rag_results_claude)
The responses to our queries using Claude 3 are shown below in Weave, which automatically logs inputs and outputs from the functions it's applied to. The model accurately identifies the relevant information from the research paper and provides an in-depth response allowing us to efficiently review research papers without having to skim through for important information.




Using GPT-4

Here are the results for the GPT-4 run in Weave:


Is Claude 3 the next big thing?

Claude 3 and GPT-4 both work well when utilized as part of an RAG pipeline to extract and analyze information as demonstrated in the results above.
The major difference is depth. Claude 3, and the context window it provides, are simply deeper. That makes Claude 3 a great choice for scenarios involving lengthy documents that require in-depth analysis, especially where chunking may lead to loss of context.
Additionally, it must be noted that the cost of output tokens for Claude 3 is higher than that of GPT-4. Therefore, Claude 3 is more suitable for enterprises looking to securely leverage LLMs to deal with large documents.

Future versions of Claude are expected to handle multimodality which includes video and audio analysis and interpretation in parallel to images.

Iterate on AI agents and models faster. Try Weights & Biases today.