Skip to main content

LLM: Chat with your Document

Used LangChain orchestration tool to chat with a document. This is a starter project, I have just shown how we can use LLM to chat with a document.
Created on January 5|Last edited on January 5

Table of Contents



Introduction

This is a simple project to showcase the power of LLM to chat with a document. I have used OpenAI's gpt-3.5-turbo as LLM to chat with the document.

Description

Chatting with our document is not as straightforward as asking directly to LLMs. Every model has its context window, and we need to fit it while asking questions. If we want to compare that with gpt-3.5-turbo , it has a context length of ~4k. That means we cannot simply read the document and pass the whole text as context to LLM.

Step-1 Document Loader

LangChain provides a comprehensive list of document loaders. Since I am working with PDF, I have used PyPDFLoader as a document loader.
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("./Tax_Fact_2023-24.pdf") #file url also supported

Step-2 Document Splitting

Now, I have used RecursiveCharacterTextSplitter as a text splitter. More text splitters can be found in langchain.text_splitter package.
pages = loader.load_and_split(
text_splitter=RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
add_start_index=True,
)
)
chunk_size is the size of a chunk and chunk_overlap preserves the context while splitting the document.

Step-3 Embeddings

Document splitting returns multiple pages and tries to preserve the context as well. Now, use OpenAIEmbeddings to create vector embeddings. I am also using FAISS to perform a similarity search on the vector embeddings.
faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())
Now, retrieve the semantically similar context from vector db.
retriever=faiss_index.as_retriever()


Step-4 LLM

I have run RetrievalQA chain. I am using ChatOpenAI for QA.
llm = ChatOpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm, retriever=faiss_index.as_retriever())
Finally, run qa_chain to get the response from LLM.
answer = qa_chain.run("What are the budget highlights?")
print(answer)

Full Source Code

Below is the source code to run.
First, install the requirements.
wandb
langchain
python-dotenv
faiss-cpu
Source code:

import os
import wandb
# turn on wandb logging for langchain
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"
os.environ["WANDB_PROJECT"] = "chatpdf"

from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

from dotenv import load_dotenv

load_dotenv()

llm = ChatOpenAI(temperature=0)

loader = PyPDFLoader("./Tax_Fact_2023-24.pdf")
pages = loader.load_and_split(
text_splitter=RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
add_start_index=True,
)
)

faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm, retriever=faiss_index.as_retriever())


with wandb.init(project='chatpdf') as run:
try:
answer = qa_chain.run("What are the budget highlights?")
print(answer)
except Exception as e:
# any errors will be also logged to Weights & Biases
print(e)
pass


ALL Tracing


Run set
3