Fine-Tuning vs. Retrieval-Augmented Generation: Navigating Legal Document Analysis
This article walkthroughs the application of finetuning against RAG, when to use each, and a Python tutorial for legal document analysis.
Created on March 31|Last edited on March 1
Comment

Source
Introduction
In today’s rapidly evolving digital landscape, Artificial Intelligence (AI) has become the cornerstone of innovation, transforming industries and revolutionizing how we interact with information. As AI continues to advance, organizations and individuals are faced with pivotal decisions on how to best leverage this technology for analyzing vast datasets and extracting valuable insights. Central to this decision-making process are two cutting-edge methodologies: Fine-Tuning and Retrieval-Augmented Generation (RAG). Each offers distinct advantages and challenges, but choosing the right approach can seem like navigating a labyrinth in the dark.
This blog aims to illuminate the path, providing a clear and engaging comparison between fine-tuning and RAG. Whether you are a data scientist, a business leader, or simply an AI enthusiast, understanding these methodologies is crucial in harnessing the full potential of AI.
Understanding Fine-Tuning and RAG
Fine-tuning is a specialized method in the field of artificial intelligence where pre-trained models, often developed on massive, diverse datasets, are further trained or "fine-tuned" on a smaller, specific dataset relevant to the task at hand. This approach leverages the general knowledge the model has acquired during its initial training phase, allowing it to adapt its understanding to niche areas or particular applications.
In practice, Fine-Tuning involves adjusting the weights and parameters of a pre-existing neural network to better align with the specific nuances and characteristics of the new data. This process significantly reduces the time and resources required for training, as the model does not need to learn from scratch. It's particularly beneficial in scenarios where data is scarce or highly specialized. The image below shows the overall architecture of how fine-tuning works.

On the other hand, RAG combines the predictive power of generative models with the vast knowledge contained in external databases or documents. RAG operates by dynamically retrieving relevant information from a large dataset during the generation process.
A query is given to a RAG model, the model searches through a large dataset or a knowledge base to find pieces of information that are relevant to the query. This retrieved information is fed to a large, pre-trained language model like those based on the Transformer architecture (for example, variants of GPT or BERT).
The generative model integrates the retrieved information with its pre-existing knowledge (acquired during its initial pre-training on vast text corpora) to construct a coherent, contextually relevant response. Finally, the integrated model generates a response that not only draws from the generative model's pre-trained knowledge but also incorporates specific information retrieved in response to the query. This allows the RAG system to produce answers that are both informed by broad general knowledge and enriched with specific, relevant details pulled from the retrieval phase. The image below shows the overall architecture of how a RAG system works.

While both methodologies aim to enhance the capabilities of AI models, they do so in distinct ways. Fine-tuning adapts and refines existing knowledge to new contexts, making it a go-to for tasks requiring high precision and domain specificity. RAG, conversely, extends the model’s reach, bringing in external information to provide more nuanced and comprehensive outputs.
When to Use Fine-Tuning vs. RAG
Deciding between Fine-Tuning and RAG depends on various factors including the specific task, data availability, and desired outcomes.
Fine-tuning is ideal for tasks requiring deep domain expertise. If you have a substantial amount of specialized data, Fine-Tuning can adapt a pre-trained model, being more resource-efficient than training a model from scratch, to understand the nuances of your specific field. When dealing with specialized tasks but limited data, Fine-Tuning is advantageous as it allows you to quickly leverage the learned patterns of a pre-trained model, minimizing the need for large datasets while maintaining the accuracy that might be difficult to procure in certain domains.
Example use cases include analyzing market data for financial investments, customer support chatbots to talk in a custom tone, and custom language translation.
Whereas, RAG is suitable for applications requiring a wide range of information, such as question-answering systems, where the model needs to pull in relevant information from vast datasets to answer queries accurately. If an application deals with constantly changing information or topics, RAG can adapt by retrieving the most recent data for each query, ensuring the model’s outputs remain up-to-date and relevant. Additionally, if the scope of a project is expected to expand over time, RAG’s ability to consult external databases allows it to scale more easily than Fine-Tuning, which may require retraining as the domain of interest grows.
Example use cases include: retrieving data from pre-existing documents, virtual assistants, conversational agents, etc.

Source: Author
The decision between Fine-Tuning and RAG should be based on a thorough assessment of your project’s goals, the nature of the data, and the specific requirements of the task at hand. Fine-tuning is generally preferred for specialized, high-precision tasks with available domain-specific data, while RAG shines in scenarios requiring broad, dynamic knowledge.
Now let’s walk through a Python tutorial where we first finetune GPT on legal data and compare the results when using a pre-trained model. We will also explore the implementation of RAG for querying the legal documents to retrieve relevant information.
Fine-Tuning GPT on Legal Data
For the study, to test a fine-tuned model and walk through its implementation, we will prompt the GPT 3.5 model to generate summaries for legal judgments. The dataset for this case study, Indian supreme court judgment and its summaries taken from huggingface, holds the judgment of legal cases from India along with the corresponding summary. We will first test the results with a pretrained GPT 3.5 model and then compare it against a fine-tuned model to analyze the performance.
As part of our coding tutorial, we'll delve into the powerful practice of logging with Weights & Biases (WandB), an intuitive platform for monitoring model performance metrics in real-time, allowing developers and data scientists to visualize results, compare different runs, and iterate more effectively towards their goals. This introduction will guide you through setting up WandB in your projects, ensuring that you have a robust toolkit for managing your fine-tuning and RAG pipelines with ease.
1. Installing the Necessary Packages
!pip install openai!pip install wandb!wandb login!pip install datasets
2. Importing the Libraries
import pandas as pdimport osimport jsonimport openaiimport wandbfrom datasets import load_datasetimport json
3. Reading the Dataset and Setting up WandB
We load the dataset and change its format from dictionary to pandas dataframe and set up a new wandb run to log the model’s predictions.
dataset_name = "rishiai/indian-court-judgements-and-its-summaries"dataset = load_dataset(dataset_name)df = dataset['train'].to_pandas()wandb.init(project='finetuning_vs_RAG_legal', name='LMM')
4. Setting up the OpenAI API Key
from IPython.display import Markdown, displayos.environ["OPENAI_API_KEY"] = 'add your openai API key'openai.api_key = os.environ["OPENAI_API_KEY"]
5. Evaluating performance using the Pre-Trained Model
We first use the pre-trained model: ‘gpt 3.5 turbo’ to generate legal summaries. For this, we first define a function ‘score_review’ and set the content for the system to act as a legal assistant to summarize legal judgments for the first 5 judgments.
def score_review(review):response = openai.chat.completions.create(messages=[{"role": "system", "content": "You are a helpful legal assistant who summarizes legal judgements"},{"role": "user", "content": review},],model="gpt-3.5-turbo",temperature=0,max_tokens=160)return response.choices[0].message.contentdf['pretrained_summary'] = df.iloc[:5].apply(lambda row: score_review(row['Judgment']), axis=1)
6. Converting Data to JSON for Fine-Tuning
To fine-tune the model on a given dataset, we first need to format the data in a way, such that each row is a JSON object.
output_file_path = 'output_data.jsonl'with open(output_file_path, 'w') as file:for index, row in df_downsampled.iterrows():json_object = {"messages": [{"role": "system", "content": "You are a helpful legal assistant who summarizes the legal judgements."},{"role": "user", "content": row['Judgment']},{"role": "assistant", "content": str(row['Summary'])}]}file.write(json.dumps(json_object) + '\n')
7. Finetuning the Model
After formatting the data we upload it for fine-tuning using the Files API. Once the data is uploaded successfully, the generated training file id is used to initiate the fine-tuning process.
from openai import OpenAIclient = OpenAI()client.files.create(file=open("output_data.jsonl", "rb"),purpose="fine-tune")client.fine_tuning.jobs.create(training_file="file-TwlxA3eiSdp1Xkt19rZUD6IA",model="gpt-3.5-turbo-1106")
8. Evaluating the Performance Using Our Fine-Tuned Model
We use the finetuned model to evaluate the results for the same 5 samples used for the pre-trained model.
results = []def score_review(review, original_score, old_score):response = openai.chat.completions.create(messages=[{"role": "system", "content": "You are a helpful legal assistant who summarizes the legal judgements."},{"role": "user", "content": review},],model="ft:gpt-3.5-turbo-1106:personal::97UfFebI",temperature=0,max_tokens=160)results.append({"Judgment": review,"Summary": original_score,"pretrained_summary": old_score,"finetuned_summary": response.choices[0].message.content,})return response.choices[0].message.contentdf['finetuned_summary'] = df.iloc[:5].apply(lambda row: score_review(row['Judgment'], row['Summary'], row['pretrained_summary']), axis=1)
9. Logging the Output to WandB
The resulting summary and the corresponding judgments along with the original summary are logged to wandb for efficient comparison of the results.
df_results = pd.DataFrame(results)wandb.log({"results_table": wandb.Table(dataframe=df_results)})
10. Evaluating the Results
The table below shows the logged examples into wandb table along with the generated summaries for the pre-trained and finetuned model. The results show that the finetuned model follows the details and specificity similar to those in the original summaries as opposed to the pre-trained model which summarizes the key points in the judgment. This validates that finetuning is beneficial when conserving the tone and format in which to generate predictions and can be utilized for tasks requiring such preservation.

Source: Author
Implementing RAG for Legal Document Retrieval
Now we will walk through the implementation of RAG to ask questions regarding a judgment using the same dataset as above. The coding steps from 1 to 4 will be the same as above.
1. Installing the Necessary Packages
!pip install langchain!pip install python-dotenv!pip install chromadb!pip install tiktoken!pip install langchain_openai!pip install langchainhub
2. Importing the Libraries
import dotenvdotenv.load_dotenv()from langchain_community.vectorstores import Chromafrom langchain.embeddings import OpenAIEmbeddingsimport chromadbfrom langchain_openai import ChatOpenAIfrom langchain import hubfrom langchain_core.output_parsers import StrOutputParserfrom langchain_core.runnables import RunnablePassthroughfrom langchain.text_splitter import RecursiveCharacterTextSplitter
3. Accessing and Chunking the Data
We will work with only the first judgment for demonstration purposes which is divided into chunks to cater for the limitation of context length for GPT and to efficiently retrieve the relevant data when querying for answers. The data is divided into 24 chunks with 1000 tokens in each chunk. Each chunk has an overlap of 200 words with the previous chunk to preserve the contextual information.
docs = df['Judgment'].iloc[0]text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, add_start_index=True)all_splits = text_splitter.create_documents([docs])
4. Creating a Vector Store
We create a vector store using Chroma that helps create, store and retrieve our embeddings.
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
5. Querying data from the Vector Store
Based on the similarity metric, we retrieve the chunks/documents from the vector store that are most similar with our user’s query.
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})retrieved_docs = retriever.invoke("What did the learned Attorney General say about taxes?")
6. Formatting RAG’s Response
We access the model and the system prompt, then input this to the ragchain for retrieving relevant text out of all the retrieved chunks/documents.
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)prompt = hub.pull("rlm/rag-prompt")def format_docs(docs):return "\n\n".join(doc.page_content for doc in docs)rag_chain = ({"context": retriever | format_docs, "question": RunnablePassthrough()}| prompt| llm| StrOutputParser())
7. Evaluating and Logging the Results to WandB
rag_chunk=[]for chunk in rag_chain.stream("What did the learned Attorney General say about taxes?"):rag_chunk.append(chunk)sentence = ''.join(rag_chunk)rag_results=[]rag_results.append({"Query": "What did the learned Attorney General say about taxes?","Response": sentence})df_results = pd.DataFrame(rag_results)wandb.log({"results_table": wandb.Table(dataframe=df_results)})
The result shows the returned response of the model after evaluating the entire judgment logged into wandb. The response accurately identifies the General’s point of view regarding taxes helping us derive accurate answers to questions without having to go through the entire judgment manually. Therefore, RAG is a great choice for querying large documents efficiently and accurately.

Source: Author
Best Practices and Considerations
Although both fine-tuning and RAG offer great opportunities for playing with heaps and heaps of documents in no time, some considerations should be kept in mind for efficient and accurate retrieval.
Firstly, the success of fine-tuning significantly depends on the quality and relevance of the data used. High-quality, well-labeled datasets that closely align with your task's requirements can dramatically improve the model's performance. Fine-tuning large models can be resource-intensive. Additionally, when using the GPT models the cost for querying finetuned models is higher as compared to pretrained models. Plan accordingly, considering both the computational resources required and the time it will take to fine-tune the model to your specific needs.
For RAG, the effectiveness is heavily influenced by the quality and reliability of the external data sources it retrieves information from. Therefore, ensure that the knowledge base is up-to-date, accurate, and covers a broad spectrum of information relevant to your tasks. The retrieval process must be optimized using techniques like indexing to speed up searches. Strategies must be implemented to handle situations where the retrieved information might be outdated, incorrect, or biased.
Conclusion
In conclusion, navigating the landscape of parsing and analyzing vast datasets presents a unique set of challenges but with the introduction of LLMs, the opportunities are even greater. This blog has delved into two pivotal methodologies for such cases: Fine-Tuning and RAG. Each method offers distinct advantages tailored to different needs and scenarios depending on the project's requirements, data availability, and desired outcomes. While fine-tuning may be the go-to for projects with rich domain-specific data, RAG stands out in scenarios where the integration of diverse and real-time information can significantly enhance the model's output.
The article also provides a comprehensive step-by-step guide on implementing both methodologies using Python, focusing on legal data to summarize judgments while preserving the tone and content. It also offers insights on how to query the judgments effectively to retrieve relevant information using RAG. These methodologies can help individuals as well as large-scale organizations easily navigate through stocks of data, efficiently and effectively.
Add a comment