Skip to main content

Crafting Superior Summaries: The ChatGPT Fine-Tuning Guide

This article details the fine-tuning of ChatGPT for dialogue summarization, showcasing marked improvements using Weights and Biases for performance tracking and optimization
Created on October 15|Last edited on November 22
Source

Introduction

Drowning in a sea of endless information? You're not alone. One communication expert estimates that the average knowledge worker must process, consciously or subconsciously, the equivalent of 174 newspapers of information every day. So, how do we make sense of it all? One word: Summarization.
In this article, we're diving into the world of ChatGPT, tweaking it a bit to make it a pro at summarizing dialogues. While reading, newcomers to machine learning will gain valuable theoretical insights and practical implementations - while seasoned experts will benefit from an in-depth understanding of the comprehensive ChatGPT fine-tuning process.
And while we're at it, we'll also see how tools like Weights & Biases help keep an eye on how our model is doing. With that said, let's dive in!

Table of Content



Understanding ChatGPT and Summarization

ChatGPT Architecture

The magic behind ChatGPT is the GPT (Generative Pre-trained Transformer) architecture which as the name implies can be broken down into three main parts.
  1. Generative: This means the model can generate text. Give it a prompt, and it'll continue the text in a way that's contextually relevant.
  2. Pre-trained: Before you even interact with it, this model has already been trained on vast amounts of text from the internet. So, it comes with a lot of general knowledge right out of the box.
  3. Transformer: This is the actual neural network architecture it uses. Without diving too deep into tech jargon, think of it as a super-smart design that lets the model consider multiple parts of a sentence at once to make sense of what's being said.

The Importance of Text Summarization

We live in an age of information overload. Every day, countless articles, papers, reports, and other forms of content are published. Reading everything isn't just impractical; it's impossible. Text summarization provides a way to quickly understand the gist of vast amounts of text without going through every single word.
Imagine having to skim through a 50-page report to know its main points. It's time-consuming, right? Automated summaries can condense that content into a few paragraphs or even sentences, saving a lot of time.

Why Do We Need To Fine-Tune ChatGPT for Text Summarization

While ChatGPT comes pre-trained on a massive amount of text and can handle a variety of tasks out-of-the-box, it's a bit like a jack-of-all-trades. Text summarization, on the other hand, is a nuanced and specific task. Fine-tuning focuses the model's abilities on that specific task, making it better at generating concise and coherent summaries.
Without fine-tuning, ChatGPT might provide summaries that are more verbose, miss the key points, or capture irrelevant details. By fine-tuning on a dataset of well-crafted summaries, we're essentially teaching the model the art of distilling information, ensuring the summaries are of high quality and relevance.
Let's walk through how to tailor ChatGPT specifically for dialogue summarization. The model will be fed with conversations involving two or more individuals. The objective is to have the model churn out concise and focused summaries of these chats, ensuring no crucial details are left out.

An Overview of Weights and Biases

Weights & Biases, commonly referred to as W&B, is a pivotal tool for machine learning experimentation. It serves as a digital laboratory journal for machine learning researchers and practitioners, enabling them to meticulously log their experiments, outcomes, models, and more.
Catering to machine learning professionals around the globe, W&B offers tools that allow for comprehensive tracking, supervision, and visual representation of every model detail. By harnessing these features, ML specialists are empowered to rapidly attain their model's peak performance, guaranteeing the best results in the shortest time frame.
Throughout this article's hands-on section, we will harness the power of W&B to keenly observe our model's efficacy, both pre and post-the fine-tuning phase.
When it comes to text summarization, the gold standard for performance evaluation is a hands-on review. We'll store the model's output from each session in W&B.
Subsequently, we'll undertake a side-by-side comparison of the model's responses, before and after adjustments, ensuring it aligns increasingly with our desired outcome.

Data Preparation and Annotation

The Significance of High-Quality Training Data for Summarization

In the case of summarization and data quality, there are two main points with their corresponding phrases that we should focus on.
First, we have the phrase “Good Data = Good Summaries”. Imagine having a student that you want to teach how to summarize a given book. If you give them a bunch of poorly written book summaries as examples, they're probably gonna get the wrong idea about what a good summary looks like. The same goes for machines! If our training data is of top-notch quality, our machine (or model) will produce summaries that are on point.
The other phrase which is more commonly known in the machine learning world is “Garbage In, Garbage Out”. It’s like trying to cook a great meal with bad ingredients. No matter how good of a chef you are, if your ingredients are rotten, the meal won’t taste good. Similarly, if we feed our AI model with low-quality training data, it’s going to spit out low-quality summaries.

Annotating a Summarization Dataset for Fine-Tuning

There is a given format required by ChatGPT in order to fine-tune the model on. This format includes 3 sections:
  • System: This is the prompt that you will pass to ChatGPT. In our case, the prompt would be “GPT is a great and to-the-point dialogue summarization tool.”
  • User: This is the question asked to the model. In our case, it would be the text that we are required to summarize.
  • Assistant: This is the answer that our model would return. In this case, it would be a brief summary of the text.

Example of the Required JSONL Data Format

{"messages": [{"role": "system", "content": "GPT is a great and to-the-point dialogue summarization tool."}, {"role": "user", "content": "#Person1#: hey, you look great! how's everything?\n#Person2#: yeah, you know what? I've been going to the club regularly. The training really pays off. Now I am in a good shape and I know more about how to keep fit.\n#Person1#: really? tell me about it. I haven't gone to the club for a long time. I am too busy with work.\n#Person2#: it's important to do proper exercises.\n#Person1#: you're right. Too much or too little won't do any good.\n#Person2#: the trainer tells me, besides regular sports activities, I should also have a healthy and balanced diet.\n#Person1#: sounds reasonable.\n#Person2#: we should eat more vegetables instead of junk food to stay energetic.\n#Person1#: and fruits!\n#Person2#: surely it is. Getting enough sleep is also crucial for fitness.\n#Person1#: I've heard that. Does your trainer tell you anything about keeping fit?\n#Person2#: yeah, he advises me to stay in a good mood. That can help one to keep sound physical health.\n#Person1#: I think if you follow your trainer's advice, you'll be on the right track.\n#Person2#: you bet it!"}, {"role": "assistant", "content": "#Person2# looks great because #Person2#'s been to the training club regularly. #Person2# tells #Person1# that having a healthy and balanced diet, getting enough sleep, and staying in a good mood help keep physical health."}]}
Regardless of whether the initial dataset is in CSV or JSON format, the ultimate data format should be presented as mentioned above with three separate distinctions. To dive deeper into the data set preparation process check the following OpenAI Documentation.

Fine-Tuning Step-by-Step Tutorial

Step 1: Install Necessary Libraries

!pip install openai
!pip install wandb
!pip install git+https://github.com/wandb/wandb.git@openai-finetuning

import os
import openai
import wandb
import pandas as pd
import json
from openai import OpenAI
from wandb.integration.openai import WandbLogger

Step 2: Set Up the OpenAI API Key

openai.api_key = "Insert your own personal OpenAI key here"

client = openai.Client()

Step3: Initialize the W&BLogger Function

The WandbLogger() function for OpenAI, as part of the Weights & Biases (W&B) toolkit, is designed to facilitate the fine-tuning of OpenAI models, including ChatGPT. It allows you to track and monitor training, visualize performance, compare experiments
To know more about the function and to check the multitude of additional parameters that can be passed to this function click here.
WandbLogger.sync(project="chatgpt3.5-fine-tuning-for-Summarization")

Step 4: Load and Sample the Dataset

# Read the CSV data
df = pd.read_csv('Insert path to summarization dataset”)

# Sample 100 rows
df = df.sample(100)

Step 5: Modify the Dataset Format To Adhere to That of OpenAI

output_filename = "Insert the new path to save the newly modified JSONL data set"

# Create a new JSONL file
with open(output_filename, 'w') as jsonl_file:
for _, row in df.iterrows():
# Create the desired format for each row in the CSV
data = {
"messages": [
{
"role": "system",
"content": "GPT is a great and to-the-point dialogue summarization tool."
},
{
"role": "user",
"content": row['dialogue']
},
{
"role": "assistant",
"content": row['summary']
}
]
}
# Write the data to the JSONL file
jsonl_file.write(json.dumps(data) + '\n')

Step 6: Upload the Created File to OpenAI

training_file = client.files.create(
file=open(output_filename, "rb"),
purpose='fine-tune'
)

training_file_id = training_file.id

Step 7: Fine-Tune the Model

fine_tuning_job = client.fine_tuning.jobs.create(
training_file=training_file_id,
model="gpt-3.5-turbo"
)

job_id = fine_tuning_job.id

Step 8: Define and Evaluation Function

First, we'll set up a new wandb project for each of our model’s evaluation performances. Next, we'll evaluate our dataset using the specified format. Here, we provide the dialogue to the model, specify that we want a summary, and then await its response.
def evaluate_model(model_name, model_id):
# Initialize Weights & Biases
run = wandb.init(project="text-summarization-with-openai", reinit=True)

correct_predictions = 0
loop_count = 0
results = []

# Iterate over each row in the DataFrame
for index, row in df.iterrows():
loop_count += 1
dialogue_text = row['dialogue']

try:
completion = client.chat.completions.create(
model=model_id,
messages=[
{"role": "system", "content": "GPT is a great and to-the-point dialogue summarization tool."},
{"role": "user", "content": dialogue_text},
]
)
response = completion.choices[0].message.content
results.append({
"dialogue": dialogue_text,
"actual_summary": row['summary'],
"predicted_summary": response
})

if compare_summaries(response, row['summary']):
correct_predictions += 1

print(f"[{model_name}] Processed {loop_count}/{len(df)} rows.")

except Exception as e:
print(f"Error on index {index} with {model_name}: {e}")
continue

accuracy = (correct_predictions / len(df)) * 100
wandb.log({f"{model_name} Accuracy": accuracy})

# Convert results list to DataFrame
df_results = pd.DataFrame(results)

# Log the entire DataFrame as a table to W&B
wandb.log({"results_table": wandb.Table(dataframe=df_results)})

print(f'{model_name} Summarization Accuracy: {accuracy:.2f}%')
# Finish the Weights & Biases run for the current model
run.finish()

def compare_summaries(predicted, actual):
return predicted.strip() == actual.strip()

Step 9: Load the Dialogue Summarization Dataset

Please be aware that we'll use an entirely distinct test sample for evaluating both models to ensure that there's no repetition of data points. We will also remove any NaN (Not a Number) value from the dataset.
filename = "/kaggle/input/dialogsum/CSV/test.csv"
df = pd.read_csv(filename)

# Remove any NaN values in 'dialogue' and 'summary'
df.dropna(subset=['dialogue', 'summary'], inplace=True)
df = df.sample(100) # Use a subset of the data for quick testing

Step 10: Evaluate the Base Model

In this step, we'll assess the gpt-3.5-turbo model using the evaluation dataset, without applying any fine-tuning.
evaluate_model("base", "gpt-3.5-turbo")

Step 11: Evaluate the Fine-Tuned Model

In this step, we'll assess the newly fine-tuned model using the evaluation dataset after applying the fine-tuning process.
evaluate_model("fine_tuned", "Insert fine_tuned model ID")

Evaluating Fine-Tuned ChatGPT

Using W&B we have stored a unique evaluation table for each of our models. Each table consists of three parts: first the dialog to be summarized, second the actual_summary provided in the dataset, and finally the predicted summary of each model. As mentioned, to make it easier to read each model’s answer is saved in a separate table.

Old Model’s Performance Table




New Model’s Performance Table



The data showcased in Weights and Biases underscores a dramatic improvement between the original and the fine-tuned models. While the former none fine-tuned gpt model tends to be indirect, often bordering on redundancy, the latter model encapsulates the essence of text summarization by delivering crisp and direct summaries.
Take, for instance, the very first summarization data point from both tables. The predecessor ChatGPT, in its attempt to be comprehensive, loses the core principle of brevity, producing an extensive summary.
In contrast, the refined model produces a concise summary, a mere three lines compared to the almost 20 lines from the previous version. This demonstrates a significant and noticeable enhancement in the model's overall performance.
While the model shows promising results, it's important to acknowledge that it was trained on a mere 100 data points. This limited dataset can influence accuracy. Although we've achieved concise summarization, there's room for refinement to ensure the summaries consistently capture the essence of the dialogues accurately. Future enhancements will further optimize performance.

Conclusion

In this article, we delved into the process of fine-tuning ChatGPT for the specialized task of dialogue summarization. Through quite a range of adjustments and evaluations, it became evident that tailored training can significantly enhance a model's performance in specific domains.
By using Weights and biases as an instrumental tool, we were capable of meticulously monitoring and optimizing the model's trajectory. In an era inundated with vast volumes of conversational data, the importance of efficient summarization cannot be overstated. With the advancements demonstrated in this study, it is evident that we are on the cusp of pioneering strides in automated information condensation.
Iterate on AI agents and models faster. Try Weights & Biases today.