Crafting Superior Summaries: The ChatGPT Fine-Tuning Guide

This article details the fine-tuning of ChatGPT for dialogue summarization, showcasing marked improvements using Weights and Biases for performance tracking and optimization
Mostafa Ibrahim
Created on October 15|Last edited on November 22
Comment
﻿
﻿Source﻿
IntroductionDrowning in a sea of endless information? You're not alone. One communication expert estimates that the average knowledge worker must process, consciously or subconsciously, the equivalent of 174 newspapers of information every day. So, how do we make sense of it all? One word: Summarization.
In this article, we're diving into the world of ChatGPT, tweaking it a bit to make it a pro at summarizing dialogues. While reading, newcomers to machine learning will gain valuable theoretical insights and practical implementations - while seasoned experts will benefit from an in-depth understanding of the comprehensive ChatGPT fine-tuning process. 
And while we're at it, we'll also see how tools like Weights & Biases help keep an eye on how our model is doing. With that said, let's dive in!
Table of ContentIntroductionTable of ContentUnderstanding ChatGPT and SummarizationChatGPT ArchitectureThe Importance of Text SummarizationWhy Do We Need To Fine-Tune ChatGPT for Text SummarizationAn Overview of Weights and BiasesData Preparation and AnnotationThe Significance of High-Quality Training Data for SummarizationAnnotating a Summarization Dataset for Fine-TuningExample of the Required JSONL Data FormatFine-Tuning Step-by-Step TutorialEvaluating Fine-Tuned ChatGPTOld Model’s Performance TableNew Model’s Performance TableConclusion
﻿
Understanding ChatGPT and Summarization
ChatGPT ArchitectureThe magic behind ChatGPT is the GPT (Generative Pre-trained Transformer) architecture which as the name implies can be broken down into three main parts.
Generative: This means the model can generate text. Give it a prompt, and it'll continue the text in a way that's contextually relevant.
Pre-trained: Before you even interact with it, this model has already been trained on vast amounts of text from the internet. So, it comes with a lot of general knowledge right out of the box.
Transformer: This is the actual neural network architecture it uses. Without diving too deep into tech jargon, think of it as a super-smart design that lets the model consider multiple parts of a sentence at once to make sense of what's being said.
The Importance of Text SummarizationWe live in an age of information overload. Every day, countless articles, papers, reports, and other forms of content are published. Reading everything isn't just impractical; it's impossible. Text summarization provides a way to quickly understand the gist of vast amounts of text without going through every single word.
Imagine having to skim through a 50-page report to know its main points. It's time-consuming, right? Automated summaries can condense that content into a few paragraphs or even sentences, saving a lot of time.
Why Do We Need To Fine-Tune ChatGPT for Text SummarizationWhile ChatGPT comes pre-trained on a massive amount of text and can handle a variety of tasks out-of-the-box, it's a bit like a jack-of-all-trades. Text summarization, on the other hand, is a nuanced and specific task. Fine-tuning focuses the model's abilities on that specific task, making it better at generating concise and coherent summaries.
Without fine-tuning, ChatGPT might provide summaries that are more verbose, miss the key points, or capture irrelevant details. By fine-tuning on a dataset of well-crafted summaries, we're essentially teaching the model the art of distilling information, ensuring the summaries are of high quality and relevance.
Let's walk through how to tailor ChatGPT specifically for dialogue summarization. The model will be fed with conversations involving two or more individuals. The objective is to have the model churn out concise and focused summaries of these chats, ensuring no crucial details are left out.
An Overview of Weights and Biases﻿Weights & Biases, commonly referred to as W&B, is a pivotal tool for machine learning experimentation. It serves as a digital laboratory journal for machine learning researchers and practitioners, enabling them to meticulously log their experiments, outcomes, models, and more. 
Catering to machine learning professionals around the globe, W&B offers tools that allow for comprehensive tracking, supervision, and visual representation of every model detail. By harnessing these features, ML specialists are empowered to rapidly attain their model's peak performance, guaranteeing the best results in the shortest time frame.
Throughout this article's hands-on section, we will harness the power of W&B to keenly observe our model's efficacy, both pre and post-the fine-tuning phase. 
When it comes to text summarization, the gold standard for performance evaluation is a hands-on review. We'll store the model's output from each session in W&B. 
Subsequently, we'll undertake a side-by-side comparison of the model's responses, before and after adjustments, ensuring it aligns increasingly with our desired outcome.
Data Preparation and Annotation
The Significance of High-Quality Training Data for SummarizationIn the case of summarization and data quality, there are two main points with their corresponding phrases that we should focus on.
First, we have the phrase “Good Data = Good Summaries”. Imagine having a student that you want to teach how to summarize a given book. If you give them a bunch of poorly written book summaries as examples, they're probably gonna get the wrong idea about what a good summary looks like. The same goes for machines! If our training data is of top-notch quality, our machine (or model) will produce summaries that are on point.
The other phrase which is more commonly known in the machine learning world is “Garbage In, Garbage Out”. It’s like trying to cook a great meal with bad ingredients. No matter how good of a chef you are, if your ingredients are rotten, the meal won’t taste good. Similarly, if we feed our AI model with low-quality training data, it’s going to spit out low-quality summaries.
Annotating a Summarization Dataset for Fine-TuningThere is a given format required by ChatGPT in order to fine-tune the model on. This format includes 3 sections:
System: This is the prompt that you will pass to ChatGPT. In our case, the prompt would be “GPT is a great and to-the-point dialogue summarization tool.”
User: This is the question asked to the model. In our case, it would be the text that we are required to summarize.
Assistant: This is the answer that our model would return. In this case, it would be a brief summary of the text.
Example of the Required JSONL Data Format{"messages": [{"role": "system", "content": "GPT is a great and to-the-point dialogue summarization tool."}, {"role": "user", "content": "#Person1#: hey, you look great! how's everything?\n#Person2#: yeah, you know what? I've been going to the club regularly. The training really pays off. Now I am in a good shape and I know more about how to keep fit.\n#Person1#: really? tell me about it. I haven't gone to the club for a long time. I am too busy with work.\n#Person2#: it's important to do proper exercises.\n#Person1#: you're right. Too much or too little won't do any good.\n#Person2#: the trainer tells me, besides regular sports activities, I should also have a healthy and balanced diet.\n#Person1#: sounds reasonable.\n#Person2#: we should eat more vegetables instead of junk food to stay energetic.\n#Person1#: and fruits!\n#Person2#: surely it is. Getting enough sleep is also crucial for fitness.\n#Person1#: I've heard that. Does your trainer tell you anything about keeping fit?\n#Person2#: yeah, he advises me to stay in a good mood. That can help one to keep sound physical health.\n#Person1#: I think if you follow your trainer's advice, you'll be on the right track.\n#Person2#: you bet it!"}, {"role": "assistant", "content": "#Person2# looks great because #Person2#'s been to the training club regularly. #Person2# tells #Person1# that having a healthy and balanced diet, getting enough sleep, and staying in a good mood help keep physical health."}]}
Regardless of whether the initial dataset is in CSV or JSON format, the ultimate data format should be presented as mentioned above with three separate distinctions. To dive deeper into the data set preparation process check the following OpenAI Documentation.
Fine-Tuning Step-by-Step Tutorial
Step 1: Install Necessary Libraries!pip install openai
!pip install wandb
!pip install git+https://github.com/wandb/wandb.git@openai-finetuning
﻿
import os
import openai
import wandb
import pandas as pd
import json
from openai import OpenAI
from wandb.integration.openai import WandbLogger
Step 2: Set Up the OpenAI API Keyopenai.api_key = "Insert your own personal OpenAI key here"
﻿
client = openai.Client()
Step3: Initialize the W&BLogger FunctionThe WandbLogger() function for OpenAI, as part of the Weights & Biases (W&B) toolkit, is designed to facilitate the fine-tuning of OpenAI models, including ChatGPT. It allows you to track and monitor training, visualize performance, compare experiments
To know more about the function and to check the multitude of additional parameters that can be passed to this function click here.
WandbLogger.sync(project="chatgpt3.5-fine-tuning-for-Summarization")
Step 4: Load and Sample the Dataset# Read the CSV data
df = pd.read_csv('Insert path to summarization dataset”)
﻿
# Sample 100 rows
df = df.sample(100)
Step 5: Modify the Dataset Format To Adhere to That of OpenAIoutput_filename = "Insert the new path to save the newly modified JSONL data set"
﻿
# Create a new JSONL file
with open(output_filename, 'w') as jsonl_file:
    for _, row in df.iterrows():
        # Create the desired format for each row in the CSV
        data = {
            "messages": [
                {
                    "role": "system",
                    "content": "GPT is a great and to-the-point dialogue summarization tool."
                },
                {
                    "role": "user",
                    "content": row['dialogue']
                },
                {
                    "role": "assistant",
                    "content": row['summary']
                }
            ]
        }
        # Write the data to the JSONL file
        jsonl_file.write(json.dumps(data) + '\n')
Step 6: Upload the Created File to OpenAItraining_file = client.files.create(
  file=open(output_filename, "rb"),
  purpose='fine-tune'
)
﻿
training_file_id = training_file.id
Step 7: Fine-Tune the Modelfine_tuning_job = client.fine_tuning.jobs.create(
  training_file=training_file_id, 
  model="gpt-3.5-turbo"
)
﻿
job_id = fine_tuning_job.id
Step 8: Define and Evaluation FunctionFirst, we'll set up a new wandb project for each of our model’s evaluation performances. Next, we'll evaluate our dataset using the specified format. Here, we provide the dialogue to the model, specify that we want a summary, and then await its response.
def evaluate_model(model_name, model_id):
    # Initialize Weights & Biases
    run = wandb.init(project="text-summarization-with-openai", reinit=True)
﻿
    correct_predictions = 0
    loop_count = 0
    results = []
﻿
    # Iterate over each row in the DataFrame
    for index, row in df.iterrows():
        loop_count += 1
        dialogue_text = row['dialogue']
﻿
        try:
            completion = client.chat.completions.create(
                model=model_id,
                messages=[
                    {"role": "system", "content": "GPT is a great and to-the-point dialogue summarization tool."},
                    {"role": "user", "content": dialogue_text},
                ]
            )
            response = completion.choices[0].message.content
            
            results.append({
                "dialogue": dialogue_text,
                "actual_summary": row['summary'],
                "predicted_summary": response
            })
﻿
            if compare_summaries(response, row['summary']):
                correct_predictions += 1
﻿
            print(f"[{model_name}] Processed {loop_count}/{len(df)} rows.")
﻿
        except Exception as e:
            print(f"Error on index {index} with {model_name}: {e}")
            continue
﻿
    accuracy = (correct_predictions / len(df)) * 100
    wandb.log({f"{model_name} Accuracy": accuracy})
﻿
    # Convert results list to DataFrame
    df_results = pd.DataFrame(results)
﻿
    # Log the entire DataFrame as a table to W&B
    wandb.log({"results_table": wandb.Table(dataframe=df_results)})
﻿
    print(f'{model_name} Summarization Accuracy: {accuracy:.2f}%')
    
    # Finish the Weights & Biases run for the current model
    run.finish()
﻿
def compare_summaries(predicted, actual):
    return predicted.strip() == actual.strip()
Step 9: Load the Dialogue Summarization DatasetPlease be aware that we'll use an entirely distinct test sample for evaluating both models to ensure that there's no repetition of data points. We will also remove any NaN (Not a Number) value from the dataset.
filename = "/kaggle/input/dialogsum/CSV/test.csv"
df = pd.read_csv(filename)
﻿
# Remove any NaN values in 'dialogue' and 'summary'
df.dropna(subset=['dialogue', 'summary'], inplace=True)
df = df.sample(100)  # Use a subset of the data for quick testing
Step 10: Evaluate the Base ModelIn this step, we'll assess the gpt-3.5-turbo model using the evaluation dataset, without applying any fine-tuning.
evaluate_model("base", "gpt-3.5-turbo")
Step 11: Evaluate the Fine-Tuned ModelIn this step, we'll assess the newly fine-tuned model using the evaluation dataset after applying the fine-tuning process.
evaluate_model("fine_tuned", "Insert fine_tuned model ID")
Evaluating Fine-Tuned ChatGPTUsing W&B we have stored a unique evaluation table for each of our models. Each table consists of three parts: first the dialog to be summarized, second the actual_summary provided in the dataset, and finally the predicted summary of each model. As mentioned, to make it easier to read each model’s answer is saved in a separate table.
Old Model’s Performance Table﻿
﻿
New Model’s Performance Table﻿
﻿
The data showcased in Weights and Biases underscores a dramatic improvement between the original and the fine-tuned models. While the former none fine-tuned gpt model tends to be indirect, often bordering on redundancy, the latter model encapsulates the essence of text summarization by delivering crisp and direct summaries.
Take, for instance, the very first summarization data point from both tables. The predecessor ChatGPT, in its attempt to be comprehensive, loses the core principle of brevity, producing an extensive summary.
In contrast, the refined model produces a concise summary, a mere three lines compared to the almost 20 lines from the previous version. This demonstrates a significant and noticeable enhancement in the model's overall performance.
While the model shows promising results, it's important to acknowledge that it was trained on a mere 100 data points. This limited dataset can influence accuracy. Although we've achieved concise summarization, there's room for refinement to ensure the summaries consistently capture the essence of the dialogues accurately. Future enhancements will further optimize performance.
ConclusionIn this article, we delved into the process of fine-tuning ChatGPT for the specialized task of dialogue summarization. Through quite a range of adjustments and evaluations, it became evident that tailored training can significantly enhance a model's performance in specific domains. 
By using Weights and biases as an instrumental tool, we were capable of meticulously monitoring and optimizing the model's trajectory. In an era inundated with vast volumes of conversational data, the importance of efficient summarization cannot be overstated. With the advancements demonstrated in this study, it is evident that we are on the cusp of pioneering strides in automated information condensation.
﻿
Add a comment
Tags: Articles, LLM, GPT, Tables, Experiment, Fine-tuning, GenAI
Iterate on AI agents and models faster. Try Weights & Biases today.