Fine-Tuning ChatGPT for Sentiment Analysis With W&B

This article explores fine-tuning ChatGPT for sentiment analysis using W&B. Our experiment will lead to a 25% accuracy boost, and we'll delve into applications.
Mostafa Ibrahim
Created on October 10|Last edited on November 22
Comment
In today's data-driven world, sentiment analysis plays a pivotal role in discerning public opinion on a myriad of topics. Advanced models like ChatGPT, built on the GPT-3.5 architecture, offer immense potential in understanding and interpreting human emotions from textual data. 
However, like many tools, their out-of-the-box capabilities might not capture the nuanced intricacies of sentiment, especially in diverse datasets like those from Reddit. 
This article dives deep into the process of fine-tuning ChatGPT for sentiment analysis, utilizing the powerful features of the Weights & Biases platform, and delves into the improvements and challenges faced.
Here's what we'll be covering: 
Table of ContentsHow Can ChatGPT Be Used for Sentiment Analysis?A Brief Overview of the GPT-3.5 ArchitectureFine-Tuning ChatGPT for Sentiment AnalysisData Preparation and LabelingThe Current Data Set at HandData Augmentation Sentiment Analysis Dataset for Fine-TuningThe Importance of High-Quality Training Data for Sentiment AnalysisStep-by-Step TutorialA) Evaluating the Old Model’s PerformanceB) Fine-Tuning the ChatGPT ModelC) Evaluating the New Model’s PerformanceFine-Tuning Results and AnalysisPractical Applications and Use CasesJargon and Slang UnderstandingE-Commerce Product ReviewsFurther ImprovementsConclusion 
﻿
﻿
Let's get going! 
How Can ChatGPT Be Used for Sentiment Analysis?﻿﻿﻿ChatGPT's ability to understand natural language makes it a good fit for sentiment analysis. This is because, unlike traditional chatbots that rely on predefined responses, ChatGPT generates real-time answers based on a vast amount of training data. 
This approach enables it to provide responses that are contextually relevant and informed by a broad spectrum of information. 
A Brief Overview of the GPT-3.5 ArchitectureThe GPT-3.5 model is a significant advancement in the realm of natural language processing. It boasts 175 billion parameters, which are essentially the components that the model adjusts during its training phase. 
These parameters allow GPT-3.5 to capture and reproduce the nuances and complexities of human language. The "3.5" here indicates a refinement from its predecessor, marking progress in its capacity to generate coherent and contextually appropriate responses.
In this tutorial, we'll be using fine-tuning GPT-3.5 to improve the accuracy of sentiment analysis. 
Fine-Tuning ChatGPT for Sentiment AnalysisFine-tuning is a pivotal step in adapting a general-purpose model like ChatGPT to a specific task such as sentiment analysis. 
ChatGPT, with its broad language understanding capabilities, can grasp a vast array of topics and concepts. However, sentiment analysis is more than just comprehending text; it requires a nuanced understanding of subjective tones, moods, and emotions.
Think sarcasm. Understanding sarcasm is tricky, even for humans sometimes. Sarcasm is when we say something but mean the opposite, often in a joking or mocking way. For example, if it starts raining just as you're about to go outside, and you say, "Oh, perfect timing!" you're probably being sarcastic because it's actually bad timing.
Now, imagine a machine trying to understand this. Without special training, it might think you're genuinely happy about the rain because you said "perfect." This is where fine-tuning a model like ChatGPT becomes crucial.
ChatGPT, out of the box, is pretty good at understanding a lot of text. It's read more than most humans ever will. But sarcasm is subtle and often needs context. So, to make ChatGPT really get sarcasm, we'd expose it to many examples of sarcastic sentences until it starts catching on to the patterns.
But here's the catch: sarcasm doesn't look the same everywhere. In different cultures or situations, what's sarcastic in one place might be meant seriously in another. That's why just general knowledge isn't enough. The model needs specific examples to truly grasp the playful twists and turns of sarcasm.
In short, to make ChatGPT understand sarcasm like a human, it needs extra training on it, just like someone might need to watch several comedy shows to start understanding a comedian's sense of humor.
Data Preparation and Labeling
The Current Data Set at HandIn this tutorial, we'll leverage the Reddit dataset sourced from Kaggle, available via the following link. This dataset features two key columns: clean_comment(the sentiment text) and its corresponding category (sentiment label).
﻿
Data Augmentation Sentiment Analysis Dataset for Fine-TuningNevertheless, it's important to note that the refined Fine-Tuning ChatGPT process mandates a specific data structure for optimal training. Here's a representative format for this data as provided by the ChatGPT documentation:
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
﻿
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
﻿
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
In the "Preprocessing and Uploading Data File" section of the tutorial, we'll transform our two-column dataset into the specified JSON format.
The Importance of High-Quality Training Data for Sentiment Analysis﻿High-quality training data is pivotal for sentiment analysis as it ensures the model learns to accurately distinguish nuances in emotions. Poor data can lead to misinterpretations, reducing the effectiveness of the analysis. Moreover, comprehensive and well-curated data can significantly boost the model's ability to generalize across diverse real-world scenarios. The dataset we're utilizing underscores this point. As even some of its entries are so nuanced that even humans might struggle to discern their sentiment.
Step-by-Step Tutorial
A) Evaluating the Old Model’s Performance
Step 1: Installing and Importing Necessary Libraries!pip install openai
!pip install wandb
!pip install git+https://github.com/wandb/wandb.git@openai-finetuning
﻿
import os
import openai
import wandb
import pandas as pd
import json
from openai import OpenAI
from wandb.integration.openai import WandbLogger
Step 2: Setting Up OpenAI API Keyopenai.api_key = "Insert your OpenAI API key here”
Step 3: Loading and Processing the Sentiment Analysis Datasetfilename = "Insert the path to your data set here"
﻿
# Read the CSV
df = pd.read_csv(filename)
﻿
# Drop rows with NaN values in 'clean_comment' and 'category'
df.dropna(subset=['clean_comment', 'category'], inplace=True)
﻿
# Sample 100 rows from the dataset
df = df.sample(100)
Step 4: Initializing a New Weights & Biases ProjectIn this section, we'll harness the capabilities of the recently introduced WandbLogger() function, designed to streamline the integration of W&B with OpenAI. This tool is tailored to enhance the fine-tuning process of OpenAI models, including ChatGPT. It offers a simplified, efficient pathway for tracking and monitoring the training process, visualizing performance metrics, and comparing different experimental setups. 
The WandbLogger() function is instrumental in providing a cohesive and user-friendly environment for managing and analyzing the nuances of model fine-tuning in a streamlined manner.
To learn more about the function and to check the multitude of additional parameters that can be passed to this function, click here.
WandbLogger.sync(project="chatgpt3.5-fine-tuning-for-Sentiment-Analysis")
Step 5: Take a New Sample To Test the Model Ondf = df.sample(100) 
Step 6: Defining a Function To Convert the Model Response to Sentiment Value and Vice Versadef convert_response_to_sentiment(response):
    response = response.lower()
    if 'positive' in response:
        return 1
    elif 'negative' in response:
        return -1
    elif 'neutral' in response:
        return 0
    else:
        return -1  # Unknown sentiment
    
def convert_numeric_to_string_sentiment(value):
    if value == 1:
        return "positive"
    elif value == -1:
        return "negative"
    elif value == 0:
        return "neutral"
    else:
        return "unknown"
Step 7: Evaluating the Old Model’s Performanceclient = openai.Client()
﻿
correct_predictions = 0
loop_count = 0  # Counter for loop iterations
﻿
results = []
﻿
# Iterate over each row in the DataFrame
for index, row in df.iterrows():
    loop_count += 1  # Increment the loop count
    text = row['clean_comment']  # Adjusted column name
    
    try:
        completion = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "What is the sentiment of the following text? Please respond with 'positive', 'negative', or 'neutral'."},
                {"role": "user", "content": text},
            ]
        )
        response = completion.choices[0].message.content
        predicted_sentiment = convert_response_to_sentiment(response)
        
        results.append({
        "sentiment": text,  
        "labeled_prediction": convert_numeric_to_string_sentiment(row['category']),
        "old_model_prediction": response
    })
        
        # Check if the predicted sentiment matches the actual sentiment
        if predicted_sentiment == row['category']:  # Adjusted column name
            correct_predictions += 1
        
        # Print the current progress using loop_count
        total_rows = len(df)
        print(f"Processed {loop_count}/{total_rows} rows.")
        
    except Exception as e:
        print(f"Error on index {index}: {e}")
        continue
Step 8: Calculating Old Accuracyaccuracy = (correct_predictions / total_rows) * 100
Step 9: Logging the Old Accuracy to W&Bwandb.log({"Old Accuracy": accuracy})
print(f'Model Accuracy before: {accuracy:.2f}%')
Output: Model Accuracy before: 48.00%
B) Fine-Tuning the ChatGPT Model
Step 10: Converting the DataFrame to the Desired JSONL Formatoutput_filename = "insert the directory in which to save the newly processed data set "
﻿
# Convert DataFrame to the desired JSONL format
with open(output_filename, "w") as file:
    for _, row in df.iterrows():
        # Map the target to its corresponding string label
        target_label = {
            0: 'neutral',
            1: 'positive',  # Corrected the spelling here
            -1: 'negative'
        }.get(row['category'], 'unknown')
        
        data = {
            "messages": [
                {
                    "role": "system",
                    "content": "What is the sentiment of the following text? Please respond with 'positive', 'negative', or 'neutral'."
                },
                {
                    "role": "user",
                    "content": row['clean_comment']
                },
                {
                    "role": "assistant",
                    "content": target_label
                }
            ]
        }
        
        # Write each data point as a separate line in the JSONL file
        file.write(json.dumps(data) + "\n")
Step 11: Uploading the Created File to OpenAItraining_file = client.files.create(
  file=open(output_filename, "rb"),
  purpose='fine-tune'
)
﻿
training_file_id = training_file.id
Step 12: Creating a New Fine-Tuning Jobfine_tuning_job = client.fine_tuning.jobs.create(
  training_file=training_file_id,
  model="gpt-3.5-turbo"
)
﻿
job_id = fine_tuning_job.id
C) Evaluating the New Model’s Performance
Step 13: Evaluating the New Model’s PerformanceNote that the new model ID will be sent to you through OpenAI via your email.
model_id = "Insert here the new model’s ID"
correct_predictions = 0
loop_count = 0  # Counter for loop iterations
loop_index = 0  # Initialize loop_index
﻿
﻿
# Iterate over each row in the DataFrame for the new model
for index, row in df.iterrows():
    loop_count += 1  # Increment the loop count
    text = row['clean_comment']  # Adjusted column name
    
    try:
        completion = client.chat.completions.create(
            model=model_id,
            messages=[
                {"role": "system", "content": "What is the sentiment of the following text? Please respond with 'positive', 'negative', or 'neutral'."},
                {"role": "user", "content": text},
            ]
        )
        response = completion.choices[0].message.content
        predicted_sentiment = convert_response_to_sentiment(response)
        
        results[loop_index].update({"new_model_prediction": response})
        loop_index += 1  # Increment the loop index
        
        # Check if the predicted sentiment matches the actual sentiment
        if predicted_sentiment == row['category']:  # Adjusted column name
            correct_predictions += 1
        
        # Print the current progress using loop_count
        print(f"Processed {loop_count}/{total_rows} rows.")
        
    except Exception as e:
        print(f"Error on index {index}: {e}")
        continue
Step 14: Calculating the New Accuracyaccuracy = (correct_predictions / total_rows) * 100
Step 15:  Logging the New Accuracy to W&Bwandb.log({"New Accuracy": accuracy})
print(f'Model Accuracy after: {accuracy:.2f}%')
Output: Model Accuracy after: 73.00%
Step 16: Create a New vs Old Result Comparison Table in W&B# Convert results list to DataFrame
df_results = pd.DataFrame(results)
﻿
# Log the entire DataFrame as a table to W&B
wandb.log({"results_table": wandb.Table(dataframe=df_results)})
Step 17: Finishing the Weights & Biases Runwandb.finish()
Fine-Tuning Results and AnalysisAfter the processing is complete, a link to Weights & Biases (W&B) will be generated. Click on this link to access the logs and view the results of your experiment.
﻿
The new model demonstrates a marked improvement of 25% in predictive accuracy. While the previous model achieved an accuracy of 48%, the updated version impressively reached 73%.
Moreover, we've compiled a table consisting of 100 entries and saved it using the Weights & Biases tool (Step 16). This table is structured with four columns: Sentiment, Labeled Prediction, Old Model Prediction, and New Model Prediction. Through manual inspection, we'll be able to pinpoint instances where the model exhibited improvements.
﻿
﻿
Navigating the complexities of model evaluation becomes simpler with W&B. It enables us to save and compare multiple tables with various accuracies and outputs. By tweaking our fine-tuning process, we can easily monitor improvements or setbacks, streamlining our path to achieving the highest model accuracy possible.
For example, in the displayed table, the new model accurately predicted the sentiment label for the second, fourth, and seventh rows, where the previous model had wrongly classified. Thus we can clearly see the improvement in our model’s predictions along with its strong and weak points.
Practical Applications and Use Cases
Jargon and Slang UnderstandingSocial media platforms, such as Facebook, Twitter, and even Reddit, have a unique language characterized by platform-specific slang, memes, niche topics, and abbreviations. Fine-tuning allows the model to interpret and respond to such vernacular accurately, ensuring better sentiment understanding.
E-Commerce Product ReviewsE-commerce platforms can analyze product reviews to identify highly rated products or vendors and adjust their recommendation algorithms accordingly.
Further ImprovementsBy leveraging a mere 100 training samples, we've seen a notable 25% improvement in the model's performance. This initial result is promising and underscores the model's adaptive capabilities. However, it's essential to recognize the expansive potential lying ahead. As with most machine learning models, the depth and diversity of training data often correlate with the model's precision and robustness.
Introducing a larger training sample, say 1,000 or 10,000 data points, could not only improve accuracy but also ensure the model is better equipped to handle a wider variety of scenarios and nuances. Such a vast dataset would encompass a broader spectrum of sentiments, jargon, and contexts, thereby refining the model's ability to discern subtleties and reduce false positives or negatives.
Moreover, a larger training set would allow the model to generalize better, minimizing overfitting to any specific subset of data. It would be intriguing to explore how the model evolves as we scale our training efforts. Given the promising improvement from a modest sample, there's a compelling case to be made for continued investment in refining and expanding the training data.
Conclusion Fine-tuning powerful models such as ChatGPT presents a promising avenue for specialized tasks like sentiment analysis. Our exploration, backed by Weights & Biases, not only showcased the significance of specific training but also the vast potential improvements one can achieve, even with limited data samples. 
The realm of sentiment analysis is rife with complexities, from understanding sarcastic tones to deciphering platform-specific jargon. With continued advancements and refined fine-tuning processes, models like ChatGPT can become indispensable tools for businesses, researchers, and developers. 
The journey of refining models underscores a critical lesson: in the world of AI, there's always room for improvement and innovation.
﻿
Add a comment
Tags: Articles, GPT, LLM, NLP, Beginner, Sentiment Analysis, Experiment
Iterate on AI agents and models faster. Try Weights & Biases today.