Skip to main content

Fine-Tuning ChatGPT for Sentiment Analysis With W&B

This article explores fine-tuning ChatGPT for sentiment analysis using W&B. Our experiment will lead to a 25% accuracy boost, and we'll delve into applications.
Created on October 10|Last edited on November 22
In today's data-driven world, sentiment analysis plays a pivotal role in discerning public opinion on a myriad of topics. Advanced models like ChatGPT, built on the GPT-3.5 architecture, offer immense potential in understanding and interpreting human emotions from textual data.
However, like many tools, their out-of-the-box capabilities might not capture the nuanced intricacies of sentiment, especially in diverse datasets like those from Reddit.
This article dives deep into the process of fine-tuning ChatGPT for sentiment analysis, utilizing the powerful features of the Weights & Biases platform, and delves into the improvements and challenges faced.
Here's what we'll be covering:

Table of Contents



Let's get going!

How Can ChatGPT Be Used for Sentiment Analysis?

ChatGPT's ability to understand natural language makes it a good fit for sentiment analysis. This is because, unlike traditional chatbots that rely on predefined responses, ChatGPT generates real-time answers based on a vast amount of training data.
This approach enables it to provide responses that are contextually relevant and informed by a broad spectrum of information.

A Brief Overview of the GPT-3.5 Architecture

The GPT-3.5 model is a significant advancement in the realm of natural language processing. It boasts 175 billion parameters, which are essentially the components that the model adjusts during its training phase.
These parameters allow GPT-3.5 to capture and reproduce the nuances and complexities of human language. The "3.5" here indicates a refinement from its predecessor, marking progress in its capacity to generate coherent and contextually appropriate responses.
In this tutorial, we'll be using fine-tuning GPT-3.5 to improve the accuracy of sentiment analysis.

Fine-Tuning ChatGPT for Sentiment Analysis

Fine-tuning is a pivotal step in adapting a general-purpose model like ChatGPT to a specific task such as sentiment analysis.
ChatGPT, with its broad language understanding capabilities, can grasp a vast array of topics and concepts. However, sentiment analysis is more than just comprehending text; it requires a nuanced understanding of subjective tones, moods, and emotions.
Think sarcasm. Understanding sarcasm is tricky, even for humans sometimes. Sarcasm is when we say something but mean the opposite, often in a joking or mocking way. For example, if it starts raining just as you're about to go outside, and you say, "Oh, perfect timing!" you're probably being sarcastic because it's actually bad timing.
Now, imagine a machine trying to understand this. Without special training, it might think you're genuinely happy about the rain because you said "perfect." This is where fine-tuning a model like ChatGPT becomes crucial.
ChatGPT, out of the box, is pretty good at understanding a lot of text. It's read more than most humans ever will. But sarcasm is subtle and often needs context. So, to make ChatGPT really get sarcasm, we'd expose it to many examples of sarcastic sentences until it starts catching on to the patterns.
But here's the catch: sarcasm doesn't look the same everywhere. In different cultures or situations, what's sarcastic in one place might be meant seriously in another. That's why just general knowledge isn't enough. The model needs specific examples to truly grasp the playful twists and turns of sarcasm.
In short, to make ChatGPT understand sarcasm like a human, it needs extra training on it, just like someone might need to watch several comedy shows to start understanding a comedian's sense of humor.

Data Preparation and Labeling

The Current Data Set at Hand

In this tutorial, we'll leverage the Reddit dataset sourced from Kaggle, available via the following link. This dataset features two key columns: clean_comment(the sentiment text) and its corresponding category (sentiment label).


Data Augmentation Sentiment Analysis Dataset for Fine-Tuning

Nevertheless, it's important to note that the refined Fine-Tuning ChatGPT process mandates a specific data structure for optimal training. Here's a representative format for this data as provided by the ChatGPT documentation:
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
In the "Preprocessing and Uploading Data File" section of the tutorial, we'll transform our two-column dataset into the specified JSON format.

The Importance of High-Quality Training Data for Sentiment Analysis

High-quality training data is pivotal for sentiment analysis as it ensures the model learns to accurately distinguish nuances in emotions. Poor data can lead to misinterpretations, reducing the effectiveness of the analysis. Moreover, comprehensive and well-curated data can significantly boost the model's ability to generalize across diverse real-world scenarios. The dataset we're utilizing underscores this point. As even some of its entries are so nuanced that even humans might struggle to discern their sentiment.

Step-by-Step Tutorial

A) Evaluating the Old Model’s Performance

Step 1: Installing and Importing Necessary Libraries

!pip install openai
!pip install wandb
!pip install git+https://github.com/wandb/wandb.git@openai-finetuning

import os
import openai
import wandb
import pandas as pd
import json
from openai import OpenAI
from wandb.integration.openai import WandbLogger

Step 2: Setting Up OpenAI API Key

openai.api_key = "Insert your OpenAI API key here”

Step 3: Loading and Processing the Sentiment Analysis Dataset

filename = "Insert the path to your data set here"

# Read the CSV
df = pd.read_csv(filename)

# Drop rows with NaN values in 'clean_comment' and 'category'
df.dropna(subset=['clean_comment', 'category'], inplace=True)

# Sample 100 rows from the dataset
df = df.sample(100)

Step 4: Initializing a New Weights & Biases Project

In this section, we'll harness the capabilities of the recently introduced WandbLogger() function, designed to streamline the integration of W&B with OpenAI. This tool is tailored to enhance the fine-tuning process of OpenAI models, including ChatGPT. It offers a simplified, efficient pathway for tracking and monitoring the training process, visualizing performance metrics, and comparing different experimental setups.
The WandbLogger() function is instrumental in providing a cohesive and user-friendly environment for managing and analyzing the nuances of model fine-tuning in a streamlined manner.
To learn more about the function and to check the multitude of additional parameters that can be passed to this function, click here.
WandbLogger.sync(project="chatgpt3.5-fine-tuning-for-Sentiment-Analysis")

Step 5: Take a New Sample To Test the Model On

df = df.sample(100)

Step 6: Defining a Function To Convert the Model Response to Sentiment Value and Vice Versa

def convert_response_to_sentiment(response):
response = response.lower()
if 'positive' in response:
return 1
elif 'negative' in response:
return -1
elif 'neutral' in response:
return 0
else:
return -1 # Unknown sentiment
def convert_numeric_to_string_sentiment(value):
if value == 1:
return "positive"
elif value == -1:
return "negative"
elif value == 0:
return "neutral"
else:
return "unknown"

Step 7: Evaluating the Old Model’s Performance

client = openai.Client()

correct_predictions = 0
loop_count = 0 # Counter for loop iterations

results = []

# Iterate over each row in the DataFrame
for index, row in df.iterrows():
loop_count += 1 # Increment the loop count
text = row['clean_comment'] # Adjusted column name
try:
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "What is the sentiment of the following text? Please respond with 'positive', 'negative', or 'neutral'."},
{"role": "user", "content": text},
]
)
response = completion.choices[0].message.content
predicted_sentiment = convert_response_to_sentiment(response)
results.append({
"sentiment": text,
"labeled_prediction": convert_numeric_to_string_sentiment(row['category']),
"old_model_prediction": response
})
# Check if the predicted sentiment matches the actual sentiment
if predicted_sentiment == row['category']: # Adjusted column name
correct_predictions += 1
# Print the current progress using loop_count
total_rows = len(df)
print(f"Processed {loop_count}/{total_rows} rows.")
except Exception as e:
print(f"Error on index {index}: {e}")
continue

Step 8: Calculating Old Accuracy

accuracy = (correct_predictions / total_rows) * 100

Step 9: Logging the Old Accuracy to W&B

wandb.log({"Old Accuracy": accuracy})
print(f'Model Accuracy before: {accuracy:.2f}%')
Output: Model Accuracy before: 48.00%

B) Fine-Tuning the ChatGPT Model

Step 10: Converting the DataFrame to the Desired JSONL Format

output_filename = "insert the directory in which to save the newly processed data set "

# Convert DataFrame to the desired JSONL format
with open(output_filename, "w") as file:
for _, row in df.iterrows():
# Map the target to its corresponding string label
target_label = {
0: 'neutral',
1: 'positive', # Corrected the spelling here
-1: 'negative'
}.get(row['category'], 'unknown')
data = {
"messages": [
{
"role": "system",
"content": "What is the sentiment of the following text? Please respond with 'positive', 'negative', or 'neutral'."
},
{
"role": "user",
"content": row['clean_comment']
},
{
"role": "assistant",
"content": target_label
}
]
}
# Write each data point as a separate line in the JSONL file
file.write(json.dumps(data) + "\n")

Step 11: Uploading the Created File to OpenAI

training_file = client.files.create(
file=open(output_filename, "rb"),
purpose='fine-tune'
)

training_file_id = training_file.id

Step 12: Creating a New Fine-Tuning Job

fine_tuning_job = client.fine_tuning.jobs.create(
training_file=training_file_id,
model="gpt-3.5-turbo"
)

job_id = fine_tuning_job.id

C) Evaluating the New Model’s Performance

Step 13: Evaluating the New Model’s Performance

Note that the new model ID will be sent to you through OpenAI via your email.
model_id = "Insert here the new model’s ID"
correct_predictions = 0
loop_count = 0 # Counter for loop iterations
loop_index = 0 # Initialize loop_index


# Iterate over each row in the DataFrame for the new model
for index, row in df.iterrows():
loop_count += 1 # Increment the loop count
text = row['clean_comment'] # Adjusted column name
try:
completion = client.chat.completions.create(
model=model_id,
messages=[
{"role": "system", "content": "What is the sentiment of the following text? Please respond with 'positive', 'negative', or 'neutral'."},
{"role": "user", "content": text},
]
)
response = completion.choices[0].message.content
predicted_sentiment = convert_response_to_sentiment(response)
results[loop_index].update({"new_model_prediction": response})
loop_index += 1 # Increment the loop index
# Check if the predicted sentiment matches the actual sentiment
if predicted_sentiment == row['category']: # Adjusted column name
correct_predictions += 1
# Print the current progress using loop_count
print(f"Processed {loop_count}/{total_rows} rows.")
except Exception as e:
print(f"Error on index {index}: {e}")
continue

Step 14: Calculating the New Accuracy

accuracy = (correct_predictions / total_rows) * 100

Step 15: Logging the New Accuracy to W&B

wandb.log({"New Accuracy": accuracy})
print(f'Model Accuracy after: {accuracy:.2f}%')
Output: Model Accuracy after: 73.00%

Step 16: Create a New vs Old Result Comparison Table in W&B

# Convert results list to DataFrame
df_results = pd.DataFrame(results)

# Log the entire DataFrame as a table to W&B
wandb.log({"results_table": wandb.Table(dataframe=df_results)})

Step 17: Finishing the Weights & Biases Run

wandb.finish()

Fine-Tuning Results and Analysis

After the processing is complete, a link to Weights & Biases (W&B) will be generated. Click on this link to access the logs and view the results of your experiment.

The new model demonstrates a marked improvement of 25% in predictive accuracy. While the previous model achieved an accuracy of 48%, the updated version impressively reached 73%.
Moreover, we've compiled a table consisting of 100 entries and saved it using the Weights & Biases tool (Step 16). This table is structured with four columns: Sentiment, Labeled Prediction, Old Model Prediction, and New Model Prediction. Through manual inspection, we'll be able to pinpoint instances where the model exhibited improvements.

This set of panels contains runs from a private project, which cannot be shown in this report

Navigating the complexities of model evaluation becomes simpler with W&B. It enables us to save and compare multiple tables with various accuracies and outputs. By tweaking our fine-tuning process, we can easily monitor improvements or setbacks, streamlining our path to achieving the highest model accuracy possible.
For example, in the displayed table, the new model accurately predicted the sentiment label for the second, fourth, and seventh rows, where the previous model had wrongly classified. Thus we can clearly see the improvement in our model’s predictions along with its strong and weak points.

Practical Applications and Use Cases

Jargon and Slang Understanding

Social media platforms, such as Facebook, Twitter, and even Reddit, have a unique language characterized by platform-specific slang, memes, niche topics, and abbreviations. Fine-tuning allows the model to interpret and respond to such vernacular accurately, ensuring better sentiment understanding.

E-Commerce Product Reviews

E-commerce platforms can analyze product reviews to identify highly rated products or vendors and adjust their recommendation algorithms accordingly.

Further Improvements

By leveraging a mere 100 training samples, we've seen a notable 25% improvement in the model's performance. This initial result is promising and underscores the model's adaptive capabilities. However, it's essential to recognize the expansive potential lying ahead. As with most machine learning models, the depth and diversity of training data often correlate with the model's precision and robustness.
Introducing a larger training sample, say 1,000 or 10,000 data points, could not only improve accuracy but also ensure the model is better equipped to handle a wider variety of scenarios and nuances. Such a vast dataset would encompass a broader spectrum of sentiments, jargon, and contexts, thereby refining the model's ability to discern subtleties and reduce false positives or negatives.
Moreover, a larger training set would allow the model to generalize better, minimizing overfitting to any specific subset of data. It would be intriguing to explore how the model evolves as we scale our training efforts. Given the promising improvement from a modest sample, there's a compelling case to be made for continued investment in refining and expanding the training data.

Conclusion

Fine-tuning powerful models such as ChatGPT presents a promising avenue for specialized tasks like sentiment analysis. Our exploration, backed by Weights & Biases, not only showcased the significance of specific training but also the vast potential improvements one can achieve, even with limited data samples.
The realm of sentiment analysis is rife with complexities, from understanding sarcastic tones to deciphering platform-specific jargon. With continued advancements and refined fine-tuning processes, models like ChatGPT can become indispensable tools for businesses, researchers, and developers.
The journey of refining models underscores a critical lesson: in the world of AI, there's always room for improvement and innovation.
Iterate on AI agents and models faster. Try Weights & Biases today.