Fine-Tuning ChatGPT for Sentiment Analysis With W&B
This article explores fine-tuning ChatGPT for sentiment analysis using W&B. Our experiment will lead to a 25% accuracy boost, and we'll delve into applications.
Created on October 10|Last edited on November 22
Comment
In today's data-driven world, sentiment analysis plays a pivotal role in discerning public opinion on a myriad of topics. Advanced models like ChatGPT, built on the GPT-3.5 architecture, offer immense potential in understanding and interpreting human emotions from textual data.
However, like many tools, their out-of-the-box capabilities might not capture the nuanced intricacies of sentiment, especially in diverse datasets like those from Reddit.
This article dives deep into the process of fine-tuning ChatGPT for sentiment analysis, utilizing the powerful features of the Weights & Biases platform, and delves into the improvements and challenges faced.
Here's what we'll be covering:
Table of Contents
How Can ChatGPT Be Used for Sentiment Analysis?A Brief Overview of the GPT-3.5 ArchitectureFine-Tuning ChatGPT for Sentiment AnalysisData Preparation and LabelingThe Current Data Set at HandData Augmentation Sentiment Analysis Dataset for Fine-TuningThe Importance of High-Quality Training Data for Sentiment AnalysisStep-by-Step TutorialA) Evaluating the Old Model’s PerformanceB) Fine-Tuning the ChatGPT ModelC) Evaluating the New Model’s PerformanceFine-Tuning Results and AnalysisPractical Applications and Use CasesJargon and Slang UnderstandingE-Commerce Product ReviewsFurther ImprovementsConclusion
Let's get going!
How Can ChatGPT Be Used for Sentiment Analysis?
ChatGPT's ability to understand natural language makes it a good fit for sentiment analysis. This is because, unlike traditional chatbots that rely on predefined responses, ChatGPT generates real-time answers based on a vast amount of training data.
This approach enables it to provide responses that are contextually relevant and informed by a broad spectrum of information.
A Brief Overview of the GPT-3.5 Architecture
The GPT-3.5 model is a significant advancement in the realm of natural language processing. It boasts 175 billion parameters, which are essentially the components that the model adjusts during its training phase.
These parameters allow GPT-3.5 to capture and reproduce the nuances and complexities of human language. The "3.5" here indicates a refinement from its predecessor, marking progress in its capacity to generate coherent and contextually appropriate responses.
In this tutorial, we'll be using fine-tuning GPT-3.5 to improve the accuracy of sentiment analysis.
Fine-Tuning ChatGPT for Sentiment Analysis
Fine-tuning is a pivotal step in adapting a general-purpose model like ChatGPT to a specific task such as sentiment analysis.
ChatGPT, with its broad language understanding capabilities, can grasp a vast array of topics and concepts. However, sentiment analysis is more than just comprehending text; it requires a nuanced understanding of subjective tones, moods, and emotions.
Think sarcasm. Understanding sarcasm is tricky, even for humans sometimes. Sarcasm is when we say something but mean the opposite, often in a joking or mocking way. For example, if it starts raining just as you're about to go outside, and you say, "Oh, perfect timing!" you're probably being sarcastic because it's actually bad timing.
Now, imagine a machine trying to understand this. Without special training, it might think you're genuinely happy about the rain because you said "perfect." This is where fine-tuning a model like ChatGPT becomes crucial.
ChatGPT, out of the box, is pretty good at understanding a lot of text. It's read more than most humans ever will. But sarcasm is subtle and often needs context. So, to make ChatGPT really get sarcasm, we'd expose it to many examples of sarcastic sentences until it starts catching on to the patterns.
But here's the catch: sarcasm doesn't look the same everywhere. In different cultures or situations, what's sarcastic in one place might be meant seriously in another. That's why just general knowledge isn't enough. The model needs specific examples to truly grasp the playful twists and turns of sarcasm.
In short, to make ChatGPT understand sarcasm like a human, it needs extra training on it, just like someone might need to watch several comedy shows to start understanding a comedian's sense of humor.
Data Preparation and Labeling
The Current Data Set at Hand
In this tutorial, we'll leverage the Reddit dataset sourced from Kaggle, available via the following link. This dataset features two key columns: clean_comment(the sentiment text) and its corresponding category (sentiment label).

Data Augmentation Sentiment Analysis Dataset for Fine-Tuning
Nevertheless, it's important to note that the refined Fine-Tuning ChatGPT process mandates a specific data structure for optimal training. Here's a representative format for this data as provided by the ChatGPT documentation:
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
In the "Preprocessing and Uploading Data File" section of the tutorial, we'll transform our two-column dataset into the specified JSON format.
The Importance of High-Quality Training Data for Sentiment Analysis
High-quality training data is pivotal for sentiment analysis as it ensures the model learns to accurately distinguish nuances in emotions. Poor data can lead to misinterpretations, reducing the effectiveness of the analysis. Moreover, comprehensive and well-curated data can significantly boost the model's ability to generalize across diverse real-world scenarios. The dataset we're utilizing underscores this point. As even some of its entries are so nuanced that even humans might struggle to discern their sentiment.
Step-by-Step Tutorial
A) Evaluating the Old Model’s Performance
Step 1: Installing and Importing Necessary Libraries
!pip install openai!pip install wandb!pip install git+https://github.com/wandb/wandb.git@openai-finetuningimport osimport openaiimport wandbimport pandas as pdimport jsonfrom openai import OpenAIfrom wandb.integration.openai import WandbLogger
Step 2: Setting Up OpenAI API Key
openai.api_key = "Insert your OpenAI API key here”
Step 3: Loading and Processing the Sentiment Analysis Dataset
filename = "Insert the path to your data set here"# Read the CSVdf = pd.read_csv(filename)# Drop rows with NaN values in 'clean_comment' and 'category'df.dropna(subset=['clean_comment', 'category'], inplace=True)# Sample 100 rows from the datasetdf = df.sample(100)
Step 4: Initializing a New Weights & Biases Project
In this section, we'll harness the capabilities of the recently introduced WandbLogger() function, designed to streamline the integration of W&B with OpenAI. This tool is tailored to enhance the fine-tuning process of OpenAI models, including ChatGPT. It offers a simplified, efficient pathway for tracking and monitoring the training process, visualizing performance metrics, and comparing different experimental setups.
The WandbLogger() function is instrumental in providing a cohesive and user-friendly environment for managing and analyzing the nuances of model fine-tuning in a streamlined manner.
To learn more about the function and to check the multitude of additional parameters that can be passed to this function, click here.
WandbLogger.sync(project="chatgpt3.5-fine-tuning-for-Sentiment-Analysis")
Step 5: Take a New Sample To Test the Model On
df = df.sample(100)
Step 6: Defining a Function To Convert the Model Response to Sentiment Value and Vice Versa
def convert_response_to_sentiment(response):response = response.lower()if 'positive' in response:return 1elif 'negative' in response:return -1elif 'neutral' in response:return 0else:return -1 # Unknown sentimentdef convert_numeric_to_string_sentiment(value):if value == 1:return "positive"elif value == -1:return "negative"elif value == 0:return "neutral"else:return "unknown"
Step 7: Evaluating the Old Model’s Performance
client = openai.Client()correct_predictions = 0loop_count = 0 # Counter for loop iterationsresults = []# Iterate over each row in the DataFramefor index, row in df.iterrows():loop_count += 1 # Increment the loop counttext = row['clean_comment'] # Adjusted column nametry:completion = client.chat.completions.create(model="gpt-3.5-turbo",messages=[{"role": "system", "content": "What is the sentiment of the following text? Please respond with 'positive', 'negative', or 'neutral'."},{"role": "user", "content": text},])response = completion.choices[0].message.contentpredicted_sentiment = convert_response_to_sentiment(response)results.append({"sentiment": text,"labeled_prediction": convert_numeric_to_string_sentiment(row['category']),"old_model_prediction": response})# Check if the predicted sentiment matches the actual sentimentif predicted_sentiment == row['category']: # Adjusted column namecorrect_predictions += 1# Print the current progress using loop_counttotal_rows = len(df)print(f"Processed {loop_count}/{total_rows} rows.")except Exception as e:print(f"Error on index {index}: {e}")continue
Step 8: Calculating Old Accuracy
accuracy = (correct_predictions / total_rows) * 100
Step 9: Logging the Old Accuracy to W&B
wandb.log({"Old Accuracy": accuracy})print(f'Model Accuracy before: {accuracy:.2f}%')
Output: Model Accuracy before: 48.00%
B) Fine-Tuning the ChatGPT Model
Step 10: Converting the DataFrame to the Desired JSONL Format
output_filename = "insert the directory in which to save the newly processed data set "# Convert DataFrame to the desired JSONL formatwith open(output_filename, "w") as file:for _, row in df.iterrows():# Map the target to its corresponding string labeltarget_label = {0: 'neutral',1: 'positive', # Corrected the spelling here-1: 'negative'}.get(row['category'], 'unknown')data = {"messages": [{"role": "system","content": "What is the sentiment of the following text? Please respond with 'positive', 'negative', or 'neutral'."},{"role": "user","content": row['clean_comment']},{"role": "assistant","content": target_label}]}# Write each data point as a separate line in the JSONL filefile.write(json.dumps(data) + "\n")
Step 11: Uploading the Created File to OpenAI
training_file = client.files.create(file=open(output_filename, "rb"),purpose='fine-tune')training_file_id = training_file.id
Step 12: Creating a New Fine-Tuning Job
fine_tuning_job = client.fine_tuning.jobs.create(training_file=training_file_id,model="gpt-3.5-turbo")job_id = fine_tuning_job.id
C) Evaluating the New Model’s Performance
Step 13: Evaluating the New Model’s Performance
Note that the new model ID will be sent to you through OpenAI via your email.
model_id = "Insert here the new model’s ID"correct_predictions = 0loop_count = 0 # Counter for loop iterationsloop_index = 0 # Initialize loop_index# Iterate over each row in the DataFrame for the new modelfor index, row in df.iterrows():loop_count += 1 # Increment the loop counttext = row['clean_comment'] # Adjusted column nametry:completion = client.chat.completions.create(model=model_id,messages=[{"role": "system", "content": "What is the sentiment of the following text? Please respond with 'positive', 'negative', or 'neutral'."},{"role": "user", "content": text},])response = completion.choices[0].message.contentpredicted_sentiment = convert_response_to_sentiment(response)results[loop_index].update({"new_model_prediction": response})loop_index += 1 # Increment the loop index# Check if the predicted sentiment matches the actual sentimentif predicted_sentiment == row['category']: # Adjusted column namecorrect_predictions += 1# Print the current progress using loop_countprint(f"Processed {loop_count}/{total_rows} rows.")except Exception as e:print(f"Error on index {index}: {e}")continue
Step 14: Calculating the New Accuracy
accuracy = (correct_predictions / total_rows) * 100
Step 15: Logging the New Accuracy to W&B
wandb.log({"New Accuracy": accuracy})print(f'Model Accuracy after: {accuracy:.2f}%')
Output: Model Accuracy after: 73.00%
Step 16: Create a New vs Old Result Comparison Table in W&B
# Convert results list to DataFramedf_results = pd.DataFrame(results)# Log the entire DataFrame as a table to W&Bwandb.log({"results_table": wandb.Table(dataframe=df_results)})
Step 17: Finishing the Weights & Biases Run
wandb.finish()
Fine-Tuning Results and Analysis
After the processing is complete, a link to Weights & Biases (W&B) will be generated. Click on this link to access the logs and view the results of your experiment.

The new model demonstrates a marked improvement of 25% in predictive accuracy. While the previous model achieved an accuracy of 48%, the updated version impressively reached 73%.
Moreover, we've compiled a table consisting of 100 entries and saved it using the Weights & Biases tool (Step 16). This table is structured with four columns: Sentiment, Labeled Prediction, Old Model Prediction, and New Model Prediction. Through manual inspection, we'll be able to pinpoint instances where the model exhibited improvements.
This set of panels contains runs from a private project, which cannot be shown in this report
Navigating the complexities of model evaluation becomes simpler with W&B. It enables us to save and compare multiple tables with various accuracies and outputs. By tweaking our fine-tuning process, we can easily monitor improvements or setbacks, streamlining our path to achieving the highest model accuracy possible.
For example, in the displayed table, the new model accurately predicted the sentiment label for the second, fourth, and seventh rows, where the previous model had wrongly classified. Thus we can clearly see the improvement in our model’s predictions along with its strong and weak points.
Practical Applications and Use Cases
Jargon and Slang Understanding
Social media platforms, such as Facebook, Twitter, and even Reddit, have a unique language characterized by platform-specific slang, memes, niche topics, and abbreviations. Fine-tuning allows the model to interpret and respond to such vernacular accurately, ensuring better sentiment understanding.
E-Commerce Product Reviews
E-commerce platforms can analyze product reviews to identify highly rated products or vendors and adjust their recommendation algorithms accordingly.
Further Improvements
By leveraging a mere 100 training samples, we've seen a notable 25% improvement in the model's performance. This initial result is promising and underscores the model's adaptive capabilities. However, it's essential to recognize the expansive potential lying ahead. As with most machine learning models, the depth and diversity of training data often correlate with the model's precision and robustness.
Introducing a larger training sample, say 1,000 or 10,000 data points, could not only improve accuracy but also ensure the model is better equipped to handle a wider variety of scenarios and nuances. Such a vast dataset would encompass a broader spectrum of sentiments, jargon, and contexts, thereby refining the model's ability to discern subtleties and reduce false positives or negatives.
Moreover, a larger training set would allow the model to generalize better, minimizing overfitting to any specific subset of data. It would be intriguing to explore how the model evolves as we scale our training efforts. Given the promising improvement from a modest sample, there's a compelling case to be made for continued investment in refining and expanding the training data.
Conclusion
Fine-tuning powerful models such as ChatGPT presents a promising avenue for specialized tasks like sentiment analysis. Our exploration, backed by Weights & Biases, not only showcased the significance of specific training but also the vast potential improvements one can achieve, even with limited data samples.
The realm of sentiment analysis is rife with complexities, from understanding sarcastic tones to deciphering platform-specific jargon. With continued advancements and refined fine-tuning processes, models like ChatGPT can become indispensable tools for businesses, researchers, and developers.
The journey of refining models underscores a critical lesson: in the world of AI, there's always room for improvement and innovation.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.