Fine-Tuning ChatGPT for Question Answering With W&B
An in-depth guide on fine-tuning ChatGPT for enhanced question-answering capabilities, leveraging Weights & Biases for optimal tracking and results.
Created on September 26|Last edited on November 22
Comment

Introduction
In the constantly evolving realm of artificial intelligence, models like ChatGPT by OpenAI have become cornerstones of conversational AI applications. However, while the inherent knowledge and capabilities of these models are immense, the need to fine-tune them for specific tasks or to update them with more recent information is often paramount.
The process of fine-tuning can be intricate, requiring careful data preparation, meticulous training, and precise monitoring. Leveraging tools like Weights & Biases (W&B) simplifies this journey, offering transparency and control. This article delves into the nuances of fine-tuning ChatGPT for question-answering tasks, utilizing the power of W&B to ensure optimal results.
Additionally, if you want more information than what we've provided here, you can download our free guide to fine-tuning and prompt engineering LLMs. Click the button below:

Table Of Content
IntroductionTable Of ContentUnderstand ChatGPT and Question AnsweringWeights & Biases: An OverviewData Preparation and AnnotationFine-Tuning ProcessFine-Tuning Results and AnalysisPractical Applications of Fine-Tuned QA ModelsTips for Optimizing the Fine-Tuning ProcessConclusion
Understand ChatGPT and Question Answering
What Is ChatGPT?
Think of ChatGPT as your digital companion with an encyclopedic knowledge base. Created by the experts at OpenAI, this model is designed to communicate effectively, resembling the conversational style you'd expect from a well-informed peer.
While ChatGPT is undoubtedly a reservoir of knowledge, what sets it apart is its ability to convey that information in a relatable manner. It isn't about reciting facts; it's about providing answers that make sense to the inquisitive human mind.
ChatGPT-3.5 Architecture

GPT-3, like its predecessors, uses the Transformer architecture. The Transformer is particularly known for its ability to handle sequential data, making it ideal for tasks like text generation.
The Transformer architecture is fundamentally composed of two main components: encoders and decoders. Each of these components is made up of layers that contain self-attention mechanisms and feed-forward neural networks. The self-attention mechanism allows the model to weigh the importance of different words relative to a given word, facilitating a contextual understanding of sequences. This is particularly beneficial in capturing long-range dependencies in textual data.
In the context of GPT models, including GPT-3, only the decoder portion of the Transformer is employed for tasks such as language modeling and text generation. This is in stark contrast to models like BERT, which harness only the encoder part of the Transformer for the bidirectional understanding of text.
Challenges of Using ChatGPT for Question Answering
ChatGPT, despite its vast knowledge and impressive conversational capabilities, still comes with challenges - specifically for the task of question answering. Out of the box, ChatGPT is trained on a broad spectrum of data. This makes it general-purpose, and capable of handling a wide array of topics. However, this also means it might not excel in any specialized domain. For nuanced or particular questions, its answers might lack depth or precision.
ChatGPT's knowledge has a cutoff, currently, January 2022, which means it is not aware of events or developments that occurred after its last training data update. This can lead to outdated or incomplete answers. Moreover, ChatGPT might sometimes provide longer answers than necessary, which might not always be ideal for succinct question-answering scenarios.
Weights & Biases: An Overview
Weights & Biases is a machine learning experiment tracking and optimization platform. At its core, it's designed to help ML practitioners keep a detailed record of their experiments, visualize results, and optimize their models. In the ever-evolving field of ML, where hundreds of experiments can be run before arriving at a desirable model, W&B acts as the central hub for all experiment-related information, ensuring nothing gets lost in the shuffle.
To better understand what W&B is used for, let's give a simple example.
Imagine you're training a neural network to recognize handwritten digits. As you tweak and adjust your model (maybe you change the architecture, the learning rate, or the type of optimizer), you'll naturally want to see how each variation performs. Instead of manually jotting down notes or getting lost in endless spreadsheets, W&B will automatically log all relevant details: from hyperparameters used to the model's performance metrics (like accuracy or loss) across epochs.
Having said that, W&B isn't just about numbers; it brings them to life. Want to see how your model's loss decreased over time? Or perhaps you're curious about how different hyperparameters influence accuracy? W&B's dashboards present this data in intuitive graphs and charts, making it easier to spot trends, anomalies, or opportunities for improvement.
Why W&B Is Beneficial for Fine-Tuning ChatGPT
Fine-tuning ChatGPT is no simple endeavor. It frequently requires running multiple experiments, each with its unique combination of datasets, hyperparameters, and training approaches.
In the hands-on segment of this piece, we'll harness W&B to record vital metrics throughout our fine-tuning journey. This encompasses everything from the initial model response, the adjusted output, right down to the conclusive training statistics.
Plus, by maintaining versions of our experiments, we create a safety net, enabling swift retracing or replication of any given experiment. This is particularly invaluable in the intricate domain of question answering.
Data Preparation and Annotation
The Importance of High-Quality Training Data
In the vast universe of machine learning, the mantra "garbage in, garbage out" rings true. At the heart of any successful machine learning model lies its training data. Think of it as the foundation of a building; the stronger and more reliable it is, the more resilient and functional the resulting structure.
Training data essentially guides a model, shaping its understanding and refining its abilities. High-quality data is clean, representative, balanced, and free from biases. When models are trained on such data, they can achieve higher accuracy, generalization, and robustness.
Data Annotating for Fine-Tuning ChatGPT for Question-Answering
For effective fine-tuning, users should utilize a similar data format to the one outlined below:
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
Each data entry is comprised of three components. First, there's the system segment, which sets the context or provides the prompt for the ChatGPT model. Next is the User section, representing the question posed by the individual. Finally, we have the Assistant portion, which showcases the model's reply to the presented question.
In this manner, we're training our model using a set of questions along with their respective answers.
For our specific fine-tuning approach, a sample from the dataset would look like this:
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
Fine-Tuning Process
What's the objective behind fine-tuning ChatGPT? Well, as it stands, ChatGPT's knowledge is capped at 2022, leaving it unaware of events post that year. For instance, it's in the dark about Elon Musk's 2022 acquisition of Twitter. Our aim with fine-tuning is to bring ChatGPT up to speed on such recent developments, priming it for more current and accurate question-answering in the future.
This approach isn't just limited to current events; it applies across various question-answering domains. For instance, if there's a software product documentation released after 2022, you can fine-tune the model on that content to keep it updated and informed.
Step 1: Importing and Installing Necessary Packages
In this phase, we'll be adding both the OpenAI and Weights & Biases libraries to our project directory.
!pip install openai!pip install wandb!pip install git+https://github.com/wandb/wandb.git@openai-finetuningimport osimport openaiimport wandbimport pandas as pdimport jsonfrom openai import OpenAIfrom wandb.integration.openai import WandbLogger
Step 2: Inserting Personal OpenAI Key
To obtain your unique OpenAI Key, you'll need to set up your own account.
openai.api_key = "Insert Your Personal OpenAI Key Here"
Step 3: Initializing Weights & Biases using WandbLogger Function
The WandbLogger() function is a recent addition designed to aid in the fine-tuning of OpenAI's ChatGPT models. To know more about the function and to check the multitude of additional parameters that can be passed to this function click here.
WandbLogger.sync(project="chatgpt3.5-fine-tuning-for-Question-Answering")
Step 4: Creating the Dataset File
client = openai.Client()training_file = client.files.create(file=open("insert your dataset directory here", "rb"),purpose='fine-tune')training_file_id = training_file.id
Step 5: Fine-Tuning Our New ChatGPT Model
fine_tuning_job = client.fine_tuning.jobs.create(training_file=training_file_id,model="gpt-3.5-turbo")job_id = fine_tuning_job.id
Step 6: Retrieving the Training_loss and Logging It Into W&B
response = openai.FineTuningJob.list_events(id=job_id, limit=10)steps = response['data']steps.reverse()for index, events in enumerate(steps):key = f'training loss {index}'wandb.log({key: str(events['message'])})
Step 7: Preparing Our Validation Dataset
Our validation dataset contains 10 data points that we will evaluate the model on. The idea is for the model to return the most up to date response on which it was fine tuned on.
file_path = 'Insert validation data set directory here'# Read and process the JSONL filedataset = []with open(file_path, 'r') as file:for line_number, line in enumerate(file, 1):try:data_point = json.loads(line)dataset.append(data_point["messages"])except json.JSONDecodeError as e:print(f"Error decoding JSON on line {line_number}: {e}")# List to store resultsresults = []
Step 8: Evaluating Our Fine-tuned Model's Response
# Generate responses and process themfor messages in dataset:system_message = next((m['content'] for m in messages if m['role'] == 'system'), None)user_query = next((m['content'] for m in messages if m['role'] == 'user'), None)# Generate completion with the fine-tuned modelcompletion_ft = client.chat.completions.create(model="insert fine tuned model ID here", messages=messages)response_ft = completion_ft.choices[0].message.content# Append results to the listresults.append({"role:system": system_message,"role:user": user_query,"role:assistant": response_ft,})# Convert results list to DataFramedf_results = pd.DataFrame(results)# Log the entire DataFrame as a table to W&Bwandb.log({"results_table": wandb.Table(dataframe=df_results)})
Step 9: Finishing the Weights & Biases Run
wandb.finish()
Fine-Tuning Results and Analysis
Utilizing W&B, we've documented certain data that help us determine the enhancement of our fine-tuned model. Specifically for question-answering tasks, there isn't a straightforward method to evaluate the model. Consequently, we employ a meticulous manual review, comparing the model's responses before and after the fine-tuning to discern any advancements.
Below is the result table that we have logged into W&B, which shows that for each prompt that we have asked the model a corresponding up-to-date answer was returned.
This set of panels contains runs from a private project, which cannot be shown in this report
Moreover, we will manually check the old and the new responses for the first question.
Old Response

New Response

As we can see, the old response provided by the non-fine-tunned model is outdated. It states that the current Owner of Twitter is Jack Dorsey. The new response is updated to the current data, which states that Elon Musk is the current owner of Twitter.

In the final stage of evaluation, we have displayed graphs depicting training accuracy and loss. These curves demonstrate that the model has effectively learned from the provided data.
Practical Applications of Fine-Tuned QA Models
Customer Support
High on the list of practical applications for a freshly fine-tuned ChatGPT model is customer support. Companies can adapt ChatGPT to their product manuals and FAQ repositories. This allows the AI to swiftly provide accurate responses to customer inquiries, easing the burden on human support teams.
Another intriguing application involves tailoring ChatGPT to respond as a specific entity. For instance, the model could be fine-tuned to answer as though it represents the Bank of France.
Technical Documentation
Software companies can fine-tune ChatGPT on their API documentation. Instead of going through and searching through thousands of documentation pages, developers can ask specific questions about functions, classes, or methods and get detailed answers, speeding up the coding or learning process.
Tips for Optimizing the Fine-Tuning Process
Understanding the Objective of the QA Fine-Tuning
When broadening the knowledge base of the model, a straightforward strategy is to feed it an extensive array of questions and their corresponding answers. This ensures that no detail, however minor, is overlooked. Furthermore, make sure to prep the model to deliver specific responses even when faced with questions outside its pre-existing knowledge.
On the other hand, if the aim is to customize the model's tone or have it respond as a particular entity (e.g., a bank's customer support), then the focus slightly shifts. While it remains crucial to equip the model with enough data for comprehensive replies, the tone of the response becomes pivotal. For instance, in the context of a bank, the model's trained responses might be peppered with phrases like, "The Bank of France regrets any inconvenience caused," and so forth.
Training on Sufficient and Diverse Data
While OpenAI's documentation suggests that in certain situations as few as 10 training examples might be adequate for fine-tuning, it's essential to consider the specific requirements of your task. Some tasks might demand thousands of data points to adequately equip the model with the necessary knowledge.
Additionally, users should emphasize the diversity and balance in their training data. If a model is predominantly trained on a particular subject, it might default to responses related to that subject, even when inappropriate, or asked about another subject.
Common Pitfalls To Avoid
It's essential to remember that currently once a ChatGPT model has been fine-tuned, it cannot undergo the fine-tuning process again. Hence, it's crucial to ensure that you include all necessary data when initially training the model.
Conclusion
The world of AI is vast, and models like ChatGPT are a testament to the strides we've made in natural language processing and conversational intelligence. Nevertheless, the efficacy of such models in specialized tasks hinges on effective fine-tuning.
This article has showcased how, with the right tools and approach, one can seamlessly update and tailor ChatGPT for specific question-answering domains. By using Weights & Biases, we can monitor, analyze, and optimize our experiments, ensuring our AI models remain relevant, informed, and efficient. As AI continues to shape our world, it is these fine-tuning endeavors that will dictate the precision, relevance, and value of our digital companions.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.