Elevating Conversations By Fine-Tuning Chatbots with GPT-3
This guide details creating a GPT-3 powered chatbot, emphasizing fine-tuning for realistic conversations. It includes a step-by-step process with practical insights, making it a valuable resource for developers and AI enthusiasts interested in conversational AI.
Created on March 19|Last edited on May 23
Comment
Introduction
The GPT family of models has revolutionized machine learning. And while GPT-4 and GPT-4o are the current state of the art, for some use cases, many organizations choose to use older versions. The reason is fairly simple: older implementations of GPT are cheaper and both GPT-2 and GPT-3 can perform quite well. One such use case? Chatbots.
In this article, we're going to explore GPT-3 for just that use case. We'll explore why fine-tuning is essential, particularly for chatbot applications, to ensure the technology remains relevant and effective in real-world scenarios. Additionally, we'll show how Weights & Biases (W&B) can streamline the fine-tuning process by showing us visualizations and results that we can compare and enhance our fine-tuning experience.
From the prerequisites of developing a GPT-3 chatbot to the step-by-step guide on fine-tuning your model, and finally evaluating its performance, this article aims to provide a comprehensive roadmap.

Source: Author
Why fine-tune GPT?
Fine-tuning Chat GPT-3 is essential for tailoring it to specific needs because it allows the model to adapt to the unique context and requirements of each use case. By fine-tuning, we can update the model with the latest information, integrate it with new datasets that weren't part of its original training, and customize its responses. This process ensures that the fine-tuned model can understand and respond to user inquiries more accurately, provide updated and relevant information, and interact in a way that feels more natural and specific to the intended audience or domain.
Preparing for GPT-3 chatbot development
For fine-tuning our GPT models, the first thing we need is OpenAI’s API, which we will use to create our training jobs and fine-tune our model. We'll need to set up a Jupyter notebook and import the necessary libraries. You don’t specifically need specialized hardware for this process as the training process will be done on OpenAI servers.
Purpose of fine-tuning
By fine-tuning our Chat GPT-3 model, we're looking to customize and enhance its capabilities to better serve specific needs. For example, we know that the GPT-3 was last trained in 2022, So it does not have any information about the events that occurred after its training phase ended. Ultimately, the goal is to transform a general-purpose model into a specialized tool, and we can make it one by training it to new data and expanding its own knowledge.
In this article, for fine-tuning purposes, we will be using the popular SQuAD dataset. By fine-tuning our model it will learn to understand passages of text and accurately answer questions related to them.
Let's get started.
Let's build our fine-tuned chatbot
In the following section, we'll walk you through a detailed, step-by-step guide on how to fine-tune your GPT-3 model. We will start by having a closer look at our dataset and converting it to the appropriate format for the fine-tuning process. Then we'll look at the fine-tuning process itself.
Step 1: Installing the required packages
First, let's install the necessary libraries for our fine-tuning jobs, and run the following code in your notebook environment.
!pip install wandb!pip install openai
Step 2: Preparing the dataset
For fine-tuning I have selected the SQuAD dataset, you can download that from this site. SQuAD is a question-answering dataset, containing over 100,000 question-answer pairs generated from more than 500 Wikipedia articles. Each data point consists of a passage of text (context) from a Wikipedia article, along with associated questions and corresponding answers.
First, we have to convert the dataset to JSONL format which is desired for the process of fine-tuning.
import jsondef convert_dataset_to_jsonl(input_json_path, output_jsonl_path):with open(input_json_path, 'r', encoding='utf-8') as f:data = json.load(f)with open(output_jsonl_path, 'w', encoding='utf-8') as outfile:for article in data['data']:for paragraph in article['paragraphs']:context = paragraph['context']for qa in paragraph['qas']:question = qa['question']is_impossible = qa.get('is_impossible', False)prompt = f"Context: {context}\nQuestion: {question}\nAnswer:"if is_impossible:completion = " Impossible"else:answer = qa['answers'][0]['text'] if qa['answers'] else "Unknown"completion = f" {answer}"jsonl_entry = json.dumps({"prompt": prompt, "completion": completion})outfile.write(jsonl_entry + '\n')convert_dataset_to_jsonl('/train-v2.0.json', '/train.jsonl')convert_dataset_to_jsonl('/dev-v2.0.json', '/dev.jsonl')
Step 3: Configuring and Initializing OpenAI
Now let's import the modules that we will be using and configure some basic settings. From your OpenAI account find the API key and either set it as environment variable or use it like mentioned below.
import openaiimport wandbimport os# Set OpenAI API Keyopenai.api_key = 'your api key'client = OpenAI(api_key)
Step 4: Logging in to wandb
Next, we log in!
wandb.login()wandb.init(project='project_name', entity='entity_name')
Step 5: Uploading training and validation datasets to OpenAI for fine-tuning
So we have our dataset ready in JSONL format, it's time to upload that to OpenAI. Make sure to upload the data once and save the train_file_id and dev_file_id, so you can use the same dataset for multiple runs instead of uploading again and again till you run out of memory.
def upload_file_to_openai(file_path, purpose='fine-tune'):response = openai.File.create(file=open(file_path), purpose=purpose)return response.idtrain_file_id = upload_file_to_openai("/train.jsonl")dev_file_id = upload_file_to_openai("/dev.jsonl")print(train_file_id)
Step 6: Defining hyperparameters and logging to wandb
# Define hyperparametershyperparameters = {"n_epochs": 2, # Number of training epochs"batch_size": 4, # Batch size for training"learning_rate_multiplier": 0.1, # Learning rate adjustment factor}# Log hyperparameters to wandbwandb.config.update(hyperparameters)
Step 7: Initiating a fine-tuning job on OpenAI and logging the job ID with Weights & Biases
Now it's time to initiate our fine-tuning job, for that we will be using openai.FineTuningJob.create() method. We will be using our train_file_id and dev_file_id which we saved before while uploading phase. After that, we'll check out the status of our fine-tuning job using this method.
openai.FineTuningJob.retrieve(fine_tune_id)fine_tune_response = openai.FineTuningJob.create(training_file=train_file_id,validation_file=dev_file_id,model="babbage-002",hyperparameters=hyperparameters)print(f"Fine-tuning started with ID: {fine_tune_response['id']}")wandb.log({"fine_tune_id": fine_tune_response["id"]})fine_tune_id= fine_tune_response['id']fine_tune_status = openai.FineTuningJob.retrieve(fine_tune_id)print(f"Fine-tuning job status: {fine_tune_status['status']}")
Step 8: Monitoring and fetching results of fine-tuning job
Now our fine-tuning job has been started, we are interested in the status of the job, whether it's completed or not, and if completed then we need details of the events, so this script initializes by capturing the fine-tuning job ID, then registers a signal handler to catch interrupt signals (SIGINT).
Upon receiving an interrupt, it retrieves and reports the current status of the fine-tuning job. It then requests and streams the events associated with the fine-tuning job, formatting and printing each event's timestamp and message. If the streaming process is interrupted or an error occurs, it reports the disruption.
import signalimport datetimefine_tune_id= fine_tune_response['id']def signal_handler(sig, frame):status = openai.FineTuningJob.retrieve(fine_tune_id)['status'] # Access status correctlyprint(f"Stream interrupted. Job is still {status}.")returnprint(f"Streaming events for the fine-tuning job: {fine_tune_id}")signal.signal(signal.SIGINT, signal_handler)try:events_response = openai.FineTuningJob.list_events(id=fine_tune_id)events = events_response['data'] # Access the list of eventsfor event in events:event_time = datetime.datetime.fromtimestamp(event['created_at']).strftime('%Y-%m-%d %H:%M:%S')print(f"{event_time} {event['message']}")except Exception as e:print(f"Stream interrupted (client disconnected). Error: {str(e)}")
Here is the output. Now we have to log them to wandb

Source: Author
Here is the graph plotted from the above data in W&B

Source: Author
Step 9: Evaluating the fine-tuned model
Now we have fine-tuned our model, we will now evaluate the model. For the evaluation, we have prepared a simple dataset having questions and answers. It queries the model with each question and its context, collects the model's answers, and then compiles these into a pandas DataFrame.Finally, the results are logged as a table in a Weights & Biases (wandb) project and optionally saved to a CSV file for local use.
wandb.login()wandb.init(project='project_name', entity='entity_name')
Let's have a look at our test dataset:
# Test datasettest_data = [{"context": "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.","qas": [{"question": "In what country is Normandy located?","answer": "France"},{"question": "When were the Normans in Normandy?","answer": "10th and 11th centuries"},{"question": "From which countries did the Norse originate?","answer": "Denmark, Iceland and Norway"},{"question": "Who was the Norse leader?","answer": "Rollo"},{"question": "What century did the Normans first gain their separate identity?","answer": "10th century"}]},{"context": "The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.","qas": [{"question": "Who was the duke in the battle of Hastings?","answer": "William the Conqueror"},{"question": "Who ruled the duchy of Normandy","answer": "Richard I"}]}]
The following function def normalize_answer(s): standardizes textual answers by converting them to lowercase, removing punctuation, and eliminating common articles like 'a', 'an', and 'the.' This ensures consistency in formatting, making it easier to compare and evaluate answers.
# Function to normalize answers (removing punctuation, lowercase, etc.)def normalize_answer(s):import redef remove_articles(text):return re.sub(r'\b(a|an|the)\b', ' ', text)def white_space_fix(text):return ' '.join(text.split())def remove_punct(text):return re.sub(r'[\W]', ' ', text)def lower(text):return text.lower()return white_space_fix(remove_articles(remove_punct(lower(s))))
This function def f1_score(prediction, truth): calculates the F1 score It first normalizes both the predicted and ground truth answers by utilizing the above-defined function. Then, it computes the F1 score based on the number of common tokens between the prediction and the truth.
# Calculate F1 scoredef f1_score(prediction, truth):prediction_tokens = normalize_answer(prediction).split()truth_tokens = normalize_answer(truth).split()common_tokens = Counter(prediction_tokens) & Counter(truth_tokens)num_same = sum(common_tokens.values())if num_same == 0: return 0precision = 1.0 * num_same / len(prediction_tokens)recall = 1.0 * num_same / len(truth_tokens)f1 = (2 * precision * recall) / (precision + recall)return f1
This function def exact_match_score(prediction, truth): calculates the Exact Match score, which determines if the predicted answer exactly matches the ground truth answer. It returns 1 if the normalized predicted answer matches the normalized truth answer, and 0 otherwise.
# Calculate Exact Match scoredef exact_match_score(prediction, truth):return int(normalize_answer(prediction) == normalize_answer(truth))
This function def query_model(question, context, model): queries a pre-trained language model (specified by the 'model' parameter) with a given question and context using the OpenAI API. It generates a response by providing the question and context as prompts and retrieves the model's answer. The response is then returned after removing leading and trailing whitespace.
# Function to query the model and get the answerdef query_model(question, context, model):openai.api_key = '<key-here>'response = openai.Completion.create(model=model,prompt=f"Question: {question}\nContext: {context}\nAnswer:",temperature=0,max_tokens=50,top_p=1.0,frequency_penalty=0.0,presence_penalty=0.0,stop=["\n"])return response.choices[0].text.strip()
This following snippet takes a list of test data that we mentioned above, iterates through each item, extracts context and questions, queries the model for answers, stores detailed results in a list, converts the list into a DataFrame, logs the DataFrame as a table to Weights & Biases.
# Adjusted part of the script to compile results into a DataFramedetailed_results = [] # List to store detailed resultsfor item in test_data:context = item['context']for qa in item['qas']:question = qa['question']true_answer = qa['answer']model_answer = query_model(question, context, model="modelid")# Append detailed results for each questiondetailed_results.append({"question": question,"model_answer": model_answer,"true_answer": true_answer })# Convert detailed results list to DataFramedf_results = pd.DataFrame(detailed_results)# Log the entire DataFrame as a table to W&Bwandb.log({"results_table": wandb.Table(dataframe=df_results)})# Optional: Save the DataFrame to CSV for local usedf_results.to_csv('evaluation_results.csv', index=False)

Here is the final table from wandb that shows both the true answer to our question and the answer given by the model. We can use this table for evaluation of our fine-tuned model and the quality of its answers!
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.