Fine-Tuning Tips and Exploration on OpenAI's GPT-3
What does GPT-3 know and how can we optimize its fine-tuning?
Created on January 19|Last edited on February 21
Comment
Introduction
OpenAI released a fine-tuning API for GPT-3, allowing better performance than few-shots prompting, especially when having datasets larger than few hundred samples.
We explore different models and fine-tuning process of GPT-3 and log our experiments through the W&B collaboration using just a single line of code:
openai wandb sync
Today, we're using a dataset based on WIT: Wikipedia-based Image Text Dataset. We'll use it to inspect GPT-3 knowledge on different subjects.
As always, our data exploration is fully traceable from dataset creation to training models and finally to logging results:

W&B Graph View of Logged Artifacts
First, let's take a quick moment to talk about the dataset that will form the backbone of this report & our fine-tuning efforts today:
Our Wikipedia Dataset
We created a dataset from WIT. Simply:
- prompt: a Wikipedia article title
- completion: corresponding first sentence from article
This will help us in querying GPT-3 about its knowledge on different subjects. We're going to get outputs as a simple "one sentence summary." The full dataset here contains 1.5M samples. This allows to quickly build training & validation sets from subsets of this larger group.
We created a few subsets for experiments:
- training set: 20 samples and 50,000 samples
- validation set: 10,000 samples
Fine-Tuning GPT-3
Measuring model performance
It can be difficult to measure a single model's performance:
- Surprisingly, the validation loss and validation token accuracy oscillate during training. It is possible that our validation dataset is too large (10,000 samples) and that it is therefore calculated only on a few batches at each iteration.
- sequence accuracy is almost always 0 but this is to be expected in this particular model.
Models trained for different tasks such as classification would produce classification metrics which may be more representative.
Learning rate schedule
Certain runs show a training loss decreasing in steps, in particular when the learning rate multiplier is high.It is likely due to a custom learning rate scheduler where each drop corresponds to a learning rate decrease.
Parameter Search
You can customize a lot of parameters when fine-tuning a model. The ones we experiment with today are:
- training dataset size: we have 2 versions, with either 20 samples or 50,000 samples
- base model: this can be "ada", "babbage," or "curie" (from smallest to largest model available)
- learning rate multiplier: we'll try 0.1 and 0.01
This leads to a total of 12 experiments.
Other parameters can be configured such as the prompt loss weight, batch size and number of epochs.
The top 2 models use Curie and a higher learning rate.
Results
The best way to evaluate a generative model is to explore sample predictions. You can select below the table which models you would like to visualize:
Run set
1
Let's see what our model predicts on a few celebrities!
Resources
Add a comment
Tags: Beginner, NLP, Text Generation, OpenAI, Experiment, Tutorial, Panels, Plots, Tables, Large Models, LLM, Fine-tuning, GPT
Iterate on AI agents and models faster. Try Weights & Biases today.