Fine-Tuning Tips and Exploration on OpenAI's GPT-3

What does GPT-3 know and how can we optimize its fine-tuning?
Created on January 19|Last edited on February 21
Comment
﻿
IntroductionOpenAI released a fine-tuning API for GPT-3, allowing better performance than few-shots prompting, especially when having datasets larger than few hundred samples.
We explore different models and fine-tuning process of GPT-3 and log our experiments through the W&B collaboration using just a single line of code: 
openai wandb sync
Today, we're using a dataset based on WIT: Wikipedia-based Image Text Dataset. We'll use it to inspect GPT-3 knowledge on different subjects.
As always, our data exploration is fully traceable from dataset creation to training models and finally to logging results:
﻿
W&B Graph View of Logged Artifacts
First, let's take a quick moment to talk about the dataset that will form the backbone of this report & our fine-tuning efforts today:
Our Wikipedia DatasetWe created a dataset from WIT. Simply:
prompt: a Wikipedia article title
completion: corresponding first sentence from article
This will help us in querying GPT-3 about its knowledge on different subjects. We're going to get outputs as a simple "one sentence summary." The full dataset here contains 1.5M samples. This allows to quickly build training & validation sets from subsets of this larger group. 
We created a few subsets for experiments:
training set: 20 samples and 50,000 samples
validation set: 10,000 samples
The data was pre-formatted following OpenAI guidelines.
﻿
project("borisd13", "GPT-3").artifactVersion("wiki-dataset", "6c56b60c26dc155076f5").file("wiki_title_description.table.json")
 - 69 of 200000
prompt
completion
66
67
68
69
﻿
Fine-Tuning GPT-3
Measuring model performanceIt can be difficult to measure a single model's performance:
Surprisingly, the validation loss and validation token accuracy oscillate during training. It is possible that our validation dataset is too large (10,000 samples) and that it is therefore calculated only on a few batches at each iteration.
sequence accuracy is almost always 0 but this is to be expected in this particular model.
Models trained for different tasks such as classification would produce classification metrics which may be more representative.
﻿
﻿
﻿
Learning rate scheduleCertain runs show a training loss decreasing in steps, in particular when the learning rate multiplier is high.It is likely due to a custom learning rate scheduler where each drop corresponds to a learning rate decrease.
﻿
﻿
﻿
Parameter SearchYou can customize a lot of parameters when fine-tuning a model. The ones we experiment with today are:
training dataset size: we have 2 versions, with either 20 samples or 50,000 samples
base model: this can be "ada", "babbage," or "curie" (from smallest to largest model available)
learning rate multiplier: we'll try 0.1 and 0.01
This leads to a total of 12 experiments.
Other parameters can be configured such as the prompt loss weight, batch size and number of epochs.
The top 2 models use Curie and a higher learning rate.
﻿
﻿
﻿
ResultsThe best way to evaluate a generative model is to explore sample predictions. You can select below the table which models you would like to visualize:
﻿
Run set1
﻿
﻿
Let's see what our model predicts on a few celebrities!
﻿
﻿
﻿
Resources﻿OpenAI Fine-Tuning Guide﻿
﻿W&B Integration with OpenAI API﻿
﻿Demo Colab﻿
﻿Using GPT-3 to Generate Dr. Who synopses﻿
﻿﻿﻿﻿﻿
﻿
Add a comment
Tags: Beginner, NLP, Text Generation, OpenAI, Experiment, Tutorial, Panels, Plots, Tables, Large Models, LLM, Fine-tuning, GPT
Iterate on AI agents and models faster. Try Weights & Biases today.