# Pretraining a 124-M Parameter GPT-2 Language Model

This report describes my experience and learnings while training the 124-M parameter GPT-2 model. Made by Bilal using Weights & Biases
Bilal

## TL;DR

A few months ago, I started working on a research project trying to pretrain my own, more efficient language model from scratch. I got access to a 128-core TPUv3 pod from the Tensorflow Research Cloud and used it to pretrain a $124$M parameter GPT-2 model to a perplexity pretty close to OpenAI's results (my pretrained model was trained for about $1/8$th of the number of iterations that OpenAI trained their model for and got $21$ ppl on OpenWebText compared to $17$ ppl for OpenAI's model), and then pretrained an ALBERT-style GPT-2 (that I am calling ALGPT2) language model with a factorized input embedding and layer-wise parameter sharing that would reduce the number of parameters in the model from $124$M to around $12$M.

Unfortunately, ALGPT-2 does not perform as well as GPT-2 (ALGPT-2 gets $31$ ppl on OpenWebText compared to $21$ ppl for my pretrained GPT-2 model), but I am writing this series of blog posts to go through everything I have learned over the last few months.

original version first published here

## The Idea

The main thing that I wanted to do from this sort-of "research project" that I was working on by myself this spring was to develop and train a more efficient version of the $124$M parameter version of GPT-2. I wanted to pretrain the $1.5$B parameter version of GPT-2, but since I only got access to the TPU pod for a week, I had to choose a model that would train in time. A $100$k iteration training run takes about $20$ hours to run, which gave me plenty of time to run multiple experiments. In contrast, following OpenAI's training procedure exactly and training for the full $800$k iterations would take up almost an entire week and use up most of my quota.

I was able to almost replicate the $124$M parameter version of GPT-2 by pretraining it to a perplexity pretty close to OpenAI's results (my pretrained model used was trained for about $1/8$th of the number of iterations that OpenAI trained their model for and got $21$ perplexity (ppl) on the standard OpenWebText dataset compared to $17$ ppl for OpenAI's model),

My idea of making a more efficient transformer did not work out since my pretrained transformer ended up being about $20$ppl worse than an equivalent GPT-2 model. However, I wanted to write up what I learned over the two or three months that I was working on this anyway.

## GPT-2

GPT-2 is a transformer decoder. The embedding layer at the root of the model maps a one-hot vector of a given token's index (all the GPT-2 models use a vocabulary size of $50257$) to a $768$ dimensional vector (all GPT-2 numbers in this blog post will be for the $124$m parameter version of GPT-2).

The embedding matrix is followed by a stack of self-attention and feed-forward layers that each output a $768$ dimensional vector (keeping the number of outputs for each layer constant), which makes up the central part of the transformer.

The stack of self-attention layers is then followed by an output embedding (the weights of the input and output embeddings are tied to make training easier) that maps the $768$ dimensional vector that is the output of the last layer of the transformer to the same $50257$ dimensional vector that represents the probability of each token in the vocabulary being the next token in the sequence.

Take a look at The Illustrated GPT-2 for a more in-depth look into GPT-2.

## ALBERT

ALBERT (A Lite BERT) is a paper that takes a look at BERT and identifies some ways in which to make it more efficient and reduce the number of parameters in the model: a factorized embedding, layer-wise parameter sharing, a sentence-order-prediction auxillary loss, and removing dropout.

### Factorized embedding

GPT-2's embedding has many parameters. It is just a dense matrix of dimensions $50257 \times 768$. That means that the input embedding alone uses up almost $50257 \times 768 = \space \sim 38,000,000$ parameters, which is a pretty big chunk of the $128$M total parameters in the model.

The ALBERT authors propose a factorized embedding with an intermediate embedding size of $128$: one embedding of size $50257 \times 128$ and another embedding of size $128 \times 768$. By breaking up the large embedding matrix into two smaller matrices, the total number of parameters used in the embedding goes from about $38$M to about $6$M.

$50257 \times 128 = \sim 6,000,000$

$128 \times 768 = \sim 100,000$

The authors try different intermediate embedding sizes and settle on $128$ as a reasonable tradeoff between parameters and performance.

### Layer-wise parameter sharing

In a normal transformer model, the transformer layers are created something like this:

class BERT(nn.Module):
def __init__(self, n_layers):
super().__init__()
// ...
self.blocks = nn.ModuleList([Block() for _ in range(n_layers)])
// ...
def forward(self, x):
// ...
for block in self.blocks:
x = block(x)
// ...


ALBERT shares all parameters across the transformer layers something like this:

class ALBERT(nn.Module):
def __init__(self, n_layers):
super().__init__()
// ...

self.n_layers = n_layers
self.block = Block()
// ...
def forward(self, x):
// ...
for _ in self.n_layers:
x = block(x)
// ...


By only defining one transformer block and looping around it n_layers times, ALBERT saves the GPU memory used to store the parameters for all the layers.

Since we usually use $32$ bit floats to store parameters on the GPU, storing the $1.5$B parameter GPT-2 on the GPU will use up about $6$GB of the GPU's memory — that is a pretty big chunk of the $16$GB of memory that's on a normal V100 GPU already being used up before taking into account the memory needed to store the model's activations as well as any momentum parameters needed by the optimizer. In contrast, if you share parameters across all transformer layers in the $1.5$B parameter GPT-2, the resulting model will only have about $37$M parameters. The parameter-sharing version would only use up around $148$MB of GPU memory.

The authors try applying parameter sharing to BERT and see that it reduces performance but makes it easier to train more massive and larger models.

In a machine learning framework like JAX, which by default unrolls and inlines loops when it is compiling your code with XLA, the size of the unrolled and inlined loop would make the computation graph really large and take a long time to compile. This is why you are recommended to use something like lax.scan() in these situations.

### Sentence-order-prediction auxillary loss

The ALBERT authors add an auxiliary loss to help training. Since language modeling is usually done autoregressively, I did not use this for my custom model.

### Removing dropout

The ALBERT authors remove all dropouts from BERT and see that it significantly improves performance.

That is pretty much what my idea was: Take GPT-2, add a factorized embedding, share parameters across all transformer layers, remove dropout (I missed the part about ALBERT removing dropout until I was pretty far into my work, but I did run one or two runs without dropout to see how that works), and pretrain on a large dataset for a few hundred thousand iterations.

There is no way that I could pretrain something like GPT-2 by myself, so I applied to the Tensorflow Research Cloud (TFRC). I emailed the TFRC team to ask if I could get upgraded from $5$ separate individual TPUv3's (with 8 cores each) to a TPU pod to pretrain a large language model. The very next day (!) I got an email back saying that I could get access to a preemptible 128-core TPUv3 Pod for 7 days, which unfortunately was not long enough for me to pretrain the $1.5$B parameter model but was enough to train a few runs on the $124$M model.

## Setup

So for the setup, I will be going through all the steps that I took to set up my VM and TPU Pod and preprocess the dataset as well.

When working on this project, I set up two VMs; One with many RAM and CPU cores to process the data quickly and another small instance to run the TPU training script. One of the nice things about training on TPUs and TPU pods is that as long as your data has been preprocessed as a set of TFRecord files, you do not need a compelling VM instance, which saves you much money/compute credits.

You can look at this for a full list of every command that I used to set up the VM and preprocess the dataset.

### OpenWebText

I used a n-1-standard-16 instance with TF2.1 to process the OpenWebText dataset. Ensure that you use an instance with an SSD instead of the default HDD because processing the dataset involves processing many tiny text files and is mostly limited by your drive's io speed. I made the mistake of using an HDD, and just extracting the dataset's TAR archives took about 7 hours. I put all the data in a folder at ~/data/openwebtext/ so modify it if you want to download the data elsewhere.

TIL: most common linux utilities (like ls, mv, and cat) aren't that optimized for working with almost 10 million files like in OpenWebText. Just counting the number of text files in the dataset could take several minutes._

Download the OpenWebText dataset (which is just a tar archive of a bunch of tar archives that contain many text files) and extract it:

gdown https://drive.google.com/uc?id=1EA5V0oetDCOke7afsktL_JDQ-ETtNOvx
tar -xf openwebtext.tar.xz
cat *.xz | tar -J -xf - -i


The dataset is about 12GB compressed and 53GB uncompressed and has just about 8 million text files.

I moved the first $100,000$ files in the dataset to a separate directory to create a validation set:

ls -f | head -100000 | xargs -i mv {} ../openwebtext-valid/


### Tokenization

I trained a Byte-level BPE tokenizer with a vocabulary size of $50,257$ (The same as GPT-2) on a $1$M file subset of the training set (I am not sure if GPT-2 trains the tokenizer on the entire dataset or just a subset, but I know that the CTRL paper trains their tokenizer on a 5% split of their training set.). I used Hugginface's fast Rust-based Tokenizers library and their ByteLevelBPETokenizer tokenizer.

You can use my script here and run

python3 train_tokenizer.py --train_path ./data/openwebtext/ --save_path ./tokenizer/ \
--vocab_size 50257 --n_files 1000000


to train the tokenizer, or just take a look at this for the main details (It just trains a tokenizer and saves it as well as a configuration file to disk):

import os
import glob
import json

from tokenizers import ByteLevelBPETokenizer

paths = glob.glob(os.path.join('./data/openwebtext', '*'))[:1000000]

tok = ByteLevelBPETokenizer()
tok.train(files=paths, vocab_size=args.vocab_size, special_tokens=args.control_codes)
tok.save('./tokenizer/')

tokenizer_config = {
"max_len": 1024
}

with open(os.path.join('./tokenizer/', "tokenizer_config.json"), 'w') as fp:
json.dump(tokenizer_config, fp)


## TFRecords

TPU Pods expect your data to be available as a set of TFRecord files in a GCP cloud bucket that get downloaded to each of your TPU board's built-in powerful VM that will take care of de-serializing the files and feeding it to the TPU chips. Ensure that your GCP bucket and your TPU pod are in the same compute zone; otherwise, you will quickly rack up many charges by transferring hundreds of GBs of data across compute zones.

Here is a thing that's not very well documented when working with TPU Pods (this does not apply to individual TPUs as much): TPU Pods create a lot (100s of GBs) of logs that get sent to Stackdriver, where you get charged about 50 cents for each GiB of logs ingested beyond a specific limit (I think it is around 50GiB/month). In just a few days of training, I ended up being charged about a $100$ IIRC. Luckily, I still had most of the free GCP credits, so this did not become a significant problem for me, but make sure to turn off ingesting logs for TPUs.

I ran into a problem early on when I got access to the TPU pod where my code would work perfectly on a single TPU but would throw an Out of range: End of sequence error when running it on a TPU pod. I struggled with this for a pretty long time until I took a look at this Kaggle discussion post that says that TPUs expect each TPU board (8 cores) to get its own TFrecord file (until that point, I was splitting the train set into 8 TFRecord files where I should have been splitting it into 16 (128 cores / 8 cores per board) TFRecord files.

TPUs are excellent for scaling to huge models and enormous datasets. However, there is much TPU-specific information (especially for TPU Pods) that you need to know that is not covered in the documentation and is not easy to find._**

You can use my script here and run

python3 make_tfrecords.py --path ./data/openwebtext/ --save_path ./train/ --files_per_tfrecord 500000 \
--use_control_codes --seq_len 1024 --min_seq_len --tokenizer ./tokenizer/

python3 make_tfrecords.py --path ./data/openwebtext-valid/ --save_path ./val/ --files_per_tfrecord 50000 \
--use_control_codes --seq_len 1024 --min_seq_len --tokenizer ./tokenizer/


To convert the raw text files from the train and validation splits into two sets of $16$ TFRecord files.

I ran a quick analysis of the average lengths of text fields in the dataset, $67$% of files have less than $1024$ tokens, $35$% of files have less than $512$ tokens, and only $10$% of files have less than $256$ tokens. This means that if I wanted to make the dataset as clean as possible and have each input sequence to the model be of a single contiguous stream of $1024$ tokens, the dataset's size would be a lot smaller. For this reason, everyone prepends a token like <|endoftext|> to the beginning of each sequence and concatenates together sequences with lengths smaller than $1024$. The specifics of how exactly you do that (e.g., do you treat the dataset as a single stream of tokens and just break it up into sequences of length $1024$, or do you keep track of sequences smaller than $1024$ and just concatenate them together into a single sequence) really should not make too big of a difference in your model's performance. However, you can take a look at my implementation here.

My version does not take full advantage of the fast, multithreaded batch_encode_plus() way to tokenize large datasets in parallel since it only keeps the first context_len tokens in each line of the files, which makes dealing with files with more or less than $1024$ tokens harder. Thus, tokenizing the dataset takes about $8$ hours, which is something I want to improve.

The train set comes out to about $26$GB and consists of about $8$M text files that have been transformed into just under $7$M tfrecord examples, each with $1024$ tokens (same as GPT-2). The validation set comes out to about $300$MB and consists of about $100$K text files that have been transformed into just about $90$K tfrecord examples, each with $1024$ tokens (also the same as GPT-2).

## Code

Since I am using TPUs, the only real library that you can practically use right now would be Tensorflow. I did not want to have to go through the learning curve of learning how to make custom training loops and stuff in TF2, so I just stuck to using Keras. You can take a look at my training script (It is pretty short) here. It is pretty simple, so I will not copy the entire training script, but I will talk about a few small code snippets.

I usually like to add a ptvsd breakpoint to my script so I can debug my training script locally with vscode before pushing it up to my VM

if args.debug:
import ptvsd
redirect_output=True)
ptvsd.wait_for_attach()
breakpoint()


I am using Weights&Biases to keep track of my experiments and save checkpoints.

    wandb.login()
wandb.init(project='lm-finetuning', config=args, tags=args.tags)

...

wandb_callback = WandbCallback(save_model=False)


Usually, when you are using a TPU with Keras, you pass in the IP address and port of the TPU to TPUClusterResolver, but you pass the name of the TPU itself to the resolver when using a TPU Pod.

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu=args.tpu)
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)


## Replicating GPT-2

I tried to use as many of the original hyperparameters that OpenAI used when replicating their $124$M parameter version of GPT-2. However, I had to modify a few things so I could train everything in time.

Note: For some reason, the authors of the GPT-2 paper do not state precisely what learning rates they used for training their models and instead just state, "The learning rate of each model was manually tuned for the best perplexity on a 5% held-out sample of WebText".

OpenAI trains their models for a total of $800$K iterations at a batch size of $512$ (Which comes out to around a total of $60$ epochs through the training set).

I trained my GPT-2 model for $1/8th$ the number of iterations that OpenAI trained theirs for (a total of around $100$K iterations) since each $100$K iteration training run took about $20$ hours to run on my 128-core TPU Pod. If I wanted to train GPT-2 for the same number of iterations as OpenAI, a single training run would have used up most of my one week of access to the pod.

Since my TPU pod was preemptible and reset every $24$ hours, I usually had to resume my training run at least once and why all of these graphs usually have two or more training runs on them.

### Replicating GPT-2

So here is my model that came really close to replicating GPT-2. The training perplexity is about $21.5$ at the end of the almost $90$K training iterations. For comparison, GPT-2 gets a training perplexity about $17.5$ ppl after about $800$K training iterations, so a difference of only about $4$ ppl.

I made a colab notebook showing how to use my pretrained GPT-2 model to generate text

## Section 2

I wanted to use the memory-saving Adafactor optimizer to make it easier to train larger language models but all of my Adafactor training runs were a lot (~5ppl IIRC) worse than using AdamW (This may be due to not using Adafactor's momentum parameter or relative update scale, so this is something I want to look into more soon).

### Learning Rates

I started out with using Adam's default learning rate of $1e-4$ but I quickly figured out that I could train my models a lot faster by using a higher learning rate like $1e-3$.

Section 2 of the GPT-3 paper lists the learning rates the OpenAI team used for different sized models when training GPT-3. They use a learning rate of $6e-4$ for the $124$M version of their model and decrease the learning rate with model size.

As you can see from this partial training run, it's pretty clear that the loss decreases a lot faster with a learning rate of 1e-3. This doesn't mean that training at a lower learning rate will necessarily lead to a worse model, but it likely will take a lot more training iterations to get there.

## Pretraining ALGPT-2

Since I was using the Huggingface Transformers repository's implementations of GPT-2 and ALBERT, I just forked the repository and modified a few files to implement my ALGPT-2 model. You can take a look at all the changes that I had to make here, most of the changes are only to make ALGPT-2 compatible with the /Transformers library and to be able to use the useful abstractions that it gives you, but most of the important code is in the modelling_algpt2.py file in which I just copied over the contents of modelling_gpt2.py and changed a few parts of the code. I'm only showing the changes that I made to the Pytorch version of ALGPT-2 here, the changes in the TF version are pretty similar to the Pytorch version and can be seen here.

### Implementing Parameter Sharing

Implementing parameter sharing only involves changing a few lines of code:

class ALGPT2Model(ALGPT2PreTrainedModel):
def __init__(self, config):
super().__init__(config)

...

-       self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True)
-           for _ in range(config.n_layer)])
+       self.h = Block(config.n_ctx, config, scale=True)

...

def forward(self, ...):

...

if past is None:
past_length = 0
-           past = [None] * len(self.h)
+           past = [None] * self.config.n_layer

...

-       for i, (block, layer_past) in enumerate(zip(self.h, past)):
+       for i in range(self.config.n_layer):

if self.output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)
-
-           outputs = block(
+           outputs = self.h(
hidden_states,
layer_past=layer_past,
use_cache=use_cache,
)
...



### Implementing a Factorized Embedding

Adding a factorized embedding is a little more work:

In the config.json that you use for your ALGPT-2 model, you need to specify that you want to use the ALGPT-2 and you need to specify the dimension of the factorized embedding that you want to use:

{
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-5,
+	"model_type": "algpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_layer": 12,
"n_positions": 1024,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"vocab_size": 50257,
+	"embedding_size": 128
}


Back in modelling_algpt2.py, define the two factorized embedding matrices (the first second matrix that is really just a simple linear layer)

class ALGPT2Model(ALGPT2PreTrainedModel):
def __init__(self, config):
super().__init__(config)

...

-       self.wte = nn.Embedding(config.vocab_size, config.n_embd)
-       self.wpe = nn.Embedding(config.n_positions, config.n_embd)
+       self.wte = nn.Embedding(config.vocab_size, config.embedding_size)
+       self.wpe = nn.Embedding(config.n_positions, config.embedding_size)

+       self.projection_layer = nn.Linear(config.embedding_size, config.n_embd)

...

def forward(self, ...):

...

hidden_states = inputs_embeds + position_embeds + token_type_embeds

+       hidden_states = self.projection_layer(hidden_states)

...

def __init__(self, config):
super().__init__(config)

...

-        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
+        self.dense = nn.Linear(config.n_embd, config.embedding_size)
+        self.lm_head = nn.Linear(config.embedding_size, config.vocab_size, bias=False)

def forward(self, ...):

...

+       dense = self.dense(hidden_states)
...


### Effect of Layer-Wise Parameter Sharing

This version of ALGPT-2 has about $47$M parameters while GPT-2 has $124$M. This ALGPT-2 model with parameter sharing trains a lot faster than GPT-2 ($9$ hours vs $20$ hours for a $90$K iteration training run), but is consistently about $10$ ppl worse than GPT-2 ($31$ vs $21$ ppl).

This difference is quite a bit larger than the difference between ALBERT and BERT, but might be explained by masked language modelling being an easier task than autoregressive language modelling. Increasing the size of the ALGPT-2 model might make it more competitive with GPT-2.

## Section 6

### Effect of Removing Dropout

I ran a partial training run on removing dropout from ALGPT-2. I did not run it for very long, but it looks like removing dropout gives you a slight improvement (~3ppl).

## Section 8

### Effect of Factorized Embeddings

I ran three experiments for $90$K iterations with three different values for the factorized embedding ($128$, $256$, and $512$) as well as the baseline version without a factorized embedding.

Model ALGPT-2 ALGPT-2 512 ALGPT-2 256 ALGPT-2 128
Parameters 47M 34M 20M 13M
Time ~9H ~9H ~9H ~9H
Perplexity 31 31 34 38

There was practically no difference in the loss curves between the baseline and the $512$ run since the change in the number of parameters wasn't that great. However, the training runs with factorized embeddings of sizes $256$ and $128$ were significantly worse than the baseline: $34$ and $38$ ppl respectively, a pretty big difference from the baseline of $31$ ppl.

## Section 10

### Effect of Model Size

I only had the time to run one more full training run with ALGPT-2-medium (this one is comparable to the $345$M version of GPT-2). ALGPT-2-medium has about $66$M parameters and took twice as long as ALGPT-2 to train (a little more than $20$ hours). The larger model size made quite a big difference in performance, the training perplexity decreased $5$ppl from $31$ to $26$ ppl.

<iframe title='Effect of model size' src='https://app.wandb.ai/bkkaggle/lm-finetuning/reports/Effect-of-model-size--VmlldzoxNzI0OTM' height='600px' width='100%'

</iframe>

## Conclusion and Next Steps

After my TPU pod's quota was used up, I started working on a [few] (https://github.com/bkkaggle/L2) other things over the summer and just kept delaying writing up what I did for a couple of months until now.

There are a lot of things that I still want to work on or look into:

• Training larger versions of ALGPT-2
• Removing or replacing the normalization layers in transformers
• Working on distilling/shrinking language models with billions of parameters to make them more accessible
• Apply something like PPLM to condition language models for few-shot inference (kinda like what GPT-3 does).

Thanks for reading through all this. If you think there's any mistakes or inaccuracies in this post, please let me know.