Pretraining a 124-M Parameter GPT-2 Language Model

This article describes experience and learnings from training the 124-M parameter GPT-2 model.
Created on September 4|Last edited on July 15
Comment
A few months ago, I started working on a research project trying to pre-train my own, more efficient language model from scratch. 
I got access to a 128-core TPUv3 pod from the Tensorflow Research Cloud and used it to pre-train a 124m parameter GPT-2 model to a perplexity pretty close to OpenAI's results (my pretrained model was trained for about 1/8th of the number of iterations that OpenAI trained their model for and got 21 ppl on OpenWebText compared to 17ppl for OpenAI's model), and then pre-trained an ALBERT-style GPT-2 (that I am calling ALGPT2) language model with a factorized input embedding and layer-wise parameter sharing that would reduce the number of parameters in the model from 124m to around 12m.
Unfortunately, ALGPT-2 does not perform as well as GPT-2 (ALGPT-2 gets 31 ppl on OpenWebText compared to 21 ppl for my pre-trained GPT-2 model), but I am writing this series of blog posts to go through everything I have learned over the last few months.
The original version was first published here. ﻿
Here's what this article will cover: 
Table of ContentsThe IdeaGPT-2What is ALBERT in Machine Learning?SetupTFRecordsCodeReplicating GPT-2Pretraining ALGPT-2Conclusion and Next Steps
﻿
﻿
Let's dig in! 
The IdeaThe main thing that I wanted to do from this sort of "research project" that I was working on by myself this spring was to develop and train a more efficient version of the 124124124﻿M parameter version of GPT-2. 
I wanted to pretrain the 1.51.51.5﻿B parameter version of GPT-2, but since I only got access to the TPU pod for a week, I had to choose a model that would train in time. A 100100100﻿k iteration training run takes about 202020﻿ hours to run, which gave me plenty of time to run multiple experiments. In contrast, following OpenAI's training procedure exactly and training for the full 800800800﻿k iterations would take up almost an entire week and use up most of my quota.
I was able to almost replicate the 124124124﻿M parameter version of GPT-2 by pretraining it to a perplexity pretty close to OpenAI's results (my pretrained model used was trained for about 1/81/81/8﻿th of the number of iterations that OpenAI trained their model for and got 212121﻿ perplexity (ppl) on the standard OpenWebText dataset compared to 171717﻿ ppl for OpenAI's model),
My idea of making a more efficient transformer did not work out since my pretrained transformer ended up being about 202020﻿ppl worse than an equivalent GPT-2 model. However, I wanted to write up what I learned over the two or three months that I was working on this anyway.
GPT-2﻿GPT-2 is a transformer decoder. The embedding layer at the root of the model maps a one-hot vector of a given token's index (all the GPT-2 models use a vocabulary size of 502575025750257﻿) to a 768768768﻿ dimensional vector (all GPT-2 numbers in this blog post will be for the 124124124﻿m parameter version of GPT-2).
The embedding matrix is followed by a stack of self-attention and feed-forward layers that each output a 768768768﻿ dimensional vector (keeping the number of outputs for each layer constant), which makes up the central part of the transformer.
The stack of self-attention layers is then followed by an output embedding (the weights of the input and output embeddings are tied to make training easier) that maps the 768768768﻿ dimensional vector that is the output of the last layer of the transformer to the same 502575025750257﻿ dimensional vector that represents the probability of each token in the vocabulary being the next token in the sequence.
Take a look at The Illustrated GPT-2 for a more in-depth look into GPT-2.
What is ALBERT in Machine Learning?﻿ALBERT (A Lite BERT) is a paper that takes a look at BERT and identifies some ways in which to make it more efficient and reduce the number of parameters in the model: a factorized embedding, layer-wise parameter sharing, a sentence-order-prediction, auxillary loss, and removing dropout.
Factorized embeddingGPT-2's embedding has many parameters. It is just a dense matrix of dimensions 50257×76850257 \times 76850257×768﻿. That means that the input embedding alone uses up almost 50257×768= ∼38,000,00050257 \times 768 = \space \sim 38,000,00050257×768= ∼38,000,000﻿ parameters, which is a pretty big chunk of the 128128128﻿M total parameters in the model.
The ALBERT authors propose a factorized embedding with an intermediate embedding size of 128128128﻿: one embedding of size 50257×12850257 \times 12850257×128﻿ and another embedding of size 128×768128 \times 768128×768﻿. By breaking up the large embedding matrix into two smaller matrices, the total number of parameters used in the embedding goes from about 383838﻿M to about 666﻿M.
﻿50257×128=∼6,000,00050257 \times 128 = \sim 6,000,00050257×128=∼6,000,000﻿﻿
﻿128×768=∼100,000128 \times 768 = \sim 100,000128×768=∼100,000﻿﻿
The authors try different intermediate embedding sizes and settle on 128128128﻿ as a reasonable tradeoff between parameters and performance.
Layer-Wise Parameter SharingIn a normal transformer model, the transformer layers are created something like this:
class BERT(nn.Module):
    def __init__(self, n_layers):
        super().__init__()
        // ...
        self.blocks = nn.ModuleList([Block() for _ in range(n_layers)])
        // ...
    def forward(self, x):
        // ...
        for block in self.blocks:
            x = block(x)
        // ...
ALBERT shares all parameters across the transformer layers something like this:
class ALBERT(nn.Module):
    def __init__(self, n_layers):
        super().__init__()
        // ...
﻿
        self.n_layers = n_layers
        self.block = Block()
        // ...
    def forward(self, x):
        // ...
        for _ in self.n_layers:
            x = block(x)
        // ...
By only defining one transformer block and looping around it n_layers times, ALBERT saves the GPU memory used to store the parameters for all the layers.
Since we usually use 323232﻿ bit floats to store parameters on the GPU, storing the 1.51.51.5﻿B parameter GPT-2 on the GPU will use up about 666﻿GB of the GPU's memory — that is a pretty big chunk of the 161616﻿GB of memory that's on a normal V100 GPU already being used up before taking into account the memory needed to store the model's activations as well as any momentum parameters needed by the optimizer. In contrast, if you share parameters across all transformer layers in the 1.51.51.5﻿B parameter GPT-2, the resulting model will only have about 373737﻿M parameters. The parameter-sharing version would only use up around 148148148﻿MB of GPU memory.
The authors try applying parameter sharing to BERT and see that it reduces performance but makes it easier to train more massive and larger models.
In a machine learning framework like JAX, which by default unrolls and inlines loops when it is compiling your code with XLA, the size of the unrolled and inlined loop would make the computation graph really large and take a long time to compile. This is why you are recommended to use something like lax.scan() in these situations.
Sentence-Order-Prediction Auxillary LossThe ALBERT authors add an auxiliary loss to help training. Since language modeling is usually done autoregressively, I did not use this for my custom model.
Removing DropoutThe ALBERT authors remove all dropouts from BERT and see that it significantly improves performance.
﻿
That is pretty much what my idea was: Take GPT-2, add a factorized embedding, share parameters across all transformer layers, and remove dropout (I missed the part about ALBERT removing dropout until I was pretty far into my work, but I did run one or two runs without dropout to see how that works), and pre-train on a large dataset for a few hundred thousand iterations.
There is no way that I could pre-train something like GPT-2 by myself, so I applied to the Tensorflow Research Cloud (TFRC). I emailed the TFRC team to ask if I could get upgraded from 555﻿ separate individual TPUv3's (with 8 cores each) to a TPU pod to pretrain a large language model. The very next day (!) I got an email back saying that I could get access to a preemptible 128-core TPUv3 Pod for 7 days, which unfortunately was not long enough for me to pretrain the 1.51.51.5﻿B parameter model but was enough to train a few runs on the 124124124﻿M model.
SetupSo for the setup, I will be going through all the steps that I took to set up my VM and TPU Pod and preprocess the dataset as well.
When working on this project, I set up two VMs; One with many RAM and CPU cores to process the data quickly and another small instance to run the TPU training script. One of the nice things about training on TPUs and TPU pods is that as long as your data has been preprocessed as a set of TFRecord files, you do not need a compelling VM instance, which saves you much money/compute credits.
You can look at this for a full list of every command that I used to set up the VM and preprocess the dataset.
OpenWebTextI used a n-1-standard-16 instance with TF2.1 to process the OpenWebText dataset. Ensure that you use an instance with an SSD instead of the default HDD because processing the dataset involves processing many tiny text files and is mostly limited by your drive's io speed. I made the mistake of using an HDD, and just extracting the dataset's TAR archives took about 7 hours. I put all the data in a folder at ~/data/openwebtext/ so modify it if you want to download the data elsewhere.
TIL: most common linux utilities (like ls, mv, and cat) aren't that optimized for working with almost 10 million files like in OpenWebText. Just counting the number of text files in the dataset could take several minutes._
Download the OpenWebText dataset (which is just a tar archive of a bunch of tar archives that contain many text files) and extract it:
gdown https://drive.google.com/uc?id=1EA5V0oetDCOke7afsktL_JDQ-ETtNOvx
tar -xf openwebtext.tar.xz
cat *.xz | tar -J -xf - -i
The dataset is about 12GB compressed and 53GB uncompressed and has just about 8 million text files.
I moved the first 100,000100,000100,000﻿ files in the dataset to a separate directory to create a validation set:
ls -f | head -100000 | xargs -i mv {} ../openwebtext-valid/
TokenizationI trained a Byte-level BPE tokenizer with a vocabulary size of 50,25750,25750,257﻿ (The same as GPT-2) on a 111﻿M file subset of the training set (I am not sure if GPT-2 trains the tokenizer on the entire dataset or just a subset, but I know that the CTRL paper trains their tokenizer on a 5% split of their training set.). I used Hugginface's fast Rust-based Tokenizers library and their ByteLevelBPETokenizer tokenizer.
You can use my script here and run
python3 train_tokenizer.py --train_path ./data/openwebtext/ --save_path ./tokenizer/ \
    --vocab_size 50257 --n_files 1000000
to train the tokenizer, or just take a look at this for the main details (It just trains a tokenizer and saves it as well as a configuration file to disk):
import os
import glob
import json
﻿
from tokenizers import ByteLevelBPETokenizer
﻿
paths = glob.glob(os.path.join('./data/openwebtext', '*'))[:1000000]
﻿
tok = ByteLevelBPETokenizer()
tok.train(files=paths, vocab_size=args.vocab_size, special_tokens=args.control_codes)
tok.save('./tokenizer/')
﻿
tokenizer_config = {
    "max_len": 1024
}
﻿
with open(os.path.join('./tokenizer/', "tokenizer_config.json"), 'w') as fp:
    json.dump(tokenizer_config, fp)
TFRecordsTPU Pods expect your data to be available as a set of TFRecord files in a GCP cloud bucket that gets downloaded to each of your TPU board's built-in powerful VM that will take care of de-serializing the files and feeding it to the TPU chips. Ensure that your GCP bucket and your TPU pod are in the same compute zone; otherwise, you will quickly rack up many charges by transferring hundreds of GBs of data across compute zones.
Here is a thing that's not very well documented when working with TPU Pods (this does not apply to individual TPUs as much): TPU Pods create a lot (100s of GBs) of logs that get sent to Stackdriver, where you get charged about 50 cents for each GiB of logs ingested beyond a specific limit (I think it is around 50GiB/month). In just a few days of training, I ended up being charged about a $100100100﻿ IIRC. Luckily, I still had most of the free GCP credits, so this did not become a significant problem for me, but make sure to turn off ingesting logs for TPUs.
I ran into a problem early on when I got access to the TPU pod where my code would work perfectly on a single TPU but would throw an Out of range: End of sequence error when running it on a TPU pod. I struggled with this for a pretty long time until I took a look at this Kaggle discussion post that says that TPUs expect each TPU board (8 cores) to get its own TFrecord file (until that point, I was splitting the train set into 8 TFRecord files where I should have been splitting it into 16 (128 cores / 8 cores per board) TFRecord files.
TPUs are excellent for scaling to huge models and enormous datasets. However, there is much TPU-specific information (especially for TPU Pods) that you need to know that is not covered in the documentation and is not easy to find._**
You can use my script here and run
python3 make_tfrecords.py --path ./data/openwebtext/ --save_path ./train/ --files_per_tfrecord 500000 \
    --use_control_codes --seq_len 1024 --min_seq_len --tokenizer ./tokenizer/
python3 make_tfrecords.py --path ./data/openwebtext-valid/ --save_path ./val/ --files_per_tfrecord 50000 \
    --use_control_codes --seq_len 1024 --min_seq_len --tokenizer ./tokenizer/
To convert the raw text files from the train and validation splits into two sets of 161616﻿ TFRecord files.
I ran a quick analysis of the average lengths of text fields in the dataset, 676767﻿% of files have less than 102410241024﻿ tokens, 353535﻿% of files have less than 512512512﻿ tokens, and only 101010﻿% of files have less than 256256256﻿ tokens. This means that if I wanted to make the dataset as clean as possible and have each input sequence to the model be of a single contiguous stream of 102410241024﻿ tokens, the dataset's size would be a lot smaller. For this reason, everyone prepends a token like <|endoftext|> to the beginning of each sequence and concatenates together sequences with lengths smaller than 102410241024﻿. 
The specifics of how exactly you do that (e.g., do you treat the dataset as a single stream of tokens and just break it up into sequences of length 102410241024﻿, or do you keep track of sequences smaller than 102410241024﻿ and just concatenate them together into a single sequence) really should not make too big of a difference in your model's performance. However, you can take a look at my implementation here.
My version does not take full advantage of the fast, multithreaded batch_encode_plus() way to tokenize large datasets in parallel since it only keeps the first context_len tokens in each line of the files, which makes dealing with files with more or less than 102410241024﻿ tokens harder. Thus, tokenizing the dataset takes about 888﻿ hours, which is something I want to improve.
The train set comes out to about 262626﻿GB and consists of about 888﻿M text files that have been transformed into just under 777﻿M tfrecord examples, each with 102410241024﻿ tokens (same as GPT-2). The validation set comes out to about 300300300﻿MB and consists of about 100100100﻿K text files that have been transformed into just about 909090﻿K tfrecord examples, each with 102410241024﻿ tokens (also the same as GPT-2).
CodeSince I am using TPUs, the only real library that you can practically use right now would be Tensorflow. I did not want to have to go through the learning curve of learning how to make custom training loops and stuff in TF2, so I just stuck to using Keras. You can take a look at my training script (It is pretty short) here. It is pretty simple, so I will not copy the entire training script, but I will talk about a few small code snippets.
I usually like to add a ptvsd breakpoint to my script so I can debug my training script locally with vscode before pushing it up to my VM
if args.debug:
    import ptvsd
    ptvsd.enable_attach(address=('localhost', 5678),
                        redirect_output=True)
    ptvsd.wait_for_attach()
    breakpoint()
I am using Weights&Biases to keep track of my experiments and save checkpoints.
    wandb.login()
    wandb.init(project='lm-finetuning', config=args, tags=args.tags)
﻿
    ...
﻿
    wandb_callback = WandbCallback(save_model=False)
Usually, when you are using a TPU with Keras, you pass in the IP address and port of the TPU to TPUClusterResolver, but you pass the name of the TPU itself to the resolver when using a TPU Pod.
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu=args.tpu)
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
Replicating GPT-2I tried to use as many of the original hyperparameters that OpenAI used when replicating their 124124124﻿M parameter version of GPT-2. However, I had to modify a few things so I could train everything in time.
Note: For some reason, the authors of the GPT-2 paper do not state precisely what learning rates they used for training their models and instead just state, "The learning rate of each model was manually tuned for the best perplexity on a 5% held-out sample of WebText".
OpenAI trains their models for a total of 800800800﻿K iterations at a batch size of 512512512﻿ (Which comes out to around a total of 606060﻿ epochs through the training set).
I trained my GPT-2 model for 1/8th1/8th1/8th﻿ the number of iterations that OpenAI trained theirs for (a total of around 100100100﻿K iterations) since each 100100100﻿K iteration training run took about 202020﻿ hours to run on my 128-core TPU Pod. If I wanted to train GPT-2 for the same number of iterations as OpenAI, a single training run would have used up most of my one week of access to the pod.
Since my TPU pod was preemptible and reset every 242424﻿ hours, I usually had to resume my training run at least once and why all of these graphs usually have two or more training runs on them.
Replicating GPT-2So here is my model that came really close to replicating GPT-2. The training perplexity is about 21.521.521.5﻿ at the end of the almost 909090﻿K training iterations. For comparison, GPT-2 gets a training perplexity about 17.517.517.5﻿ ppl after about 800800800﻿K training iterations, so a difference of only about 444﻿ ppl.
I made a colab notebook showing how to use my pretrained GPT-2 model to generate text. 
﻿
Run set3
﻿
AdamW vs AdafactorI wanted to use the memory-saving Adafactor optimizer to make it easier to train larger language models but all of my Adafactor training runs were a lot (~5ppl IIRC) worse than using AdamW (This may be due to not using Adafactor's momentum parameter or relative update scale, so this is something I want to look into more soon).
Learning RatesI started out with using Adam's default learning rate of 1e−41e-41e−4﻿ but I quickly figured out that I could train my models a lot faster by using a higher learning rate like 1e−31e-31e−3﻿.
Section 2 of the GPT-3 paper lists the learning rates the OpenAI team used for different sized models when training GPT-3. They use a learning rate of 6e−46e-46e−4﻿ for the 124124124﻿M version of their model and decrease the learning rate with model size.
As you can see from this partial training run, it's pretty clear that the loss decreases a lot faster with a learning rate of 1e-3. This doesn't mean that training at a lower learning rate will necessarily lead to a worse model, but it likely will take a lot more training iterations to get there.
﻿
﻿
﻿
Run set2
﻿
Pretraining ALGPT-2Since I was using the HuggingFace Transformers repository's implementations of GPT-2 and ALBERT, I just forked the repository and modified a few files to implement my ALGPT-2 model. You can take a look at all the changes that I had to make here, most of the changes are only to make ALGPT-2 compatible with the /Transformers library and to be able to use the useful abstractions that it gives you, but most of the important code is in the modelling_algpt2.py file in which I just copied over the contents of modelling_gpt2.py and changed a few parts of the code. I'm only showing the changes that I made to the Pytorch version of ALGPT-2 here, the changes in the TF version are pretty similar to the Pytorch version and can be seen here.
Implementing Parameter SharingImplementing parameter sharing only involves changing a few lines of code:
class ALGPT2Model(ALGPT2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
﻿
        ...
﻿
-       self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True)
-           for _ in range(config.n_layer)])
+       self.h = Block(config.n_ctx, config, scale=True)
﻿
        ...
﻿
    def forward(self, ...):
﻿
        ...
﻿
        if past is None:
            past_length = 0
-           past = [None] * len(self.h)
+           past = [None] * self.config.n_layer
﻿
        ...
﻿
-       for i, (block, layer_past) in enumerate(zip(self.h, past)):
+       for i in range(self.config.n_layer):
﻿
            if self.output_hidden_states:
                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)
-
-           outputs = block(
+           outputs = self.h(
                hidden_states,
                layer_past=layer_past,
                attention_mask=attention_mask,
                head_mask=head_mask[i],
                use_cache=use_cache,
            )
        ...
﻿
Implementing a Factorized EmbeddingAdding a factorized embedding is a little more work:
In the config.json that you use for your ALGPT-2 model, you need to specify that you want to use the ALGPT-2 and you need to specify the dimension of the factorized embedding that you want to use:
{
+	"architectures": ["ALGPT2LMHeadModel"],
	"attn_pdrop": 0.1,
	"bos_token_id": 50256,
	"embd_pdrop": 0.1,
	"eos_token_id": 50256,
	"initializer_range": 0.02,
	"layer_norm_epsilon": 1e-5,
+	"model_type": "algpt2",
	"n_ctx": 1024,
	"n_embd": 768,
	"n_head": 12,
	"n_layer": 12,
	"n_positions": 1024,
	"resid_pdrop": 0.1,
	"summary_activation": null,
	"summary_first_dropout": 0.1,
	"summary_proj_to_labels": true,
	"summary_type": "cls_index",
	"summary_use_proj": true,
	"vocab_size": 50257,
+	"embedding_size": 128
}
Back in modelling_algpt2.py, define the two factorized embedding matrices (the first second matrix that is really just a simple linear layer)
class ALGPT2Model(ALGPT2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
﻿
        ...
﻿
-       self.wte = nn.Embedding(config.vocab_size, config.n_embd)
-       self.wpe = nn.Embedding(config.n_positions, config.n_embd)
+       self.wte = nn.Embedding(config.vocab_size, config.embedding_size)
+       self.wpe = nn.Embedding(config.n_positions, config.embedding_size)
﻿
+       self.projection_layer = nn.Linear(config.embedding_size, config.n_embd)
﻿
﻿
        ...
﻿
    def forward(self, ...):
﻿
        ...
﻿
        hidden_states = inputs_embeds + position_embeds + token_type_embeds
﻿
+       hidden_states = self.projection_layer(hidden_states)
﻿
        ...
﻿
﻿
class ALGPT2LMHeadModel(ALGPT2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
﻿
        ...
﻿
-        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
+        self.dense = nn.Linear(config.n_embd, config.embedding_size)
+        self.lm_head = nn.Linear(config.embedding_size, config.vocab_size, bias=False)
﻿
    def forward(self, ...):
﻿
        ...
﻿
-       lm_logits = self.lm_head(hidden_states)
+       dense = self.dense(hidden_states)
+       lm_logits = self.lm_head(dense)
        ...
Effect of Layer-Wise Parameter SharingThis version of ALGPT-2 has about 474747﻿M parameters while GPT-2 has 124124124﻿M. This ALGPT-2 model with parameter sharing trains a lot faster than GPT-2 (999﻿ hours vs 202020﻿ hours for a 909090﻿K iteration training run), but is consistently about 101010﻿ ppl worse than GPT-2 (313131﻿ vs 212121﻿ ppl).
This difference is quite a bit larger than the difference between ALBERT and BERT, but might be explained by masked language modelling being an easier task than autoregressive language modelling. Increasing the size of the ALGPT-2 model might make it more competitive with GPT-2.
﻿
Run set5
﻿
Effect of Removing DropoutI ran a partial training run on removing dropout from ALGPT-2. I did not run it for very long, but it looks like removing dropout gives you a slight improvement (~3ppl).
﻿
﻿
﻿
Run set3
﻿
Effect of Factorized EmbeddingsI ran three experiments for 909090﻿K iterations with three different values for the factorized embedding (128128128﻿, 256256256﻿, and 512512512﻿) as well as the baseline version without a factorized embedding.

































ModelALGPT-2ALGPT-2 512ALGPT-2 256ALGPT-2 128
Parameters47M34M20M13M
Time~9H~9H~9H~9H
Perplexity31313438
﻿
There was practically no difference in the loss curves between the baseline and the 512512512﻿ run since the change in the number of parameters wasn't that great. However, the training runs with factorized embeddings of sizes 256256256﻿ and 128128128﻿ were significantly worse than the baseline: 343434﻿ and 383838﻿ ppl respectively, a pretty big difference from the baseline of 313131﻿ ppl.
﻿
﻿
Run set17
﻿
Effect of Model SizeI only had the time to run one more full training run with ALGPT-2-medium (this one is comparable to the 345345345﻿M version of GPT-2). ALGPT-2-medium has about 666666﻿M parameters and took twice as long as ALGPT-2 to train (a little more than 202020﻿ hours). The larger model size made quite a big difference in performance, the training perplexity decreased 555﻿ppl from 313131﻿ to 262626﻿ ppl.
<iframe
    title='Effect of model size'
    src='https://app.wandb.ai/bkkaggle/lm-finetuning/reports/Effect-of-model-size--VmlldzoxNzI0OTM'
    height='600px'
    width='100%'
> </iframe>
﻿
﻿
Run set4
﻿
Conclusion and Next StepsAfter my TPU pod's quota was used up, I started working on a few other
things over the summer and just kept delaying writing up what I did for a couple of months until now.
There are a lot of things that I still want to work on or look into:
Training larger versions of ALGPT-2
Removing or replacing the normalization layers in transformers
Working on distilling/shrinking language models with billions of parameters to make them more accessible
Apply something like PPLM to condition language models for few-shot inference (kinda like what GPT-3 does).
Thanks for reading through all this. If you think there are any mistakes or inaccuracies in this post, please let me know.
﻿
﻿
Model	ALGPT-2	ALGPT-2 512	ALGPT-2 256	ALGPT-2 128
Parameters	47M	34M	20M	13M
Time	~9H	~9H	~9H	~9H
Perplexity	31	31	34	38
Add a comment
Thomas O'Hara • 3 years ago
Thanks very much: that's chock full of useful information. Note: there is a typo in the ALBERT code sketch, which should use self to access the block: x = self.block(x)