Skip to main content

Notes on Implementing GPT-2 from scratch

Insights from Karpathy's Approach
Created on April 8|Last edited on June 12
After implementing transformers based on my previous understanding (which you can read about in this blog post), I recently watched Andrej Karpathy's video on training GPT from scratch. This post explores the key differences between the standard transformer architecture from the "Attention Is All You Need" paper and the GPT-2 architecture, along with practical insights from my implementation experience.
Note: This blog post is not a step-by-step guide to implementing GPT-2 from scratch. Instead, it captures my personal observations, architectural insights, and practical learnings while following Andrej Karpathy’s GPT training video. If you're following the video yourself, you might find these reflections and notes helpful.

Architectural Differences in GPT-2

Decoder-Only Architecture

GPT-2 uses a decoder-only architecture, removing both the encoder and the cross-attention components found in the original transformer. This streamlined approach focuses entirely on the auto-regressive capabilities of the decoder.

Positional Encoding Evolution

While the original transformer paper used fixed sinusoidal positional encodings, GPT-2 takes a different approach:
  • It trains the embeddings for positional encoding rather than using fixed values
  • Interestingly, Karpathy notes that these trained embeddings still exhibit some sinusoidal properties
  • GPT-2 uses a fixed sequence length of 1024, which is incorporated into the positional encodings
  • Some recent studies suggest that large language models can implicitly learn positional information, potentially eliminating the need for explicit positional embeddings altogether.

Architectural Refinements

Several architectural changes were made to improve performance:
  • LayerNorm Repositioning: The position of LayerNorm was moved from the end of the attention block to the beginning
  • Additional LayerNorm: A final LayerNorm was added after all attention blocks
  • Activation Function: GELU with tanh activation replaced RELU in the feedforward layer, removing the hard threshold present in RELU.

Attention Masking Approach

In GPT-2, the attention mask has a fixed size determined by the context length (1024):
torch.tril(torch.ones(self.context_len, self.context_len)).view(
1, 1, self.context_len, self.context_len)
This differs from the variable-length masking often used in transformer implementations.

Optimizations

Weight Tying

Although this was first introduced in the original Transformers paper, I only truly grasped their significance after watching Karpathy's explanation. Using weight tying methods, it's possible to reduce the model size by around 40 million parameters.
Weight tying in GPT models is a technique where the embedding layer (converting token IDs to vectors) and the projection layer (converting vectors back to token probabilities) share the same weight matrix. This was done since both these layers are essentially inverses of each other.

Precision and Performance Optimizations

Karpathy demonstrated several precision configurations:
FP32 Precision (baseline):
Epoch 1/10 [Train]: 0%| 0/94242 [00:00<?, ?it/s]step: 0, loss: 11.025991439819336, dt: 1174.2680072784424, tokens/s : 1744.0652281301386
Epoch 1/10 [Train]: 0%| | 100/94242 [00:40<10:45:38, 2.43it/s]step: 100, loss: 7.5851287841796875, dt: 395.0457572937012, tokens/s : 5184.209581264764
Epoch 1/10 [Train]: 0%|| 200/94242 [01:20<10:31:35, 2.48it/s]step: 200, loss: 6.906430244445801, dt: 418.8878536224365, tokens/s : 4889.136751732982
Epoch 1/10 [Train]: 0%|| 300/94242 [02:01<10:41:26, 2.44it/s]step: 300, loss: 6.423778533935547, dt: 407.1216583251953, tokens/s : 5030.4373597440135
Epoch 1/10 [Train]: 0%|| 400/94242 [02:42<10:37:16, 2.45it/s]step: 400, loss: 6.498997211456299, dt: 412.72664070129395, tokens/s : 4962.122136143414
TF32 Precision:
Epoch 1/10 [Train]: 0%| | 0/94242 [00:00<?, ?it/s]step: 0, loss: 11.02596378326416, dt: 1072.5276470184326, tokens/s : 1909.5078860608642
Epoch 1/10 [Train]: 0%|| 100/94242 [00:31<8:07:40, 3.22it/s]step: 100, loss: 7.585122585296631, dt: 310.6203079223633, tokens/s : 6593.258546739574
Epoch 1/10 [Train]: 0%|| 200/94242 [01:03<8:12:15, 3.18it/s]step: 200, loss: 6.910003662109375, dt: 302.4632930755615, tokens/s : 6771.0695707077675
Epoch 1/10 [Train]: 0%|| 300/94242 [01:34<8:07:02, 3.21it/s]step: 300, loss: 6.423272132873535, dt: 303.3599853515625, tokens/s : 6751.05517831095
While Karpathy saw a 3x increase in tokens processed per second, I didn't observe the same improvement, possibly because my GPU doesn't fully support TF32. Nevertheless, there was still a noticeable increase in tokens/sec.
Mixed Precision with BFloat16:
Epoch 1/10 [Train]: 0%| | 0/94242 [00:00<?, ?it/s]step: 0, loss: 11.025543212890625, dt: 1149.9764919281006, tokens/s : 1780.9059701439933
Epoch 1/10 [Train]: 0%|| 100/94242 [00:25<6:30:14, 4.02it/s]step: 100, loss: 7.585273742675781, dt: 249.59468841552734, tokens/s : 8205.302817143578
Epoch 1/10 [Train]: 0%|| 200/94242 [00:50<6:33:51, 3.98it/s]step: 200, loss: 6.9183244705200195, dt: 249.06373023986816, tokens/s : 8222.795017273744
Epoch 1/10 [Train]: 0%|| 300/94242 [01:16<6:32:43, 3.99it/s]step: 300, loss: 6.416199684143066, dt: 249.7537136077881, tokens/s : 8200.078270772656
Epoch 1/10 [Train]: 0%|| 400/94242 [01:41<6:32:50, 3.98it/s]step: 400, loss: 6.4893035888671875, dt: 253.40914726257324, tokens/s : 8081.791924732447
An added benefit of mixed precision training was reduced GPU memory usage, decreasing by approximately 2GB.

Flash Attention Implementation

Implementing Flash Attention provided remarkable speed improvements. Implementing Flash Attention combined with vocabulary size adjustments, dramatically increased processing speed, with tokens/s jumping from around 8,000 to over 15,000.
Epoch 1/10 [Train]: 0%| | 0/94242 [00:00<?, ?it/s]step: 0, loss: 10.933685302734375, dt: 891.303300857544, tokens/s : 2297.7587966179085
Epoch 1/10 [Train]: 0%|| 100/94242 [00:14<3:35:17, 7.29it/s]step: 100, loss: 7.229774475097656, dt: 134.56296920776367, tokens/s : 15219.64038143296
Epoch 1/10 [Train]: 0%|| 200/94242 [00:28<3:36:35, 7.24it/s]step: 200, loss: 6.723935604095459, dt: 127.14266777038574, tokens/s : 16107.889160376917
Epoch 1/10 [Train]: 0%|| 300/94242 [00:41<3:36:07, 7.24it/s]step: 300, loss: 6.507452964782715, dt: 135.31255722045898, tokens/s : 15135.328472606432
Epoch 1/10 [Train]: 0%|| 400/94242 [00:55<3:33:34, 7.32it/s]step: 400, loss: 6.184662818908691, dt: 144.84047889709473, tokens/s : 14139.6936519041

Training Insights:

Training Without Gradient Accumulation on a single GPU.


Run set
2



Learning Rate's Impact

I experimented with two different learning rate peak values:
  • Initially: A higher learning rate (~0.0006)
  • Later: A reduced learning rate (~0.0003)
The higher learning rate initially gave faster convergence but introduced significant instability around step 200k. After reducing to the lower rate, training stabilized dramatically - a perfect demonstration of how critical the right learning rate is for transformer models.
Looking at my loss curves, you can see the higher learning rate caused the loss to spike from 4 to 7, while the lower rate maintained a steady trajectory. My perplexity charts tell the same story - a huge spike to 1,200 with the higher rate.


Challenges

One of the first major challenges I encountered while training GPT-2 was when the dataset size grew significantly. Initially, I was working with a smaller dataset — about 900 million tokens — and everything worked smoothly. But as I scaled to larger models and increased the dataset size to 40 GB, my original data loading approach simply didn’t hold up. Let me explain this with a piece of code I originally used.
class GPT2Dataset(Dataset):
def __init__(self, data, seq_len, tokenizer):
super().__init__()
self.seq_len = seq_len
self.data = data
self.tokenizer = tokenizer
self.tokens = self.tokenizer.encode(self.data, allowed_special={'<|endoftext|>'})
num_samples = len(self.tokens) // (self.seq_len + 1)
self.tokens = self.tokens[: num_samples * (self.seq_len + 1)]
self.tokens = torch.tensor(self.tokens, dtype=torch.long).reshape(num_samples, self.seq_len + 1)

def __len__(self):
return len(self.tokens)

def __getitem__(self, idx):
x = self.tokens[idx, :-1] # Input: all but last token
y = self.tokens[idx, 1:] # Target: all but first token
return x, y
This approach worked fine for smaller datasets. I could tokenize the entire text corpus in memory and create training samples on the fly. However, as I scaled up, several issues emerged.

Problem 1: Tokenization Bottleneck

When working with a smaller dataset, tokenization is relatively fast and can be done in-memory. But once I began scaling up to tens of gigabytes of raw text, tokenizing the entire dataset in one go became infeasible. Every time I re-ran training, I had to tokenize the entire corpus again, which was time-consuming and redundant.

Solution: Pre-tokenization and Storage

To solve this, I pre-tokenized the dataset and saved the tokenized data into np arrays on disk. This way, I could skip the tokenization step during training and directly load the processed tokens. However, this fix introduced a new challenge.

Problem 2: Memory Constraints with Large Arrays

Saving tokenized data in .npy format works well — until the size of the np array itself becomes too large to fit into memory. If your dataset contains billions of tokens, you can easily exceed your available RAM just trying to load the array. This becomes a huge bottleneck, especially when training large language models (LLMs), where datasets can reach hundreds of gigabytes or even terabytes.

Solution: Data Sharding and Streaming

One common solution is to break the tokenized dataset into smaller shards. Each shard contains a manageable chunk of the data that can be loaded and used independently. This works great when your data pipeline can accommodate a streaming architecture. You can also use the Hugging Face streaming datasets feature to load data lazily. This allows you to work with datasets larger than your available RAM without needing to load everything at once.
But what if you want *more control* over your data pipeline, or you're building a custom training loop?

Enter np.memmap: Memory Mapping for Large Datasets

np.memmap is a powerful tool that would make people go “how the hell did not know this before?”
It is a powerful tool that allows allows us to access a np array stored on disk as if it were in memory and only the parts we need are loaded onto the RAM. This trick of efficient random access helps us access any chunk of the data without having the whole data loaded on to the RAM.
class GPT2Dataset(Dataset):
def __init__(self, data_dir, seq_len):
self.seq_len = seq_len
pattern = os.path.join(data_dir, "**", "shard_*.npy")
self.shard_files = sorted(glob.glob(pattern, recursive=True))
We first get all the np shards present in the directory and keep them in a sorted order.
self.shards_mmap = []
self.cumsum_shards = []
total_len = 0

for i, path in enumerate(self.shard_files):
shard_mmap = np.load(path, mmap_mode='r')
# This returns a memory map instead of a numpy array.
# We then want to save the number of sequences so they any sequence can be accessed.
shard_len = shard_mmap.shape[0]
self.shards_mmap.append(shard_mmap)
total_len += shard_len
self.cumsum_shards.append(total_len)
# With this we have two lists.
# One list stores the memory maps and the other list stores when number of sequences each shard has in a cumsum way.
We do the following process with each shard and we retain two lists.
def __len__(self):
return self.total_len
def __getitem__(self, idx):
##################################
# This is where the magic happens.
shard_idx = bisect.bisect_right(self.cumsum, idx)
# Bisect is a algorithm used for binary search in sorted arrays.
# Using Bisect tells us the nearest shard which contains the index.
# Using Bisect Right helps us avoid edge-case bugs at the boundaries.
if shard_idx == 0:
local_idx = idx
else:
local_idx = idx - self.cumsum_shards[shard_idx - 1]
# With this we know the idx we are lokking for is in which shard
# and where in that particular shard.
req_idx_mmap = self.shards_mmap[shard_idx][local_idx]
# Once we have the given index we get the target and input and return them.
x = torch.from_numpy(req_idx_mmap[:-1].astype(np.int64))
y = torch.from_numpy(req_idx_mmap[1:].astype(np.int64))

return x,y




Conclusion

Training a GPT model from scratch with Karpathy's guidance revealed numerous architectural nuances and optimization techniques that weren't immediately obvious from just reading papers.
The most valuable lesson was seeing how seemingly small changes—like weight tying, precision adjustments, and attention optimizations—can have dramatic effects on both model size and training efficiency. These insights will certainly influence my approach to future model implementations.
What are your experiences with implementing transformer models? Have you found other optimizations that made a significant difference? Share your thoughts in the comments below