Training Tiny Llamas for Fun—and Science

Exploring how SoftMax implementation can impact model performance using Karpathy's Tiny llama implementation.
Created on August 2|Last edited on September 14
Comment
﻿
IntroductionIn this report, we're going to look at the new llama2.c repo and perform some experiments to see how softmax performs. But first, a cute mammal: 
A cute tiny Llama from Karpathy's llama2.c repository 
From nanoGPT to llama2.cBoth repositories have a lot in common, including the same logic for training a model—just two files, in this case train.py and model.py. One contains the code to train the model with the different hyperparameters, the other is the bare PyTorch module. 
If you're already familiar with using nanoGPT, this new repo will be very friendly to you, as they share a lot of code logic. This new repo implements a llama-like model instead of the old-ish (at least in ML years) GPT2 model. So if you wants to train modern architecture with all the recent tricks, now we have llama2.c as as a new, minimal implementation.
The goal of this repo is not actually training tiny-llamas. The original idea was to make a C-based inference engine. We have already covered the big brother of this project: llama.cpp on a previous article. This repo aims to have a C implementation as pedagogical as possible, not "the best possible" implementation. Andrej makes this case explicitly on the contributing guidelines of the project's readme. No fancy stuff here, just keep things simple and understandable.
Another change here is that this repo trains the models on TinyStories dataset instead of OWT. Citing the repo's readme:
"You might think that you need many billion parameter LLMs to do anything useful, but in fact very small LLMs can have surprisingly strong performance if you make the domain narrow enough. I recommend looking at the TinyStories paper for inspiration."
Training tiny-llamasThis repo implements the base llama-architecture. By tweaking the parameters, you can train small versions of the original llama architecture released by Meta. 
You have some room to play with here. The key hyperparameters to take into consideration are the number of transformer blocks, the dimension of the blocks, the number of heads, and the sequence length. Depending on your available compute, one could train a tiny-tiny llama arch with only 6 layers, 6 heads and dim 288 getting a 15M parameters model, something very small by today standards.
Table of results for the training of tiny llamas (from the llama2.c repo's Readme)
If you don't want to train you can download the model checkpoints directly from the tinyllamas huggingface repo.﻿
💡
﻿
Run set3
﻿
We trained the 15M parameter tiny llama with different softmax functions. But we're not training TinyLlamas just for fun (well, actually, yes we are), but we also want to try to understand some issues that occur on the activations when training large language models. 
The softmax function debateThe softmax function is the work-horse of classification problems. It is a fundamental piece of the transformer architecture as it is at the core of the attention mechanism:
attention(Q,K,V)=softmax(QKtd)Vt\displaystyle \text{attention}(Q,K,V) = \text{softmax}\left(\frac{QK^t}{\sqrt d}\right)V^tattention(Q,K,V)=softmax(d​QKt​)Vt﻿
I like viewing this formula as a scaled dot product in the form of:
<A(x)x,x><A(x)x, x><A(x)x,x>﻿
 I know it is not exactly this, but in the self-attention case, Q, K, and V are linear transforms to the same x, so we are not that far off. To fix some ideas, if we suppose that Q, K, and V are the identity matrix and d=1, then the formula becomes:
self_attention(x)=softmax(xxt)x\displaystyle \text{self\_attention}(x) = \text{softmax}\left(xx^t\right)xself_attention(x)=softmax(xxt)x﻿
So what Softmax is doing here is telling us which values of x to "attend" to. It's basically a quadratic form of x with an identity matrix at the center.
💡Let's use argmax instead of softmax for a secondSoftmax behaves more like an "argmax" as it gets the value where the product is higher in a continuous way. Suppose that we have a vector x of tokens of length 6. If we use argmax, the attention matrix could look something like this:
argmax(xxt)=[100000100000010000100000001000000010]\text{argmax}\left(xx^t\right) =\begin{bmatrix}
1 & 0 & 0 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 & 0 
\end{bmatrix}argmax(xxt)=​110100​001000​000010​000000​000001​000000​​﻿
An upper-triangular mask also applies to the matrix, so the upper half is zero. This is because we are doing causal attention (I always read this as "casual," does it happen to you, too?), so the tokens can only attend to previous tokens. We cannot attend to tokens we haven't seen yet. If we read the matrix above, the first row tells us that the first token in x attends to itself. The second row tells us that the second token of x attends to the first token, and so on. The one represents the position of the token in the sequence.
But what if we don't want to attend to previous tokens? That would mean a row full of zeros 😱. But the argmax function has to pick one value, so there will always be one in the row, and here is the problem: sometimes, the attention mechanism tries to not-attend to anything and find ways to achieve this by attending to one specific token (or tokens)!
In PyTorch the masking trick is achieved by adding a mask with negative infinity values:
💡
import torch
﻿
max_seq_len = 6
mask = torch.full((1, 1, max_seq_len, max_seq_len), float("-inf"))
mask = torch.triu(mask, diagonal=1)
>> tensor([[[[0., -inf, -inf, -inf, -inf, -inf],
             [0., 0., -inf, -inf, -inf, -inf],
             [0., 0., 0., -inf, -inf, -inf],
             [0., 0., 0., 0., -inf, -inf],
             [0., 0., 0., 0., 0., -inf],
             [0., 0., 0., 0., 0., 0.]]]])
﻿
It is way more complicated than this, as the softmax gives us a probability distribution instead of just one per row. Also, we have multiple attention heads adding information to each other, so the interactions are not easy to explain. But I like my interpretation, and it gives a nice mental picture of what I may want to achieve here. This blog post from Evan Miller explains the problem in detail and how the attention heads attend to specific tokens like punctuation, spaces, and other non-word tokens when they want to make flow "no-attention."
The softmax functionLet's not make this longer for no reason, shall we? Here we are for now: 
softmax(x)i=exp(xi)∑jexp(xj)\displaystyle softmax(x)_i = \frac{exp(x_i)}{\sum_j exp(x_j)}softmax(x)i​=∑j​exp(xj​)exp(xi​)​﻿
One remarkable property of softmax (thanks to the exponential function) is that:
softmax(x+c)i=exp(xi)∑jexp(xj)∀c∈R\displaystyle softmax(x+c) _i= \frac{exp(x_i)}{\sum_j exp(x_j)} \quad \forall c \in \Rsoftmax(x+c)i​=∑j​exp(xj​)exp(xi​)​∀c∈R﻿
The denominator is the same for all values of i. Hence:
∑isoftmax(x)i=1∑jexp(xj)∑iexp(xi)=1\displaystyle \sum_i softmax(x) _i= \frac{1}{\sum_j exp(x_j)} \sum_i exp(x_i)=1i∑​softmax(x)i​=∑j​exp(xj​)1​i∑​exp(xi​)=1﻿
Implementing this function on code is tricky because exponentials tend to overflow (have you hit those NaNs?), so often we use the translation property and compute the softmax as follows, let: c=max⁡(x)c = \max(x)c=max(x)﻿. This way, the argument to the exponential functions is sent to negative, where it has much less tendency to overflow (it could underflow 😱).
softmax(x)i=exp(xi−max(x))∑jexp(xj−max(x))\displaystyle softmax(x)_i = \frac{exp(x_i - max(x))}{\sum_j exp(x_j- max(x))}softmax(x)i​=∑j​exp(xj​−max(x))exp(xi​−max(x))​﻿
Most frameworks implement softmax like this, in a two step process. First computing the max on the vector x, and then computing the softmax on this translated vector x-max(x). I know PyTorch's C and Cuda implementation do it this way, but couldn't find the ref on the enourmous PyTorch codebase 🤣. 
💡
Computing the softmax of a vector:
Let's run a quick example, easier with code:
from math import exp
scores = [6, 2, 3, 10, 5, 1]
﻿
def softmax(x: list):
  exp_x = [exp(xi) for xi in x]  
  return [ex/sum(exp_x) for ex in exp_x]
﻿
for i, x in enumerate(softmax(scores)):
    print(f"{i}: {x:2.2f}")
﻿
# output
# 0: 0.02
# 1: 0.00
# 2: 0.00
# 3: 0.97  <-- almost an argmax here :P
# 4: 0.01
# 5: 0.00
Researching a better softmaxPyTorch implements a pass-through mechanism in the MultiHeadAttention layer that none (maybe Stella Biderman 🤣?) use: add_zero_attn. If turned on, this boolean concatenates an extra zero to K and V before feeding the softmax. The implementation is buried down into the nn.function.multi_head_attention_forward method:
# code from the `nn.functional.multi_head_attention_forward` function
# add zero attention along batch dimension (now first)
if add_zero_attn:
    zero_attn_shape = (bsz * num_heads, 1, head_dim)
    k = torch.cat([k, torch.zeros(zero_attn_shape, dtype=k.dtype, device=k.device)], dim=1)
    v = torch.cat([v, torch.zeros(zero_attn_shape, dtype=v.dtype, device=v.device)], dim=1)
    if attn_mask is not None:
        attn_mask = pad(attn_mask, (0, 1)/)
In the self-attention case, the attention matrix is not squared anymore (it has dimensions [seq_len, seq_len+1]). The matrix ends up having the last column full of zeros before computing the softmax; also, the value vector V is extended now, with that extra zero at the end. This effectively enables the attention mechanism to route zero if attending this "extra" token.
Now the vector V = [V, 0] and the softmax is taken into a matrix that has a zero at the end, so:
softmax(x)i=exp(xi)exp(0)+∑jexp(xj)=exp(xi)1+∑jexp(xj)\displaystyle softmax(x)_i = \frac{exp(x_i)}{exp(0) + \sum_j exp(x_j)} = \frac{exp(x_i)}{1 + \sum_j exp(x_j)}softmax(x)i​=exp(0)+∑j​exp(xj​)exp(xi​)​=1+∑j​exp(xj​)exp(xi​)​﻿
So this enables a routing that never fades, making the function somewhat more stable. We can experiment by replacing the actual softmax and adding a constant to the denominator, it can be 1 or other value. Giving birth to:
softmax(x;c)=exp(xi)c+∑jexp(xj),c∈R+\displaystyle softmax(x;c) = \frac{exp(x_i)}{c + \sum_j exp(x_j)}\quad , \quad c \in \R^+softmax(x;c)=c+∑j​exp(xj​)exp(xi​)​,c∈R+﻿
ResultsWe launched a training of multiple TinyLLama's using Karphaty's code, the baseline is using the default params for the training script.
MetricsTo assess the performance of the new softmax, we want to measure how is the distribution of values at the output of the Transformer layers, we will compute 2 metrics, the maximum value (or infinity norm) and a dispersion coefficient called kurtosis (soemthing like a 4th degree momentum).
﻿
﻿
﻿
﻿
﻿
We will compute these metrics at the end of the each Feed Forward Network and at the end of the Attention Layer, the code to do this is very simple, we just add two methods to the nn.Module﻿
from scipy.stats import kurtosis
﻿
class Transformer(nn.Module):  # <- This should be called Llama 🤣
   ## Original code here
   ...
   ## at the end of the module we add these 2 methods
   def compute_attention_metrics(self) -> Tuple[List[float], List[float]]:
        "compute the max inf norm and kurtosis of the attention outputs"
        outputs = [b.attention.output.cpu() for b in self.layers]
        k = [kurtosis(o.flatten().half()) for o in outputs]
        inf_norm = [o.abs().max().item() for o in outputs]
        return inf_norm, k
    
    def compute_ffn_metrics(self) -> Tuple[List[float], List[float]]:
        "compute the max inf norm and kurtosis of the ffn outputs"
        outputs = [b.feed_forward.output.cpu() for b in self.layers]
        k = [kurtosis(o.flatten().half()) for o in outputs]
        inf_norm = [o.abs().max().item() for o in outputs]
        return inf_norm, k
15M parameters Llama (no Kurtosis)﻿
﻿
Scaling the model Size to 110MFor this experiment the model is bigger and the constant is small, the model diverged quickly if we used c=1,  having c small, makes the function closer to the original softmax, so we want it "as big as possible", we settled on c=1e-3.
﻿
﻿
﻿
﻿
﻿
ConclusionI am not sure I can conclude much from this. The reported metrics where not significantly different on my small experiments, and some people have suggested that the issues start to appear when training 1B+ parameter models. We have a very active Discord with researchers and practitioners exploring this and other relevant transformer issues!
It was a nice exercise for me that help me understand the inner computation on the now standard formula of inside every transformer architecture out there.
The compute was offered by LambdaLabs and ran on a mixture of H100 and A6000 GPUS 😱😎🚀
﻿