Experiments: Memory Consumption

We demonstrate how the memory allocation of the various reformer variants compare to the transformer during training
Created on January 27|Last edited on January 27
Comment
﻿
The main motivation of the reformer paper is that it reduces the memory footprint as compared to a normal transformer with little compromise in performance. In the previous sections we have seen that the reformer can achieve similar performance as a normal transformer on many tasks, and that is scales well to deep architecture (number of layers experiment) and long sequences (validation speed experiment). But how is memory actually allocated during training?
Claim: The reformer reduces the memory footprint compared to a normal transformer. The claim is summarized in table 5 of the reformer paper.
Model TypeMemory ComplexityTime Complexity
Transformermax(bldff,bnhl2)nlmax(bld_{ff} , bn_hl^2)n_lmax(bldff​,bnh​l2)nl​(bldff+bnhl2)nl(bld_{ff} + bn_hl^2)n_l(bldff​+bnh​l2)nl​
Reversible Transformermax(bldff,bnhl2)max(bld_{ff} , bn_hl^2)max(bldff​,bnh​l2)(bnhldff+bnhl2)nl(bn_hld_{ff} + bn_hl^2)n_l(bnh​ldff​+bnh​l2)nl​
Chunked Reversible Transformermax(bldmodel,bnhl2)max(bld_{model}, bn_hl^2)max(bldmodel​,bnh​l2)(bnhldff+bnhl2)nl(bn_hld_{ff} + bn_hl^2)n_l(bnh​ldff​+bnh​l2)nl​
LSH Transformermax(bldff,bnhlnrc)nlmax(bld_{ff} , bn_hln_rc)n_lmax(bldff​,bnh​lnr​c)nl​(bldff+bnhnrlc)nl(bld_{ff} + bn_hn_rlc)n_l(bldff​+bnh​nr​lc)nl​
Reformermax(bldmodel,bnhlnrc)max(bld_{model}, bn_hln_rc)max(bldmodel​,bnh​lnr​c)(bldff+bnhnrlc)nl(bld_{ff} + bn_hn_rlc)n_l(bldff​+bnh​nr​lc)nl​


We write dmodeldmodeldmodel and dffd_{ff}dff​ for model depth and assume dffd_{ff}dff​ ≥ dmodeldmodeldmodel. bbb stands for batch size, lll for length, nln_lnl​ for the number of layers. We assume nc=l/32n_c = l/32nc​=l/32 so 4l/nc=1284l/nc = 1284l/nc=128 and we write c=1282c = 128^2c=1282
To investigate the claim we log the memory allocation of various reformer combinations during 0.1 % of an epoch of training the tiny Shakespeare dataset. The sequence length for all experiments was set to 4096, with a batch size of 1. We investigate the following combinations:
Comparing the Transformer LM,  LSH LM, Reversible LM and full Reformer LM
Comparing Reformer LMs with different number of hashes
Comparing Reformer LMs with different number of layers
﻿
1. Comparing Transformer LM,  LSH LM, Reversible LM and the full Reformer LMThe figure below shows the peak memory usage for the Transformer, LSH LM, Reversible LM and the full Reformer. We see that the transformer stores activations for each forward pass during training, and that these are gradually released as the backward pass is completed. In this case we have 6 distinct jumps in memory which corresponds to the number of layers 6.
With the LSH LM, with 8 hashing rounds, memory allocation is mostly parallel to the transformer, but since the LSH Attention computation is cheaper, the actual memory consumption is lower. Here we also observe that activations are stored for each layer.
For the Reversible LM memory doesn't accumulate over the forward passes, since it recalculates gradients for the backward pass. But since it needs to store 2 intermediate layers, the actual memory allocation per layer (i.e. the step size) is approximately twice that of the transformer. Note that we observe 4 peaks in the chart for the Reversible LM. Each peak corresponds to a forward and backward pass. The timing isn't directly comparable to that of the Transformer LM, so the plot for the Revesible LM starts in the middle of a forward and backward pass.
The full Reformer includes both LHS-attention and Reversible Residual Layers. Memory allocation therefore doesn't increase like the Transformer, and since LHS-attention is cheaper than standard attention, the peak memory usage is smaller than for the Reversible LM.
  
﻿
2. Reformer Memory vs Number of HashesIn previous sections we have seen that increasing the number of hashes lead to increased performance as the attention approximation of LSH-attention approaches that of classic dot product attention. But since we have to store the result of each hashing round we would expect memory to grow linearly with the number of hashes. Note that it's only during the actual LSH-attention calculation that the number of hashes matter. I.e. the intermediate shape of LSH-attention is [batch_size, n_chunks, chunk_size, chunk_size*2], where n_chunks is the product of the number of hash buckets per round and the number of hash rounds, n_chunks = n_buckets * n_hashrounds.
The figure below, from our experiments, confirms that peak memory scales linearly with number of hashes. The memory peak happens during the forward pass during calculation of LSH-attention. Note that the output of LSH-attention is [bs, seq_len, d_model] and is independent of the number of hashes. This explains the drop in memory when LSH-attention calculation is finished. A 6-layer ReformerLM with a sequence length of 4096 was used for this experiment.
  
﻿
3. Reformer with changing number of layersIn this experiment compare Reformer LMs with n_layers = 2,4,6,8,10,12. 8 hashes are used in all cases. We see the same memory peak during LSH-attention calculation as in the figure above, and also that each added layer increases the memory footprint equally during the entire process. The figure show two forward and backward passes.
Since the reformer isn't accumulating forward and backward passes through all it's layers, but only stores intermediate activations, the peak memory allocation scales linearly to each extra layer.
 
﻿
SummaryOur experiments have verified that the reformer has a much smaller peak memory allocation as compared to the transformer. We have also shown that the reformer can scale to much deeper models than the transformer within a fixed budget. The main bottleneck of the reformer w.r.t. memory is the number of hash rounds used. For practical purposes this means that a user will have to strike a balance between performance and budget to fit their particular need.
﻿
﻿
Model Type	Memory Complexity	Time Complexity
Transformer	$max(bld_{ff} , bn_hl^2)n_l$	$bld_{ff} + bn_hl^2)n_l$
Reversible Transformer	$max(bld_{ff} , bn_hl^2)$	$bn_hld_{ff} + bn_hl^2)n_l$
Chunked Reversible Transformer	$max(bld_{model}, bn_hl^2)$	$bn_hld_{ff} + bn_hl^2)n_l$
LSH Transformer	$max(bld_{ff} , bn_hln_rc)n_l$	$bld_{ff} + bn_hn_rlc)n_l$
Reformer	$max(bld_{model}, bn_hln_rc)$	$bld_{ff} + bn_hn_rlc)n_l$
Add a comment