Reformer Reproducibility

Submission to Reproducibility challenge 2020: "Reformer: The Efficient Transformer" by Nikita Kitaev, Łukasz Kaiser and Anselm Levskaya, accepted to ICLR 2020.
Morgan, Hallvar Gisnås, Arto, tyoc, Priyank, Sheik Mohamed Imran, Dean P
Created on January 11|Last edited on January 28
Comment
﻿
Reproducibility SummaryThis paper attempts to reproduce and thus validate the central claims of ICLR 2020 Paper "Reformer: The Efficient Transformer" from Kitaev et al. , notably that the techniques introduced enable performance on par with a traditional Transformer model while being much more memory-efficient and much faster on long sequences. This Fast.ai community effort finds through experimentation that these claims indeed hold, despite a more limited amount of compute available to us. However substantial coding effort was needed to implement the central aspects of the paper with a lack of documentation of hyperparameters compounding this difficulty.
Memory - ok
Speed - ok
Performance - inconclusive, struggled to make it work well
- lack of computational resources.
- "Integrating into model requires additional hyperparameter tuning and engineering effort". Possible issues with our implementation, cant' rule out (?)
- Emphasise different frameworks (Jax vs Pytorch), numerical differences cause considerable difference in performance.
- Nuances due to backend.
﻿
Scope of ReproducibilityState the main claim(s) of the original paper you are trying to reproduce (typically the main claim(s) of the paper). This is meant to place the work in context, and to tell a reader the objective of the reproduction.
Research into how Transformer models can be made more computationally and memory efficient has received increased attention, for example the Long Range Arena paper gives a thorough overview of the variety of methods proposed to date, including Reformer.
The Reformer paper introduced a model, the "Reformer", based on Locality Sensitive Hashing (LSH) Attention, Reversible Residual Layers, Chunked Feed Forward Layers, Axial Positional Encodings, and Shared Query-Key Attention that are claimed to perform on par with Transformer models while being much more memory-efficient and much faster on long sequences.
The scope of this reproducibility effort is to verify these claims of memory efficiency and speed on longer sequences. In order to validate the papers' claims, our experiments replicated the LSH Attention, Reversible Residual Layers, Shared Query-Key Attention NLP experiments with a custom synthetic dataset described in the paper as well as the WMT [REFERENCE], enwik8 [REFERENCE], and tiny Shakespeare[REFERENCE] datasets. The computer vision experiments were not assessed due to a lack of available compute for this community effort
﻿
MethodologyBriefly describe what you did and which resources you used.  For example, did you use author’s code?  Did you re-implement parts of the pipeline? You can also use this space to list the hardware used, and the total budget (e.g. GPUhours) for the experiments.
We reimplemented the original Transformer model Vaswani REF which then provided a foundation from which to modify individual model components to test the techniques introduced in the paper. We referenced available code for our Transformer [Transformer Code References] and Reformer [Reformer code references] implementation.  During our experiment design we also referenced the authors' code to replicate the data loading and tokenization methods used and ported them into the fastai framework.
With regard to compute, we used Tesla T4, P4 and K80 GPUs provided in Google Colaboratory along with a single Tesla V100 GPU on DataCrunch.io, Three Tesla T4 VMs on GCP (provided courtesy of [DAGsHub](attach BIBTEX Reference below)), Titan V, GeForce 2080 Ti and GeForce 3090.
The fastai framework was used for dataloading and model training, experiment tracking was carried out with [Weights and Biases](attach BIBTEX Reference below) the nbdev literate programming environment nbdev REF was used for all development work and Discord was used to communicate between our distributed team.
﻿
ResultsStart with your overall conclusion — where did your results reproduce the original paper, and where did your results differ? Be specific and use precise language, e.g. "we reproduced the accuracy to within 1% of reported value, which supports the paper’s conclusion that it outperforms the baselines". Getting exactly the same number is in most cases infeasible, so you’ll need to use your judgement to decide if your results support the original claim of the paper.
We examined the main claims of the paper and found that the same relative performance was observed between baseline Transformer references and implementations of LSH Attention, Reversible Residual Layers and also with both combined into the full Reformer model. Due to a lack of available computational resources we were unable to validate the performance figures achieved in the paper. Nonetheless we observed that the use of Shared-QK Attention and Reversible Residual Layers did not adversely impact performance relative to a baseline Transformer, while also being more memory efficient. As sequence length was increased we observed that LSH Attention was much faster than standard attention, again as claimed in the paper. Increasing the number of hashes in LSH Attention also improved model performance, as claimed. One issue encountered was vanishing gradients when using Reversible Residual Layers with mixed precision training, possibly due to the unconventional way the backward pass is calculated.
﻿
What was easyDescribe which parts of your reproduction study were easy. For example, was it easy to run the author’s code, or easy to re-implement their method based on the description in the paper? The goal of this section is to summarize to a reader which parts of the original paper they could easily apply to their problem.
Implementation of the Chunked Feed Forward layers was relatively straightforward while code for the Axial Positional Encodings was imported using the axial_positional_encoding Github repository [REF lucidrains].
﻿
What was difficultDescribe which parts of your reproduction study were difficult or took much more time than you expected. Perhaps the data was not available and you couldn’t verify some experiments or the author’s code was broken and had to be debugged first. Or, perhaps some experiments just take too much time/resources to run and you couldn’t verify them. The purpose of this section is to indicate to the reader which parts of the original paper are either difficult to re-use, or require a significant amount of work and resources to verify
This paper introduced 5 new techniques meaning that substantial effort had to be made to ensure a correct reimplementation of each technique. While there were Reformer code resources available, when it came to implementation details this paper was quite challenging due to many design decisions or hyperparameters not being fully documented in the paper. 
﻿
Communication with original authorsThe authors were receptive to email correspondence and clarified  a number of implementation details.
﻿
[References to be made nice]
WandB Bibtext Citation:
@misc{wandb,
title = {Experiment Tracking with Weights and Biases},
year = {2020},
note = {Software available from wandb.com},
url={https://www.wandb.com/},
author = {Biewald, Lukas},
}

DAGsHub Bibtext Citation:
@misc{dagshub,
title = {Data Science Collaboration with DAGsHub},
year = {2019},
note = {Software available from dagshub.com},
url={https://dagshub.com/},
author={Pleban, Dean and Smoilovsky, Guy},
}

nbdev Citation
Howard, Jeremy, nbdev, (2020), GitHub repository, https://github.com/fastai/nbdev

Reformer Code References
authors' repository
Phil Wang@lucidrains REF,
HuggingFace@huggingface Transformer REF 

Transformer Code Resources
FairSeq REF
@lucidtrains local-attention REF 
Illustraed Transformer REF
Annotated Transformer REF
﻿
Report
﻿
1.IntroductionAttention has become an integral part of many deep learning algorithms in the form of the transformer architecture (https://arxiv.org/abs/1706.03762), with many applications for text, e.g. the high profile GPT models, ref and lately also increasingly for vision (ViT:https://arxiv.org/abs/2010.11929). 
One of the main characteristics of the transformer is a large number of parameters, and costly training (see e.g. https://ruder.io/research-highlights-2020/index.html#1-scaling-up-and-down). Finding ways of matching the performance of the Transformer, but reducing the cost of training is, therefore, a democratizing effort. The Reformer attempts to achieve this by reducing the memory footprint of the Transformer and thus enabling modeling of deeper models and longer sequences within a fixed resource budget.
This particular direction of research has seen increased attention in 2019 and 2020, for example, the Long Range Arena paper gives a thorough overview of model efficiency and methods dealing with longer sequence lengths to date, including the Reformer. The Reformer has seen a fair amount of attention, with versions available at Huggingface and Phil Wang among others. 
﻿
2. Scope of ReproducibilityThe Reformer represents a collection of techniques aimed at reducing the memory footprint of the Transformer while retaining acceptable performance. A reduced memory footprint also enables the modeling of very deep models and long sequences within a given computational budget. The main claims of the reformer paper are:
Proof of concept: A full attention Transformer can solve a sequence copy task perfectly. An LSH-attention Transformer can also solve it, but with decreased performance as the number of hashing rounds decrease. A model trained with one type of attention can also be evaluated with a different type of attention.
Evaluation speed: The number of hashes of the Reformer can be increased at evaluation time to produce more accurate results. The evaluation speed measured in seconds/step is growing with the number of hashes but not with increased sequence length.
Attention approximation: The performance of the Reformer increases with the number of hashing rounds and is close to full attention performance at n_hashes = 8.
Scaling: Deep Reformer models: a) can be trained on very long sequences using a single accelerator (GPU or TPU core) and b) have better performance.
The performance of Reversible Transformers are similar to the Transformer on language modeling tasks
Shared queries and keys in the attention module has minimal effect on performance in a Transformer
The Reformer reduces the memory footprint compared to the transformer.
﻿
3. Methodology3.1 Model DescriptionThe Reformer has 5 main novelties, of which we will go into detail for the first 2:
3.1.1 LSH-attentionIn normal attention, we perform a matrix multiply of all keys and queries. This is the same as taking the dot product of every query with every key and is the source of the quadratic cost of attention. Dot products will be large when vectors are aligned. We then pass the dot products through a softmax that makes relatively large input yet larger. It makes sense that keys and queries that are close together in our high dimensional space will end up with a high score after the softmax. If we could figure out upfront which vectors were close, we might disregard the ones that were far away, since their softmax product most likely would be small. It turns out that the LSH (Andoni et al.,2015) is a suitable algorithm to cluster high dimensional vectors.
LSH clusteringWe could do perfect clustering by comparing each key and query. But that would still leave us with the quadratic cost. LSH instead approximates clustering probabilistically. This means that the cost of it is reduced, but that it can't guarantee perfect accuracy. The intuition behind LSH is described in the blog: https://www.pragmatic.ml/reformer-deep-dive/
1. If we draw a random vector u and take the dot product with other vectors, we will get positive and negative scores depending if vectors are aligned with u or not.
2. If we take the sign of this dot product, we have effectively created two "buckets" for our vectors: '+' and '-'
3. If we draw more lines ("hash rounds"), we create additional buckets in the same way. E.g. 3 lines would give us 2^3=8 buckets.
4. Vectors that end up in the same bucket will have a high probability to be close together, and this probability will go up as the number of hash rounds increases.
 
Note that the Reformer paper does LSH via random projections as described in (Andoni et al.,2015).
LSH-attentionThe next step is to add LSH clustering to our new attention mechanism. We refer to figure 2 of the reformer paper that gives a clear overview of the main steps:
 
First we receive an input sequence of tokens in the original order. In normal attention, our tokens are projected into keys and queries, but we'll set keys and queries to be identical. (see chapter nn)
We then use LSH clustering to produce a bucket ID for each item (i.e. key/query). We sort our items first for bucket id and next for a position in the original sequence.
Group items to equal size chunks, since the number of items in each bucket may vary.
Concatenate chunks with their previous chunk to allow for limited in-between attention. In the diagram, we also mask items from other buckets, so that only in-bucket attention is allowed.
Calculate a normal dot product attention within each concatenated chunk.
Un-chunk and un-sort all items.
LSH attention complexityAccording to the reformer paper the time and memory complexities of scaled dot product and LSH attention are:
Attention TypeMemory ComplexityTime Complexity
Scaled Dot-Productmax(bnhldk,bnhl2)max(bn_hld_k, bn_hl^2)max(bnh​ldk​,bnh​l2)max(bnhldk,bnhl2)max(bn_hld_k, bn_hl^2)max(bnh​ldk​,bnh​l2)
LSH Attentionmax(bnhldk,bnhlnr(4l/nc)2)max(bn_hld_k, bn_hln_r(4l/n_c)^2)max(bnh​ldk​,bnh​lnr​(4l/nc​)2)max(bnhldk,bnhnrl(4l/nc)2)max(bn_hld_k, bn_hn_rl(4l/n_c)^2)max(bnh​ldk​,bnh​nr​l(4l/nc​)2)
Where: 
lll is the length of the input sequence
bbb is the batch size
nhn_hnh​ is the number of heads in multihead attention.
ncn_cnc​ is the number of LSH chunks.
nrn_rnr​ is the number of hash rounds
dkd_kdk​ is the model dimension
Language models typically benefit from long input sequences (lll) to give the model a longer context(ref), so we often want to make lll large. This means that the cost is usually dominated by the latter part of the max in the table above, i.e. lll is usually larger than dkd_kdk​. 
In this case dot product attention has a quadratic cost in both time and memory that depends on l2l^2l2. If we want to compare LSH-attention to dot product attention, we can compare lnr(4l/nc)2ln_r(4l/n_c)^2lnr​(4l/nc​)2 to l2l^2l2, since bbb and nhn_hnh​ are common terms. This means that the cost of LSH relative to normal attention is linear w.r.t. nrn_rnr​, but decreases quadratically with ncn_cnc​. In our implementation we did not set ncn_cnc​ directly, but rater define a bucket size, bsb_sbs​ where ncn_cnc​ = l/bsl/b_sl/bs​. 
We can therefore rewrite the LSH complexity as nr(4bs)2ln_r(4b_s)^2lnr​(4bs​)2l, which is linearly dependent on lll. It also makes intuitive sense that the complexity of LSH-attention grows quadratically with bsb_sbs​, since we do normal dot product attention within each chunk.
3.1.2 Shared keys and queriesTo ensure that the number of queries is equal to the number of keys within each bucket we want h(kj)=h(qj)h(k_j) = h(q_j)h(kj​)=h(qj​). This is achieved by setting kj=qj∥qj∥k_j = \frac{q_j}{\|q_j\|}kj​=∥qj​∥qj​​ . Note that it's necessary to prevent tokens only attending to themselves as the dot product of perfectly aligned vectors would be always the largest. This is done by masking. Sharing queries and keys also reduces the size of the attention input projection layer. 
3.1.3 Reversible layersReversible Residual layers were presented in Gomez et al Ref. The idea is to make layers "reversible", i.e. each layer’s activations can be computed from the next layer’s activations. Adding Reversible Residial Layers enables us to perform backpropagation without storing the activations in memory.
The memory requirements of architectures based on Reversible Residual blocks are independent of the number of layers in the model. Hence it allows one to train arbitrarily deep models, at the cost of increased compute.
(a) the forward, and (b) the reverse computations of a residual block (c) Gomez at al
The forward pass for a Reversible Residual Layer is given as 
y1=x1+F(x2)y_1 = x_1 + \mathcal{F}(x_2)y1​=x1​+F(x2​)
y2=x2+G(y1)y_2 = x_2 + \mathcal{G}(y_1)y2​=x2​+G(y1​)
When doing a backward pass, the intermediate activations can be reconstructed as follows:
x2=y2−G(y1)x_2 = y_2 - \mathcal{G}(y_1)x2​=y2​−G(y1​)
x1=y1−F(x2)x_1 = y_1 - \mathcal{F}(x_2)x1​=y1​−F(x2​)
Applying this to Transformer architecture we set F=MultiHeadAttention\mathcal{F} =\mathbf{MultiHeadAttention}F=MultiHeadAttention and G=FeedForward\mathcal{G} = \mathbf{FeedForward}G=FeedForward. x1x_1x1​ and x2x_2x2​ are initialized as copies of original input.
Our implementation of Reversible Residual Layers is largely inspired by https://github.com/lucidrains/reformer-pytorch and can be found here.
3.1.4 Chunked feed-forward layersThe final feed-forward layers in the transformer can have a lot of parameters and be costly to compute in a single step. The Reformer instead process this step in smaller chunks, effectively trading computation time for memory.
3.1.5 Axial positional encodingsAxial positional encodings are a more efficient way of encoding positions than e.g. sinusoidal positional encodings. This is especially relevant for very long sequences.
﻿
3.2 DatasetsSynthetic dataset: A sequence of integers of the form 0w0w, where w is a sequence of integers from 1 to 128 of some length. See our documentation for details: https://arampacha.github.io/reformer_fastai/experiment.synthetic-task.html
enwik8 is a dump of English language Wikipedia from 2006. Details on the dataset: https://cs.fit.edu/~mmahoney/compression/textdata.html. 
WMT 2014 English-German is a dataset of 4.5 million English and German sentence pairs from the Machine Translation Shared Task at the ACL 9th Workshop on Statistical Machine Translation in 2014. Downloaded from Huggingface: https://huggingface.co/datasets/wmt_t2t
Tiny Shakespeare is a subselection of Shakespeare's works, downloaded from HuggingFace: https://huggingface.co/datasets/tiny_shakespeare
For all datasets, our implementation notes give full details: https://github.com/arampacha/reformer_fastai/blob/db065661af97ace71fd2c7cb177a94c127f792d6/README.md#data
WMT-14 Reference
@InProceedings{bojar-EtAl:2014:W14-33,
author    = {Bojar, Ondrej  and  Buck, Christian  and  Federmann, Christian  and  Haddow, Barry  and  Koehn, Philipp  and  Leveling, Johannes  and  Monz, Christof  and  Pecina, Pavel  and  Post, Matt  and  Saint-Amand, Herve  and  Soricut, Radu  and  Specia, Lucia  and  Tamchyna, Ale
{s} },
title     = {Findings of the 2014 Workshop on Statistical Machine Translation},
booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
month     = {June},
year      = {2014},
address   = {Baltimore, Maryland, USA},
publisher = {Association for Computational Linguistics},
pages     = {12--58},
url       = {http://www.aclweb.org/anthology/W/W14/W14-3302}
}
﻿
3.4 Experimental setup and codeOur code is available at https://github.com/arampacha/reformer_fastai and full project documentation is at: https://arampacha.github.io/reformer_fastai/. 
We have made supplementary material under the Exploration and Experiments of the documentation. These resources go into greater detail on the algorithms used, and experiment setup. The latter section also includes our training script and Config class for default hyperparameters used for experiments. 
All experiments are logged to [Wandb](attach BIBTEX Reference) and have their configs saved locally. Finally, models are also checkpointed and saved locally.
﻿
3.5 Computational requirementsWith regard to compute, we used Tesla T4, P4 and K80 GPUs provided in Google Colaboratory along with a single Tesla V100 GPU on DataCrunch, Three Tesla T4 VMs on GCP (provided courtesy of [DAGsHub](attach BIBTEX Reference) ), Titan V, GeForce 2080 Ti, and 3090. The total number of GPU hours for the entire project is approximately 1600 hours, not accounting for device type, nor GPU-utilization.
﻿
4. ResultsWe attempt to verify each of the claims stated in section 2.
﻿
LSH attention evaluation on synthetic taskThe synthetic task is about copying a sequence of integers of the form 0w0w, where w is a sequence of some length, composed of integers from 1 to 128. The first part of the sequence is masked from the loss function, so the goal for the model is to learn that midway through the sequence it has to repeat 0w. For a more detailed explanation of the task see the experiment notebook: https://arampacha.github.io/reformer_fastai/experiment.synthetic-task.html
Claim: A full attention transformer can solve this task perfectly. An LSH-attention transformer can also solve it, but with decreasing performance as the number of hashing rounds decrease. A model trained with one type of attention can also be evaluated with a different type of attention.
ExperimentsFor this task, we used our own implementation of LSH LM: https://arampacha.github.io/reformer_fastai/reformer.html#LSHLM.  For details on the LSH algorithm see our project documentation: https://arampacha.github.io/reformer_fastai/exploration.LSH.html. Note that our model uses shared key and queries even when using full attention. We train a model with full attention and three models with LSH-attention with 1,2 and 4 hashing rounds respectively.
We trained for 150 000 steps with a batch size of 64, a training size of 12,800, and a validation size of 1,280. We used hyperparameters as described in the paper, and other defaults suggested in the Trax Github repository: https://github.com/google/trax. See our SyntheticConfig class for experiment defaults: https://arampacha.github.io/reformer_fastai/experiment-configs.html#SyntheticConfig. 
Training dynamicsSeedFrom our early prototyping using a standard transformer model, we had observed that the model seemed to suddenly "understand" the task and reach zero loss and perfect accuracy as expected. However, when we started running our LSHLM model this was not always the case. The runs below illustrate this behavior. The runs are identical except for the seed, but the outcome of training is very different. 
Learning rateSimilarly, the learning rate seemed to affect training as well. The plot below shows an LSHLM with a single hashing round trained with a base lr of 1e-3 and 1e-4. We used the 1-cycle policy from the fast.ai library for learning rate scheduling: https://docs.fast.ai/callback.schedule.html#Learner.fit_one_cycle
﻿
﻿
﻿
Run set4
﻿
ResultsWe were able to train models that performed as expected. We trained four models:
Full attention
LSH with 1,2 and 4 hashing rounds
We then evaluated the models with full attention and LSH attention with 8,4,2 and 1 hashing rounds respectively. Results are summarized in the table below. The first column shows how the model was trained, the subsequent ones how it was evaluated.
See our documentation for analysis setup: https://arampacha.github.io/reformer_fastai/experiment.synthetic-task-analysis.html.  
Evaluation
Training Full AttentionLSH-8LSH-4LSH-2LSH-1
Full Attention100.001.371.853 .004.56
LSH-446.5499.7199.7793.0577.62
LSH-275.9496.697.4597.0886.06
LSH-170.6576.6179.6879.3456.09
We see that the LSH model gradually performs worse with fewer hashing rounds both in training and validation, as expected. LSH-4 seems to give a near-identical performance to full attention. Limited  hyperparameter search was needed to achieve the training results obtained.
Discrepancy with results from the paperThe results from the paper were as follows:
 
We can see that there are three clear differences in the tables:
Our results for this particular set of runs are a bit poorer than that of the paper, especially so for LSH-1. But as noted above, we were also able to train LSH-1 to near perfection when we used a higher learning rate. This suggests that the absolute numbers in the table are a bit random, and depend on the specific training setup.
Models trained with LSH Attention and validated with standard attention does much better in our experiments than in the paper.
The model trained with standard attention and validated with LSH does much worse in our experiment.
One explanation could be that there are issues with our implementation. But overall the models seem to behave as expected. When comparing to other implementations, we also see similar results between implementations, see our documentation: https://arampacha.github.io/reformer_fastai/experiment.synthetic-task-comparison.html
A second possibility is that due to random factors (as we observed with the role of the seed in training) the resulting models simply behave differently. E.g. when validating the full attention model, there is no clear trend in the role of hashing rounds.
A final explanation is that there might be a mix up in the actual summarizing of results leading to a switch of rows/columns in the original report.
﻿
Effect of sharing QKShared Query-Key Attention ("shared-QK") was used in the Reformer paper to help further reduce the memory footprint of the model.
Claim: A shared query-key space does not perform worse than regular attention
To investigate this we train a standard Transformer LM and a Transformer LM with shared-QK attention on the enwik8 dataset. The Bytes-Per-Character (BPC) metric was used. Each model had 3 layers, d_model equal to 1024, and used axial positional embeddings. They were trained for 10 epochs with a sequence length of 4096 and a batch size of 8 using the Adafactor optimizer. Gradient Accumulation was used where needed. The mean BPC of 3 training rounds was used.
After experimentation, we cannot validate the claim that shared-QK attention does not perform worse than standard attention for this experimental setting. Nor did we see the effect of shared-QK attention training slightly faster as noted in the paper. Potentially with additional training shared-QK attention performance will converge with standard attention, we leave this to future work.
﻿
﻿
﻿
Run set6
﻿
During the experiment phase of the synthetic task (section 3.1), we used a Transformer LM with Shared Query-Keys. We expected this model to solve the relatively simple synthetic task to perfection, but we found it difficult to train the model to convergence even when training for 750 epochs. When we instead used  a standard Transformer LM (i.e. with separate Queries and keys), the model consistently converges in 7-8 epochs.
The figure below illustrates several runs on identical setup, changing only the model seed. Out of five runs with shared query-key only one model converged after about 200 epochs. For our standard Transformer LM the three runs converge within 7-8 epochs.
﻿
Effect of reversible layersClaim:
1) Reversible Residual Layers in a Transformer enable more memory efficient training and do not come at the expense of model
accuracy
To validate this claim we train a Transformer Language Model and full sequence-to-sequence Transformer using Reversible Residual Layers and compared them to their standard Transformer equivalent.
Reversible LM on enwik8We train the Reversible Transformer LM ("ReversibleLM"), a 6-layer causal language model on the enwik8 dataset with a sequence length of 4096 with the Adafactor optimizer. A batch size of 8 was used, via Gradient Accumulation. Experiments were run on 15GB and 12GB GPUs and training was carried out in full-precision. Unfortunately training this model on full 64k sequences from enwik8 as per the original paper was not feasible given our computational budget. The average of 3 runs for each model-type was taken
We could not validate the claim that Reversible Residual Layers do not have a significant impact on language model performance due to the sizeable difference of  0.11 validation BPC between the ReversibleLM and the baseline TransformerLM
﻿
﻿
﻿
Run set6
﻿
Reversible Transformer on translation taskWe train a full Reversible Transformer ("ReversibleTransformer"), a 12-layer sequence-to-sequence Transformer model with Reversible Residual layers on the WMT-14 en-de translation dataset. Given the short sequences in this dataset, a sequence length of 256 was used with a batch size of 64 and the Adam optimizer. Gradient Accumulation was used when training with a 12GB GPU. Training was carried out for 2 epochs and full-precision was used. The one-cycle learning rate schedule from fast.ai was used, with a maximum learning rate of 1e-4, initial div of 5, and a percent start of 12.5%
From our experiments, we COULD / COULD NOT not validate the claim that Reversible Residual Layers do not have a significant impact on Transformer model performance due to XXX
﻿
﻿
﻿
Run set2
﻿
Effect of number of hashing rounds on the performanceWhen using LSH for chunking there is a small probability that similar tokens will and up in different buckets, therefore attention will not be computed for these tokens. To reduce this probability we can do multiple rounds of hashing using different hashing functions.
Claim: Performance of Transformer with LSH attention increases with the number of hashing rounds and is close to full attention performance at n_hashes = 8.
To reinforce this claim we train and compare models with different numbers of hashing rounds on the enwik8 dataset.
For this experiment, we trained a 3-layer deep Transformer with LSH attention for 10 epochs. The training was done on sequences of length 4096 and an effective batch size of 8. A full list of model parameters may be found here. Refer to this notebook for full experiment setup.
We report training and validation losses for runs with a number of hashing rounds from 2 to 16 compared to baseline (full attention Transformer).
The results illustrate that training and validation losses tend to improve with an increasing number of hashing rounds. To be noted that both time and memory requirements also increase with n_hashes.
Also while training losses are very close for larger n_hashes and full attention, models using LSH seem to have slightly higher generalization error.
﻿
﻿
﻿
Run set5
﻿
LSH attention evaluation speedClaim: The number of hashes of a model with LSH-attention can be increased at evaluation time to produce more accurate results. The evaluation speed measured in seconds / step is growing with the number of hashes but not with increased sequence length.
To verify the claim we test evaluation time on the synthetic task with hyperparameters as indicated in the right part of figure 5 in the reformer report. See our documentation for experiment details: https://arampacha.github.io/reformer_fastai/experiment.speed-lsh_synthetic-task.html. Our results are summarized in the figure below:
  
We were unable to complete the longest sequence lengths for full attention due to out of memory errors on a single GPU. The results for the smaller sequences are mostly matching, but our full attention model seems to be a bit faster relative to LSH-attention than in the paper. 
All results are slightly faster than in the paper. This could be due to method of measurement, architecture choices or hardware.
﻿
Deep Reformer modelsIt has been shown in multiple studies that deeper models generally have better performance. Reformer is designed to be memory efficient so can enable training deep models (Reversible layers) on very long sequences (LSH attention).
Claims:
1) Deep Reformer models can be trained on very long sequences using single accelerator (GPU or TPU core);
2) Deeper Reformer models have better performance.
enwik8 experimentTo reinforce this claims we train models with different depth on inputs with seq_len = 16,384. We use  the enwik8 dataset. Our experiments are designed to run on single 12GB GPU.
Unfortunately training very deep models is out of our computational budget.
Although the first claim is easily proved by the fact that Reversible layers have O(1)O(1)O(1) memory complexity, so we can fit very deep model in cost of compute.
We trained models with 3, 6 and 12 layers for 4 epochs. There is a trend for deeper models to have lower training loss as training progresses and it would be beneficial to train longer to see if this trend strengthens.
﻿
﻿
﻿
Run set3
﻿
Memory ConsumptionThe main motivation of the reformer paper is that it reduces the memory footprint as compared to a normal transformer with little compromise in performance. In the previous sections we have seen that the reformer can achieve similar performance as a normal transformer on many tasks, and that is scales well to deep architecture (number of layers experiment) and long sequences (validation speed experiment). But how is memory actually allocated during training?
Claim: The reformer reduces the memory footprint compared to a normal transformer. The claim is summarized in table 5 of the reformer paper.
Model TypeMemory ComplexityTime Complexity
Transformermax(bldff,bnhl2)nlmax(bld_{ff} , bn_hl^2)n_lmax(bldff​,bnh​l2)nl​(bldff+bnhl2)nl(bld_{ff} + bn_hl^2)n_l(bldff​+bnh​l2)nl​
Reversible Transformermax(bldff,bnhl2)max(bld_{ff} , bn_hl^2)max(bldff​,bnh​l2)(bnhldff+bnhl2)nl(bn_hld_{ff} + bn_hl^2)n_l(bnh​ldff​+bnh​l2)nl​
Chunked Reversible Transformermax(bldmodel,bnhl2)max(bld_{model}, bn_hl^2)max(bldmodel​,bnh​l2)(bnhldff+bnhl2)nl(bn_hld_{ff} + bn_hl^2)n_l(bnh​ldff​+bnh​l2)nl​
LSH Transformermax(bldff,bnhlnrc)nlmax(bld_{ff} , bn_hln_rc)n_lmax(bldff​,bnh​lnr​c)nl​(bldff+bnhnrlc)nl(bld_{ff} + bn_hn_rlc)n_l(bldff​+bnh​nr​lc)nl​
Reformermax(bldmodel,bnhlnrc)max(bld_{model}, bn_hln_rc)max(bldmodel​,bnh​lnr​c)(bldff+bnhnrlc)nl(bld_{ff} + bn_hn_rlc)n_l(bldff​+bnh​nr​lc)nl​


We write dmodeldmodeldmodel and dffd_{ff}dff​ for model depth and assume dffd_{ff}dff​ ≥ dmodeldmodeldmodel. bbb stands for batch size, lll for length, nln_lnl​ for the number of layers. We assume nc=l/32n_c = l/32nc​=l/32 so 4l/nc=1284l/nc = 1284l/nc=128 and we write c=1282c = 128^2c=1282
To investigate the claim we log the memory allocation of various reformer combinations during 0.1 % of an epoch of training the tiny Shakespeare dataset. The sequence length for all experiments was set to 4096, with a batch size of 1. We investigate the following combinations:
Comparing the Transformer LM,  LSH LM, Reversible LM and full Reformer LM
Comparing Reformer LMs with different number of hashes
Comparing Reformer LMs with different number of layers
Our experiments have verified that the reformer has a much smaller peak memory allocation as compared to the transformer. We have also shown that the reformer can scale to much deeper models than the transformer within a fixed budget. The main bottleneck of the reformer w.r.t. memory is the number of hash rounds used. For practical purposes this means that a user will have to strike a balance between performance and budget to fit their particular need.
﻿
Comparing Transformer LM,  LSH LM, Reversible LM and the full Reformer LMThe figure below shows the peak memory usage for the Transformer, LSH LM, Reversible LM and the full Reformer. We see that the transformer stores activations for each forward pass during training, and that these are gradually released as the backward pass is completed. In this case we have 6 distinct jumps in memory which corresponds to the number of layers 6.
With the LSH LM, with 8 hashing rounds, memory allocation is mostly parallel to the transformer, but since the LSH Attention computation is cheaper, the actual memory consumption is lower. Here we also observe that activations are stored for each layer.
For the Reversible LM memory doesn't accumulate over the forward passes, since it recalculates gradients for the backward pass. But since it needs to store 2 intermediate layers, the actual memory allocation per layer (i.e. the step size) is approximately twice that of the transformer. Note that we observe 4 peaks in the chart for the Reversible LM. Each peak corresponds to a forward and backward pass. The timing isn't directly comparable to that of the Transformer LM, so the plot for the Revesible LM starts in the middle of a forward and backward pass.
The full Reformer includes both LHS-attention and Reversible Residual Layers. Memory allocation therefore doesn't increase like the Transformer, and since LHS-attention is cheaper than standard attention, the peak memory usage is smaller than for the Reversible LM.
  
﻿
Reformer Memory vs Number of HashesIn previous sections we have seen that increasing the number of hashes lead to increased performance as the attention approximation of LSH-attention approaches that of classic dot product attention. But since we have to store the result of each hashing round we would expect memory to grow linearly with the number of hashes. Note that it's only during the actual LSH-attention calculation that the number of hashes matter. I.e. the intermediate shape of LSH-attention is [batch_size, n_chunks, chunk_size, chunk_size*2], where n_chunks is the product of the number of hash buckets per round and the number of hash rounds, n_chunks = n_buckets * n_hashrounds.
The figure below, from our experiments, confirms that peak memory scales linearly with number of hashes. The memory peak happens during the forward pass during calculation of LSH-attention. Note that the output of LSH-attention is [bs, seq_len, d_model] and is independent of the number of hashes. This explains the drop in memory when LSH-attention calculation is finished. A 6-layer ReformerLM with a sequence length of 4096 was used for this experiment.
  
﻿
Reformer vs Number of LayersIn this experiment compare Reformer LMs with n_layers = 2,4,6,8,10,12. 8 hashes are used in all cases. We see the same memory peak during LSH-attention calculation as in the figure above, and also that each added layer increases the memory footprint equally during the entire process. The figure show two forward and backward passes.
Since the reformer isn't accumulating forward and backward passes through all it's layers, but only stores intermediate activations, the peak memory allocation scales linearly to each extra layer.
 
﻿
5. DiscussionThis paper introduced 5 new techniques meaning that substantial effort had to be made to ensure a correct reimplementation of each. While there were Reformer code resources available, when it came to implementation details this paper was quite challenging due to many design decisions or hyperparameters not being fully documented in the paper. 
Having a team with many contributors meant that we had to improve our development process. Midway the project we switch to nbdev(ref), a devops system with full CI/CD. This resulted in a much smoother development process.
LSH-Attention is also fairly complex compared to normal dot product attention, and our barebones implementation ended up over 200 SLOC. Our scaled dot product attention implementation for references is around 30 SLOC.
Several of the experiments are also fairly compute intensive for a single GPU setup, taking up to 30 hours per epoch. The enwik8 64K sequence lenght and imagenet-64 experiments had to be dropped all together for this reason. We did however use smaller datasets instead and were able to investigate the claims this way.
﻿
﻿
﻿
Run set8
﻿
﻿
Attention Type	Memory Complexity	Time Complexity
Scaled Dot-Product	$max(bn_hld_k, bn_hl^2)$	$max(bn_hld_k, bn_hl^2)$
LSH Attention	$max(bn_hld_k, bn_hln_r(4l/n_c)^2)$	$max(bn_hld_k, bn_hn_rl(4l/n_c)^2)$

(a) the forward, and (b) the reverse computations of a residual block (c) Gomez at al
	Evaluation
Training	Full Attention	LSH-8	LSH-4	LSH-2	LSH-1
Full Attention	100.00	1.37	1.85	3 .00	4.56
LSH-4	46.54	99.71	99.77	93.05	77.62
LSH-2	75.94	96.6	97.45	97.08	86.06
LSH-1	70.65	76.61	79.68	79.34	56.09
Model Type	Memory Complexity	Time Complexity
Transformer	$max(bld_{ff} , bn_hl^2)n_l$	$bld_{ff} + bn_hl^2)n_l$
Reversible Transformer	$max(bld_{ff} , bn_hl^2)$	$bn_hld_{ff} + bn_hl^2)n_l$
Chunked Reversible Transformer	$max(bld_{model}, bn_hl^2)$	$bn_hld_{ff} + bn_hl^2)n_l$
LSH Transformer	$max(bld_{ff} , bn_hln_rc)n_l$	$bld_{ff} + bn_hn_rlc)n_l$
Reformer	$max(bld_{model}, bn_hln_rc)$	$bld_{ff} + bn_hn_rlc)n_l$
Add a comment