Skip to main content

Reformer Reproducibility - Long Draft

A fast.ai community submission to the Reproducibility challenge 2020: "Reformer: The Efficient Transformer" by Nikita Kitaev, Łukasz Kaiser and Anselm Levskaya, accepted to ICLR 2020.
Created on February 9|Last edited on February 9

Reproducibility Summary

This paper attempts to reproduce and thus validate the central claims of ICLR 2020 Paper "Reformer: The Efficient Transformer" from Kitaev et al. , notably that the techniques introduced enable performance on par with a traditional Transformer model while being much more memory-efficient and much faster on long sequences. This Fast.ai community research effort was able, through experimentation, to validate claims around speed for long sequences and also observed a reduced memory footprint.

However our attempt to match the performance of a traditional Transformer model was not successful. Potential reasons for this could be implementation differences deriving from the significant engineering effort and hyperparameter tuning required to implement these new techniques, lack of computational resources to train the models for as long as the original paper, or even numerical differences due to implementation nuances between the authors' JAX code and our Pytorch code. Finally, substantial coding effort was needed to implement the central aspects of the paper with a lack of documentation of hyperparameters compounding this difficulty.



Scope of Reproducibility

The Reformer paper introduced a model, the "Reformer", based on Locality Sensitive Hashing (LSH) Attention, Reversible Residual Layers, Chunked Feed Forward Layers, Axial Positional Encodings, and Shared Query-Key Attention, that is claimed to perform on par with Transformer models while being much more memory-efficient and much faster on long sequences.

The scope of this reproducibility effort is to verify these claims of memory efficiency and speed on longer sequences. In order to validate the papers' claims, our experiments replicated the LSH Attention, Reversible Residual Layers, Shared Query-Key Attention NLP experiments with a custom synthetic dataset described in the paper as well as the WMT-14 [REFERENCE], enwik8 [REFERENCE], and tiny Shakespeare[REFERENCE] datasets. The computer vision experiments were not assessed due to a lack of available compute for this community effort



Methodology

We reimplemented the original Transformer model (Vaswani REF) which then provided a foundation from to test individual the techniques introduced in the paper. We referenced available code for our Transformer [Transformer Code References] and Reformer [Reformer code references] implementation. During our experiment design we also referenced the authors' code to replicate the data loading and tokenization methods used and ported them into the fastai framework.

The fastai framework was used for dataloading and model training, experiment tracking was carried out with [Weights and Biases](attach BIBTEX Reference below), nbdev literate programming environment nbdev REF was used for all development work and Discord was used to communicate between our distributed team.

With regard to compute, a variety of cloud and local resources were used including Three Tesla T4 VMs on GCP (provided courtesy of [DAGsHub](attach BIBTEX Reference below)), Google Colaboratory, DataCrunch.io and Paperspace Gradient. All experimentation was carried out in a single GPU setting, unlike the paper which used 8 x V100s for some experiments. Total GPU runtime between testing and experimentation came to a little over 2 months.



Results

We examined the main claims of the paper and found that claims around speed on longer sequences as well as reduced memory footprint could be validated. As sequence length was increased we observed that LSH Attention was much faster than standard attention, as claimed in the paper. Increasing the number of hashes in LSH Attention also improved model performance, as claimed.

However we could not achieve the performance figures of a traditional Transformer while testing the both the individual components that make up the Reformer and also while testing the full Reformer itself. We must note that we could not run some experiments for as long as they were run in the original paper due to a lack of available computational resources. Potentially the underperformance of Reformer that we observed may be due to under-training, implementation differences in hyperparameters used or nuances in the authors' JAX code implementation vs our Pytorch implementation.

One issue encountered while training was exploding gradients when using Reversible Residual Layers with mixed precision training, possibly due to the unconventional way the backward pass is calculated. In addition, several model settings were found to be unstable when training depending on the random seed or learning rate selected.



What was easy

Obtaining the WMT-14 and enwik8 datasets was straightforward as they are commonly used benchmarks. Reproducing the data processing and tokenization used was also fairly easy. Implementation of the Chunked Feed Forward layers was relatively straightforward while code for the Axial Positional Encodings was imported using the axial_positional_encoding Github repository [REF lucidrains].



What was difficult

This paper introduced 5 new techniques meaning that substantial effort had to be made to ensure a correct reimplementation of each technique. While there were Reformer code resources available, when it came to engineering details, this paper was quite challenging due to many design decisions or hyperparameters not being fully documented in the paper. Integrating techniques such as shared-QK and Reversible Residual Layers also required significant hyperparameter tuning and even then it was difficult to assess whether optimal parameters had been found.



Communication with original authors

The authors were receptive to email correspondence and clarified a number of implementation details. In hindsight, we could have communicated even more with them if needed.



Report



1.Introduction

One of the main characteristics of the transformer is a large number of parameters, and costly training (see e.g. https://ruder.io/research-highlights-2020/index.html#1-scaling-up-and-down). Finding ways of matching the performance of the Transformer, but reducing the cost of training is, therefore, a democratizing effort. The Reformer attempts to achieve this by reducing the memory footprint of the Transformer and thus enabling modelling of deeper models and longer sequences within a fixed resource budget.



2. Scope of Reproducibility

The Reformer represents a collection of techniques aimed at reducing the memory footprint of the Transformer while retaining acceptable performance. A reduced memory footprint also enables the modelling of very deep models and long sequences within a given computational budget. The paper claims that the Reformer:

performs on par with Transformer models while being much more memory-efficient and much faster on long sequences

We break down this claim into the following sub-claims, based on the experiments in the paper:

Performance

  1. Proof of concept: A full attention Transformer can solve a sequence copy task perfectly. An LSH-attention Transformer can also solve it, but with decreased performance as the number of hashing rounds decrease. A model trained with one type of attention can also be evaluated with a different type of attention.
  2. Attention approximation: The performance of the Reformer increases with the number of hashing rounds and is close to full attention performance at n_hashes = 8.
  3. Scaling: Deep Reformer models: a) can be trained on very long sequences using a single accelerator (GPU or TPU core) and b) have better performance.
  4. The performance of Reversible Transformers are similar to the Transformer on language modelling tasks
  5. Shared queries and keys in the attention module has minimal effect on performance in a Transformer
  6. The Reformer reduces the memory footprint compared to the transformer.

Speed

  1. Evaluation speed: The number of hashes of the Reformer can be increased at evaluation time to produce more accurate results. The evaluation speed measured in seconds/step is growing with the number of hashes but not with increased sequence length.


3. Methodology

3.1 Reformer Techniques

The Reformer paper introduces 5 techniques which can be implemented separately or combined together into a full Reformer model:

I) LSH attention

LSH attention is an approximation of full attention that is more memory and time efficient for longer sequences. The LSH LM used in our experiments in a traditional Transformer language model with its attention module replace with LSH attention.

II) Shared keys and queries

Shared-QK attention was introduced to make LSH work better by ensuring that the number of queries is equal to the number of keys within each bucket.

III) Reversible Residual Layers

Reversible Residual layers were first presented in Gomez et al Ref. Adding Reversible Residual Layers enables us to perform backpropagation without storing the activations in memory, resulting in significant memory savings at the cost of increased compute.

IV) Chunked feed-forward layers

The final feed-forward layers in the transformer can have a lot of parameters and be costly to compute in a single step. The Reformer instead processes this step in smaller chunks, effectively trading computation time for memory.

V) Axial positional encodings

Axial positional encodings are a more memory efficient way of encoding positions than standard learnable positional encodings. This is especially relevant for very long sequences.

For more in-depth explanations of these techniques please see the appendix and our repository's documentation page.



3.2 Datasets used

[FOOTNOTE] For all datasets, our implementation notes give full details: https://github.com/arampacha/reformer_fastai/blob/db065661af97ace71fd2c7cb177a94c127f792d6/README.md#data



3.4 Experimental setup and code

Our code and documentation and supplementary material such as full experiment notebooks are available at Github [FOOTNOTE] https://github.com/arampacha/reformer_fastai, [FOOTNOTE] https://arampacha.github.io/reformer_fastai/. All experiments are logged to [Wandb](attach BIBTEX Reference)



3.5 Computational requirements

With regard to compute, we used Tesla T4, P4 and K80 GPUs provided in Google Colaboratory along with a single Tesla V100 GPU on DataCrunch, Three Tesla T4 VMs on GCP (provided courtesy of [DAGsHub](attach BIBTEX Reference) ), Titan V, GeForce 2080 Ti, and 3090. The total number of GPU hours for the entire project is approximately 1600 hours, not accounting for device type, nor GPU-utilization.



4. Results

We attempt to verify each of the claims stated in section 2.

4.1 enwik8 Experiment Settings

For all enwik8 experiments we train a 3-layer Transformer LM with a model dimension of 1024 and axial positional encoding, modified with the relevant Reformer technique, on the enwik8 dataset with a sequence length of 4096 with the Adafactor optimizer, unless noted otherwise. A batch size of 8 was used, via Gradient Accumulation. Experiments were run on 15GB and 12GB GPUs and training was carried out in full-precision.

[FOOTNOTE: link to configs]



4.2 LSH attention analysis on synthetic task

The synthetic task is about copying a sequence of integers of the form 0w0w, where w is a sequence of some length, composed of integers from 1 to 128. The first part of the sequence is masked from the loss function, so the goal for the model is to learn that midway through the sequence it has to repeat 0w. For a more detailed explanation of the task see the experiment notebook: https://arampacha.github.io/reformer_fastai/experiment.synthetic-task.html

Claim: A full attention transformer can solve this task perfectly. An LSH-attention transformer can also solve it, but with decreasing performance as the number of hashing rounds decrease. A model trained with one type of attention can also be evaluated with a different type of attention.

Experiments

For this task, we used our own implementation of LSH LM: https://arampacha.github.io/reformer_fastai/reformer.html#LSHLM. For details on the LSH algorithm see our project documentation: https://arampacha.github.io/reformer_fastai/exploration.LSH.html. Note that our model uses shared key and queries even when using full attention. We train a model with full attention and three models with LSH-attention with 1,2 and 4 hashing rounds respectively.

We trained for 150 000 steps with a batch size of 64, a training size of 12,800, and a validation size of 1,280. We used hyperparameters as described in the paper, and other defaults suggested in the Trax Github repository: https://github.com/google/trax. See our SyntheticConfig class for experiment defaults: https://arampacha.github.io/reformer_fastai/experiment-configs.html#SyntheticConfig.

We evaluated the models with full attention and LSH attention with 8,4,2 and 1 hashing rounds respectively.



Results

Results are summarized in the table below. The first column shows how the model was trained, the subsequent ones how it was evaluated. See our documentation for analysis setup: https://arampacha.github.io/reformer_fastai/experiment.synthetic-task-analysis.html.

Evaluation
Training Full AttentionLSH-8LSH-4LSH-2LSH-1
Full Attention100.001.41.93 .04.6
LSH-446.599.799.793.0577.6
LSH-275.996.697.597.186.1
LSH-170.776.679.779.356.1

We see that the LSH models gradually performs worse with fewer hashing rounds both in training and validation, as claimed. We also see that with a higher number of hashing rounds LSH attention can approximate the performance of full attention. Limited hyperparameter search was needed to achieve the training results obtained.

Discrepancy with results from the paper

The results from the paper were as follows: image.png

We see that there are three clear differences in the tables:

  1. Our results for this particular set of runs are a bit poorer than that of the paper, especially so for LSH-1. But as detailed in the appendix, we were also able to train LSH-1 to near perfection when we used a higher learning rate. This suggests that the absolute numbers in the table are a bit random, and depend on the specific training setup.
  2. Models trained with LSH attention and validated with standard attention does much better in our experiments than in the paper.
  3. The model trained with standard attention and validated with LSH does much worse in our experiment.

One explanation could be that there are issues with our implementation. But overall the models seem to behave as expected. When we compare our implementation to other implementations, we also see similar results between implementations, see our documentation for details: https://arampacha.github.io/reformer_fastai/experiment.synthetic-task-comparison.html

A second possibility is that due to random factors (as we observed with the role of the seed in training) the resulting models simply behave differently. E.g. when validating the full attention model with LSH attention, there is no connection between the number of hashing rounds used and performance.



4.3 Effect of sharing QK

Shared Query-Key Attention ("shared-QK") was used in the Reformer paper to help further reduce the memory footprint of the model.

Claim: A shared query-key space does not perform worse than regular attention

To investigate this we train a standard Transformer LM and a Transformer LM with shared-QK attention on the enwik8 dataset. The Bits-Per-Character (BPC) metric was used [FOOTNOTE]. They were trained for 10 epochs and the mean validation BPC of 3 training rounds was taken.

Result

After experimentation, we cannot validate the claim that shared-QK attention does not perform worse than standard attention for this experimental setting. Nor did we see the effect of shared-QK attention training slightly faster as noted in the paper. Potentially with additional training shared-QK attention performance will converge with standard attention, we leave this to future work.




Run set
6


Supplementary Shared-QK Experiment

During the experiment phase of the synthetic task (section 4.2), we used a Transformer LM with Shared Query-Keys. We expected this model to solve the relatively simple synthetic task to perfection, but we found it difficult to train the model to convergence even when training for 750 epochs. When we instead used a standard Transformer LM (i.e. with separate Queries and keys), the model consistently converges in 7-8 epochs.

The figure below illustrates several runs on identical setup, changing only the model seed. Out of five runs with shared query-key only one model converged after about 200 epochs. For our standard Transformer LM the three runs converge within 7-8 epochs.




Run set
8


[FOOTNOTE] Reformer paper used the BPD, which is equivalent to BPC when evaluating characters



4.4 Effect of reversible layers

Claim: Reversible Residual Layers in a Transformer enable more memory efficient training and do not come at the expense of model accuracy

To validate this claim we train a Transformer Language Model and full sequence-to-sequence Transformer using Reversible Residual Layers and compared them to their standard Transformer equivalent.

Reversible LM on enwik8

Training this model on full 64k sequences from enwik8 as per the original paper was not feasible given our computational budget. To investigate this claim we trained on sequences of length 4096. The mean of 3 runs for each model-type was taken.

Result

We could not validate the claim that Reversible Residual Layers do not have a significant impact on language model performance due to the sizeable difference of 0.11 validation BPC between the ReversibleLM and the baseline TransformerLM. Potentially this difference can be reduced with further training, we will leave this to further work.




Run set
6


4.5 Reversible Transformer on translation task

We train a full Reversible Transformer ("ReversibleTransformer"), a 12-layer sequence-to-sequence Transformer model with Reversible Residual layers on the WMT-14 en-de translation dataset. Given the short sequences in this dataset, a sequence length of 256 was used with a batch size of 64 and the Adam optimizer. Gradient Accumulation was used when training with a 12GB GPU. Training was carried out for 2 epochs for the Reversible Transformer and 1 epoch for the standard Transformer. Full-precision was used in both cases. The one-cycle learning rate schedule from fast.ai was used, with a maximum learning rate of 1e-4, initial div of 5, and a percent start of 12.5%

Results

From our experiments, we could not not validate the claim that Reversible Residual Layers do not have a significant impact on Transformer model performance. Our CorpusBLEU score for a standard Transformer trained for 1 epoch outperformed a ReversibleTransformer that was trained for 2 epochs.




Run set
2


4.6 Effect of number of hashing rounds on the performance

When using LSH for chunking there is a small probability that similar tokens will and up in different buckets, therefore attention will not be computed for these tokens. To reduce this probability we can do multiple rounds of hashing using different hashing functions.

Claim: Performance of Transformer with LSH attention increases with the number of hashing rounds and is close to full attention performance at n_hashes = 8.

To validate this claim we train and compare models with different numbers of hashing rounds on the enwik8 dataset. We train a Transformer with LSH attention for 10 epochs. We report training and validation losses for runs with a number of hashing rounds from 2 to 16 compared to baseline (full attention Transformer). Losses for one run for each number of hashes setting is reported.

Results

Our results verify the claim and illustrate that training and validation losses tend to improve with an increasing number of hashing rounds. To be noted that both time and memory requirements also increase with n_hashes.

Also while training losses are very close for larger n_hashes and full attention, models using LSH seem to have slightly higher generalization error.




Run set
5


4.7 LSH attention evaluation speed

Claim: The evaluation speed measured in seconds / step is growing with the number of hashes but not with increased sequence length.

Experiment

To verify the claim we test evaluation time on the synthetic task with hyperparameters as indicated in the right part of figure 5 in the reformer report. We were unable to complete the longest sequence lengths for full attention due to out of memory errors on a single GPU.

Results

Our results are summarized in the figure below:

image.png

The results support the claim that evaluation speed increases with the number of hashes but remains constant with increased sequence length.



4.8 Deep Reformer models

It has been shown in multiple studies that deeper models generally have better performance. Reformer is designed to be memory efficient so can enable training deep models (Reversible layers) on very long sequences (LSH attention).

Claims: 1) Deep Reformer models can be trained on very long sequences using single accelerator (GPU or TPU core); 2) Deeper Reformer models have better performance.

enwik8 experiment

To reinforce this claims we train models with different depth on inputs with seq_len = 16,384. We use the enwik8 dataset. Our experiments are designed to run on single 12GB GPU. We trained models with 3, 6 and 12 layers for 4 epochs. Unfortunately training very deep models is out of our computational budget.

Results

We confirm claim (1), deeper models can be trained on a single accelerator as expected, given the fact that Reversible layers have O(1)O(1) memory complexity. For claim (2), our results are indicative of better performance for deeper models, but longer training runs are needed to confirm this.




Run set
3


4.9 Memory Consumption

The main motivation of the reformer paper is that it reduces the memory footprint as compared to a normal transformer.

Claim: The reformer reduces the memory footprint compared to a normal transformer. The claim is summarized in table 5 of the reformer paper.

Experiments

To investigate the claim we log the memory allocation of various reformer combinations during 0.1 % of an epoch of training the tiny Shakespeare dataset. The sequence length for all experiments was set to 4096, with a batch size of 1. We investigate the following combinations:

  1. Comparing the Transformer LM, LSH LM, Reversible LM and full Reformer LM
  2. Comparing Reformer LMs with different number of hashes
  3. Comparing Reformer LMs with different number of layers


Comparing Transformer LM, LSH LM, Reversible LM and the full Reformer LM

The figure below (left) shows the peak memory usage for the Transformer, LSH LM, Reversible LM and the full Reformer.

We see that the transformer stores activations for each forward pass during training, and that these are gradually released as the backward pass is completed. In this case we have six distinct jumps in memory which corresponds to the 6 layers.

With the LSH LM, with 8 hashing rounds, memory allocation is mostly parallel to the transformer, but since the LSH attention computation is cheaper, the actual memory consumption is lower. Here we also observe that activations are stored for each layer.

For the Reversible LM memory doesn't accumulate over the forward passes, since it recalculates intermediate activations for the backward pass. Note that we observe 6 peaks for the forward pass, each peak corresponds to full attention matrix computation. Another 6 higher peaks correspond to backward pass.

The full ReformerLM includes both LSH attention and Reversible Residual Layers. Memory allocation therefore doesn’t increase with number of layers, and since LSHattention is cheaper than standard attention, the peak memory usage is smaller than for the ReversibleLM.

memory.png



Reformer Memory vs Number of Hashes

In previous sections we have seen that increasing the number of hashes lead to increased performance as the attention approximation of LSH attention approaches that of scaled dot-product attention. But since we have to store the result of each hashing round we would expect memory to grow linearly with the number of hashes. Note that it's only during the actual LSH attention calculation that the number of hashes matter. I.e. the intermediate shape of LSH attention is [batch_size, n_chunks, chunk_size, chunk_size*2], where n_chunks is the product of the number of hash buckets per round and the number of hash rounds, n_chunks = n_buckets * n_hashrounds.

The figure above (right), from our experiments, confirms that peak memory scales linearly with number of hashes. The memory peak happens during the backward pass during calculation of LSH attention. Note that the output of LSH attention is [bs, seq_len, d_model] and is independent of the number of hashes. This explains the drop in memory when LSH-attention calculation is finished. A 2-layer Reformer LM with a sequence length of 4096 was used for this experiment.



Reformer vs Number of Layers

In this experiment compare Reformer LMs with n_layers = 2,4,6,8,10,12. 8 hashes are used in all cases. We see the same memory peak during LSH-attention calculation as in the figure above, and also that each added layer increases the memory footprint equally during the entire process. The figure shows forward and backward passes for a layer of each of the models matched along x-axis.

Since the Reformer isn't accumulating memory during forward pass through all it's layers, the peak memory allocation scales linearly to each extra layer. The additional memory is due to additional model parameters as number of layers increases.

image.png

Results

Our experiments have verified that the reformer has a much smaller peak memory allocation as compared to the transformer. We have also shown that the reformer can scale to much deeper models than the transformer within a fixed budget. The main bottleneck of the reformer w.r.t. memory is the number of hash rounds used.



5. Discussion and Conclusion

Claims validated

There are a number of the paper's original claims that we can confidently say we have validated such as claims around superior evaluation speed for longer sequence lengths, how LSH attention comes close to approximating full attention, that scaling deeper reformer models works on a single GPU and that the Reformer has a reduced memory footprint compared to the transformer.

Claims not validated

However we cannot validate the claim that performance of Transformer model with Reversible Residual Layers or Shared-QK attention performs on part with a traditional Transformer. As noted elsewhere, there may be a number of explanations for this discrepancy with our lack of compute compared to the original authors being the most likely factor.

Non-trival implementation

As mentioned elsewhere in the report this reproducibility work involved a non-trivial amount of engineering to ensure our reimplementation was as faithful to the original paper as we could make it. This paper introduced 5 new techniques meaning that substantial effort had to be made to ensure a correct reimplementation of each. While there were Reformer code resources available many design decisions or hyperparameters were not fully documented in the paper. For example LSH attention is quite complex compared to normal dot product attention and our implementation is over 200 lines of code, compared to our dot product attention implementation which is just over 30 lines of code.

Several of the experiments are also quite computationally intensive for a single GPU setup, taking up to 30 hours per epoch. The 64K sequence length and imagenet-64 experiments could not be tested for this reason. Nevertheless the use of shorter sequence lengths enabled use to investigate the claims made.

Reformer in practice

In our experimentation we observe that Reformer is especially suited for very deep models using very long sequences and enables training of deep models with long sequences on a single GPU, thus supporting the democratization of machine learning.

Practitioners interested in using the model should note however that we also observed a non-neglible increase in time complexity due to the trade-off of memory for compute in the Reversible Residual Layers and the trade-off of time for increased accuracy as the number of hashes used increases. This behaviour is also observed in other reports [REF: "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" & REF "Long Range Arena"]

Community work

As this was a community effort with many contributors we were forced to improve our development processes; midway through the project we switch to nbdev[REF], an open-source dev-ops system with documentation, testing and full CI/CD integrated into Github. This resulted in a much smoother development process for the team.

Reproducible Reproducibility

All of our code and documentation is available on Github [FOOTNOTE]. Of particular interest to future practitioners might be our integration of Reformer language model and sequence-to-sequence training into the fastai training loop as well as the memory profiling scripts used in our memory consumption experiment.



Acknowledgements

There are a number of contributions that we would like to acknowledge; Weights & Biases [REF] for supporting this community work by providing a team account, DAGsHub [REF] for providing computational resources for experimentation and Phil Wang (@lucidrains on Github) for his Pytorch Reformer implementation as well as a number of other transformer implementations which made this tricky implementation much smoother at times. Finally, we would like to acknowledge that this work originates in the fastai community and to thank Jeremy Howard, Sylvain Gugger and all of the contributors to the library and the forums for continuing to make deep learning more accessible, more friendly and more performant.




REFERENCES

[References to be made nice]

WandB Bibtext Citation:

@misc{wandb, title = {Experiment Tracking with Weights and Biases}, year = {2020}, note = {Software available from wandb.com}, url={https://www.wandb.com/}, author = {Biewald, Lukas}, }

DAGsHub Bibtext Citation:

@misc{dagshub, title = {Data Science Collaboration with DAGsHub}, year = {2019}, note = {Software available from dagshub.com}, url={https://dagshub.com/}, author={Pleban, Dean and Smoilovsky, Guy}, }

nbdev Citation

Howard, Jeremy, nbdev, (2020), GitHub repository, https://github.com/fastai/nbdev

Reformer Code References

Transformer Code Resources

WMT-14 Reference

@InProceedings{bojar-EtAl:2014:W14-33, author = {Bojar, Ondrej and Buck, Christian and Federmann, Christian and Haddow, Barry and Koehn, Philipp and Leveling, Johannes and Monz, Christof and Pecina, Pavel and Post, Matt and Saint-Amand, Herve and Soricut, Radu and Specia, Lucia and Tamchyna, Ale {s} }, title = {Findings of the 2014 Workshop on Statistical Machine Translation}, booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation}, month = {June}, year = {2014}, address = {Baltimore, Maryland, USA}, publisher = {Association for Computational Linguistics}, pages = {12--58}, url = {http://www.aclweb.org/anthology/W/W14/W14-3302} }



Appendix

1 LSH-attention

In normal attention, we perform a matrix multiply of all keys and queries. This is the same as taking the dot product of every query with every key and is the source of the quadratic cost of attention. Dot products will be large when vectors are aligned. We then pass the dot products through a softmax that makes relatively large input yet larger. It makes sense that keys and queries that are close together in our high dimensional space will end up with a high score after the softmax. If we could figure out upfront which vectors were close, we might disregard the ones that were far away, since their softmax product most likely would be small. It turns out that the LSH (Andoni et al.,2015) is a suitable algorithm to cluster high dimensional vectors.

LSH clustering

We could do perfect clustering by comparing each key and query. But that would still leave us with the quadratic cost. LSH instead approximates clustering probabilistically. This means that the cost of it is reduced, but that it can't guarantee perfect accuracy. The intuition behind LSH is described in the blog: https://www.pragmatic.ml/reformer-deep-dive/ 1. If we draw a random vector u and take the dot product with other vectors, we will get positive and negative scores depending if vectors are aligned with u or not. 2. If we take the sign of this dot product, we have effectively created two "buckets" for our vectors: '+' and '-' 3. If we draw more lines ("hash rounds"), we create additional buckets in the same way. E.g. 3 lines would give us 2^3=8 buckets. 4. Vectors that end up in the same bucket will have a high probability to be close together, and this probability will go up as the number of hash rounds increases.

image.png

Note that the Reformer paper does LSH via random projections as described in (Andoni et al.,2015).

LSH-attention

The next step is to add LSH clustering to our new attention mechanism. We refer to figure 2 of the reformer paper that gives a clear overview of the main steps:

image.png

  1. First we receive an input sequence of tokens in the original order. In normal attention, our tokens are projected into keys and queries, but we'll set keys and queries to be identical. (see chapter nn)
  2. We then use LSH clustering to produce a bucket ID for each item (i.e. key/query). We sort our items first for bucket id and next for a position in the original sequence.
  3. Group items to equal size chunks, since the number of items in each bucket may vary.
  4. Concatenate chunks with their previous chunk to allow for limited in-between attention. In the diagram, we also mask items from other buckets, so that only in-bucket attention is allowed.
  5. Calculate a normal dot product attention within each concatenated chunk.
  6. Un-chunk and un-sort all items.

LSH attention complexity

According to the reformer paper the time and memory complexities of scaled dot product and LSH attention are:

Attention TypeMemory ComplexityTime Complexity
Scaled Dot-Productmax(bnhldk,bnhl2)max(bn_hld_k, bn_hl^2)max(bnhldk,bnhl2)max(bn_hld_k, bn_hl^2)
LSH Attentionmax(bnhldk,bnhlnr(4l/nc)2)max(bn_hld_k, bn_hln_r(4l/n_c)^2)max(bnhldk,bnhnrl(4l/nc)2)max(bn_hld_k, bn_hn_rl(4l/n_c)^2)

Where:

  • ll is the length of the input sequence
  • bb is the batch size
  • nhn_h is the number of heads in multihead attention.
  • ncn_c is the number of LSH chunks.
  • nrn_r is the number of hash rounds
  • dkd_k is the model dimension

Language models typically benefit from long input sequences (ll) to give the model a longer context(ref), so we often want to make ll large. This means that the cost is usually dominated by the latter part of the max in the table above, i.e. ll is usually larger than dkd_k.

In this case dot product attention has a quadratic cost in both time and memory that depends on l2l^2. If we want to compare LSH-attention to dot product attention, we can compare lnr(4l/nc)2ln_r(4l/n_c)^2 to l2l^2, since bb and nhn_h are common terms. This means that the cost of LSH relative to normal attention is linear w.r.t. nrn_r, but decreases quadratically with ncn_c. In our implementation we did not set ncn_c directly, but rater define a bucket size, bsb_s where ncn_c = l/bsl/b_s.

We can therefore rewrite the LSH complexity as nr(4bs)2ln_r(4b_s)^2l, which is linearly dependent on ll. It also makes intuitive sense that the complexity of LSH-attention grows quadratically with bsb_s, since we do normal dot product attention within each chunk.



2. Training dynamics

Seed

From our early prototyping using a standard transformer model, we had observed that the model seemed to suddenly "understand" the task and reach zero loss and perfect accuracy as expected. However, when we started running our LSHLM model this was not always the case. The runs below illustrate this behavior. The runs are identical except for the seed, but the outcome of training is very different.

Learning rate

Similarly, the learning rate seemed to affect training as well. The plot below shows an LSHLM with a single hashing round trained with a base lr of 1e-3 and 1e-4. We used the 1-cycle policy from the fast.ai library for learning rate scheduling: https://docs.fast.ai/callback.schedule.html#Learner.fit_one_cycle




Run set
4