Skip to main content

Experiments: Shared Query-Key Attention

We demonstrate how the memory allocation of the various reformer variants compare to the transformer during training
Created on January 27|Last edited on January 27

Shared Query-Key Attention ("shared-QK") was used in the Reformer paper to help further reduce the memory footprint of the model.

Claim: A shared query-key space does not perform worse than regular attention

To investigate this we train a standard Transformer LM and a Transformer LM with shared-QK attention on the enwik8 dataset. The Bytes-Per-Character (BPC) metric was used. Each model had 3 layers, d_model equal to 1024 and used axial positional embeddings. They were trained for 10 epochs with a sequence length of 4096 and a batch size of 8 using the Adafactor optimizer. Gradient Accumulation was used where needed. The mean BPC of 3 training rounds was used.

After experimentation we cannot validate the claim that shared-QK attention does not perform worse than standard attention for this experiment setting. Nor did we see the effect of shared-QK attention training slightly faster as noted in the paper. Potentially with additional training shared-QK attention performance will converge with standard attention, we leave this to future work




246810Epoch0.811.21.41.61.82
TransformerLM
TransformerLM-Shared-QK
246810Epoch1.522.53
TransformerLM
TransformerLM-Shared-QK
Run set
6