Experiments: Shared Query-Key Attention

We demonstrate how the memory allocation of the various reformer variants compare to the transformer during training

Created on January 27|Last edited on January 27

Comment

﻿
Shared Query-Key Attention ("shared-QK") was used in the Reformer paper to help further reduce the memory footprint of the model.
Claim: A shared query-key space does not perform worse than regular attention
To investigate this we train a standard Transformer LM and a Transformer LM with shared-QK attention on the enwik8 dataset. The Bytes-Per-Character (BPC) metric was used. Each model had 3 layers, d_model equal to 1024 and used axial positional embeddings. They were trained for 10 epochs with a sequence length of 4096 and a batch size of 8 using the Adafactor optimizer. Gradient Accumulation was used where needed. The mean BPC of 3 training rounds was used.
After experimentation we cannot validate the claim that shared-QK attention does not perform worse than standard attention for this experiment setting. Nor did we see the effect of shared-QK attention training slightly faster as noted in the paper. Potentially with additional training shared-QK attention performance will converge with standard attention, we leave this to future work  
﻿
﻿
﻿
Shared-QK - Validation BPC
Shared-QK - Validation BPC
246810Epoch0.811.21.41.61.82
TransformerLM
TransformerLM-Shared-QK
Shared-QK - Train Loss
Shared-QK - Train Loss
246810Epoch1.522.53
TransformerLM
TransformerLM-Shared-QK
Run set6
﻿
﻿

Add a comment