Experiments: Shared Query-Key Attention
Shared Query-Key Attention ("shared-QK") was used in the Reformer paper to help further reduce the memory footprint of the model.
Claim: A shared query-key space does not perform worse than regular attention
To investigate this we train a standard Transformer LM and a Transformer LM with shared-QK attention on the enwik8 dataset. The Bytes-Per-Character (BPC) metric was used. Each model had 3 layers, d_model equal to 1024 and used axial positional embeddings. They were trained for 10 epochs with a sequence length of 4096 and a batch size of 8 using the Adafactor optimizer. Gradient Accumulation was used where needed. The mean BPC of 3 training rounds was used.
After experimentation we cannot validate the claim that shared-QK attention does not perform worse than standard attention for this experiment setting. Nor did we see the effect of shared-QK attention training slightly faster as noted in the paper. Potentially with additional training shared-QK attention performance will converge with standard attention, we leave this to future work