Experiments: Hashing rounds
Motivation
When using LSH for chunking there is small probability that similar tokens will and up in different buckets, therefore attention will not be computed for these tokens. To reduce this probability we can do multiple rounds of hashing using different hashing functions.
Claim: Performance of Transformer with LSH attention increases with number of hashing rounds and is close to full attention performance at n_hashes = 8
.
To reinforce this claim we train and compare models with different number of hashing rounds on enwik8 dataset.
enwik8 dataset experiments
For this experiment we trained 3-layer deep Transformer with LSH attention for 10 epochs. The training was done on sequences of length 4096 and effective batch size of 8. Full list of model parameters may be found here. Refer to this notebook for full experiment setup.
We report training and validation losses for runs with number of hashing rounds from 2 to 16 compared to baseline (full attention Transformer).
The results illustrate that training and validation loss tend to improve with increasing number of hashing rounds. To be noted that both time and memory requirements also increase with n_hashes
.
Also while training losses are very close for larger n_hashes
and full attention, models using LSH seem to have slightly higher generalization error.
For comparison with charts presented in the paper we also report bits per character (bpc) measured on validation data.