Skip to main content

Experiments: Synthetic task

Results from the synthetic task of the Reformer paper: https://arxiv.org/abs/2001.04451
Created on January 13|Last edited on January 27

The task

The synthetic task is about copying a sequence of integers of the form 0w0w, where w is a sequence of some length, composed of integers from 1 to 128. The first part of the sequence is masked from the loss function, so the goal for the model is to learn that midway through the sequence it has to repeat 0w. For a more detailed explanation of the task see the experiment notebook: https://arampacha.github.io/reformer_fastai/experiment.synthetic-task.html

Claim: A full attention transformer can solve this task perfectly. An LSH-attention transformer can also solve it, but with decreasing performance as the number of hashing rounds decrease. A model trained with one type of attention can also be evaluated with a different type of attention.

Experiments

For this task we used our own implementation of LSH LM: https://arampacha.github.io/reformer_fastai/reformer.html#LSHLM. For details on the LSH algorithm see our project documentation: https://arampacha.github.io/reformer_fastai/exploration.LSH.html. Note that our model uses shared key and queries even when using full attention. We train a model with full attention and three models with LSH-attention with 1,2 and 4 hashing rounds respectively.

We trained for 150 000 steps with a batch size of 64, a training size of 12,800 and validation size of 1,280. We used hyperparameters as described in the paper, and other defaults suggested in the trax Github repository: https://github.com/google/trax. See our SyntheticConfig class for experiment defaults: https://arampacha.github.io/reformer_fastai/experiment-configs.html#SyntheticConfig.



Training dynamics

Seed

From our early prototyping using a standard transformer model we had observed that the model seemed to suddenly "understand" the task and reach zero loss and perfect accuracy as expected. However, when we started running our LSHLM model this was not always the case. The runs below illustrate this behavior. The runs are identical except for the seed, but the outcome of training is very different.




Run set
2


Learning rate

Similarly the learning rate seemed to affect training as well. The plot below shows a LSHLM with a single hashing round trained with a base lr of 1e-3 and 1e-4. We used the 1-cycle policy from the fast.ai library for learning rate scheduling: https://docs.fast.ai/callback.schedule.html#Learner.fit_one_cycle




Run set
2


Results

We did not do extensive hyperparameter search, but after a few iterations were able to train models that preformed as expected. We trained four models:

  • Full attention
  • LSH with 1,2 and 4 hashing rounds

We then evaluated the models with full attention and LSH attention with 8,4,2 and 1 hashing rounds respectively. Results are summarized in the table below. The first column shows how the model was trained, the subsequent ones how it was evaluated. See our documentation for analysis setup: https://arampacha.github.io/reformer_fastai/experiment.synthetic-task-analysis.html.

Evaluation
Training Full AttentionLSH-8LSH-4LSH-2LSH-1
Full Attention100.001.371.853 .004.56
LSH-446.5499.7199.7793.0577.62
LSH-275.9496.697.4597.0886.06
LSH-170.6576.6179.6879.3456.09


We see that the LSH model gradually performs worse with fewer hashing rounds both in training and validation, as expected. LSH-4 seems to give near identical performance to full attention.



Discrepancy with results from the paper

The results from the paper were as follows: image.png

We can see that there are three clear differences in the tables:

  1. Our results for this particular set of runs are a bit poorer than that of the paper, especially so for LSH-1. But as noted above, we were also able to train LSH-1 to near perfection when we used a higher learning rate. This suggest that the absolute numbers in the table are a bit random, and depends on the specific training setup.
  2. Models trained with LSH Attention and validated with standard attention does much better in our experiments than in the paper.
  3. The model trained with standard attention and validated with LSH does much worse in our experiment.

One explanation could be that there are issues with our implementation. But overall the models seems to behave as expected. When comparing to other implementations, we also see similar results between implementations, see our documentation: https://arampacha.github.io/reformer_fastai/experiment.synthetic-task-comparison.html

A second possibility is that due to random factors (as we observed with the role of the seed in training) the resulting models simply behave differently. E.g. when validating the full attention model, there is no clear trend in the role of hashing rounds.

A final explanation is that there might be a mix up in the actual summarizing of results leading to a switch of rows/columns in the original report.