Experiments: Deep Reformer models
Motivation
It has been shown in multiple studies that deeper models generally have better performance. Reformer is designed to be memory efficient so can enable training deep models (Reversible layers) on very long sequences (LSH attention).
Claims: 1) Deep Reformer models can be trained on very long sequences using single accelerator (GPU or TPU core); 2) Deeper Reformer models have better performance.
enwik8 experiment
To reinforce this claims we train models with different depth on inputs with seq_len = 16,384
. We use the enwik8 dataset. Our experiments are designed to run on single 12GB GPU.
Unfortunately training very deep models is out of our computational budget. Although the first claim is easily proved by the fact that Reversible layers have O(1)O(1) memory complexity, so we can fit very deep model in cost of compute.
We trained models with 3, 6 and 12 layers for 4 epochs. There is a trend for deeper models to have lower training loss as training progresses and it would be beneficial to train longer to see if this trend strengthens.