Experiments: Reversible Residual Layers
Motivation
Reversible Residual Layers were first introduced in the Computer Vision research community to reduce memory requirements while preserving performance. In the Reformer paper they are demonstared to
Claim:
1) Reversible Residual Layers in a Transformer enable more memory efficient training and do not come at the expense of model accuracy
To validate this claim we train a Transformer Language Model and full sequence-to-sequence Transformer using Reversible Residual Layers and compared them to their standard Transformer equivalent.
Reversible Transformer LM
We train the Reversible Transformer LM ("ReversibleLM"), a 6-layer causal language model on the enwik8 dataset with a sequence length of 4096 with the Adafactor optimizer. A batch size of 8 was used, via Gradient Accumulation. Experiments were run on 15GB and 12GB GPUs and training was carried out in full-precision. Unfortunately training this model on full 64k sequences from enwik8 as per the original paper was not feasible given our computational budget. The average of 3 runs for each model-type was taken
We could not validate the claim that Reversible Residual Layers do not have a significant impact on language model performance due to the sizeable difference of 0.11 BPC between the ReversibleLM and the baseline TransformerLM
Reversible Transformer
We train a full Reversible Transformer ("ReversibleTransformer"), a 12-layer sequence-to-sequence Transformer model with Reversible Residual layers on the WMT-14 en-de translation dataset. Given the short sequences in this dataset a sequence length of 256 was used with a batch size of 64 and the Adam optimizer. Gradient Accumulation was used when training with a 12GB GPU. Training was carried out for 2 epochs and full-precision was used. The one-cycle learning rate schedule from fast.ai was used, with a maximum learning rate of 1e-4, initial div of 5 and a percent start of 12.5%
From our experiments we COULD / COULD NOT not validate the claim that Reversible Residual Layers do not have a significant impact on Transformer model performance due to XXX