Experiments: Reversible Residual Layers

Created on January 26|Last edited on January 27
Comment
﻿
MotivationReversible Residual Layers were first introduced in the Computer Vision research community to reduce memory requirements while preserving performance. In the Reformer paper they are demonstared to  
Claim:
1) Reversible Residual Layers in a Transformer enable more memory efficient training and do not come at the expense of model
accuracy
To validate this claim we train a Transformer Language Model and full sequence-to-sequence Transformer using Reversible Residual Layers and compared them to their standard Transformer equivalent.
﻿
Reversible Transformer LMWe train the Reversible Transformer LM ("ReversibleLM"), a 6-layer causal language model on the enwik8 dataset with a sequence length of 4096 with the Adafactor optimizer. A batch size of 8 was used, via Gradient Accumulation. Experiments were run on 15GB and 12GB GPUs and training was carried out in full-precision. Unfortunately training this model on full 64k sequences from enwik8 as per the original paper was not feasible given our computational budget. The average of 3 runs for each model-type was taken
We could not validate the claim that Reversible Residual Layers do not have a significant impact on language model performance due to the sizeable difference of  0.11 BPC between the ReversibleLM and the baseline TransformerLM
﻿
﻿
﻿
Run set6
﻿
Reversible TransformerWe train a full Reversible Transformer ("ReversibleTransformer"), a 12-layer sequence-to-sequence Transformer model with Reversible Residual layers on the WMT-14 en-de translation dataset. Given the short sequences in this dataset a sequence length of 256 was used with a batch size of 64 and the Adam optimizer. Gradient Accumulation was used when training with a 12GB GPU. Training was carried out for 2 epochs and full-precision was used. The one-cycle learning rate schedule from fast.ai was used, with a maximum learning rate of 1e-4, initial div of 5 and a percent start of 12.5%
From our experiments we COULD / COULD NOT not validate the claim that Reversible Residual Layers do not have a significant impact on Transformer model performance due to XXX
﻿
﻿
﻿
Run set2
﻿
﻿
Add a comment