Skip to main content

Methods: Reversible Residual Layers

Brief explanation of Rev Layers operation
Created on January 25|Last edited on January 27

Reversible Residual layers were presented in Gomez et al Ref. The idea is to make layers "reversible", i.e. each layer’s activations can be computed from the next layer’s activations. Adding Reversible Residial Layers enables us to perform backpropagation without storing the activations in memory. The memory requirements of architectures based on Reversible Residual blocks are independent of the number of layers in the model. Hence it allows one to train arbitrarily deep models, at the cost of increased compute.


rev-res-block.png
(a) the forward, and (b) the reverse computations of a residual block (c) Gomez at al


The forward pass for a Reversible Residual Layer is given as

y1=x1+F(x2)y_1 = x_1 + \mathcal{F}(x_2)

y2=x2+G(y1)y_2 = x_2 + \mathcal{G}(y_1)

When doing a backward pass, the intermediate activations can be reconstructed as follows:

x2=y2−G(y1)x_2 = y_2 - \mathcal{G}(y_1)

x1=y1−F(x2)x_1 = y_1 - \mathcal{F}(x_2)



Applying this to Transformer architecture we set F=MultiHeadAttention\mathcal{F} =\mathbf{MultiHeadAttention} and G=FeedForward\mathcal{G} = \mathbf{FeedForward}. x1x_1 and x2x_2 are initialized as copies of original input.

Our implementation of Reversible Residual Layers is largely inspired by https://github.com/lucidrains/reformer-pytorch and can be found here.