Methods: Reversible Residual Layers

Brief explanation of Rev Layers operation

Created on January 25|Last edited on January 27

Comment

﻿
Reversible Residual layers were presented in Gomez et al Ref. The idea is to make layers "reversible", i.e.  each layer’s activations can be computed from the next layer’s activations. Adding Reversible Residial Layers enables us to perform backpropagation without storing the activations in memory.
The memory requirements  of architectures based on Reversible Residual blocks are independent of the number of layers in the model. Hence it allows one to train arbitrarily deep models, at the cost of increased compute.
﻿











(a) the forward, and (b) the reverse computations of a residual block (c) Gomez at al
﻿
The forward pass for a Reversible Residual Layer is given as 
y1=x1+F(x2)y_1 = x_1 + \mathcal{F}(x_2)y1​=x1​+F(x2​)
y2=x2+G(y1)y_2 = x_2 + \mathcal{G}(y_1)y2​=x2​+G(y1​)
When doing a backward pass, the intermediate activations can be reconstructed as follows:
x2=y2−G(y1)x_2 = y_2 - \mathcal{G}(y_1)x2​=y2​−G(y1​)
x1=y1−F(x2)x_1 = y_1 - \mathcal{F}(x_2)x1​=y1​−F(x2​) 
﻿
Applying this to Transformer architecture we set F=MultiHeadAttention\mathcal{F} =\mathbf{MultiHeadAttention}F=MultiHeadAttention and G=FeedForward\mathcal{G} = \mathbf{FeedForward}G=FeedForward. x1x_1x1​ and x2x_2x2​ are initialized as copies of original input.
Our implementation of Reversible Residual Layers is largely inspired by https://github.com/lucidrains/reformer-pytorch and can be found here.
﻿
﻿


(a) the forward, and (b) the reverse computations of a residual block (c) Gomez at al

Add a comment