[Re] On the relationship between self-attention and convolutional layers

Our submission to ML Reproducibility Challenge 2020. Original paper "On the Relationship between Self-Attention and Convolutional Layers" by Jean-Baptiste Cordonnier, Andreas Loukas & Martin Jaggi, accepted into ICML 2020.
Nishant Prabhu
Created on April 2|Last edited on April 3
Comment
﻿
Reproducibility Summary
1	Introduction
2	Fundamentals
3	Reproducibility
4	Hierarchical attentionGiven the problems described above, we need to propose a method that can avoid the over-expressive nature of independent attention heads while still being able to learn and derive strong features from the input image across layers. We now introduce a novel attention operation which we refer to as Hierarchical Attention (HA) operation. In the following sections, we explain the core idea behind the operation and perform detailed experiments to showcase its effectiveness. 
4.1	MethodologyIn most of the transformer-based methods, independent self-attention layers are stacked sequentially and the output from one is passed onto the next to derive the QQQ﻿, KKK﻿ , and VVV﻿ vectors. This allows each attention head to freely attend to specific features and derive a better representation. Deviating from these methods, the HA operation updates only the QQQ﻿ and VVV﻿ vectors while the KKK﻿ remains the same after each attention block. Further, the weights are shared across these attention layers inducing the transformer model to iteratively refine its representation across layers. This can be considered analogous to an unrolled recurrent neural network (RNN) as we are trying to sequentially improve the representation across layers based on the previous hidden state. This helps the model hierarchically learn complex features by focusing on corresponding portions of the KKK﻿ vector, and aggregating required information from the VVV﻿ vector. Figs. 6, 7 visualize the normal attention and the hierarchical attention operations respectively. For the sake of simplicity, we do not visualize the entire inner workings of the transformer like residual connections, layer normalization, etc. This is a very simple yet effective method that can be easily adapted to any existing attention-based network as described in the next section. 
Figure 6: Normal Attention (Scaled Dot Product). The projection weights in each layer are different and independently learned (different color). QQQ﻿, KKK﻿, and VVV﻿ vectors are updated in each layer.
Figure 7: Hierarchical Attention (HA). The projection weights are shared across all the layers. Only QQQ﻿, VVV﻿ vectors are updated while the KKK﻿ remains the same in each layer.
As mentioned earlier, there has been a lot of recent hype in replacing convolutions with attention layers. Hence, we choose two other popular and similar papers - "Exploring Self-attention for Image Recognition" [2], "An Image is worth 16x16 words: Transformers for Image Recognition at scale" [3﻿﻿], and apply the proposed HA operation to justify its effectiveness. For better understanding, we briefly introduce these papers in the following subsections.
4.2	Pairwise and patchwise self-attention (SAN)Introduced by [2], pairwise self-attention is essentially a general representation of the self-attention operation. It is fundamentally a set operation, does not attach stationary weights to specific locations and is invariant to permutation and cardinality. The paper presents a number of variants of the pairwise attention that have greater expressive power than dot-product attention. Specifically, the weight computation does not collapse the channel dimension and allows the feature aggregation to adapt to each channel. It can be mathematically formulated as follows:
yi=∑jϵRiα(xi,xj)⊙β(xj)α(xi,xj)=γ(δ(xi,xj))y_i = \sum_{j \epsilon R_i}^{} \alpha(x_i, x_j) \odot \beta\left({x_j}\right) \\
    \alpha(x_i, x_j) = \gamma({\delta(x_i, x_j)})yi​=∑jϵRi​​α(xi​,xj​)⊙β(xj​)α(xi​,xj​)=γ(δ(xi​,xj​))﻿
Here, iii﻿ is the spatial index of feature vector xix_ixi​﻿, δ(.)\delta(.)δ(.)﻿ is the relation function mapped onto another vector by γ(.)\gamma(.)γ(.)﻿. The adaptive weight vectors α(xi,xj)\alpha(x_i,x_j)α(xi​,xj​)﻿ aggregate feature vectors obtained from β(.)\beta(.)β(.)﻿. An important point to note here is that δ\deltaδ﻿ can produce vectors of different dimensions when compared to β\betaβ﻿, allowing for a more expressive weight construction.
The patch-wise self attention is a variant of the pairwise operation, where xix_ixi​﻿ is replaced by a patch of feature vectors xR(i)x_{R(i)}xR(i)​﻿ which allows the weight vector to incorporate information from all the feature vectors in the patch. The equations are hence rewritten as:
yi=∑jϵRiα(xR(i))j⊙β(xj)α(xR(i))=γ(δ(xR(i))) y_i = \sum_{j \epsilon R_i}^{} \alpha(x_{R(i)})_j \odot \beta\left({x_j}\right) \\
    \alpha(x_{R(i)}) = \gamma({\delta(x_{R(i)})})yi​=∑jϵRi​​α(xR(i)​)j​⊙β(xj​)α(xR(i)​)=γ(δ(xR(i)​))﻿
﻿
4.3	Vision transformer (ViT)The Vision Transformer [3], has successfully shown that reliance on convolutions is no longer necessary and a pure transformer outperforms all convolution-based techniques significantly when pre-trained on large amounts of data. In this method, an image is split into patches which are then projected onto another representation using a trainable layer and then passed through a standard set of transformer operations as described earlier.
4.4	ResultsTo validate our intuition and to prove the effectiveness of the proposed method, we incorporate HA in all the methods described above without modifying the overall structure of the architecture﻿. Table 3 compares the accuracy for each model with its corresponding HA variant. We see significant improvements in the performance (at least 5% gain) in each case while being able to reduce the number of parameters to almost (1/5)th(1/5)^{th}(1/5)th﻿ of the original model. As mentioned earlier, transformers require sufficient training to perform equally well as convolution-based architectures. When pre-trained on large datasets (14M-300M images), transformer-based architectures achieve excellent performance and transfer to tasks with fewer data points [3]. However, for all our experiments in this report, we only focus on training these models from scratch on the CIFAR10 dataset.
﻿
Table 3: Comparison between models using normal SA and HA. Wall times are average inference times in milliseconds for the models over 300 iterations.
In Fig. 8a, we visualize the attention probabilities for a given query pixel. The relationship between self-attention and convolutions is striking as the model is attending to distinct pixels at a fixed shift from the query pixel reproducing the receptive field of the convolution operation. The initial layers attend to local patterns while the deeper layers focus on larger patterns positioned further away from the query pixel. Similarly, in Fig. 8b, the attention heads from the last two layers are no longer sparse and help capture more information. Hence the visual correctness verifies the operation and its increased performance. We also visualize the attention probabilities for SAN and ViT in Sections. B.7-﻿﻿B.8 and B.9-B.10 (Appendix) respectively. 
(a) Attention probabilities for a given query pixel (red square is on the frog head).                               (b) Average Attention visualization                       Figure 8: Attention probabilities for a model with 6 layers (rows) and 9 heads (columns) using Hierarchical Learned 2D embedding w/ content
We summarize the hierarchical operation as follows:
Enable weight sharing between the layers of the model by reusing the KKK﻿ vector and updating the Q,VQ,VQ,V﻿ vectors only. This helps the model progressively extract higher-level features.
The method of progressive refinement ensures that the attention matrices do not get over-expressive. This leads to a significant improvement over the corresponding non-hierarchical cases. 
The total number of parameters in the model is independent of the number of layers and this property helps significantly reduce the number of parameters when compared to the non-hierarchical versions. Also, this helps make deeper models without worrying about memory constraints.
Even if the model is provided more layers than are necessary, every layer learns to attend to a different pattern. The new features learned at every layer add on to those learned by the previous ones, which provides its characteristic hierarchical nature. By visualizing the attention scores on a test image, we obtain convincing evidence to support this hypothesis.
5	Conclusion
References
Appendix
A	Hyper-parameters
B	Attention visualizationWe present more examples of visualizing attention in various models.
B.1	Learned embedding with content, 9 heads
B.2	Learned embedding without content, 9 heads
B.3	Hierarchical learned embedding with content, 9 heads
B.4	Learned embedding with content, 16 heads
B.5	Learned embedding without content, 16 heads
B.6	Hierarchical learned embedding with content, 16 heads
B.7	Hierarchical SAN pairwise
B.8	Hierarchical SAN patchwise
B.9	Vision transformer (ViT)
B.10 Hierarchical Vision transformer (ViT)
C	Wandb Training LogsWe provide training logs for all our experiments here for the reader's reference. 

W&B Experiment Run Links
ResNet 18
Learned 2D w/ content; 9 heads
Learned 2D w/o content; 9 heads
Quadratic Encoding (isotropic); 9 heads
Quadratic Encoding (non-isotropic); 9 heads
Learned 2D w/ content; 16 heads
Learned 2D w/o content; 16 heads
Quadratic Encoding (non-isotropic); 16 heads
SAN (pairwise)
SAN (patchwise)
ViT
Hierarchical Learned 2D w/ content; 9 heads
Hierarchical Learned 2D w/ content; 16 heads
Hierarchical SAN (pairwise)
Hierarchical SAN (patchwise)
Hierarchical ViT
﻿
﻿
W&B Experiment Run Links
ResNet 18
Learned 2D w/ content; 9 heads
Learned 2D w/o content; 9 heads
Quadratic Encoding (isotropic); 9 heads
Quadratic Encoding (non-isotropic); 9 heads
Learned 2D w/ content; 16 heads
Learned 2D w/o content; 16 heads
Quadratic Encoding (non-isotropic); 16 heads
SAN (pairwise)
SAN (patchwise)
ViT
Hierarchical Learned 2D w/ content; 9 heads
Hierarchical Learned 2D w/ content; 16 heads
Hierarchical SAN (pairwise)
Hierarchical SAN (patchwise)
Hierarchical ViT
Add a comment