ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

As part of this blog post, we will look into the ConViT transformer architecture in detail and learn all about it and also the gated positional self-attention (GPSA) layer! We also see how the ConViT architecture gets the best of both worlds and obtains the benefits of both Transformers and CNNs.
Aman Arora

Paper | GitHub | Model Checkpoints

Credits

Thank you Stéphane d'Ascoli for getting on a call with me and helping me understand the ConViT architecture. This blog post would not have been possible without your help.

Prerequisite

In this blog post, I am going to assume that the reader has a good understanding of the Vision Transformer (ViT) architecture. Please refer to this blog for an introduction to ViT architecture.

Introduction

Ever since the release of Vision Transformer, a large number of other architectures have also been introduced such as DeiT, CaiT, TnT, Swin Transformers, DETR, Visformer, BoTNet, HaloNet & more - that have utilized the Transformer and applied it to the field of computer vision.
But IMHO, we are still a bit further away from seeing self-attention as a complete replacement for convolutional neural networks. From what I've seen and read so far, the best model performance comes from getting the best of both worlds - using both self-attention (in the earlier part) and convolutions (in the latter part) of the model architectures. Allow me to explain:
As Cordonnier, Loukas and Jaggi have shown in their work - On the Relationship between Self-Attention and Convolutional Layers, a self-attention layer with N_h heads can express a convolution of kernel size \sqrt{N_h} if each head focuses on the pixels in the kernel patch. While this is theoretical evidence that self-attention can completely replace convolutions, practically speaking, the best results in Visformer, BoTNet, Stand-Alone Self-Attention in Vision Models were noted when both convolutional and self-attention layers were utilized. A common theme to use convolutional layers in the early parts of the network and self-attention layers in the later parts of the network can be noticed.
The ConViT research paper also builds on top of this insight and replaces the first 10 self-attention layers of the Vision Transformer with gated positional self-attention (GPSA) layers - which upon initialization act as convolutional layers and based on a gating parameter can convert to self-attention layers.
Doing so makes the earlier part of the network upon initialization behave as a convolutional neural network with the option to turn into a fully self-attention-based network based on the gating parameter which is learned via model training.
As part of this blog post, we are going to be looking into the ConViT architecture in detail and also look at how the GPSA layers are different from self-attention (SA) layers.

Transformers vs Convolutions - is there any middle ground?

Recently, the success of ViT demonstrates that the transformer architecture can be extremely powerful in data-plentiful regimes (when there is huge amounts of data available). The ViT architecture requires pretraining on huge amounts of data - JFT-300M or ImageNet-21k datasets. This is not always possible as practitioners might have sufficient hardware required to perform this pretraining.
On the other hand, we know that convolutional models such as EfficientNets, can have a strong performance on fewer data as well. For example, EfficientNet-B7 was able to achieve 84.7% top-1 accuracy without any external pretraining.
The practitioner is therefore confronted with a dilemma between using a convolutional model, which has a higher performance floor but a lower performance ceiling, or a self-attention-based model, which has a lower performance floor but a higher ceiling.
This leads us to the question - "can one get the best of both worlds?"
In this direction, we have seen two successful approaches before ConViT -
  1. A "hybrid" model that uses convolutional layers in the earlier layers followed by self-attention. Hybrid ViT, BotNet, Visformer are all examples of this approach.
  2. Use knowledge distillation with a convolutional-based model as a teacher model. DeiT is an example of using this appraoch. Architectures such as CaiT have also used utilized this approach for higher ImageNet top-1 accuracy. But, this approach depends on having a strong teacher model available - which might not always be the case.
This brings us to the key question - is there any middle ground between convolutions and self-attention? And the answer is yes! It's ConViT!

Key Contributions

In ConViT, the researchers take a new step in bridging the gap between CNNs and Transformers, and key contributions are:
  1. A new form of self-attention layer named gated positional self-attention (GPSA) layer.
  2. ConViT outperforms DeiT and offers an improved sample-efficiency. (figure-2 below)
  3. Researchers performed ablations to investigate the inner workings of ConViT. These answer some key questions about Transformers in vision.
Figure-1: The ConViT architecture outperforms DeiT in both sample and parameter efficiency.
Having looked at the key contributions, next we will look at the model architecture where we look at the GPSA layers in detail.

ConViT Architecture

From the paper,
The ConViT is simply a ViT, where the first 10 blocks replace SA layers by a GPSA layer with convolutional initialization.
What? Really?! That makes things easy for us to understand. As long we understand ViT and GPSA, we are good.
Figure-2: The ConViT architecture
As can be seen in Figure-2 above, the ConViT architecture uses gated position self-attention (GPSA) layers in the earlier part of the network followed by self-attention (SA) layers in the later part of the network.
For an introduction to ViT, refer here. We look at the gated positional self-attention (GPSA) layer in detail next.

Gated Positional Self-Attention (GPSA) Layer

In this section, we are going to be looking at the GPSA layer in detail.
NOTE: This section is a bit math heavy. Please feel free to reach out to me or comment below at the end of this report if you have any questions.
From Attention Is All You Need paper, we already know that "Attention" can be mathematically represented as:
A = Softmax(\frac{QK^{T}}{\sqrt(D_h)})
Where Q and K represent the query and key matrix. Given some input X, Q=W_{qry}X and K=W_{key}X. And D_h represents the embedding dimension (In ViT, D_his set to 768).
Therefore, in the attention mechanism, a sequence of "query" embeddings is matched against another sequence of "key" embeddings using an inner product. The result is an attention matrix that quantifies how "relevant" Q is to K!
Finally, output Z is given by, Z = AXW_{val}^{H}.
Where V = XW_{val}^H.
👉: Please don't be confused by this representation of attention. It is just another way of writing Attention(Q,K,V) = \frac{Softmax(QK^T)}{\sqrt{D_h}}V
However, as we know already, this attention mechanism does not have any positional information with it. That is, self-attention layers do not know how the patches are placed with respect to each other.
To solve this, there are two ways to incorporate positional information.
  1. Add positional embeddings to the input patches before propagating through the self-attention layers. (equation-1 from the ViT paper)
  2. Replace self-attention with positional self-attention (PSA), using embeddings r_{i,j} of the patches i and j.
PSA layer can be mathematically represented as:
A_{i,j} = softmax(Q^h_iK_j^{hT} + v^{hT}_{pos}r_{i,j})
In the equation above, we have content and positional interactions. The first part is the content interaction whereas the second part is the positional interaction.
❓: Can you think of why the first part represents content interaction and the second part represents positional interaction? Also, how is this different from Vanilla self-attention layer?
But the authors make two modifications to positional self-attention (PSA) which are as follows:
  1. The relative embeddings r_{i,j} are fixed and not learned during training. This is to reduce the number of trainable parameters involved since the number of relative embeddings is quadratic in the number of patches.
  2. Since we are taking the softmax between positional and content interactions, and they are both of different magnitudes (content interaction having a much larger value than positional interaction), therefore, this difference is further magnified after softmax operation.
To avoid (2) above, the authors introduce gated positional self-attention (GPSA) layers that sum the content and positional interactions after the softmax, with their relative importance governed by a learnable gating parameter \lambda_h.
GPSA (X) = (1 - \sigma(\lambda_h))softmax(Q_i^hK_j^hT) + \sigma(\lambda_h)softmax(v^{hT}_{pos}r_{i,j})
As can be seen, the equation above translates to the GPSA layer below.
Figure-3: Gated Positional Self-Attention
👉: I recommend the reader to pause, take a small break and make sure that they can see the relation between figure-3 and the mathematical representation of GPSA layer.
And that's all really! By initializing the GPSA layers a certain way it is now possible for these layers to behave like convolutional layers during initialization. Therefore, we get the best of both worlds and obtain the benefits of both Transformers and CNNs in the ConViT architecture.

Summary

Let's summarize what we have learned so far. First, we saw that in data-plentiful regimes, the Transformer based architectures outperform CNNs but this approach requires costly pretraining. CNNs perform better than Transformers in computer vision when there is fewer data available.
This leads us to the question - "is it possible to get the best of both worlds?"
To this end, researchers from FAIR introduce the ConViT architecture and the GPSA layer which can be initialized as a convolutional layer, and upon training, let's the model decide whether these layers want to stay convolutional. We also looked at the GPSA layer in detail and understood how the mathematical representation of GPSA relates to Figure-3.
Finally, the ConViT architecture is simply a ViT, where the first 10 blocks replace SA layers with a GPSA layer with convolutional initialization.

ConViT in PyTorch

As part of this blog post, I have also contributed the ConViT architecture to one of my favorite libraries - TIMM.
import timm import torch m = timm.create_model('convit_tiny') x = torch.randn(1, 3, 224, 224)m(x).shape>> (1, 1000)

Conclusion

I hope that as part of this blog post, I have been able to introduce the reader with a new approach of using the Transformer architecture in the field of computer vision. ConViT architecture is different from DeiT and BoTNeT, in the sense that it introduces soft inductive bias to the Transformer as opposed to hard inductive bias.
Please feel free to reach out to me or comment below should you have any questions! Thanks for reading!