ConViT: Paper Reading Group

The paper reading groups are supported by experiments, blogs & code implementation! This is your chance to come talk about the paper that interests you!.
Aman Arora
After having already discussed ViT and the ResNet-RS architectures, Aman Arora from Weights & Biases is discussing the ConViT architecture!
Rewatch the recording and find all questions & answers on the bottom of the report.
We are also thrilled to announce that we will also be joined by one of the authors - St├ęphane d'Ascoli from Facebook AI Research (FAIR Paris)!

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

Blog | Paper | GitHub | Model Checkpoints


As Cordonnier, Loukas and Jaggi have shown in their work - On the Relationship between Self-Attention and Convolutional Layers, a self-attention layer with N_h heads can express a convolution of kernel size \sqrt{N_h} if each head focuses on the pixels in the kernel patch. While this is theoretical evidence that self-attention can completely replace convolutions, practically speaking, the best results in Visformer, BoTNet, Stand-Alone Self-Attention in Vision Models were noted when both convolutional and self-attention layers were utilized. A common theme to use convolutional layers in the early parts of the network and self-attention layers in the later parts of the network can be noticed.
The ConViT research paper also builds on top of this insight and replaces the first 10 self-attention layers of the Vision Transformer with gated positional self-attention (GPSA) layers - which upon initialization act as convolutional layers and based on a gating parameter can convert to self-attention layers.
Doing so makes the earlier part of the network upon initialization behave as a convolutional neural network with the option to turn into a fully self-attention-based network based on the gating parameter which is learned via model training.
We will continue with:

Register here for June 15, 9am PT / 6pm CET / 9:30pm IST