Dos and Don'ts of Vision Transformers (ViTs)

This article covers the lessons learned regarding Vision Transformers (ViTs), including inductive bias, so you can avoid common pitfalls and learn from what works.
Saurav Maheshkar
Created on August 9|Last edited on January 17
Comment
It has been demonstrated time and time again that transformers can be a good backbone for multi-modal methods, in no small part, because they deal with tokens (patch of an image, cube of a video, spectrogram, words, etc.). 
For a while, it was unclear as to why transformers weren't working directly for vision, but then the Original Vision Transformer Paper came out, and the key point was to realize that transfer learning from HUGE datasets (JFT-300M/2B) helps a ton.
This article covers the dos and don'ts of ViTs, to help you learn from our mistakes. 
Here's what we'll be covering: 
Table of Contents (Click to Expand)Transformers: A SimplificationDo's and Don'ts of Visual ViTsSummary Recommended Reading
﻿
﻿
Let's jump in! 
Transformers: A SimplificationLet me attempt to simplify transformers (yet again):
The most powerful piece is self-attention. Basically, having every token somehow relate to every other token. This way, each token learns about everything else, i.e. a full receptive field. You could also concatenate multiple sequences together and utilize Cross-Attention. This makes it easily parallelizable and leads to object-centric representations, as opposed to the well-known texture-centric representations obtained from Convolutional Networks.
A bunch of feedforward layers for further logits processing
Sugar-coating of layer normalization and residual connections
Given sufficient data (300 million rows!), vision transformers can give good performance at reasonable scales. Then eventually, convolutional variants came out that tried to tackle small-scale datasets using certain engineering tricks.
To read more about Vision Transformers can be made to work on small datasets, refer to this article.
💡
Do's and Don'ts of Visual ViTs
OptimizersSimple stochastic gradient descent doesn't always work. No matter how small your ViT is, it's extremely unlikely to converge to a reasonable performance.
Adam utilizes 3M\large 3M3M﻿ memory. Therefore it might be a bit intensive on your compute.
AdaFactor utilizes M+O(1)\large M + O(1)M+O(1)﻿. While being less memory intensive doesn't really work.
AdamW or LARS is the current best choice for experiments at regular scales. Shampoo has been shown to be better for extreme-scale experiments.
For a better understanding of why Shampoo works, refer to this set of reports.
Distributed Shampoo with Adagrad grafting
Evaluation of Distributed Shampoo
Comparison of optimizers: Distributed Shampoo, Adam & Adafactor
﻿
A Note On Inductive BiasesVanilla transformers do not have the inductive bias of convolutions, thereby making them extremely helpful and valid for complicated geometries. Hybrid models along the lines of DETR have been shown to help.
DETR: Panoptic segmentation on Cityscapes dataset
Machine Learning for Computer Vision Project Work (2020/2021)
﻿
Summary In this article, you saw some tips and tricks related to Vision Transformers, what works and what doesn't. This post is meant to be a regularly updated piece containing newly learned intricacies and details. If you have a suggestion to make feel free to comment or reach out!
To see the full suite of W&B features, please check out this short 5 minutes guide. If you want more reports covering the math and from-scratch code implementations, let us know in the comments below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.
Recommended Reading
How Do Vision Transformers Work ?
An in-depth breakdown of 'How Do Vision Transformers Work?' by Namuk Park and Songkuk Kim.
Paper Reading Group: Vision Transformers
The paper reading groups are supported by experiments, blogs & code implementation! This is your chance to come talk about the paper that interests you!
An Introduction to Attention
Part I in a series on attention. In this installment we look at its origins, its predecessors, and provide a brief example of what's to come
On the Relationship Between Self-Attention and Convolutional Layers
Our submission to ML Reproducibility Challenge 2020. Original paper "On the Relationship between Self-Attention and Convolutional Layers" by Jean-Baptiste Cordonnier, Andreas Loukas & Martin Jaggi, accepted into ICML 2020.
PyTorch Dropout for regularization - tutorial 
Learn how to regularize your PyTorch model with Dropout, complete with a code tutorial and interactive visualizations
How To Use GPU with PyTorch 
A short tutorial on using GPUs for your deep learning models with PyTorch, from checking availability to visualizing usable.
﻿
﻿
Add a comment
Neel Gupta • 3 years ago
Have you tried using SAM vs. Shampoo? Does it help in ConvNets-ViT Hybrids too?
Tags: Articles, Intermediate, Tutorial
Iterate on AI agents and models faster. Try Weights & Biases today.