Skip to main content

Dos and Don'ts of Vision Transformers (ViTs)

This article covers the lessons learned regarding Vision Transformers (ViTs), including inductive bias, so you can avoid common pitfalls and learn from what works.
Created on August 9|Last edited on January 17
It has been demonstrated time and time again that transformers can be a good backbone for multi-modal methods, in no small part, because they deal with tokens (patch of an image, cube of a video, spectrogram, words, etc.).
For a while, it was unclear as to why transformers weren't working directly for vision, but then the Original Vision Transformer Paper came out, and the key point was to realize that transfer learning from HUGE datasets (JFT-300M/2B) helps a ton.
This article covers the dos and don'ts of ViTs, to help you learn from our mistakes.
Here's what we'll be covering:

Table of Contents (Click to Expand)



Let's jump in!

Transformers: A Simplification

Let me attempt to simplify transformers (yet again):
  • The most powerful piece is self-attention. Basically, having every token somehow relate to every other token. This way, each token learns about everything else, i.e. a full receptive field. You could also concatenate multiple sequences together and utilize Cross-Attention. This makes it easily parallelizable and leads to object-centric representations, as opposed to the well-known texture-centric representations obtained from Convolutional Networks.
  • A bunch of feedforward layers for further logits processing
  • Sugar-coating of layer normalization and residual connections
Given sufficient data (300 million rows!), vision transformers can give good performance at reasonable scales. Then eventually, convolutional variants came out that tried to tackle small-scale datasets using certain engineering tricks.
To read more about Vision Transformers can be made to work on small datasets, refer to this article.
💡

Do's and Don'ts of Visual ViTs

Optimizers

  • Simple stochastic gradient descent doesn't always work. No matter how small your ViT is, it's extremely unlikely to converge to a reasonable performance.
  • Adam utilizes 3M\large 3M memory. Therefore it might be a bit intensive on your compute.
  • AdaFactor utilizes M+O(1)\large M + O(1). While being less memory intensive doesn't really work.
  • AdamW or LARS is the current best choice for experiments at regular scales. Shampoo has been shown to be better for extreme-scale experiments.
For a better understanding of why Shampoo works, refer to this set of reports.


A Note On Inductive Biases

Vanilla transformers do not have the inductive bias of convolutions, thereby making them extremely helpful and valid for complicated geometries. Hybrid models along the lines of DETR have been shown to help.


Summary

In this article, you saw some tips and tricks related to Vision Transformers, what works and what doesn't. This post is meant to be a regularly updated piece containing newly learned intricacies and details. If you have a suggestion to make feel free to comment or reach out!
To see the full suite of W&B features, please check out this short 5 minutes guide. If you want more reports covering the math and from-scratch code implementations, let us know in the comments below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.

Neel Gupta
Neel Gupta •  
Have you tried using SAM vs. Shampoo? Does it help in ConvNets-ViT Hybrids too?
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.