DINO: Emerging Properties in Self-Supervised Vision Transformers
Breakdown of Emerging Properties in Self-Supervised Vision Transformers by Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski and Armand Joulin with Weights and Biases logging ⭐️.
Created on December 13|Last edited on February 22
Comment
🗂 Table of Contents (👆🏻Click to Expand)
🔑 Key Takeaways👋 Context / Motivation🙇♂️ Method / Approach✂️ Multi-Crop Strategy📊 Experiments1️⃣0️⃣ CIFAR-10 Experiments1️⃣0️⃣0️⃣ CIFAR-100 Experiments🛬 Comparing Optimizers📚 References
🔑 Key Takeaways
- This study questions if Self Supervised Learning provides new properties to Vision Transformers[1] that standout from convolutional networks and underlines the importance of momentum encoder[2], multi-crop training[3] and the use of small patches. Although the emergence of segmentation masks seem to be a property shared across self-supervised methods.
- Features (from Vision Transformers(ViT)[1]) learned through self supervision contain explicit information about the semantic segmentation of an image viz. the scene layout and in particular the object boundaries, which do not emerge as clearly with supervised ViTs nor with convolutional networks. This information is directly accessible in the self-attention modules of the last block.
- These features prove to be great k-NN classifiers without any fine-tuning, linear classifier nor data augmentation. This only emerges when combined with other components such as momentum encoders and multi-crop augmentation.
- Propose a new framework that can be interpreted as Knowledge Distillation[4] with No Labels (DINO), where the Teacher network is dynamically built during training thereby casting Knowledge Distillation as a direct Self-Supervised objective instead of being a post-processing step.
👋 Context / Motivation
Vision Transformers[1] have been competitive but haven't delivered clear benefits over convolutional networks. They are more computationally demanding, require more training data and their features don't exhibit unique properties. This paper questions whether the muted success of transformers in vision can be explained by the use of supervision in pre-training. Inspired from the previous works in vision based Self Supervised Learning, the authors study the impact of Self-Supervised PreTraining on ViT features. The question is whether the muted success of Transformers in vision can be explained by the use of supervision in their pretraining.
Based on their experiments and findings, they designed a simple Self-Supervised approach that can be interpreted as a form of Knowledge Distillation[4] with No Labels (DINO). This framework simplifies Self-Supervised training by directly predicting the output of a teacher network - built with a momentum encoder[2] - by using a standard cross-entropy loss.
Interestingly, DINO works only with centering and sharpening of the teacher output to avoid collapse, where as predictor, advanced normalization techniques and contrastive loss add little benefit. Furthermore DINO works for both ViT and convolutional networks without the need of any change to the architecture or the internal normalizations.
🙇♂️ Method / Approach

Figure 1: The DINO (Self DIstillation with NO labels) Framework. The teacher parameters are updated with an exponential moving average(ema) of the student parameters. Adapted from Figure 2 of the paper.
Knowledge Distillation[4] is a learning paradigm where we train a student network to match the output of a given teacher network , parameterized by and . The basic notion being to train a smaller network to mimic the output of a larger network to compress models, the key idea being that the soft probabilities output by a teacher network contain a lot more information about the class labels than just the class labels. More concretely, given an input image the teacher network produces a vector of scores which are then converted into ("soft") probabilities . These probabilities are usually softened using temperature scaling (discussed below), and the loss that the student trains for is a linear combination of the cross-entropy loss and a Knowledge Distillation Loss , viz.
For a quick review of Knowledge Distillation please refer to "On the Efficacy of Knowledge Distillation" by Jang Hyun Cho and Bharath Hariharan [ICCV 2019]
💡
While previous works rely on a pre-trained fixed teacher, in this method, the teacher is dynamically built during training. Thus, instead of being used as a pre-processing step Knowledge Distillation is directly cast as a self-supervised objective. The authors show through experimentation that freezing the teacher network over an epoch works well, whereas simply copying the student weights to the teacher fails to converge. The best strategy that appeared was to use an exponential moving average(EMA) i.e. a momentum encoder, on the student weights (illustrated in Figure 1 👆🏻👆🏻) showed the best result. The update rule used was :-
where follows a cosine schedule from 0.996 to 1 during training.
The authors through experimentation find that instead of batch normalizations or predictors, DINO could only work with a centering and sharpening of the momentum built teacher to avoid model collapse. While centering prevents one dimension to dominate it encourages collapse to the uniform distribution whereas sharpening has the opposite effect. Thus, they cancel out each other and avoid collapse. The centering operation (which takes place after the view is fed through the teacher network) can be interpreted as adding a bias term to the teacher i.e. . The center is updated using exponential moving average(ema) using the following rule :-
where is a rate parameter and is the batch size.
Given an input image , both networks output probability distributions over -dimensions denoted by and . This probability is obtained by normalizing the output of the network with a softmax function.
with some a temperature parameter that controls the sharpness of the output distribution.
Given a fixed teacher network , we learn to match these distributions by minimizing the cross-entropy loss w.r.t the parameters of the student network :
Different distorted views, or crops, of an image are generated using the multi-crop strategy[3]. What's that you ask? Good question let's delve into it.
✂️ Multi-Crop Strategy
Comparing random crops of an image plays a central role by capturing information in terms of relations between parts of a scene or an object. But unfortunately, increasing the number of crops (or views) quadratically increases the memory and compute requirements. Thus in the "multi-crop strategy"[3] from a given image (say ) a set is generated consisting of different views of the image i.e. . This set contains two "global" (standard resolution crops) say and , and several other "local" (low resolution crops) views. Using low resolution crops ensures only a small increase in compute cost.
All the local crops are passed through the student whereas only the global views are passed through the teacher i.e. encouraging "local-to-global" correspondences. The following loss is minimized.
NOTE: Both networks share the same architecture with different sets of parameters and . The parameters are learned through Stochastic Gradient Descent by minimizing the above equation.
💡
It has been known that neural networks can capture apparent visual similarity among various categories, without being explicitly told to do so. This notion was extended into the instance classification paradigm, where each image was considered as a class and the model was trained to discriminate between them. As one can imagine, for huge datasets such as the JFT-300M we're essentially looking at 300M classes. Obviously this method was not scalable. Thus, Noise Contrastive Estimation (NCE)[5] was proposed to compare the various instances instead of classifying them. But even this method had its caveats such as the need for efficient batching and availability of large number of images for comparison.
But even more Recent works have shown that it is possible to learn Unsupervised features without discriminating between images. The paper also cites the metric-learning formulation called BYOL [6], in which the features are trained by matching them to representations obtained with a momentum encoder by focusing on maximizing the agreement between two views. DINO supposedly inspired by BYOL operates with a different similarity matching loss and uses the exact same architecture for the student and teacher network. Although it's interesting to note that BYOL works even without a momentum encoder but at the price of drop in performance.
📊 Experiments
For the purposes of this report, CIFAR-10 and CIFAR100 was used to train the various model architectures (Vision Transformer and ResNet variants) for Multi-Class Image Classification. The code used to train the models can be found here.
There is also a Pull Request open to the official DINO repository which proposes to add Weights and Biases Logging.
1️⃣0️⃣ CIFAR-10 Experiments
👁 Vision Transformers
🔁 ResNets
1️⃣0️⃣0️⃣ CIFAR-100 Experiments
👁 Vision Transformers
🔁 ResNets
🛬 Comparing Optimizers
Run set
36
📚 References
Add a comment