Skip to main content

Self-Supervised Learning : An Introduction

A Brief Introduction to Self Supervised Learning, the first in an upcoming series of reports covering Self Supervised Learning.
Created on November 30|Last edited on December 14

🧾 Table of Content



🧐 What is Self-Supervised Learning?

Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child's? -- Alan M. Turing, Computing Machinery and Intelligence, 1950
Can you recall when did you exactly develop a common sense? Were you really born with it? Is this common sense actually an intuition that we develop as we grow?
But wait, the blog says we will read about Self Supervised Learning! Why all these questions?
The essence of intelligence is the ability to predict! - Yann LeCun[1] talks about how we don't necessarily have to go and try falling off the road in a hilly region to know that it will lead to death. This knowledge or this ability to predict is probably what makes us 'intelligent'.
A cliché meme on Self-Supervised Learning.
To answer the question about what is Self-Supervised Learning (SSL), one can visualize it as a form of unsupervised learning where the supervision is provided by the data [2], i.e. instead of labelling the data with human intervention, the problem is understood as deriving relationships between the data samples. When we talk about Self Supervised Learning, we define a pretext task which learns to predict either the masked data or its features. These learned representations are then utilized to solve the real or the downstream task by fine-tuning the model and providing some labels.
Figure: In this example, Self Supervised Learning is utilized to fill the blanks in the data. This is done by making a part of the data masked while the rest is made available to the model. Source: Yann LeCun's Lecture[1]
As a pipeline, SSL can be defined as :
  • Training a model on a pretext task, without annotations.
  • Fine-tuning the model for a specific task with the help of labels.
In this awesome CVPR 2021 tutorial[2], the speaker has drawn an analogy between the movie Karate Kid and Self-Supervised Learning where the muscle memory of doing the daily chores is a pretext task while fine-tuning those acquired skills to perform Karate is the downstream task!
The Karate Kid - Jacket Scene

🔍 Can we classify SSL further?

Yes! There are multiple classifications emerging in the domain of Self-Supervised Learning. Contrastive vs Non Contrastive is the classification scheme we are discussing today!
Other interesting approaches are Teacher-student approaches and Clustering-style self-supervised learning methods.

🌇🏙 Contrastive Self-Supervised Learning

SimCLR : A Simple Framework for Contrastive Learning of Visual Representations [3] was probably the first work that introduced the utilization of Contrastive Learning for Self-Supervision.

🧐 What is Contrastive Learning?

As the name suggests, you learn to contrast between two given samples in an embedding space! Say you have an image and for that image you have two samples, one which is similar to it while another isn't. The similar sample is your positive sample while the other one is the negative sample. Now you train the network to differentiate between them! This clip[4] from Ishan Misra is one of my favorites to understand the concept in simple words!
Contrastive Loss attempts to minimize the distance between embeddings of such positive samples when they represent the same class while maximizes the distance if they come from a different class.
Lcontrastive(ra,rp,rn)=max(0,m+d(ra,rp)d(ra,rn))\huge L_{contrastive}(r_a,r_p,r_n) = max(0,m + d(r_a,r_p) - d(r_a,r_n))

This contrastive loss (this blogpost has a nice summary[5]) aims to maximize the distance between the anchor and negative sample d(ra,rn)d(r_a,r_n) as compared to the distance between the positive sample and the anchor image d(ra,rp)d(r_a,r_p).
The illustration below gives an overview of the SimCLR[3] framework. In this architecture, the two networks f()f \, (\cdot) and g()g \, (\cdot) are trained together to facilitate the idea of Contrastive Learning where the embeddings for augmented images are similar to their original image while they are dissimilar for different images irrespective of which class they belong to.
Figure: The SimCLR Framework
Another interesting paper using contrastive learning between positive and negative samples in its loss function is Momentum Contrast for Unsupervised Visual Representation Learning (MoCo)[7]. Have a look at some of these other posts on Fully Connected on Self Supervised Learning.


🧑‍⚖️ Non Contrastive Self-Supervised Learning

The contrast between positive and negative sample is what makes up contrastive learning and since we are talking about 'non' contrastive learning, we are definitely not comparing distances in the vector space!
Non Contrastive SSL works with only the positive samples. More recently, multiple methods such as BYOL[7] , SimSiam[8] and DirectPred[9] have performed significantly well while utilizing just the positive samples.
SimSiam[8] utilizes simple Siamese Network and reports greats results without taking help from :
  1. negative sample pairs
  2. large batches
  3. momentum encoders
Figure: BYOL[8] Architecture
In case of BYOL[7], we have two networks which have the same architecture but different parameters. Out of the two, one network is fixed and also called the 'target' while the other one is trainable 'online' network.
An input tt passes through the trainable 'online' network and its augmentation tt' passes through the fixed or the target network and we obtain two embeddings. The aim is to minimize the distance between these embeddings using MSE loss.
And yes, there will be blogs coming on all these state-of-the-art methods in SSL! Stay tuned to Fully Connected.

💪🏻 Applications of SSL

📚 Self-Supervised Learning in Natural Language Processing

Self Supervised Learning took off in the domain of NLP since 2013 with the introduction of Word2Vec[10]. Word2Vec produces word embeddings using self-supervision and predicts a word from its surrounding words, thus utilizing the contextual knowledge! For example, given the center word, we compute the probabilities of the predicted word.
The authors also released word2vec embeddings pretrained on 100 billion word Google News dataset. Interestingly, even though the embeddings were trained without any human intervention, they performed great! Below is an illustration of vectors from the paper depicting model performance on countries and their capitals.

Checkout some of our other posts on Word Embeddings here :-

SSL based transformer models like BERT and RoBERTa use masked word prediction to train the model which are then fine-tuned for specific tasks. Most state-of-the-art machine translation models like mBART and mT5 are also Self-Supervised Learning oriented.
The all famous GPT models use SSL to predict the next possible word. ALBERT works on an interesting SSL methodology where it judges if the order of two given sentences is correct or not!
Another interesting work introduced by Google Research is Pegasus. This awesome blog[12] from Google AI talks about Pegasus, a self-supervised learning based model for Abstractive Text Summarization.

👁 Self-Supervised Learning in Computer Vision

Similar to the case in NLP where we had a pretext task working with masked word prediction or predicting the next possible word, in Computer Vision we can develop pretext task using either images or videos. The task to predict the unknown is same! Pretext tasks such as rotation, puzzle or ordering the frames of a video input were used.
In domains such as Medical Imaging where manual labelling is a humongous task, contrastive learning based approaches have contributed significantly. As discussed throughout the blog, multiple approached in the domain have been recently devised such as SimCLR, MoCo and SwAV have been solving the challenges with visual features.

📖 References

  1. SimCLR : A Simple Framework for Contrastive Learning of Visual Representations by Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton
  2. Momentum Contrast for Unsupervised Visual Representation Learning, by Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick
  3. Bootstrap your own latent: A new approach to self-supervised Learning by Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko
  4. Distributed Representations of Words and Phrases and their Compositionally by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean