Skip to main content

Self Supervised Learning

Created on April 16|Last edited on July 5
This is broken into two parts
1) First is learning useful representations from the pool of unlabelled data.
2) Second is fine tuning those representation with the use of labelled data.
Self-supervised learning is a machine learning process where the model trains itself to learn one part of the input from another part of the input. It is also known as predictive or pretext learning. In this process, the unsupervised problem is transformed into a supervised problem by auto-generating the labels. The process of the self-supervised learning method is to identify any hidden part of the input from any unhidden part of the input.
For example, in natural language processing, if we have a few words, using self-supervised learning we can complete the rest of the sentence. Similarly, in a video, we can predict past or future frames based on available video data.

Contrastive Learning

Contrastive Learning states that for any positive pairs x1 and x2, the respective outputs f(x1) and f(x2) should be similar to each other and for a negative input x3, f(x1) and f(x2) both should be dissimilar to f(x3).
The positive pair could be two crops of same image(lets say top-left and bottom right), two frames of same video file, two augmented views(horizontally flipped version for instance) of same image, etc. and respective negatives could be a crop from different image, frame from different video, augmented view of different image, etc.

Contrastive Predictive Coding(CPC)

Whole image is divided into coarse grid and given the upper few rows of the image, the task is to predict the lower rows of the same image.
The generated encoder model is evaluated by linear evaluation protocol(A linear classifier is trained on top of the output of the frozen encoder model(g_enc) using the Imagenet dataset and then it is evaluated for the classification accuracy of the learnt classifier model on the Imagenet Val/Test set. Note that during this whole training process of the linear classifier, the backbone model(g_enc) is fixed and is not trained at all).
The idea of image crop discrimination was extended to instance discrimination and tightened the gap between self-supervised learning and supervised learning methods.


Instance Discrimination Method

Instance Discrimination method constraints that two augmented versions of the same image(positive pair) should have similar representations and two augmented versions of the different image(negative pair) should have different representations.
The augmented image can include a horizontal flip, a random crop of a certain size, color channel distortion, a gaussian blur, etc.
Two papers MoCo and SimCLR worked on the idea of instance discrimination. The augmentations although change the input image, but does not change the class of the input image(a cat would be a cat after flipping and cropping as well) and hence their representations should also not change.

Examples of Contrastive Learning

#### SimCLR
It considers all the images in the current batch as negative samples. SimCLR representations achieve a top-1% accuracy of 69.3% on the Imagenet with the linear evaluation protocol
#### MoCo(Momentum Contrast)
It keeps a separate buffer of negatives(as high as 8k) and uses them for calculating the InfoNCE(Noise Contrastive Estimation - loss function for self supervised learning) loss. This allows them to train MoCo with smaller batch sizes without compromising on accuracy.
The representations attained 71.1% accuracy on the Imagenet under linear evaluation protocol.
#### BOYL(Bootstrap your own latent)
It is based on the instance discrimination method and it has shown that using two networks similar to MoCo, better visual representations could be learnt even without negatives.
This method achieves 74.3% top-1 classification accuracy on ImageNet under linear evaluation protocol

Downfall

Though the representation leaning method is better than every other learning. But in terms of classification accuracy, it is lacking in some way.

Image to understand




References