Skip to main content

SimCLR

A simple framework for contrastive learning of visual representations.
Created on July 28|Last edited on August 9

References:



Paper:



Process involved:

Pretext Task:

Data Augmentation:

Augmentation applied:
  • random cropping
  • random flipping
  • color distortions
  • gaussian blurs


Getting Representations [Base Encoder]:

ResNet-50 is used majorly in the paper.



Projection head:



Tuning or Training Model [Bringing similar closer]:

  • Calculation of Cosine Similarity
  • Loss Calculation - Contrastive Loss (NT-Xent loss-Normalized Temperature-Scaled Cross-Entropy Loss)

Downstream Task:

Only base encoder representation is taken for downstream task.


Explanation through images:



This images shows the contrastive self supervised learning where the augmentations of the same image should have same representations(positive pairs) and all the other images should have different representations(negative pairs). 

If you have an image say a dog's image, the first figure (a) shows us the global view which is the full image of a dog (shown as B), and a local view which is some cropped part of the dog's image (shown as A). The second figure (b) shows us the adjacent views of the dog's image meaning that both the images are the cropped part of the same image (shown as C and D). It needs to learn the representations that the cropped part of the same image has the same embeddings (positive pairs).

This is the contrastive loss function. In the numerator, zi and zj is the two augmentations of the same image. In the denominator, it is all other images and their augmented version of those. The 2N is because at first you start with an image but after augmenting it twice, you now have twice the number of images in the batch.

This image shows the different types of projection heads used and it's usefulness. You can clearly see that using a non linear projection gives us much better result than the others.

Key Findings:

  • Augmentations: Crop and Color Jitter
  • Base Encoder
  • Projection head: None, Linear, Non Linear(in Fully Connected Layer)
  • Scale Up: Parameters, Epochs, Batch Size