VICReg: Variance-Invariance-Covariance Regularization for Self Supervised Learning

Intuitive and step-by-step guide to understanding the VICReg framework for Self Supervised Learning, along with interactive visualizations and code
Saurav Maheshkar
Created on February 11|Last edited on February 21
Comment
﻿
IntroductionSelf supervised learning has recently emerged as a promising path for training models with rich representations in this world of data efficient training. Self supervised learning relies on perturbing or masking unlabelled datasets and training a model to recognize the distorted or missing data items as "pseudo-labeled" pairs. 
A recent survey paper on self supervised learning (A Cookbook on Self Supervised Learning) proposes four broad classes namely: Deep Metric Learning Family, Self-Distillation Family, Canonical Correlation Analysis Family and Masked Modeling Family. These four broad classes provide a great framework to abstract away papers and provides a great framework to start learning about self supervised learning!
In this article, we will breakdown one such paper [ICLR 2022] VICReg: Variance-Invariance-Covariance Regularization for Self Supervised Learning by Adrien Bardes, Jean Ponce and Yann LeCun. This framework comes under the Canonical Correlation Analysis Family which aims to learn the relationship between variables by analyzing the their cross-covariance matrices. 
The TL;DR is VICReg aims to balance three objectives based on co-variance matrices of representations from two views: variance, invariance, covariance. Regularizing the variance along each dimension of the representation prevents collapse, the invariance ensures two views are encoded similarly, and the co-variance encourages different dimensions of the representation to capture different features.
I have open-sourced implementations both in PyTorch and Tensorflow in my repository (SauravMaheshkar/sslgym) if you'd like to take any of this out for a spin. 
📋 Table of ContentsIntroduction👨‍🏫 Method👨‍💻 Code📊 Results👋 Summary
﻿
﻿
👨‍🏫 Method
Figure 1: Proposed VICReg Framework
This framework is similar to most contrastive based methods for self supervised learning. Given a distribution of transformations TTT﻿, we randomly sample transformations t,t′t, t't,t′﻿ from this distribution to generate two views of a image. These images are then passed through an encoder fθf_{\theta}fθ​﻿ to generate intermediate representations which are then passed through an expander to generate embeddings corresponding to each view. This is where our work starts!
The authors of VICReg propose to use a composite loss function consisting of:
Invariance Loss: the simple mean square distance between the embedding vectors.
Variance Loss: a hinge loss to maintain the standard deviation (over a batch) of each variable of the embedding above a given threshold. This term forces the embedding vectors of samples within a batch to be different.
Covariance Loss:  a term that attracts the covariances (over a batch) between every pair of (centered) embedding variables towards zero. This term decorrelates the variables of each embedding and prevents an informational collapse in which the variables would vary together or be highly correlated.
Essentially all these individual losses aim to tightly couple the embeddings of representations along every possible dimension. Their  implementations are extremely simple as evident by the code sample below.
👨‍💻 CodeVICReg follows the standard contrastive model for generating views from data points, and thus can be implemented in relatively few lines of code in PyTorch:
class VICReg(nn.Module):
    def __init__(self, mlp="8192-8192-8192") -> None:
        super().__init__()
        self.num_features = int(mlp.split("-")[-1])
        self.encoder = torchvision_models.resnet50(zero_init_residual=True)
        self.embedding = self.encoder.fc.weight.shape[1]
        self.encoder.fc = nn.Identity()
	## Expander is a simple MLP based NN
        self.expander = expander(self.embedding)
﻿
    def forward(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
        # Get Embeddings
        x = self.expander(self.encoder(x))
        y = self.expander(self.encoder(y))
﻿
        # Calculate the Representation (Invariance) Loss
        repr_loss = F.mse_loss(x, y)
﻿
        # Calculate var. and std. dev. of embeddings
        x = x - x.mean(dim=0)
        y = y - y.mean(dim=0)
        std_x = torch.sqrt(x.var(dim=0) + 0.0001)
        std_y = torch.sqrt(y.var(dim=0) + 0.0001)
﻿
        # Calculate the Variance Loss (Hinge Function)
        std_loss = torch.mean(F.relu(1 - std_x)) / 2 + torch.mean(F.relu(1 - std_y)) / 2
﻿
        # Get Covariance Matrix
        cov_x = (x.T @ x) / (batch_size - 1)
        cov_y = (y.T @ y) / (batch_size - 1)
﻿
        # Calculate the Covariance Loss
        cov_loss = off_diagonal(cov_x).pow_(2).sum().div(
            self.num_features
        ) + off_diagonal(cov_y).pow_(2).sum().div(self.num_features)
﻿
        # Weighted Avg. of Invariance, Variance and Covariance Loss
        loss = sim_coeff * repr_loss + std_coeff * std_loss + cov_coeff * cov_loss
        return loss
📊 ResultsThe following panel graph shows a training run for training a ResNet50 encoder using the VICReg objective on the CIFAR10 dataset:
﻿
Run set1
﻿
NOTE: If the Y-axis seems a bit shady it's because of the choice of the coefficients, we use the same coefficients are described in the paper (25, 25, 1) leading to the scaled loss.
💡
👋 SummaryIn this article we covered the VICReg framework for Self Supervised Learning as introduced in the paper [ICLR 2022] VICReg: Variance-Invariance-Covariance Regularization for Self Supervised Learning by Adrien Bardes, Jean Ponce and Yann LeCun. We also looked at how to implement the model in PyTorch and compared a few experiments using the provided implementation.
To see the full suite of W&B features, please check out this short 5-minute guide. If you want more reports covering the math and "from-scratch" code implementations, let us know in the comments down below or on our forum ✨!﻿
Check out these other reports on Fully Connected covering other Self Supervised Learning topics>:
Self-Supervised Learning : An Introduction
A Brief Introduction to Self Supervised Learning, the first in an upcoming series of reports covering Self Supervised Learning. 
DINO: Emerging Properties in Self-Supervised Vision Transformers
Breakdown of Emerging Properties in Self-Supervised Vision Transformers by Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski and Armand Joulin with Weights and Biases logging ⭐️.
PAWS : Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples
Breakdown of Semi-Supervised Learning of Visual Features by Non-Parametrically
Predicting View Assignments with Support Samples by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Armand Joulin, Nicolas Ballas, Michael Rabbat with Weights and Biases logging.
What Is Noise Contrastive Estimation Loss? A Tutorial With Code
A tutorial covering the Noise Contrastive Estimation Loss, a commonly encountered loss function in Self Supervised Learning
﻿
﻿
Add a comment
Tags: Articles, Intermediate, Tutorial
Iterate on AI agents and models faster. Try Weights & Biases today.