VICReg: Variance-Invariance-Covariance Regularization for Self Supervised Learning
Intuitive and step-by-step guide to understanding the VICReg framework for Self Supervised Learning, along with interactive visualizations and code
Created on February 11|Last edited on February 21
Comment
Introduction
Self supervised learning has recently emerged as a promising path for training models with rich representations in this world of data efficient training. Self supervised learning relies on perturbing or masking unlabelled datasets and training a model to recognize the distorted or missing data items as "pseudo-labeled" pairs.
A recent survey paper on self supervised learning (A Cookbook on Self Supervised Learning) proposes four broad classes namely: Deep Metric Learning Family, Self-Distillation Family, Canonical Correlation Analysis Family and Masked Modeling Family. These four broad classes provide a great framework to abstract away papers and provides a great framework to start learning about self supervised learning!
In this article, we will breakdown one such paper [ICLR 2022] VICReg: Variance-Invariance-Covariance Regularization for Self Supervised Learning by Adrien Bardes, Jean Ponce and Yann LeCun. This framework comes under the Canonical Correlation Analysis Family which aims to learn the relationship between variables by analyzing the their cross-covariance matrices.
The TL;DR is VICReg aims to balance three objectives based on co-variance matrices of representations from two views: variance, invariance, covariance. Regularizing the variance along each dimension of the representation prevents collapse, the invariance ensures two views are encoded similarly, and the co-variance encourages different dimensions of the representation to capture different features.
I have open-sourced implementations both in PyTorch and Tensorflow in my repository (SauravMaheshkar/sslgym) if you'd like to take any of this out for a spin.
📋 Table of Contents
👨🏫 Method

Figure 1: Proposed VICReg Framework
This framework is similar to most contrastive based methods for self supervised learning. Given a distribution of transformations , we randomly sample transformations from this distribution to generate two views of a image. These images are then passed through an encoder to generate intermediate representations which are then passed through an expander to generate embeddings corresponding to each view. This is where our work starts!
The authors of VICReg propose to use a composite loss function consisting of:
- Invariance Loss: the simple mean square distance between the embedding vectors.
- Variance Loss: a hinge loss to maintain the standard deviation (over a batch) of each variable of the embedding above a given threshold. This term forces the embedding vectors of samples within a batch to be different.
- Covariance Loss: a term that attracts the covariances (over a batch) between every pair of (centered) embedding variables towards zero. This term decorrelates the variables of each embedding and prevents an informational collapse in which the variables would vary together or be highly correlated.
Essentially all these individual losses aim to tightly couple the embeddings of representations along every possible dimension. Their implementations are extremely simple as evident by the code sample below.
👨💻 Code
VICReg follows the standard contrastive model for generating views from data points, and thus can be implemented in relatively few lines of code in PyTorch:
class VICReg(nn.Module):def __init__(self, mlp="8192-8192-8192") -> None:super().__init__()self.num_features = int(mlp.split("-")[-1])self.encoder = torchvision_models.resnet50(zero_init_residual=True)self.embedding = self.encoder.fc.weight.shape[1]self.encoder.fc = nn.Identity()## Expander is a simple MLP based NNself.expander = expander(self.embedding)def forward(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:# Get Embeddingsx = self.expander(self.encoder(x))y = self.expander(self.encoder(y))# Calculate the Representation (Invariance) Lossrepr_loss = F.mse_loss(x, y)# Calculate var. and std. dev. of embeddingsx = x - x.mean(dim=0)y = y - y.mean(dim=0)std_x = torch.sqrt(x.var(dim=0) + 0.0001)std_y = torch.sqrt(y.var(dim=0) + 0.0001)# Calculate the Variance Loss (Hinge Function)std_loss = torch.mean(F.relu(1 - std_x)) / 2 + torch.mean(F.relu(1 - std_y)) / 2# Get Covariance Matrixcov_x = (x.T @ x) / (batch_size - 1)cov_y = (y.T @ y) / (batch_size - 1)# Calculate the Covariance Losscov_loss = off_diagonal(cov_x).pow_(2).sum().div(self.num_features) + off_diagonal(cov_y).pow_(2).sum().div(self.num_features)# Weighted Avg. of Invariance, Variance and Covariance Lossloss = sim_coeff * repr_loss + std_coeff * std_loss + cov_coeff * cov_lossreturn loss
📊 Results
The following panel graph shows a training run for training a ResNet50 encoder using the VICReg objective on the CIFAR10 dataset:
Run set
1
NOTE: If the Y-axis seems a bit shady it's because of the choice of the coefficients, we use the same coefficients are described in the paper (25, 25, 1) leading to the scaled loss.
💡
👋 Summary
In this article we covered the VICReg framework for Self Supervised Learning as introduced in the paper [ICLR 2022] VICReg: Variance-Invariance-Covariance Regularization for Self Supervised Learning by Adrien Bardes, Jean Ponce and Yann LeCun. We also looked at how to implement the model in PyTorch and compared a few experiments using the provided implementation.
To see the full suite of W&B features, please check out this short 5-minute guide. If you want more reports covering the math and "from-scratch" code implementations, let us know in the comments down below or on our forum ✨!
Self-Supervised Learning : An Introduction
A Brief Introduction to Self Supervised Learning, the first in an upcoming series of reports covering Self Supervised Learning.
DINO: Emerging Properties in Self-Supervised Vision Transformers
Breakdown of Emerging Properties in Self-Supervised Vision Transformers by Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski and Armand Joulin with Weights and Biases logging ⭐️.
PAWS : Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples
Breakdown of Semi-Supervised Learning of Visual Features by Non-Parametrically
Predicting View Assignments with Support Samples by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Armand Joulin, Nicolas Ballas, Michael Rabbat with Weights and Biases logging.
What Is Noise Contrastive Estimation Loss? A Tutorial With Code
A tutorial covering the Noise Contrastive Estimation Loss, a commonly encountered loss function in Self Supervised Learning
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.