Skip to main content

PAWS : Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples

Breakdown of Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Armand Joulin, Nicolas Ballas, Michael Rabbat with Weights and Biases logging.
Created on January 29|Last edited on February 2

Original Paper | Official Github Repository | W&B Implementation

Table of Contents (Click to Expand 👆🏻)

👋 First things first : Prerequisites

Pseudo-labels

This method trains a model to minimize a consistency loss, which ensures that different views of the same unlabeled instance are assigned similar pseudo-labels.
What are the pseudo-labels?

PAWS attempts to overcome the challenge of Semi-Supervised Learning by incorporating information from a subset of labeled images or support samples to generate pseudo labels for the unlabelled huge dataset during the pre-training phase. To generate these pseudo labels, PAWS compares the L2 normalized representations of the unlabelled view(s) to the representation of the randomly sampled support images. Thus, it non-parametrically utilizes the labeled samples.
An interesting read on Pseudo Meta Labels for Semi-Supervised Learning Methods is this CVPR 2021 paper by Google AI, Brain Team : Meta Pseudo Labels [1]
💡

Target Sharpening

Sharpening the targets encourages the network to produce confident predictions.
In the paper 'MixMatch: A Holistic Approach to Semi-Supervised Learning' [2], the authors played around with the sharpening function to improve the results as well. In general, sharpening is applied to reduce the entropy (or simply, the randomness) of the label distribution to make the outputs rather sharp or, the model predictions more confident.



🧑🏻‍🏫 Major Contributions

The major contributions that the paper proposes can be summarized as :
  1. Utilizing a small labeled support set during pre-training to achieve competitive classification accuracy for semi-supervised tasks.
  2. PAWS requires significantly less training as compared to the prior work in the domain.
  3. The authors also propose to overcome the collapse of all representations, that is a common challenge in existing self-supervised approaches, to a single vector by simply sharpening the target pseudo-labels.
  4. PAWS can be interpreted as a neural network architecture with an external memory that is trained using the assimilation & accommodation principle. During assimilation, PAWS updates the representations of new observations so that they are easily described by its external memory (or schemata), while during accommodation, PAWS updates its external memory to account for the new observations.
Let's now look at the approach to understand these contributions better!



🙇‍♂️ Approach

The goal is to leverage a large unlabelled dataset and smaller labeled, support dataset of images to learn representations during pretraining. After pre-training with both the datasets, the learned representations are fine-tuned using only the labeled support set.

Breaking it down for you, you have an image from the unlabelled dataset from which you create or rather, generate two different views using random augmentations. Now one of these is your positive view while the other is an anchor view. You also have a support set of augmented images. An encoder is used to obtain representations of these three inputs.
Here comes the beautiful part! You use a simple nearest neighbour classifier (πd)(\pi^d) that measures the similarity of a given representation to those of a mini-batch of labeled samples from the support set S, and outputs a (soft) class label. This support set is not used to evaluate the loss term but is only utilized to assign pseudo-labels to the unlabelled images.
Two views of the same image, one is called the anchor view while the other one is called the positive view. We obtain representations for both using an encoder which are further used to generate class predictions, called prediction p and target p+ respectively. Since these two images come from the same image, we attempt to maximize the similarity or the agreement! [3]
💡
Mathematically, the similarity classifier can be written as,
πd(zi,zS)=(zsj,yj)zs(d(zi,zsj)zskzSd(zi,zsk))yj\huge \displaystyle π_d(z_i, z_S ) = \sum_{(z_{sj}, y_j) \in z_s}\left( \frac{d(z_i, z_{sj})} {\sum_{z_{sk}∈z_S} d(z_i , z_{sk})} \right) y_j

where ziz_i is the ithi^{th} representation in the mini-batch zz, zsz_sis the representation computed from the support samples and yjy_j is the one-hot ground truth label vector associated with the jthj^{th} row vector zsjz_{sj} from zSz_S.

Similarity Metric : The Exponential Temperature-Scaled Cosine

As the author explains in [3], all approaches similar to SimCLR [4] have a drawback of representation collapse where the representations generated from the augmented views of the same input might collapse into one. In PAWS, the authors overcome the challenge of representation collapse by simply sharpening the target pseudo-labels.
We consider both the anchor representation πdπ_d can now be understood as :
pi:=πd(zi,zS)=στ(zizST)yS\huge p_i := π_d(z_i, z_S ) = σ_τ ({z_i z_S}^T)y_S

where στ()\large \sigma_{\tau} (\cdot) is the softmax with temperature τ>0\large \tau > 0 , and pi[0,1]K\large p_i ∈ [0, 1]^K is the prediction for representation zi\large z_i.
How does this temperature affect our predictions?
Temperature is a hyperparameter where a low temperature (below 1) makes the model more confident and a high temperature (above 1) makes the model less confident.
In case you would like to read more, the authors have a section on 'Theoretical Guarantees' where they prove how PAWS is guaranteed to avoid the trivial collapse of representations.
💡



📈 Experiments

For the purposes of this Report, we ran tons of experiments using the CIFAR-10 Dataset exhaustively for different values of label-smoothing, number of crops and architectures (ResNet50 and WideResnet28 with a widen factor of 2).

🏠 Comparing Architectures


Run set
23


✂️ Effect of number of crops


Run set
23


⛷ Effect of Label-Smoothing


Run set
23




📚 References