When Does Self-Supervision Improve Few-Shot Learning?

As part of ML Reproducibility Challenge Spring 2021. Made by Arjun Ashok using Weights & Biases
Arjun Ashok


Deep learning has made major advances, however, this has been possible only due to the availability of large annotated datasets for each task. Methods such as data augmentation and regularization alleviate overfitting in low-data regimes, but not completely. This motivated research in few-shot learning, in which we aim to build a classifier that should be adapted to learn new classes not seen in training, with very few samples in each class. In this work, we reproduce and extend the results of the paper "When Does Self-supervision Improve Few-shot Learning?" which investigates using self-supervised learning (SSL) in such low-data regimes to improve the performance of meta-learning based few-shot learners.
The paper investigates applying self-supervised learning (SSL) as a regularizer to meta-learning based few-shot learners. The authors claim that SSL tasks reduce the relative error of few-shot learners by 4% - 27% even when the datasets are small, and the improvements are greater when the amount of supervision is lesser or the task is more challenging. Further, they observe that incorporating unlabelled images from other domains for SSL can hurt the performance, and propose a simple algorithm to select images for SSL from other domains to provide further improvements.
Our code is available at https://github.com/ashok-arjun/fsl_ssl_working.

Scope of reproducibility

The paper claims that
We thoroughly reproduce all the experiments and investigate whether the claims hold true, with the model and the six benchmark datasets used by the authors. We find that all the claims hold true when the same architecture and image size as used by the paper is used.
Beyond the paper, we find that the results are biased towards the architecture used, and demonstrate that the gains do not hold when the input image size and architecture differ from those reported in the paper.
We also report results on the more practical cross-domain few-shot learning setup, where we find that self-supervision does not help ImageNet-trained few-shot learners generalize to new domains better.



The goal of a few-shot learner is to generalize is to learn representations of base classes that lead to good generalization on novel classes. To this end, the proposed framework combines meta-learning approaches for few-shot learning with self-supervised learning.
Meta-learning is synonymous with learning to learn, where the learning algorithm is not designed to parameterize a function, but rather to parameterize a quick learner. As shown in the figure, we iterate over multiple learning tasks in the training phase and understand how to learn tasks quickly, and the in test phase, we do not perform prediction, we instead learn an unseen task quickly.
Since we are doing learning at a meta-level, the training phase where we iterate over multiple tasks is called the meta-training phase, which contain multiple training and test phases inside. The test phase is called the meta-test phase, which also contains a training and test phase. This is what distinguishes meta-learning from general supervised learning. A task is nothing but a class in our context.
In the original paper, the meta-learning based prototypical networks (ProtoNet) are used. To put in a few words, in this algorithm, the network meta-learns to provide useful class prototypes for new tasks from very few examples. At meta-test time, we encounter new classes, and class prototypes are computed for classification from the training examples of the test classes. Since the network has meta-learned to provide good class prototypes from few samples, we can learn to classify examples from the new classes very quickly.

Self-supervised losses

Apart from the supervised losses, the paper uses self-supervised losses that are based on data whose labels can be derived automatically without any human labelling.
Here you can see an example of a self-supervised task. This is the rotation task which is very common in literature. Without human labelling, the image is used as input, and rotations of the image are used as labels. The algorithm learns representations from this task, and can generalize very well to downstream supervised tasks.
In our case, the self-supervised tasks used are:
The paper studies self-supervised learning as a regularizer for representation learning, in the context of few-shot learning tasks.

Overall loss function

The overall loss can be denoted as
L = (1 − α) ∗ L_s +(α) ∗ L_{ss} where
L_s denotes the supervised loss, and L_{ss} denotes the self-supervised loss

Domain selection

On observation that unlabelled images from other domain do not help few-shot learners when they are used for self-supervision, the authors propose a domain selection algorithm.
This algorithm can be used to select images from a large corpus of unlabelled images, and can ensure that for the particular dataset (and its domain), the selected images will improve the performance of the few-shot learner.
This is done by training a logistic regression classifier with ImageNet-trained ResNet-110 features of the unlabelled data, as well as that of the labelled in-domain data. The unlabelled data are treated as negative images, while the in-domain data are treated as positive images. We rank the unlabelled data according to the ratio P(+)/P(-), and choose the top-k images (k=10 times the size of the labelled data), and use them for SSL.

Experimental Settings

Details regarding the code

The authors provide a public implementation of their paper, which is built upon a popular codebase from Chen et al. We find that there are a lot of errors and bugs in the code, which took a lot of time to debug. This took up a considerable part of our time. Further, the code for the domain selection algorithm was not present, and hence we had to reimplement it from scratch.
Our code reuses multiple files from the original codebase, corrects several errors, provides easier interfaces to train and test models, and also provides an implementation of the domain selection algorithm.
We also provide interfaces to train models with a different architecture, and to evaluate models in the more practical cross-domain setup.

Model descriptions

The authors use a well-known architecture ResNet-18 for their experiments. This architecture takes in an input of size 224 x 224.
We experiment with both ResNet-18 as well as Conv-4-64, a 4-layered convolutional neural network, which is another popular architecture in the few-shot learning literature. This architecture takes in an input of size 84 x 84.


Following the few-shot setup, each dataset is split into three disjoint sets, each having a different set of classes. A model is trained on the base set, validated on the validation set, and tested on the test set.
Following the paper, we experiment with multiple datasets across diverse domains and denote the number of classes in the base, validation, test splits in the below table:
Dataset Download Link Base classes Val classes Test classes Comment
CUB-200-2011 https://www.kaggle.com/tarunkr/caltech-birds-2011-dataset 64 12 20 Publicly available
VGG Flowers https://www.kaggle.com/arjun2000ashok/vggflowers/ 51 26 26 Pre-Processed and Contributed by us
Stanford Cars https://www.kaggle.com/hassiahk/stanford-cars-dataset-full 98 49 49 Pre-Processed and Contributed by us
Stanford Dogs https://www.kaggle.com/jessicali9530/stanford-dogs-dataset 60 30 30 Publicly available
FGVC - Aircrafts https://www.kaggle.com/seryouxblaster764/fgvc-aircraft 50 25 25 Publicly available
miniImageNet https://www.kaggle.com/arjunashok33/miniimagenet 64 16 20 Pre-Processed and Contributed by us
The first 5 datasets are henceforth referred to as "the small datasets". Apart from these, we also experiment with a benchmark dataset for few-shot learning, the miniImageNet dataset.
Among the small datasets, we found that there were no versions of flowers and cars datasets that could be used directly. Hence we had to preprocess the two datasets from scratch to use them. Further, we contribute the preprocessed versions to Kaggle for public use.
With the miniImageNet dataset, we found that all the directly downloadable versions contained images resized to 84x84, however, we needed a dataset that could be resized to either 84x84 or 224x224 adaptively. Hence, we had to download the ImageNet dataset (155 GB) and process the dataset from scratch, which caused storage issues and also took up a significant part of our time.
To this end, we also open-source the preprocessed miniImageNet dataset with image sizes same as that in ImageNet, to save other researchers’ time in preprocessing the dataset from scratch. To the best of our knowledge, we are the first to release such a version.
For the domain selection algorithm, the authors use the training sets of two large datasets - Open Images v5 and iNaturalist, which are 500 GB and 200 GB in size respectively. These sizes far exceeded our storage capacity, and we instead could only use the validation sets of each of the datasets as unlabelled images for self-supervision.

Hyperparameter search

Each sweep uses random search to search over 3 hyperparameters:
• Learning Rate: uniform(0.0001, 0.03)
• Batch normalization mode:
1. Use batch normalization, accumulate statistics throughout training, and use the statistics during testing
2. Use batch normalization, but do not track the running mean and variance during training; estimate them from batches during training and test
3. No batch normalization
\alpha, the weightage of the SSL term in the loss (only where self-supervision is applied)
Each of the experimental configurations is done for ProtoNet, ProtoNet+Jigsaw, ProtoNet+Rotation and ProtoNet+Jigsaw+Rotation (4 configurations) in the 5-way 5-shot setup.
The batch normalization mode is found to be 2 for all the self-supervised learning experiments, and 1 for all the supervised experiments, very early in a few sweeps. This also corroborates the paper's choice for these hyperparameters.
We do a total of 24 sweeps, each having 5-10 runs, amounting to 150 runs in sweeps alone.
The sweep for CUB dataset with Conv-4 architecture, with supervised + jigsaw losses:
The sweep for Cars dataset with Conv-4 architecture, with supervised + rotation losses

Experimental setup & Computational Requirements

We had to verify all the results across multiple datasets, to make sure the results held true. In addition, one of our main findings apart from the paper were based on runs from a different architecture, across multiple datasets. Due to this, the total number of full runs was 150. The runs from all the sweeps put together amounted to about 230.
Hence, the total number of runs in our project amounted to 380 runs.
We used 4 Nvidia 1080Ti GPUs for all experiments. All of the main experiments took approximately 700 GPU hours. Along with the hyperparameter sweeps which were lesser in duration, the experiments took approximately 980 hours of compute time.


We discuss and visualize all our experiments, the results from the paper, as well our novel findings beyond the paper.

Results reproducing the original paper

In this section, we consider the same architecture that the paper uses - a ResNet-18 with an input image size of 224.

Self-supervision improves few-shot learning

Here, we successfully verify claim 1 of the paper that with no additional unlabelled data, SSL improves few-shot learning when applied as an auxiliary task. We conduct experiments with ProtoNet, ProtoNet + Jigsaw, ProtoNet + Rotation and ProtoNet+Rotation+Jigsaw, for all 6 datasets.
Below, we visualize the results from 3 datasets: aircraft, cars and miniImageNet.
The jigsaw task leads to consistent improvements for all datasets, concurring the results of the paper. In some cases, both jigsaw and rotation improve the results of the few-shot learner.

The benefits of self-supervision increase with the difficulty of the task

We verify claim 2 of the authors that the relative gains of using SSL are more when the difficulty of the task is higher. We experiment with artificially constructed difficult tasks: one with low-resolution images - which denote images that are resized and then upsampled, and then - greyscale images as input. In addition, we also experiment to find out the effect of self-supervision, when the labelled data in the base training set is just 20%. We experiment with 3 selected datasets - dogs, cars and CUB.
We plot the validation and test accuracies below
Experiments with artifically constructed harder tasks
As shown, within this sector of hard tasks, self-supervision can have a huge impact on the performance of the few-shot learner. It can be seen that the jigsaw task provides consistent gains in this scenario, while rotation provides lesser gains.
Experiments with less labelled data
We show the test accuracies from 2 selected datasets, across four run configurations, each with just 20% of supervised data:
With just 20% labelled data, the prototypical few-shot learner fails to generalize as well as the learner with 100% labelled data, which is expected. However, just applying self-supervision on the unlabelled data increases the performances consistently for both the plotted datasets.

Unlabelled data for SSL from dissimilar domains negatively impacts the few-shot learner

We test another claim of the paper that self-supervision does not work unless the distribution or the domain used for self-supervision matches that of the labelled data.
To verify this, a portion of the data for a dataset is replaced with data from other datasets, and images are sampled at random for self-supervision. That is, with just 20% labelled data maintained throughout, we sample 20%, 40%, 60% and 80% of unlabelled data from another dataset at random.
We show results for the dogs and CUB datasets below.

The proposed domain selection algorithm can alleviate this issue by learning to pick images from a large and generic pool of images

Since there was no available implementation of the domain selection algorithm, we had to reimplement it from scratch.
Here, we experimented with all 5 datasets to verify our implementation.
We show the results obtained by our implementation below, for 3 selected datasets.
The results obtained by the algorithm are compared with those with unlabelled data selected at random from the corpus, to demonstrate the algorithm's effectiveness.
all-domain denotes sampling images at random, and 80-domain denotes using the proposed algorithm for sampling unlabelled images.
One can verify that the domain selection algorithm has a pronounced effect on the performance when a large corpus of out-domain unlabelled data is used for self-supervised learning.

Results beyond the original paper

Going beyond the paper, we redo the experiments testing the effect of self-supervision on few-shot learners, for all the 5 small datasets and miniImageNet, however with a different architecture - the Conv-4-64, which takes in images of size 84.
Further, we also evaluate the trained models on a cross-domain setup, where no data is available at training or test time, and a well-trained model on a different domain must generalize directly.

Results on Conv-4

Surprisingly, we find that the reported gains do not hold true when the architecture and image size are changed.
We experiment with all 6 datasets and plot the runs of 3 selected datasets above. As seen, for every dataset, the run without any self-supervision obtains the best performance.
In order to confirm our findings, we also run the same experiments with a different seed. We find that the same results hold true again.
We conduct an ablation study, where we experimented with various values of the \alpha parameter, apart from that found by the hyperparameter search:


Rotation CUB Cars
α = 0 (no SSL) 77.72 ± 0.71 67.6 ± 0.84
α = 0.1 77.6 ± 0.73 66.83 ± 0.75
α = 0.3 77.22 ± 0.9 65.53 ± 0.73
α = 0.5 73.94 ± 0.81 65.87 ± 0.73


Jigsaw CUB Cars
α = 0 (no SSL) 77.72 ± 0.71 67.6 ± 0.84
α = 0.1 75.57 ± 0.73 62.548 ± 0.75
α = 0.3 64.91 ± 0.9 51.83 ± 0.73
α = 0.5 69.09 ± 0.45 60.88 ± 0.53
With the rotation loss, one can see that increasing \alpha gradually decreases the performance. With the jigsaw loss, there is a strange dip in accuracy, when \alpha = 0.3. However, for both losses, the supervised learner maintains the best performance.
This implies that the basic claim of the authors that self-supervision improves few-shot learning heavily depends on the architecture and image size used in the paper.

Results on cross-domain few-shot learning

In another effort to extend the paper’s results, we test the results of our trained models on the BSCD-FSL benchmark for cross-domain few-shot learning. Note that the paper reports results only for training models with unlabelled images from other domains, at various ratios. It does not evaluate the trained models zero-shot on another domain, and hence we set to do the same.
The benchmark requires ImageNet trained few-shot models to be evaluated on four cross-domain datasets:
The selected datasets reflect real-world use cases for few-shot learning since collecting enough examples from the above domains is often difficult, expensive, or in some cases not possible.
We use this benchmark to find out if models trained with self-supervision provide gains over normal supervised models when tested on real-world datasets.
Method (with ResNet-18 backbone) ChestX CropDisease EuroSAT ISIC
ProtoNet 24.32 ± 0.41 83.36 ± 0.63 76.09 ± 0.74 41.60 ± 0.58
ProtoNet + Jigsaw 23.97 ± 0.39 77.86 ± 0.69 72.72 ± 0.68 41.22 ± 0.56
ProtoNet + Rotation 23.84 ± 0.39 79.11 ± 0.68 72.47 ± 0.69 43.79 ± 0.61
ProtoNet + Jigsaw + Rotation 23.73 ± 0.38 77.39 ± 0.68 71.91 ± 0.7 40.05 ± 0.55
We find that a model trained with self-supervision gives gains in this zero-shot setup, only in 1 out of 4 datasets.
In CropDisease and EuroSAT, models trained with self-supervision perform much worse than the fully supervised one. With the ISIC dataset, only the model trained with the rotation loss gives an increase in the performance.
With the Conv-4 backbone also, we find that in all 4 datasets, models trained with self-supervision have much less performance than the fully-supervised one when evaluated zero-shot on highly dissimilar domains.
Hence, verified by 7 out of 8 experiments, training models with self-supervision results in highly domain-specific representations being learned, and there are dangers if the model has to be tested on a cross-domain setup.


We find that the central claims of the author as given in Section 2 hold true when the same architecture is used. Considering the ResNet-18 model used in the paper with an input image size of 224, we find that self-supervision provides a consistent boost.
However, going beyond the paper’s architecture, we find that the results depend heavily on the image size and architecture and do not give the same gains with Conv-4-64, another architecture common in the few-shot learning literature, with an input image size of 84.
Future work may investigate ways to boost the performance of few-shot classifiers when the input sizes are small, and may also find out better architectures to use when the input size is small. Future work may also experiment with other available architectures, and find out if self-supervision increases performances across all configurations.
Regarding claims 2 and 3 such as on harder tasks and scenarios with lesser labelled data in the base dataset, our experiments on selected datasets verify that the claims hold true, with the ResNet-18 backbone. Further, we verify claim 4 of the paper by implementing the domain selection algorithm from scratch and our experiments on all five datasets show that relative gains are achieved.
Future work may also investigate if the same claims hold true when different architectures were used.
Finally, we evaluate the miniImageNet-trained models on a more practical setting of cross-domain few-shot learning and find that SSL during the training time does not help few-shot learners generalize across domains better. Future work may investigate why applying SSL results in domain-specific features, and propose methods to apply SSL in a more domain-agnostic manner.

Recommendations for future work

We recommend future works in few-shot learning to train and evaluate with multiple architectures with different image sizes and verify their work more thoroughly.

What was easy

The paper was well written and easy to follow and provided a clear description of the experiment. The author’s code implementations were relatively easy to understand and mostly reflected the experiments described in the paper.

What was difficult

Since the codebase was not fully complete, it took us a lot of time to identify and solve bugs, and reimplement the algorithms not present in the code. Further, multiple datasets needed a lot of preprocessing to be used. The number of hyperparameters being too many but each proving to be important, and evaluating all the claims of the paper on 5 datasets and 2 architectures was difficult to the number of experiment configurations, resulting in a very high computational cost of 980 GPU hours.

Communication with original authors

We maintained communication with the authors throughout our implementation and training phase, spanning two months. We were able to clarify many implementation details in the original codebase, and the authors also re-ran an experiment on their side to test if the numbers match. Further, we received a lot of help regarding the implementation of the domain selection algorithm, and could also confirm the implementation with them. We acknowledge and thank the authors for their help with the reproducibility of their paper.