Interventional Few-Shot Learning

A reproduction of the recent paper Interventional Few-Shot Learning by Yue et al. (2020)
Created on January 28|Last edited on April 22
Comment
﻿
Scope of ReproducibilityThis post covers a recent paper by Yue et al. entitled Interventional Few-Shot Learning. The main claim of the paper is that by using intervention (P(Y∣do(X))P(Y | do(X))P(Y∣do(X))﻿ in a few-shot learning (FSL) problem, we can lower the bias coming from pre-trained models. 
Moreover, the authors claim that they have an evidence of pre-trained models being a confounding variable in FSL tasks, meaning that relying too much on pre-trained features in training stage based on support samples may lead to more errors in the query set which is significantly different from support set. They provide three different ways of implementing the intervention, based on backdoor adjustment for their Structural Causal Model proposition. These are as follows:  
Feature-wise adjustment   
Class-wise adjustment    
Combined feature and class-wise adjustment
The report is focused on combined adjustment as it gave best results in almost all of the cases mentioned in the paper.
Due to computer power restraints (and some dataset unknowns) we decided to restrict our reproducibility to MTL (meta-transfer learning) and SIB settings on mini-Imagenet datasets. The goal of this paper was to improve results of meta learning algorithms, so they should be compared relative to each other.
MethodologyAuthors provided the code, which can be found here.﻿ The mini-ImageNet split was taken from here for SIB implementation and here as stated by the authors in their repository. 
We changed some minor elements of the pipeline to enable additional logging and hyperparameter optimization. Trainings were generally done on 1 RTX2080 GPU, no multi-gpu training was tested. The time needed to train one model varied from 5h for 1-shot Resnet-based training of MTL to about 70h of training for IFSL 5-shot Resnet-based setting.
DatasetWe are going to focus on mini-Imagenet dataset, in the paper authors used tiered-ImageNet and CUB datasets as well. Mini-Imagenet contains 600 images per class over 100 classes. We followed the split proposed in [1]: 64/16/20 classes for train/val/test.
[1] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2017. 6
What was EasyRunning MTL and SIB algorithms was quite easy - for basic run provided instructions were sufficient. The only missing thing were sufficient configurable of paths - one had to look through all the project to change paths in a given file. Including additional logging and hyperparameter search with Weights and Biases framework didn't cause much trouble either.
What was difficultThe reason why we focused only on MTL and SIB algorithms was due to reproducibility issues in other cases - errors such as missing *.npy files in case of MAML example, that were not to be found anywhere to download or configuration issues that happened when trying to run LEO algorithm.
There was a problem with consistency of miniImagenet dataset download - the main repository stated that one should download it using this URL﻿﻿, whereas the subrepository with MAML code stated that on should use this, which required the download of whole ImageNet. Due to missing "novel.hdf5" file I couldn't reproduce results from MAML part of the paper. I described the issue in a github repository﻿﻿.
The other thing to note is the length of training - it was possible to train the baselines in about 10-30 hours on single RTX2080 depending on basic architecture (Resnet or Wide-Resnet), however introducing intervention increased this time by a factor of 5, with less efficient GPU utilisation present.
Communication with original authorsI tried to communicate with authors only by the repository with official code implementation, however I got no response. Link to the discussion can be found here.﻿﻿
IntroductionFew shot learning has become a rapidly growing field of research in recent years. Mostly it comes from the fact that the data for classes that are outside of most common datasets is usually scarce and there is a growing need of fast model update for new classes. Few-shot learning can be distinguished into three cases:
Zero-shot learning - based only on features of new classes, without directly training on it
One-shot learning - training only on one example from new class.
Few-shot learning - trained on few examples, usually from 2 to 5.
The most common approach to few-shot learning is using transfer learning. The problem with this approach is the fact that the parameters are not easily transferable between different tasks (e.g. classification and detection) and the model relies heavily on data distribution that was found in the training set. In order to counter these problems algorithms such as matching networks and model-agnostic meta-learning were established. Particularly the second one shows an interesting path - the model can learn meta-parameters instead of parameters for the prediction, that can be later easily transferred for different tasks. In this work meta-transfer learning will be shortly presented along with SIB (Empirical Bayes Transductive Meta-Learning with Synthetic Gradients) algorithm. 
SIBThe idea behind SIB is to use so-called synthetic gradients that are used in the process of learning a neural network. They are estimated by some model, e.g. another neural network, which allows the model to train in parallel. the benefit of that is to construct better posterior distribution, that achieves better results than standard MAML method.
Meta-transfer learningParameter-level fine-tuning (FT) is a conventional meta-training operation, e.g. in MAML. Its update works for all neuron parameters, 𝑊 and 𝑏. In this work neuron-level scaling and shifting (SS) operations are learned. They reduce the number of learning parameters and avoid overfitting problems. In addition, they keep large-scale trained parameters frozen, preventing “catastrophic forgetting”.
ResultsFor all the runs we used default hyperparameters that were provided in authors' code. We tried to perform hyperparameter optimization, however for all runs the results were worse as compared to the original implementation. Due to computational constraints, we focused on Resnet based architectures only. We provide plots from the training, with training loss, training accuracy and validation accuracy being most important metrics. In the table below graphs w show results that were obtained for test set on miniImagenet dataset.
MTL results﻿
﻿
﻿
Run set4
﻿
Final test set performance:

ModelTest acc 1-shotTest acc 5-shot
Resnet10 Baseline - ours53.01 ± 0.4472.27 ± 0.36
Resnet10 IFSL - ours60.16 ± 0.4478.03 ± 0.34
Resnet10 Baseline - paper58.49 ± 0.4675.65 ± 0.35
Resnet10 IFSL - paper61.17 ± 0.4578.03 ± 0.33
﻿
﻿\newline﻿
We can observe that the performance improvement of this setting is much higher (around 5 to 7 percentage points of accuracy), although baseline results are significantly lower than those provided by the authors. Moreover, below shows examples with  the highest loss on validation set, which were wrongly predicted by the model.
﻿
﻿
﻿
Run set8
﻿
SIB results
﻿
﻿
﻿
Run set4
﻿
Final test set performance:

Setting usedTest acc 1-shotTest acc 5-shot
Resnet10 Baseline - ours67.33 ± 0.5979.01 ± 0.37
Resnet10 IFSL - ours68.20 ± 0.5679.93 ± 0.35
Resnet10 Baseline - paper67.10 ± 0.5678.88 ± 0.35
Resnet10 IFSL - paper68.85 ± 0.5680.32 ± 0.35
﻿
We can also see that for Resnet architecture we achieve results slightly worse than the mean provided by the authors, however, excluding 1-shot IFSL case, they fit in 3σ3\sigma3σ﻿ requirement. 
﻿
Summary
Pros of the methodIt can be used with any meta-learning algorithm.
Seems to improve accuracy in multiple cases and much different approaches (meta-learning, transfer learning, bayesian).
Cons of the methodRequires much more training time than the method with intervention.
It's not easy to implement it and start using in multiple settings. 
According to the paper, the most important thing is the performance improvement. Authors report improvement in the range of 1.5-2 percentage points in accuracy on test set, however we observe improvement in the range of 0.5-1 percentage points for all runs in case of SIB algorithm, however we see much more significant gain in case of MTL algorithm. It may suggest, that settings which are more reliant on transfer learning, such as MTL, are more susceptible to bias that comes from it. It's worth to note that in all the cases the performance is actually improved.
﻿
Additional links:Code: https://github.com/freefeynman123/ifsl/tree/develop﻿
Runs: https://wandb.ai/freefeynman123/mtl_ifsl_mini_imagenet?workspace=user-freefeynman123﻿
Hyperparameter tuning for baseline: https://wandb.ai/freefeynman123/mtl_baseline_sweeps?workspace=user-freefeynman123﻿
Link to paper: https://arxiv.org/pdf/2009.13000v1.pdf﻿
﻿
﻿
﻿
Run set4
﻿
Final test set performance:

ModelTest acc 1-shotTest acc 5-shot
Resnet10 Baseline - ours53.01 ± 0.4472.27 ± 0.36
Resnet10 IFSL - ours60.16 ± 0.4478.03 ± 0.34
Resnet10 Baseline - paper58.49 ± 0.4675.65 ± 0.35
Resnet10 IFSL - paper61.17 ± 0.4578.03 ± 0.33
﻿
﻿\newline﻿
One can observe that the performance improvement in case of this setting is much higher (around 5 to 7 percentage points of accuracy), although baseline results are significantly lower than those provided by the authors. Moreover, below are shown examples with highest loss on validation set, which were mispredicted by the model.
﻿
﻿
﻿
Run set8
﻿
SIB results
﻿
﻿
﻿
Run set4
﻿
Final test set performance:

Setting usedTest acc 1-shotTest acc 5-shot
Resnet10 Baseline - ours67.33 ± 0.5979.01 ± 0.37
Resnet10 IFSL - ours68.20 ± 0.5679.93 ± 0.35
Resnet10 Baseline - paper67.10 ± 0.5678.88 ± 0.35
Resnet10 IFSL - paper68.85 ± 0.5680.32 ± 0.35
﻿
One can see that for Resnet architecture we achieve results slightly worse than the mean provided by the authors, however, excluding 1-shot IFSL case, they fit in 3σ3\sigma3σ﻿ requirement. 
﻿
Summary
Pros of the methodIt can be used with any meta-learning algoritm.
Seems to improve accuracy in multiple cases and much different approaches (meta-learning, transfer learning, bayesian).
Cons of the methodRequires much more training time than the method with intervention.
It's not easy to implement it and start using in multiple settings. 
According to the paper, the most important thing is the performance improvement. Authors report improvement in the range of 1.5-2 percentage points in accuracy on test set, however we observe improvement in the range of 0.5-1 percentage points for all runs in case of SIB algorithm, however we see much more significant gain in case of MTL algorithm. It may suggest, that settings which are more reliant on transfer learning, such as MTL, are more susceptible to bias that comes from it. It's worth to note that in all the cases the performance is actually improved.
﻿
Additional links:Code: https://github.com/freefeynman123/ifsl/tree/develop﻿
Runs: https://wandb.ai/freefeynman123/mtl_ifsl_mini_imagenet?workspace=user-freefeynman123﻿
Hyperparameter tuning for baseline: https://wandb.ai/freefeynman123/mtl_baseline_sweeps?workspace=user-freefeynman123﻿
Link to paper: https://arxiv.org/pdf/2009.13000v1.pdf﻿
﻿
﻿
Model	Test acc 1-shot	Test acc 5-shot
Resnet10 Baseline - ours	53.01 ± 0.44	72.27 ± 0.36
Resnet10 IFSL - ours	60.16 ± 0.44	78.03 ± 0.34
Resnet10 Baseline - paper	58.49 ± 0.46	75.65 ± 0.35
Resnet10 IFSL - paper	61.17 ± 0.45	78.03 ± 0.33
Setting used	Test acc 1-shot	Test acc 5-shot
Resnet10 Baseline - ours	67.33 ± 0.59	79.01 ± 0.37
Resnet10 IFSL - ours	68.20 ± 0.56	79.93 ± 0.35
Resnet10 Baseline - paper	67.10 ± 0.56	78.88 ± 0.35
Resnet10 IFSL - paper	68.85 ± 0.56	80.32 ± 0.35
Model	Test acc 1-shot	Test acc 5-shot
Resnet10 Baseline - ours	53.01 ± 0.44	72.27 ± 0.36
Resnet10 IFSL - ours	60.16 ± 0.44	78.03 ± 0.34
Resnet10 Baseline - paper	58.49 ± 0.46	75.65 ± 0.35
Resnet10 IFSL - paper	61.17 ± 0.45	78.03 ± 0.33
Setting used	Test acc 1-shot	Test acc 5-shot
Resnet10 Baseline - ours	67.33 ± 0.59	79.01 ± 0.37
Resnet10 IFSL - ours	68.20 ± 0.56	79.93 ± 0.35
Resnet10 Baseline - paper	67.10 ± 0.56	78.88 ± 0.35
Resnet10 IFSL - paper	68.85 ± 0.56	80.32 ± 0.35
Add a comment
Tags: Intermediate, Domain Agnostic, Research, Github, Panels, Plots, ImageNet, RC
Iterate on AI agents and models faster. Try Weights & Biases today.