Meta Dropout: Learning to Perturb Latent Features for Better Generalization

Submission for Reproducibility Challenge 2020 for the paper "Meta Dropout: Learning to Perturb Latent Features for Better Generalization" by Lee et al. (2020), accepted to ICLR 2020. .
Joel Joseph

Reproducibility Summary

This report validates the reproducibility of the ICLR 2020 paper "Meta Dropout: Learning to Perturb Latent Features for Better Generalization" by Lee et al. (2020). It covers each aspect of reproducing the results and claims put forth in the paper. The paper primarily presents a novel regularization method that can be plugged in to standard Meta-Learning Algorithms. We could only reproduce some portion of the experiments due to a lack of clarity regarding some of the experiments in the original paper.

Scope of Reproducibility

The paper proposes a novel regularization method named Meta-dropout which learns to perturb latent variables for increased generalization, in the Meta-Learning setting. To achieve this Metadrop multiplies a learnable gaussian noise to the model in a layer by layer fashion. Theoretically, multiplying noise to the neurons is considered a regularization technique.

Methodology

For reproducing the paper, we initially used the original implementation provided by the authors. But the code that the authors have provided is very limited, it only covers a few experiments. Hence, it was very difficult to reproduce the rest of the paper. We used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard. The compute we used were exclusively Tesla P100 GPUs provided in Kaggle Notebooks.

Results

Due to the lack of clarity we were only able to reproduce a portion of the experiments. The paper contains three set of experiments:
(1) Metadrop vs MAML
(2) Metadrop vs Other Noise Types
(3) Metadrop vs Adversarial Techniques.
We were only able to reproduce the first and second set of experiments. Our results are close to 2% of the original results. We also observe that the results confirm the main claims of the paper, which we intended to verify.

What was easy

The code for the first experiment was published by the authors on Github. Implementing the noise types were also straightforward. The Omniglot experiments take trivial amount of training time which enabled us to experiment with the implementation quite easily.

What was difficult

The authors did not provide code for the second and third experiments. Also the implementations were in Tensorflow 1.1 which made it difficult to work with using the kaggle notebook initially. Further, the experiments on the miniImageNet dataset take a considerable amount of training time which limited our scope of experimentation.

Communication with original authors

The authors have so far been unresponsive to any of our emails. We had requested to provide rest of the training code and also some clarity regarding the implementations to ensure accurate reproducibility.

Introduction

This reproducibility submission is an effort to validate the ICLR 2020 paper by Wang et al. (2020), titled "Meta Dropout: Learning to Perturb Latent Features for Better Generalization". This report evaluates the central claim of the paper, which proposes a novel novel regularization method, attempting to increase generalization to the test dataset, used in the domain of Meta-Learning. We find that the central claim is valid. We also provide a public dashboard to view all the reproduced results and experiments, including the codebase used to run those experiments.

Scope of Reproducibility

The paper proposes a novel regularization method Meta-dropout that when attached to meta-learning algorithms such as MAML and Meta-SGD outperform them by increasing their capacity for generalizing to the test set of a given task.
The paper takes inspiration from dropout regularization method. Instead of dropping out neurons, the method multiplies Learned Gaussian Noise with the weights of a model to increase the generalization capacity.
The paper is built on the following central claims:

Methodology

For initial understanding and clarity we investigated the original implementation provided by the authors at https://github.com/haebeom-lee/metadrop. To build the full experiment pipeline for image classification-based experiments, we made modifications to their codebase and included support for experiment tracking via Weights & Biases to verify and validate the accuracy of the claims made in the paper.
We used freely available compute resources like Kaggle Notebooks for all our experiments.

Model Descriptions

For most of the experiments, the same setup for MAML was used.
The paper conducted experiments only on the domain of Image Classification for Meta-Learning. Both Metadrop and MAML used the same model architecture as that of the original MAML implementation by Finn et. al. for the evaluation of few-shot classification performance. It consists of 4 convolutional layers with 3 × 3 kernels (”same” padding), and has either 64 (Omniglot) or 32 (miniImageNet) channels. Each layer is followed by batch normalization, ReLU, and max pooling (”valid” padding).

Datasets

The paper primarily used two datasets - Omniglot & miniImageNet. Both datasets were provided in the original code repo. The authors uploaded a readymade version onto Dropbox, which is used in the experiments.
  1. Omniglot: From the paper "Matching Networks for One Shot Learning" by Vinyals et. al. It is also further augmented by rotating each image 4 times. Download from Dropbox.
  2. MiniImageNet: It is a subset of the original ImageNet dataset by Deng et al.

Hyper-parameters

Throughout all our experiments, we used the same hyper-parameters as in the paper. The authors clearly state all hyper-parameters to train the models in the experiments.
Omniglot:
1-shot: meta batch size = 8, inner learning rate = 0.1
5-shot: meta batch size = 6, inner learning rate = 0.4
Training is for 40,000 iterations; meta learning rate = 0.001
miniImageNet:
meta batch size = 4, inner learning rate = 0.01
Training is for 60,000 iterations; meta learning rate = 0.0001
In both datasets, 5 inner steps are done. Also, Adam optimizer is used with gradients clipped between 3 and -3.

Experimental setup

For all the experiments, we used the free GPUs provided by Kaggle Notebooks. The GPU used was Nvidia P100 with 16GB of RAM.
All the experiments were conducted based on our public reimplementation repository available at https://github.com/joeljosephjin/metadrop.

Computational requirements

Training a single model can range from 2hrs to 8 hrs. The RAM used will be less than 8GBs. The Omniglot experiments take 2-4hrs for each experiment, while miniImagenet experiments take 4-6hrs. All the experiments in total took about 170 hours of compute time.

Results

To establish the claim that Meta-dropout outperforms its base models, experiments are conducted on Omniglot and miniImageNet datasets, in two different settings - 1-shot and 5-shot.

Omniglot Image Classification

Original Results:
MAML MAML+Metadrop
1-Shot 95.23 +- 0.17 96.63 +- 0.13
5-Shot 98.38 +- 0.07 98.73 +- 0.06
Reproduced Results:
MAML MAML+Metadrop
1-Shot 93.60 +- 0.18 95.75 +- 0.15
5-Shot 98.00 +- 0.08 98.85 +- 0.06
The reproduced results are within 2% accuracy of the original results, and adding Metadrop resulted in 2.15% and 0.85% increase in performance on 1-Shot and 5-Shot mode respectively.

1 shot Omniglot Experiments

5 shot Omniglot Experiments

We can also observe that in each of the case, Metadrop takes double the training time. This is expected, since Metadrop uses twice the number of training parameters and hence twice the compute for updating gradients.

Minimagenet Image Classification

Original:
MAML MAML+Metadrop
1-Shot 49.58 +- 0.65 51.93 +- 0.67
5-Shot 64.55 +- 0.52 67.42 +- 0.52
Reproduced:
MAML MAML+Metadrop
1-Shot 49.25 +- 0.60 50.90 +- 0.67
5-Shot 64.51 +- 0.52 67.18 +- 0.55
Reproduced results are close to 1% of the original results, and the Metadrop model achieves 1.65% and 2.67% improvement in performance in 1-Shot and 5-Shot mode respectively.

1 shot MiniImageNet

5 shot MiniImageNet

We can see that Metadrop models take roughly 1.5 times the training time for MAML, in both modes, on the MiniImageNet dataset.

Hyper-parameter Sweeps

MAML's Inner Loop Learning Rate against Test Accuracy
MAML's Outer Loop Learning Rate against Test Accuracy
Metadrop's Inner Learning Rate against Accuracy
Metadrop's Outer Learning Rate against Accuracy
The hyperparameter optimization experiments are presented here are performed on the Omniglot dataset. We observe that both in the case of Metadrop and MAML, inner learning rate doesn't change the accuracy much, while the Outer Learning rate has a clear optimal value.

Noise types Ablations

Original:
MAML MAML+Fixed Gaussian MAML+Weight Gaussian MAML+Independent Gaussian MAML+Metadrop
1-Shot 95.23 95.44 94.32 94.36 96.63
5-Shot 98.38 98.99 98.35 98.26 98.73
Reproduced:
MAML MAML+Fixed Gaussian MAML+Weight Gaussian MAML+Independent Gaussian MAML+Metadrop
1-Shot 93.60 96.41 96.28 96.52 96.74
5-Shot 98.00 98.89 98.36 97.46 98.85
We obtained results close to 2% of the original, and our results verify the claim in the original paper that Metadrop and Fixed Gaussian are the only noise types which achieve top performance. We also observe that Fixed Gaussian is only 0.04% better than Metadrop, thus confirming that the use of Learned Gaussian Noise Multiplication is better than the other noise types.

Discussion

Based on the results we obtained, it is clear that the claims in the paper hold true. Although compute time is almost double in the case of Meta-dropout, it successfully out-competes its base models by a significant margin. From the noise type experiments, we also see that Metadropout is a combination of two different enhancements - multiplicative noise and learned Gaussian Noise and that this is better than the other variations we explored.

What was easy

Overall the structure of the paper was easy to follow. The paper also clearly mentioned all hyper-parameters used in the main experiments and clearly defined their experimental setting, making it easier to reimplement them. Additionally, the original implementation provided by the authors served as a helpful resource for reimplementing the Noise Type experiments.

What was difficult

Only the experiments for Table 1 (Results on Image Classification) in the original paper was provided by the authors. The code for implementing Meta-SGD with Dropout, MAML ablations with other Noise Types, and the Adversarial Benchmarking were not provided in the original repository. We were able to reproduce MAML with Noise Types but there were ambiguous details missing in the original paper, due to which we had to conduct some additional experiments to come to the better hyperparameters for these experiments.

Communication with authors

We tried to communicate with the authors with regard to the remaining code that they haven't published and about the ambiguous details about the experiments, but so far we have not recieved any reply from the authors.

Conclusion

We were able to replicate the main results of the paper, which successfully confirm the claims that Metadropout is a competitive regularization method compatible with the Meta Learning arena. We were also able to verify the advantage of multiplying learned Gaussian Noise over addition and non-learnable variations. Although the compute times were significantly higher, Meta-dropout technique increases the generalization capacity of the algorithm considerably. We would also encourage the original authors to provide make more of their official code public.