iTAML: An Incremental Task-Agnostic Meta-learning Approach

A reproduction of the paper 'iTAML: An Incremental Task-Agnostic Meta-learning Approach' by Rajasegaran et al. (2020), accepted to CVPR 2020. .
Joel Joseph

Reproducibility Summary

This report validates the reproducibility of the CVPR 2020 paper "iTAML: An Incremental Task-Agnostic Meta-learning Approach" by Rajasegaran et al. (2020). It covers each aspect of reproducing the results and claims put forth in the paper. The paper primarily presents a novel Meta Learning approach to Incremental Learning. The experiments were simple to reproduce thanks to the official implementation, although the results were unexpected.

Scope of Reproducibility

The paper proposes a novel gradient based Incremental Learning algorithm named Incremental Task Agnostic Meta Learning (iTAML) that aims to avoid catastrophic forgetting by balancing knowledge from old and new tasks. The algorithm draws inspiration from the Meta-Learning algorithm MAML used for Few Shot Classification. Two of the main contributions by iTAML that make it superior to being just an adaptation of MAML to the Incremental Learning arena are the Balancing Factor used to leverage old and new information and its ability to do as little as one inner update during training.

Methodology

For reproducing the paper, we used the original implementation provided by the authors. It was enough for us to verify the results on the main datasets. We used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard. In regards to compute, we used Tesla P100 GPU available in Kaggle Notebooks for most of our experiments. We also made use of the Google Colab Pro subscription for investigating the code.

Results

We attempted to reproduce most of the experiments on the lighter datasets. The results we obtained on MNIST, SVHN and CIFAR datasets are very much lower than the results on the paper and hence we are unable to validate the main claim of the paper regarding superior performance of the iTAML algorithm. Due to this reason, we didn't feel the need to try the experiments on the heavier datasets.

What was easy

Getting started with running the code was straightforward from the instructions in the readme of the official repo. The main mechanism of the algorithm was also described clearly in the paper. Low running time of experiments on the MNIST dataset enabled us to tinker with the code smoothly.

What was difficult

The model was giving very low results with the hyperparameters provided in the official code repo. So we had to perform hyperparameter optimization on every experiment, which consumed significant compute. Also certain hyperparameters mentioned in the paper seemed absent in the implementation contributing more confusion to the understanding.

Communication with original authors

We were unable to contact the authors. We created a Github Issue in the original repository but we haven't received any replies so far. We also noticed that there were other issues in the repo that were more than a year old, which were asking questions regarding reproducibility of the algorithm, but received no replies yet.

Introduction

This report is an attempt to validate the CVPR 2020 paper by Rajeswaran et al. (2020), titled "iTAML: incremental Task Agnostic Meta Learning". This report evaluates the central claim of the paper, which proposes a novel Incremental Learning algorithm for the Incremental Learning domain. The paper develops a variation of the Reptile algorithm in the Meta Learning After extensive experiments and investigations we were unable to verify the main claims of the paper. We provide detailed visualizations of different behaviors of the algorithm under varying hyperparameters and update rules.

Scope of Reproducibility

The paper proposes incremental Task Agnostic Meta Learning (iTAML), a novel incremental learning algorithm that claims to outperform the existing baselines in the field. iTAML focuses on learning a generalized parameter which can quickly adapt to any given task in the continuum using a task specific inner update. This is similar to the popular Meta Learning algorithm, known as Model Agnostic Meta Learning (MAML) which made a breakthrough in the gradient based meta learning category with its introduction.
The main mechanisms of the paper are:
The Update Rule: Reptile algorithm is a close relative of the MAML algorithm with a more insightful update rule. Reptile update rule is defined as:

$\theta_{1}$ = $\theta_{0}$ + $\epsilon$ $*$ $\overline{(\theta_{taski} - \theta_{0})}$

where \epsilon is the outer learning rate. Here, the role of the gradient is done by the average of the difference between the task specific models and the base model.
This can also be written as:

$\theta_{1}$ = $\theta_{0}$ + ($\epsilon$ / n) $$ $\sum_0^t$\theta_{taski}$ $-$ $\epsilon$ $$ $\theta_{0}$

which can be further rewritten as:

$\theta_{1} = (1 - \epsilon) * \theta_{0} + \epsilon * \sum_0^t\theta_{taski}$ / n

where n is the total number of tasks encountered.
And the iTAML update rule is:

$\theta_{1} = (1 - \eta) * \theta_{0} + \eta * \sum_0^t\theta_{taski} / t$

We see that both the updates are almost identical. The only difference is that \eta is not fixed but varies during training. \eta is called the Balancing Factor.
Balancing Factor: In Reptile update \epsilon is fixed while in iTAML, \eta varies during training. The purpose of variable \eta is to balance the information from old and new tasks, so that with each new task iTAML encounters the contribution of the new task decreases. This is to reduce the phenomenon of Catastrophic Forgetting, about which we will describe in the Discussion section.

$\eta = exp(- \beta * t / T)$

where t is the current task and T is the total number of tasks it will encounter.

Methodology

For initial investigations we used the official implementation provided by the authors at https://github.com/brjathu/iTAML. To track experiments and perform hyperparameter optimizations, we modified the codebase to incorporate Weights & Biases, which can be seen at https://github.com/joeljosephjin/Reproducibility-Challenge-iTAML.
We mainly used Kaggle Notebooks for our experiments. We also made use of paid subscription of Google Colab Pro for performing swift interactive experiments on GPUs. We also developed a minimal version of the codebase for basic error debugging at https://github.com/joeljosephjin/itaml-pytorch.

Model Descriptions

From the implementation, we can see that iTAML uses the same model architecture from the paper RPSNet, the SOTA in the domain. For MNIST, the architecture is two pairs of Linear and ReLU layers followed by another Linear layer. And for SVHN and CIFAR-100 datasets, the model is a combination of 3-4 layers of blocks of Convnets, BatchNorm and ReLU layers followed by a Linear layer. Kaiming Initialization was used in all the models.

Datasets

The paper evaluates the algorithm on several datasets:
MNIST and SVHN have been used in almost all previous papers in Incremental Learning. They take less than an hour to train on a GPU. MNIST has 10 classes of handwritten digits from 0 to 9. For default configuration, the 10 classes are divided into 5 tasks which are fed sequentially to the iTAML algorithm. All the datasets are available from PyTorch's indigenous dataloader, which can be used for regular Computer Vision purposes. For Incremental Learning, we used the open source Incremental Learning loader given by https://github.com/khurramjaved96/incremental-learning.
CIFAR-100 is used in three modes:
  1. 5 Classes per task - where there are 20 sequential tasks
  2. 10 Classes per task - with 10 sequential tasks
  3. 20 Classes per task - 5 sequential tasks

Hyper-parameters

Although some hyperparameters were mentioned in the paper, we did not find them very useful. Initially we used the default hyperparameters used in the implementation which crashed the learning. Hence, we performed hyperparameter optimization for each parameter to arrive at a working result.

Experimental setup

For all the light datasets, we performed the experiments in the Google Colab Pro Notebooks which give NVIDIA P100 GPUs. For the CIFAR experiments, we used Kaggle Notebooks to do overnight training.
All the experiments performed is publicly visible from our wandb repository available at https://wandb.ai/joeljosephjin/itaml. They are all based on our code base at https://github.com/joeljosephjin/Reproducibility-Challenge-iTAML.

Computational requirements

For the experiments on lighter datasets, a single NVIDIA P100 GPU is capable of training the model in less than an hour, while the ones on heavier datasets require 7-12 hours of training on the same. Both datasets use less than 12 GB of RAM and equal amount of memory to store the datasets.

Results

MNIST Image Classification

Methods Accuracy on MNIST
MAS 19.52%
LwF 24.17%
GEM 92.20%
DGR 91.24%
RtF 92.56%
RPS-net 96.16%
iTAML (Original) 97.95%
iTAML (Our Result) 59.65%
For the first task, the accuracy is 99% but then it starts decreasing to 87%, 74%, 68% and finally 60%. This indicates that the algorithm is able to adapt to a single task but the generalization capacity decreases catastrophically. It underperforms most of the baselines only superior to MAS and LwF algorithms.

SVHN Image Classification

Methods Accuracy on SVHN
MAS 17.32%
LwF -
GEM 75.61%
DGR -
RtF -
RPS-net 88.91%
iTAML (Original) 93.97%
iTAML (Our Result) 71.24%
The accuracy decreases from 99% for the first task to 91%, 84%, 76% and 71% for the last task. It slightly underperforms the GEM algorithm.

CIFAR Image Classification

Methods Accuracy on CIFAR
MAS 17.32%
LwF -
GEM 75.61%
DGR -
RtF -
RPS-net 88.91%
iTAML (Original) 93.97%
iTAML (Our Result) 34.84%
With an accuracy of 82% for the first task, the accuracy decreases to 35% for the final task. It only outperforms MAS algorithm here.

Hyperparameter Optimization

iTAML on CIFAR

The Learning Rate is the rate of update of the inner loop. We obtain maximum accuracy at 1e-3.

iTAML on MNIST

Discussion

Catastrophic Forgetting is a prevalent phenomenon in the continual learning domain. As the name indicates, it causes the model to "forget" how to perform well on older tasks as it encounters and adapts to newer tasks. We see this phenomenon in the case of our experiments with the iTAML algorithm. The model performs well on the initial tasks, indicating that the model is capable of performing well on at least one task, but the accuracy decreases incrementally with each new task it sees. This can be seen in the case of the experiments on all three datasets-MNIST, SVHN and CIFAR.
We tried optimizing every set of hyperparameters given in the official implementation, but failed to obtain any promising result that can outperform the previous state of the art in the domain. This rules out the possibility that we used the wrong parameters. We also tinkered extensively with the code and were unable to find any possible bug in the implementation. We provide a few of those experiments in the Additional Experiments section. Some of these experiments gave us the impression that the Reptile meta update rule (main mechanism of the paper) is not crucial to the success of the algorithm.

What was easy

The official repository of the paper was instructive in getting started with the experiments. The main mechanism of the paper is easily understandable especially for those already familiar with the gradient based Meta Learning algorithms - MAML and Reptile. Experimenting on the MNIST dataset was convenient due to its low running time of 7-12 minutes on GPU.

What was difficult

Although the official code repo was easy to start experimenting with, the code given seemed incomplete for verifying several of the secondary mechanisms described in the paper. This made the reproducibility exercise very challenging. For eg., the paper describes the hyperparameter representing the size of the data continuum 'p' as crucial in determining the performance of the model. But we were unable to find any such parameter in the implementation.
Even after extensive hyperparameter optimization experiments and investigations into the implementation, we were unable to reach the performance described in the paper.

Communication with authors

We were not able to contact the authors. We have created an issue on Github here, but have received no reply so far.

Conclusion

Although we could not find the algorithm to outperform the previous baselines, even the low accuracy value we found did outperform many previous baselines and also demonstrates the usefulness of applying gradient based Reptile techniques on the Incremental Learning domain. However, we would like to encourage the authors to provide a more stable implementation and provide clarification regarding the details of how they were able to obtain the higher accuracies.