iTAML: An Incremental Task-Agnostic Meta-learning Approach

A reproduction of the paper 'iTAML: An Incremental Task-Agnostic Meta-learning Approach' by Rajasegaran et al. (2020), accepted to CVPR 2020.
Joel Joseph
Created on July 31|Last edited on October 15
Comment
﻿
Reproducibility SummaryThis report validates the reproducibility of the CVPR 2020 paper "﻿﻿iTAML: An Incremental Task-Agnostic Meta-learning Approach" by Rajasegaran et al. (2020). It covers each aspect of reproducing the results and claims put forth in the paper. The paper primarily presents a novel Meta Learning approach to Incremental Learning. The experiments were simple to reproduce thanks to the official implementation, although the results were unexpected.
Scope of ReproducibilityThe paper proposes a novel gradient based Incremental Learning algorithm named Incremental Task Agnostic Meta Learning (iTAML) that aims to avoid catastrophic forgetting by balancing knowledge from old and new tasks. The algorithm draws inspiration from the Meta-Learning algorithm MAML used for Few Shot Classification. Two of the main contributions by iTAML that make it superior to being just an adaptation of MAML to the Incremental Learning arena are the Balancing Factor used to leverage old and new information and its ability to do as little as one inner update during training.
MethodologyFor reproducing the paper, we used the original implementation provided by the authors. It was enough for us to verify the results on the main datasets. We used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard. In regards to compute, we used Tesla P100 GPU available in Kaggle Notebooks for most of our experiments. We also made use of the Google Colab Pro subscription for investigating the code.
ResultsWe attempted to reproduce most of the experiments on the lighter datasets. The results we obtained on MNIST, SVHN and CIFAR datasets are very much lower than the results on the paper and hence we are unable to validate the main claim of the paper regarding superior performance of the iTAML algorithm. Due to this reason, we didn't feel the need to try the experiments on the heavier datasets.
What was easyGetting started with running the code was straightforward from the instructions in the readme of the official repo. The main mechanism of the algorithm was also described clearly in the paper. Low running time of experiments on the MNIST dataset enabled us to tinker with the code smoothly.
What was difficultThe model was giving very low results with the hyperparameters provided in the official code repo. So we had to perform hyperparameter optimization on every experiment, which consumed significant compute. Also certain hyperparameters mentioned in the paper seemed absent in the implementation contributing more confusion to the understanding.
Communication with original authorsWe were unable to contact the authors. We created a Github Issue in the original repository but we haven't received any replies so far. We also noticed that there were other issues in the repo that were more than a year old, which were asking questions regarding reproducibility of the algorithm, but received no replies yet.
IntroductionThis report is an attempt to validate the CVPR 2020 paper by Rajeswaran et al. (2020), titled "iTAML: incremental Task Agnostic Meta Learning". This report evaluates the central claim of the paper, which proposes a novel Incremental Learning algorithm for the Incremental Learning domain. The paper develops a variation of the Reptile algorithm in the Meta Learning  After extensive experiments and investigations we were unable to verify the main claims of the paper. We provide detailed visualizations of different behaviors of the algorithm under varying hyperparameters and update rules.
Scope of ReproducibilityThe paper proposes incremental Task Agnostic Meta Learning (iTAML), a novel incremental learning algorithm that claims to outperform the existing baselines in the field. iTAML focuses on learning a generalized parameter which can quickly adapt to any given task in the continuum using a task specific inner update. This is similar to the popular Meta Learning algorithm, known as Model Agnostic Meta Learning (MAML) which made a breakthrough in the gradient based meta learning category with its introduction.
The main mechanisms of the paper are:
The Update Rule
Balancing Factor
The Update Rule: Reptile algorithm is a close relative of the MAML algorithm with a more insightful update rule. Reptile update rule is defined as:
θ1\theta_{1}θ1​ = θ0\theta_{0}θ0​ + ϵ\epsilonϵ ∗*∗ (θtaski−θ0)‾\overline{(\theta_{taski} - \theta_{0})}(θtaski​−θ0​)​
﻿
where ϵ\epsilonϵ﻿ is the outer learning rate. Here, the role of the gradient is done by the average of the difference between the task specific models and the base model.
This can also be written as:
θ1\theta_{1}θ1​ = θ0\theta_{0}θ0​ + (ϵ\epsilonϵ / n) ∗*∗ ∑0t\sum_0^t∑0t​θtaski\theta_{taski}θtaski​ −-− ϵ\epsilonϵ ∗*∗ θ0\theta_{0}θ0​
﻿
which can be further rewritten as:
θ1=(1−ϵ)∗θ0+ϵ∗∑0tθtaski\theta_{1} = (1 - \epsilon) * \theta_{0} + \epsilon * \sum_0^t\theta_{taski}θ1​=(1−ϵ)∗θ0​+ϵ∗∑0t​θtaski​ / n
﻿
where n is the total number of tasks encountered. 
And the iTAML update rule is:
θ1=(1−η)∗θ0+η∗∑0tθtaski/t\theta_{1} = (1 - \eta) * \theta_{0} + \eta * \sum_0^t\theta_{taski} / tθ1​=(1−η)∗θ0​+η∗∑0t​θtaski​/t
﻿
We see that both the updates are almost identical. The only difference is that η\etaη﻿ is not fixed but varies during training. η\etaη﻿ is called the Balancing Factor.
Balancing Factor: In Reptile update ϵ\epsilonϵ﻿ is fixed while in iTAML, η\etaη﻿ varies during training. The purpose of variable η\etaη﻿ is to balance the information from old and new tasks, so that with each new task iTAML encounters the contribution of the new task decreases. This is to reduce the phenomenon of Catastrophic Forgetting, about which we will describe in the Discussion section.
η=exp(−β∗t/T)\eta = exp(- \beta * t / T)η=exp(−β∗t/T)
﻿
where t is the current task and T is the total number of tasks it will encounter.
MethodologyFor initial investigations we used the official implementation provided by the authors at https://github.com/brjathu/iTAML. To track experiments and perform hyperparameter optimizations, we modified the codebase to incorporate Weights & Biases, which can be seen at https://github.com/joeljosephjin/Reproducibility-Challenge-iTAML.
We mainly used Kaggle Notebooks for our experiments. We also made use of paid subscription of Google Colab Pro for performing swift interactive experiments on GPUs. We also developed a minimal version of the codebase for basic error debugging at https://github.com/joeljosephjin/itaml-pytorch.
Model DescriptionsFrom the implementation, we can see that iTAML uses the same model architecture from the paper RPSNet, the SOTA in the domain. For MNIST, the architecture is two pairs of Linear and ReLU layers followed by another Linear layer. And for SVHN and CIFAR-100 datasets, the model is a combination of  3-4 layers of blocks of Convnets, BatchNorm and ReLU layers followed by a Linear layer. Kaiming Initialization was used in all the models.
DatasetsThe paper evaluates the algorithm on several datasets: 
MNIST, SVHN and CIFAR which are considered as lighter datasets.
MS-Celeb and ImageNet which are heavier datasets so called due to their high computation time.
MNIST and SVHN have been used in almost all previous papers in Incremental Learning. They take less than an hour to train on a GPU. MNIST has 10 classes of handwritten digits from 0 to 9. For default configuration, the 10 classes are divided into 5 tasks which are fed sequentially to the iTAML algorithm. All the datasets are available from PyTorch's indigenous dataloader, which can be used for regular Computer Vision purposes. For Incremental Learning, we used the open source Incremental Learning loader given by https://github.com/khurramjaved96/incremental-learning.
CIFAR-100 is used in three modes:
5 Classes per task - where there are 20 sequential tasks
10 Classes per task - with 10 sequential tasks
20 Classes per task - 5 sequential tasks
Hyper-parametersAlthough some hyperparameters were mentioned in the paper, we did not find them very useful. Initially we used the default hyperparameters used in the implementation which crashed the learning. Hence, we performed hyperparameter optimization for each parameter to arrive at a working result.
Experimental setupFor all the light datasets, we performed the experiments in the Google Colab Pro Notebooks which give NVIDIA P100 GPUs. For the CIFAR experiments, we used Kaggle Notebooks to do overnight training.
All the experiments performed is publicly visible from our wandb repository available at https://wandb.ai/joeljosephjin/itaml. They are all based on our code base at https://github.com/joeljosephjin/Reproducibility-Challenge-iTAML.
Computational requirementsFor the experiments on lighter datasets, a single NVIDIA P100 GPU is capable of training the model in less than an hour, while the ones on heavier datasets require 7-12 hours of training on the same. Both datasets use less than 12 GB of RAM and equal amount of memory to store the datasets.
Results﻿
MNIST Image Classification﻿
Run set2
﻿

MethodsAccuracy on MNIST
MAS19.52%
LwF24.17%
GEM92.20%
DGR91.24%
RtF92.56%
RPS-net96.16%
iTAML (Original)97.95%
iTAML (Our Result)59.65%
﻿
For the first task, the accuracy is 99% but then it starts decreasing to 87%, 74%, 68% and finally 60%. This indicates that the algorithm is able to adapt to a single task but the generalization capacity decreases catastrophically. It underperforms most of the baselines only superior to MAS and LwF algorithms.
SVHN Image Classification﻿
Run set1
﻿

MethodsAccuracy on SVHN
MAS17.32%
LwF-
GEM75.61%
DGR-
RtF-
RPS-net88.91%
iTAML (Original)93.97%
iTAML (Our Result)71.24%
﻿
The accuracy decreases from 99% for the first task to 91%, 84%, 76% and 71% for the last task. It slightly underperforms the GEM algorithm.
CIFAR Image Classification﻿
Run set3
﻿

MethodsAccuracy on CIFAR
MAS17.32%
LwF-
GEM75.61%
DGR-
RtF-
RPS-net88.91%
iTAML (Original)93.97%
iTAML (Our Result)34.84%
﻿
With an accuracy of 82% for the first task, the accuracy decreases to 35% for the final task. It only outperforms MAS algorithm here.
Hyperparameter Optimization
iTAML on CIFAR﻿
Run set4
﻿
The Learning Rate is the rate of update of the inner loop. We obtain maximum accuracy at 1e-3.
iTAML on MNIST﻿
Run set34
﻿
﻿
﻿
Run set34
﻿
﻿
﻿
Run set34
﻿
﻿
﻿
Run set34
﻿
﻿
﻿
Run set34
﻿
DiscussionCatastrophic Forgetting is a prevalent phenomenon in the continual learning domain. As the name indicates, it causes the model to "forget" how to perform well on older tasks as it encounters and adapts to newer tasks. We see this phenomenon in the case of our experiments with the iTAML algorithm. The model performs well on the initial tasks, indicating that the model is capable of performing well on at least one task, but the accuracy decreases incrementally with each new task it sees. This can be seen in the case of the experiments on all three datasets-MNIST, SVHN and CIFAR.
We tried optimizing every set of hyperparameters given in the official implementation, but failed to obtain any promising result that can outperform the previous state of the art in the domain. This rules out the possibility that we used the wrong parameters. We also tinkered extensively with the code and were unable to find any possible bug in the implementation. We provide a few of those experiments in the Additional Experiments section. Some of these experiments gave us the impression that the Reptile meta update rule (main mechanism of the paper) is not crucial to the success of the algorithm.
What was easyThe official repository of the paper was instructive in getting started with the experiments. The main mechanism of the paper is easily understandable especially for those already familiar with the gradient based Meta Learning algorithms - MAML and Reptile. Experimenting on the MNIST dataset was convenient due to its low running time of 7-12 minutes on GPU.
What was difficultAlthough the official code repo was easy to start experimenting with, the code given seemed incomplete for verifying several of the secondary mechanisms described in the paper. This made the reproducibility exercise very challenging. For eg., the paper describes the hyperparameter representing the size of the data continuum 'p' as crucial in determining the performance of the model. But we were unable to find any such parameter in the implementation.
Even after extensive hyperparameter optimization experiments and investigations into the implementation, we were unable to reach the performance described in the paper.
Communication with authorsWe were not able to contact the authors. We have created an issue on Github here, but have received no reply so far.
ConclusionAlthough we could not find the algorithm to outperform the previous baselines, even the low accuracy value we found did outperform many previous baselines and also demonstrates the usefulness of applying gradient based Reptile techniques on the Incremental Learning domain. However, we would like to encourage the authors to provide a more stable implementation and provide clarification regarding the details of how they were able to obtain the higher accuracies.
﻿
Methods	Accuracy on MNIST
MAS	19.52%
LwF	24.17%
GEM	92.20%
DGR	91.24%
RtF	92.56%
RPS-net	96.16%
iTAML (Original)	97.95%
iTAML (Our Result)	59.65%
Methods	Accuracy on SVHN
MAS	17.32%
LwF	-
GEM	75.61%
DGR	-
RtF	-
RPS-net	88.91%
iTAML (Original)	93.97%
iTAML (Our Result)	71.24%
Methods	Accuracy on CIFAR
MAS	17.32%
LwF	-
GEM	75.61%
DGR	-
RtF	-
RPS-net	88.91%
iTAML (Original)	93.97%
iTAML (Our Result)	34.84%
Add a comment
Tags: Intermediate, Computer Vision, Classification, Research, iTAML, Github, Panels, Plots, Sweeps, CIFAR-100, CIFAR10, MNIST, RC
Iterate on AI agents and models faster. Try Weights & Biases today.