Skip to main content

La-MAML: Look-ahead Meta Learning for Continual Learning

Submission for Reproducibility Challenge 2020 for the paper La-MAML: Look-ahead Meta Learning for Continual Learning by Gupta et. al., accepted to NeurIPS 2020.
Created on January 14|Last edited on April 20

Reproducibility Summary

This report validates the reproducibility of the NeurIPS 2020 paper titled La-MAML: Look-ahead Meta Learning for Continual Learning by Gupta et. al. The article covers each aspect of reproducing the results and claims put forth in the paper. The paper primarily presents a novel optimization-based meta-learning algorithm for online continual learning. La-MAML achieves performance superior to other replay-based, prior-based and meta-learning based approaches for continual learning on real-world visual classification benchmarks (Imagenet & CIFAR). The main experiments were easily reproducible from the official repo.

Scope of Reproducibility

The paper proposes a novel optimization-based meta-learning algorithm named Look Ahead MAML (La-MAML) with the aim of mitigating catastrophic forgetting while being sample efficient and robust to change in hyperparameters. The central foundation behind La-MAML are the use of per-parameter learning rates, asynchronous updates and a sample efficient objective.

Methodology

For reproducing the paper, we used the original implementation of the paper provided by the authors. We used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard. In regards to compute, we used Tesla T4, P4 and K80 GPUs provided in Google Colaboratory, Kaggle and Codeocean.

Results

We were able to replicate the main experiments presented in the paper. Overall, the results obtained from the reimplementation do support the claims on the efficiency of the mechanisms proposed in the paper.

What was easy

Because of an up-to-date official repo on Github, we were able to test the experiments without much delay in implementing the code. The hyperparameters were clearly noted in the paper which made it easy to replicate the runs.

What was Difficult

The compute required to perform some of the experiments were difficult to do on free GPU platforms due to time duration and RAM capacity issues. The experiment required to calculate the cosine similarity of gradients was not published on the official repo which made it difficult to reproduce it.

Communication with Original Authors

We were able to communicate with the authors regarding the experiment and received useful advise for conducting some additional experiments.

Introduction

This reproducibility submission is an effort of validating the NeurIPS 2020 paper by Gupta et. al. titled "La-MAML: Look-ahead Meta Learning for Continual Learning." This report evaluates the central claim of the paper which proposes a novel high performing, fast meta-learning algorithm for image classification used in the domain of continual learning. We also provide a public dashboard to view all the reproduced results and experiments including the codebase used to run those experiments.

Scope of Reproducibility

Main mechanisms:
  1. Sample-Efficient Objective is derived from the online-aware meta learning (OML) objective. It differs from the meta experience replay (MER) objective in that the MER objective focuses on aligning all the individual pairwise gradients between tasks from 1 to t, while in C-MAML the objective focuses on aligning the gradients between task t and the average of all the previous gradients. As it turns out, this is a great performance boost. This is best demonstrated from the experiments that show C-MAML taking 5 times less time than MER while also achieving a slightly higher performance than MER.
  2. Per-Parameter Learning Rates: here, we observed that the gradient (g_MAML) of the C-MAML objective with regards to the inner loop's learning rates directly reflects the alignment between the old and new tasks.
  • The first term in g_MAML: gradient of the meta-loss on the meta-batch (current task data+replay buffer data) -> g_meta
  • The second term: cumulative gradient from the inner updates -> g_traj
The expression indicates that, the gradient of the learning rates (LR) will be negative when the inner product between g_meta and g_traj is high, ie. the two are aligned. Negative (positive) LR gradients would pull up (down) the LR magnitude.
Additionally, La-MAML uses per-parameter LRs while C-MAML uses fixed LRs.
  1. Asynchronous Updates: In the outer loop, the weights are updated using the updated alpha (inner learning rate). The alternative was to use a fixed learning rate for the outer loop, in which case, it would be called synchronous updates. This alternative is evaluated using the algorithm Sync-La-MAML proposed in the paper as an ablation. This asynchronous update is similar to look ahead search since the step sizes for each parameter are adjusted based on the loss incurred after applying hypothetical updates to them. This is where the name "Look Ahead-MAML" comes from.
In general, the paper is built on the following central claims:
  • La-MAML performs better than other continual learning (CL) algorithms such as MER in terms of retained accuracy (RA), backward transfer interference (BTI) and running times
  • La-MAML is robust in terms of hyperparameters

Model Descriptions

The MNIST datasets used models with 3 linear layers, each followed by a ReLU layer (except the last one).
The CIFAR and TinyImagenet datasets used models with 3 CNN-ReLU layers followed by the same set of layers as on the MNIST dataset.

Datasets

The algorithms are evaluated on two sets of classification datasets: (1) variations of the MNIST dataset, which are toy continual learning benchmarks proposed in previous works; (2) the Real World Classification datasets, which are more complex and represent a more serious challenge than the toy MNIST datasets.
MNIST Rotations: A variant of the MNIST dataset of handwritten digits, where each task contains digits rotated by a fixed angle between 0 and 180 degrees.
MNIST Permutations: A different variant of MNIST, where each task is transformed by a fixed permutation of pixels. In this dataset, the input distribution for each task is unrelated.
MNIST Many Permutations: A third variant of MNIST Permutations that has 5 times more tasks (100 total) and 5 times less training examples per task (200 each).
CIFAR-100: Incremental CIFAR100, a variant of the CIFAR object recognition dataset with 100 classes, where each task introduces a new set of classes.
TinyImageNet-200: Tiny Imagenet is a scaled down version of ImageNet dataset.

Hyperparameters

Throughout all our experiments, we used the same hyper-parameters as defined in the paper. The authors clearly state all hyper-parameters used in training the models used in the experiments presented in the paper.

Experimental Setup

For all the small scale experiments we trained the models and conducted ablation studies using the free GPU resources provided by Google Collab. The GPUs we used included a NVIDIA K80, NVIDIA P100 and NVIDIA Tesla P4.

Computational Requirements

All of the experiments on the MNIST datasets take less than 10 minutes. The MNIST Rotations and Permutations dataset requires more than 6GB of RAM to run and the Many Permutations dataset requires more than 12GB of RAM to run. The Many Permutations experiments were run on Codeocean's NVIDIA K80 GPUs. All other experiments were run on Kaggle's P100 GPUs.
Almost all of the experiments on the real-world classification datasets (TinyImagenet and CIFAR) were on Kaggle's P100 GPUs. Most of the experiments take between 30 minutes and 5 hours run time each, though some exceed 12 hours of run time. Due to the resource constraints on free GPUs, we ran them on slower Tesla P4 GPUs which took 3-4 days to complete.

Results

We were able to achieve results that validated the main claims made in the paper. The result's metric values came close to 4% of the results in the paper.

Pseudocode

for task in task_loader:
for ep in num_epochs:
for X,Y in task:
for glance in glances:
model_clone = model.clone()
bx, by = (X,Y) + sample(Memory)
for x,y in X,Y:
model_clone.inner_update(x, y)
meta_loss += lossfn(model_clone(bx), by)
learning_rate.update(meta_loss)
model.meta_update(meta_loss, learning_rate)


Results on MNIST Datasets

Result on MNIST Rotations Datset

We will compare the Retained Accuracy (RA) of La-MAML, Sync and C-MAML. Total accuracy denotes the average validation accuracy we get by evaluating on all tasks. Current task accuracy denotes the validation accuracy on the task on which it is currently being training on.

lamaml
5
sync
5
cmaml
5
mer
5

From the above charts we can easily see that La-MAML and achieves slightly better performance than MER with much lower running time (5 times less). This is due to its sample efficient objective function.

Results on MNIST Permutations Dataset


lamaml
3
sync
3
cmaml
3


Results on MNIST Many Permutations Dataset


lamaml
344
sync
1
cmaml
1


Results on Real World Classification

Results on CIFAR-100

What is Single-Pass vs Multiple-Pass?
In single pass, you only get to use the training data for a task once. This is the setup for efficient Lifelong Learning (LLL).
In multiple pass, you get to use the training data for a task any number of times before moving on to the next task.
In code, the difference manifests as "single pass having 1 epoch while multiple pass having 10 epochs". Refer to the pseudocode to see where epochs come in the for loops.

Single-Pass


Run set
9


Multiple-Pass


Run set
11


Results on TinyImagenet-200

Single-Pass


Run set
10


Multiple-Pass


Run set
8


Ablation Studies

We compare La-MAML with the MER algorithm, which is the SOTA currently in CL.

Run set
2



Run set
3



Run set
3



Run set
2



Run set
3



Run set
2


Hyperparameter Robustness

La-MAML


a_i exps
6



o_l exps
6


A Quick Thank You

Thanks to the authors for their research and for being responsive as we reproduced their work. You can read the original paper here. Additionally, you can click on the "Reproducibility Challenge" tag in the W&B Gallery to see additional reports in this project. Thanks for reading!
Iterate on AI agents and models faster. Try Weights & Biases today.