Meta-Consolidation for Continual Learning (MERLIN)

A reproduction of the paper 'Meta-Consolidation for Continual Learning' by K J Joseph and Vineeth N Balasubramanian, accepted at the proceedings of Neural Information Processing Systems (NIPS 2020).
Shambhavi Mishra
Created on August 15|Last edited on October 8
Comment
﻿
Reproducibility SummaryIn this report, we attempt to reproduce the paper 'Meta-Consolidation for Continual Learning' by K J Joseph et al., accepted in Proceedings of the Neural Information Processing Systems (NIPS 2020). The report covers each aspect of reproducing the results and claims put forth in the paper. This paper proposes a novel methodology for continual learning called MERLIN: Meta-Consolidation for Continual Learning. It was feasible to conduct the reproducibility task, the computation was inexpensive and we could match the results promised in the paper.
Scope of ReproducibilityThe paper proposes a novel methodology for continual learning called MERLIN: Meta-Consolidation for Continual Learning. The authors assume that weights of a neural network ψ, for solving task t, come from a meta-distribution p(ψ|t). This meta-distribution is learned and consolidated incrementally. The authors operate in the challenging online continual learning setting, where a data point is seen by the model only once.
MethodologyWe used the code provided by the authors in their Github repository. The authors have given the parameters used for the models explicitly in a config file and thus it was easy to reproduce the paper. The code took 2-3 hours for Split MNIST dataset on single NVIDIA GTX 1060 GPU and around 40 minutes on a NVIDIA Tesla P100. Further details are presented in the Run set 4. We used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard.
ResultsWe reproduced the results for a single dataset, which is Split MNIST as the code was available only for the mentioned dataset. The results obtained overlap with the ones promised in the paper.
What was easyThe paper was understandable and it was quite fascinating to follow its structure. Along with the theoretical concept, the mathematical equations provided ease to reformulate the paper. In addition to this, the authors provided with the original implementation along with the hyperparameters used making it easy to run with very little modifications.
What was difficult The only bottleneck we faced was the lack of compute resources which prevented us from experimenting with all the variants of Split CIFAR-10, Split CIFAR-100 and Split Mini-ImageNet specified in the paper. Also, the code for baselines could have been added for a comparative analysis illustrated in the paper.
Communication with original authorsThe authors was very responsive over email and helped us with every doubt we had during the reimplementation. The reproducibility was encouraged and supported by the authors.
IntroductionThis reproducibility submission is an effort to validate the research paper 'Meta-Consolidation for Continual Learning' by K J Joseph et al., accepted in Proceedings of the Neural Information Processing Systems (NIPS 2020).
Continual learning is a machine learning scenario in which a learning model must adapt to new tasks progressively while maintaining its performance on previously acquired tasks. 
The authors propose MERLIN: Meta-Consolidation for Continual Learning, a new continuous learning technique based on consolidation in a meta-space, namely the latent space, which generates model weights for solving downstream tasks. 
In this reproducibility report, we study MERLIN in detail, which consists of running experiments according to the open-source code by authors, reporting the important details about certain issues encountered during reproducing and comparing the obtained results with the ones reported in the original paper. We report our numbers on seen test accuracy, validation accuracy, loss and average accuracy for each task in the given 'k' tasks in the table and plots below.
Scope of ReproducibilityThe authors claim that the weights of a neural network are derived from a meta distribution P( Ψ\PsiΨ﻿|t), where 't' is a representation for the task. They propose 'Meta Consolidation', a methodology to learn this distribution, as well as continually adapt it to be competent on new tasks by consolidating this meta-space of model parameters whenever a new task arrives.
Major contributions listed in the paper are :
Proposing a new perspective to continual learning based on the meta-distribution of model parameters and their consolidation.
A method to learn this meta-distribution using a VAE with task-specific priors allowing the ensemble of models for each task at inference.
MERLIN outperforms well known benchmark methods and state-of-the-art method on five continual learning datasets.
Methodology﻿
Understanding the Algorithms
                     The three step approach to the problem statement as proposed by MERLIN.
MERLIN: Overall MethodologyWe consider a sequence of tasks  T1,T2,....Tk−1T_{1}, T_{2},.... T_{k-1}T1​,T2​,....Tk−1​﻿ that have been seen by the learner, until now. A new task TkT_{k}Tk​﻿  is introduced at time instance k. In this step,  a set of B base models are trained on random subsets of TktrT_{k}^{tr}Tktr​﻿ to obtain a collection of models Ψk=[Ψk1,......ΨkB]\Psi_{k} = [{\Psi^1_{k},......\Psi^B_{k}}]Ψk​=[Ψk1​,......ΨkB​]﻿. Using a VAE-like technique, the model (Ψk\Psi_{k}Ψk​﻿) is then utilized to learn a task-specific parameter distribution.
﻿
﻿
META-CONSOLIDATION IN MERLINIn the meta-consolidation phase, model parameters are sampled for all tasks seen so far, from the decoder of the VAE, each conditioned on a task-specific prior, and use them to refine the overall VAE.
﻿
MERLIN INFERENCEInference involves selecting models from parameter distributions for each task and evaluating them against test data. 
We can sample any number of models from this distribution at inference/test time since we've learnt the distribution to produce model parameters for each job we've encountered so far. This enables the suggested technique to assemble several models at test time.
﻿
﻿
Model ArchitectureDifferent architectures have been proposed for different datasets -
CIFAR and Mini-ImageNet datasets : A modified ResNet architecture with 10 layers and fewer feature maps in the four residual blocks (5, 10, 20, 40), thus reducing the number of parameters from 0.27M to 34997.
 MNIST dataset: A two-layer fully connected neural network with 100 neurons each and ReLU activation.
These base models (their weights, to be specific) are then used to train the Variational AutoEncoder in MERLIN. 
DatasetsThe paper utilizes the following benchmark datasets :
Split MNIST and Permuted MNIST
Split CIFAR-10 and Split CIFAR-100
Split Mini-ImageNet
The label space increases with tasks in Split datasets, but the data space varies with tasks in Permuted MNIST without altering the label space. The first is known as the Class-Incremental setting, whereas the second is known as the Domain-Incremental setting. The authors claim that MERLIN works in both settings.
We perform reproducibility on the Split MNIST dataset as the code for the same is publicly available.
Hyper-parametersThroughout all our experiments, we used the same hyper-parameters as in the paper. The authors clearly state all hyper-parameters to train the models in the experiments.
Computational RequirementsTo conduct the experiment, we used an NVIDIA Tesla P100 GPU and an NVIDIA GeForce GTX 1060 GPU.
ResultsWe utilized the baseline values provided by the authors as a .yml file. 
  task: 'split_mnist'
  n_tasks: 5
  samples_per_task: 1000
  validation_samples_per_task: 100
  method:
    run_merlin: True
  epochs: 1
  n_finetune_epochs: 40
  learning_rate: 0.1
  batch_size_train: 10
  batch_size_test: 128
  finetune_learning_rate: 0.001
We conducted three runs of the experiment for a robust testing as detailed in the table Run set 4. 
The plots illustrated below depict the accuracy, test accuracy, loss and average accuracy for each task in the 5 tasks we reproduced the experiments for. 
 1.  Test Accuracy or 'the average accuracy for each task in tasks' (5 tasks in this case) :
We observed this value to be 82.4 ±  0.7 while in the paper it is given to be 86.6 ± 1.4. The code for computing the test accuracy is attached below for clear understanding.
def test(model, tasks, verbose=False, mode='Test'):
    accuracies = []
    for task in tasks:
        test_data = MNIST('./data', task=task, mode=mode, transform=transforms.ToTensor())
        test_dataloader = DataLoader(test_data, batch_size=cfg.batch_size_test, shuffle=cfg.continual.shuffle_datapoints)
﻿
        accuracy, _ = evaluate_accuracy(model, test_dataloader, task=task)
        accuracies.append(accuracy)
        if verbose:
            log('Accuracy of task %d is %f' % (task, accuracy))
﻿
    acc = statistics.mean(accuracies)
    if verbose:
        log('Average accuracy for ' + str(tasks) + ' is ' + str(acc))
﻿
    wandb.log({'average accuracy for each task in tasks is': acc})
﻿
    return acc
2. Number of Parameters in the classifier for Split MNIST dataset exactly overlaps with the figure given in the paper - 89610.
﻿
﻿
Run set3
﻿
﻿
﻿
Run set3
﻿
ConclusionFrom our attempt at reproducibility, we conclude that MERLIN indeed delivers on the aspects pointed out in the paper. We were able to replicate the main results of the paper which were easy to reproduce.  The paper was an interesting read and could be understood easily despite the mathematical complexity as it was well constructed. We would also encourage the original authors to make more of their official code public for other datasets as well. 
﻿
Add a comment
Tags: Intermediate, Computer Vision, Classification, Research, Plots, CIFAR10, ImageNet, MNIST
Iterate on AI agents and models faster. Try Weights & Biases today.