Uncertainty-Guided Continual Learning With Bayesian Neural Networks

This article provides a reproduction of the paper 'Uncertainty-Guided Continual Learning With Bayesian Neural Networks' by Ebrahimi et al. (2020), submitted at ICLR.
Shravan Nayak
Created on August 24|Last edited on June 26
Comment
This article attempts to validate the reproducibility of the ICLR 2020 paper 'Uncertainty-Guided Continual Learning With Bayesian Neural Networks' by Ebrahimi et al. (2020). It covers each aspect of reproducing the results and claims put forth in the paper. 
The authors propose a novel approach to overcome the dependency of continual learning algorithms on an external representation and extra computation to measure the parameters’ importance by introducing Uncertainty guided Continual Bayesian Neural Networks (UCB). 
In UCB, the learning rate adapts according to the uncertainty defined in the probability distribution of the weights in networks. The paper was easy to read, well written, and we could reproduce the results as promised in the paper.
Here's what we'll be covering: 
Table of ContentsTable of ContentsScope of ReproducibilityMethodologyResultsWhat was easyWhat was difficultIntroductionScope of ReproducibilityMethodologyUCB With Learning Rate RegularizationUCB Using Weight Pruning (UCB-P)ResultsExperimental SetupResults : 5-Split MNISTResults: Permuted MNISTResults: Alternating CIFAR10 & CIFAR100Conclusion
﻿
﻿
Let's dive in!
Scope of ReproducibilityThe paper proposes a novel network, namely Uncertainty guided Continual Bayesian Neural Networks (UCB), where the learning rate adapts according to the uncertainty defined in the probability distribution of the weights in networks. The authors also show a variant of UCB, which uses uncertainty for weight pruning and retains task performance after pruning by saving binary masks per task.
MethodologyWe used the code provided by the authors in their GitHub repository. The authors have given the implementation for all the datasets, and it is convenient to run the code as well. The code took approximately 3 hours for MNIST dataset on a Titan XP GPU, while for the PMNIST dataset, it took around 17 hours on NVIDIA GTX 1060 GPU. We could only run experiments for these two datasets due to computational constraints. Further details are presented in the Run below. We used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard.
ResultsWe were able to reproduce the experiments for MNIST-5 and P-MNIST datasets mentioned in the paper. The results obtained overlap with the ones promised in the paper for MNIST-5 as we ran it for 50 epochs. However, for P-MNIST, we could only run it for 5 epochs due to computational constraints, and we were unable to conclude the experiments even after around 17 hours and thus did not obtain the reported average accuracy per task. Instead, we report the test accuracy for P-MNIST dataset. For alternating CIFAR 10/CIFAR 100 dataset, we could not obtain consistent results, and thus we do not report them here.
What was easyThe paper was understandable, and it was quite fascinating to follow its structure. Along with the theoretical concept, the mathematical equations provided ease in reformulating the paper. Because of an up-to-date official repo on Github, we were able to test the experiments without much delay in implementing the code. The hyperparameters were clearly noted in the paper, which made it easy to replicate the runs. 
What was difficultConducting the experiments was computationally expensive and we could only reproduce the results for two datasets - MNIST-5 and P-MNIST. 
IntroductionThis reproducibility submission is an effort to validate the ICLR 2020 paper 'Uncertainty-Guided Continual Learning With Bayesian Neural Networks' by Ebrahimi et al. (2020).
UCB, the proposed work, is a Bayesian neural network that utilizes uncertainty predictions to perform continual learning. The authors try out two approaches, the first one being learning rate regularization and using weight pruning.
In this reproducibility report, we study UCB in detail, which consists of running experiments according to the open-source code by authors, reporting the important details about certain issues encountered during reproducing, and comparing the obtained results with the ones reported in the original paper. 
Scope of ReproducibilityThe claims of the paper are listed below :
The paper proposes to perform continual learning with Bayesian neural networks, and it develops a new methodology that exploits the uncertainty to adapt the learning rate of individual parameters.
The authors propose a hard-threshold variant that decides which parameters to freeze.
The authors perform thorough experiments to validate their approach on the prior art as well as benchmark datasets for the task. They also claim that, unlike the existing approaches, this paper does not rely on knowledge about task boundaries at inference time. The code for the paper is available at the official GitHub repository.
MethodologyThe authors introduce Uncertainty-guided Continual learning approach with Bayesian neural networks (UCB), which uses the estimated uncertainty of the parameters' posterior distribution to softly or aggressively regulate the change in "important" parameters.
UCB With Learning Rate RegularizationIn continual learning, forgetting is reduced by regularizing the changes in model representation based on the importance of parameters. The authors propose to regularize the changes through the learning rate of each parameter, and thus the gradient update becomes a function of its importance.
The main advantage of using learning rate as the regularizer is that it doesn't require any additional memory, unlike pruning techniques, nor does it require monitoring the change in parameters in relation to the previously learned job, as standard weight regularization methods do.
                                                      αμ←αμΩμ\alpha_\mu \leftarrow \frac{\alpha_\mu}{\Omega_\mu} αμ​←Ωμ​αμ​​﻿  ---- (1)
                                                      αρ←αρΩρ \alpha_\rho  \leftarrow \frac{\alpha_\rho}{\Omega_\rho}αρ​←Ωρ​αρ​​﻿   ---- (2)
In the equations above, the authors establish that the learning rate of µ (mean of the posterior) and ρ (standard deviation for the scaled mixture Gaussian pdf of prior) for each parameter distribution is inversely proportional to its importance Ω.  
Thus, importance is inversely proportional to the standard deviation σ which represents the parameter uncertainty in the Bayesian neural network.
UCB Using Weight Pruning (UCB-P)UCB-P consists of the following steps :
For every layer, the parameters are arranged according to their SNR value, and the lowest SNR value, which is also the least important parameter, is pruned or set to zero.
Pruned parameters are marked using a binary mask and used later to learn a new task, while the important parameters are fixed throughout the training. 
After a task is learned, an associated binary mask is saved, which will be used during inference to recover key parameters and hence the exact performance of the desired task.
SNR is a signal processing metric that is used to discern between “useful” and “unwanted” information in a signal. The SNR can be considered as a measure of parameter relevance in neural models; the higher the SNR, the more effective or significant the parameter is to model predictions for a specific task.
﻿
Results
Experimental Setup
Results : 5-Split MNISTThe authors report the results on 5-Split MNIST dataset as :
We first present our results for class incremental learning of MNIST (5-Split MNIST) in which we learn the digits 0 − 9 in five tasks with 2 classes at a time in 5 pairs of 0/1, 2/3, 4/5, 6/7, and 8/9.
Authors report the values as 
Accuracy (Average Test Classification Accuracy Across All Tasks) - 99.63
BWT (Backward Transfer to measure forgetting) - 0.00
Our observations were 
Accuracy - 98.64
BWT - 0.00
Results: Permuted MNISTThe paper proposed for the model to learn a sequence of 10 random permutations and report average accuracy at the end. However, due to computational constraints, the model finished the epochs and we could get the test accuracy but not the average accuracy per task. The details for this run can be found in the run set 2 below.
Results: Alternating CIFAR10 & CIFAR100We refrain from reporting the results on this dataset because they were quite random, with the test accuracy on CIFAR10 much higher while CIFAR100 was comparatively very low. A similar result was raised in this GitHub issue on the repository. 
﻿
﻿
Run set2
﻿
ConclusionFrom our attempt at reproducibility, we conclude that UCB introduces a novel approach through Bayesian neural networks, and it successfully experiments with the two proposed methodologies, using learning rate regularization and pruning the weights according to their importance. We were able to replicate some results of the paper, which were easy to reproduce.  The paper was an interesting read and could be understood easily despite the mathematical complexity, as it was well constructed. 
﻿
Add a comment
Tags: Intermediate, Computer Vision, OCR, Research, Plots, MNIST, RC
Iterate on AI agents and models faster. Try Weights & Biases today.