A reproduction of the paper 'Uncertainty-Guided Continual Learning With Bayesian Neural Networks' by Ebrahimi et al. (2020), submitted at ICLR 2020.

This report attempts to validate the reproducibility of the ICLR 2020 paper 'Uncertainty-Guided Continual Learning With Bayesian Neural Networks' by Ebrahimi et al. (2020). It covers each aspect of reproducing the results and claims put forth in the paper. The authors propose a novel approach to overcome the dependency of continual learning algorithms on an external representation and extra computation to measure the parameters’ importance by introducing Uncertainty guided Continual Bayesian Neural Networks (UCB). In UCB, the learning rate adapts according to the uncertainty defined in the probability distribution of the weights in networks. The paper was easy to read, well written and we could reproduce the results as promised in the paper.

The paper proposes a novel network, namely Uncertainty guided Continual Bayesian Neural Networks (UCB), where the learning rate adapts according to the uncertainty defined in the probability distribution of the weights in networks. Authors also show a variant of UCB, which uses uncertainty for weight pruning and retains task performance after pruning by saving binary masks per tasks.

We used the code provided by the authors in their Github repository. The authors have given the implementation for all the datasets and it is convenient to run the code as well. The code took approximately 3 hours for MNIST dataset on a Titan XP GPU while for the PMNIST dataset, it took around 17 hours on NVIDIA GTX 1060 GPU. We could only run experiment for these two datasets due to computational constraints. Further details are presented in the Run below. We used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard.

We were able to reproduce the experiments for MNIST-5 and P-MNIST datasets mentioned in the paper. The results obtained overlap with the ones promised in the paper for MNIST-5 as we ran it for 50 epochs. However, for P-MNIST we could only run it for 5 epochs due to computational constraints and we were unable to conclude the experiments even after around 17 hours and thus did not obtain the reported average accuracy per task. Instead, we report the test accuracy for P-MNIST dataset. For alternating CIFAR 10/CIFAR 100 dataset, we could not obtain consistent results and thus we do not report them here.

The paper was understandable and it was quite fascinating to follow its structure. Along with the theoretical concept, the mathematical equations provided ease to reformulate the paper. Because of an up-to-date official repo on Github, we were able to test the experiments without much delay in implementing the code. The hyperparameters were clearly noted in the paper which made it easy to replicate the runs.

Conducting the experiments was computationally expensive and we could only reproduce the results for two datasets - MNIST-5 and P-MNIST.

This reproducibility submission is an effort to validate the the ICLR 2020 paper 'Uncertainty-Guided Continual Learning With Bayesian Neural Networks' by Ebrahimi et al. (2020).

UCB, the proposed work, is a Bayesian neural network that utilizes the uncertainty predictions to perform continual learning. The authors try out two approaches, first one being learning rate regularization and by using weight pruning.

In this reproducibility report, we study UCB in detail, which consists of running experiments according to the open-source code by authors, reporting the important details about certain issues encountered during reproducing and comparing the obtained results with the ones reported in the original paper.

The claims of the paper are as listed below :

- The paper proposes to perform continual learning with Bayesian neural networks and it develops a new methodology that exploits the uncertainty to adapt the the learning rate of individual parameters.
- The authors propose a hard-threshold variant that decides which parameters to freeze.

The authors perform thorough experiments to validate their approach on the prior art as well as benchmark datasets for the task. They also claim that unlike the existing approaches, this paper does not rely on knowledge about task boundaries at inference time. The code for the paper is available at the official github repository.

The authors introduce Uncertainty-guided Continual learning approach with Bayesian neural networks (UCB), which uses the estimated uncertainty of the parameters' posterior distribution to softly or aggressively regulate the change in "important" parameters.

In continual learning, forgetting is reduced by regularizing the changes in model representation based on the importance of parameters. The authors propose to regularize the changes through learning rate of each parameter and thus the gradient update becomes a function of its importance.

The main advantage of using learning rate as the regularizer is that it doesn't require any additional memory, unlike pruning techniques, nor does it require monitoring the change in parameters in relation to the previously learnt job, as standard weight regularization methods do.

\alpha_\mu \leftarrow \frac{\alpha_\mu}{\Omega_\mu} ---- (1)

\alpha_\rho \leftarrow \frac{\alpha_\rho}{\Omega_\rho} ---- (2)

In the equations above, the authors establish that the learning rate of µ (mean of the posterior) and ρ (standard deviation for the scaled mixture Gaussian pdf of prior) for each parameter distribution is inversely proportional to its importance Ω.

Thus, importance is inversely proportional to the standard deviation σ which represents the parameter uncertainty in the Bayesian neural network.

UCB-P consists of the following steps :

- For every layer, the parameters are arranged according to their SNR value and the lowest SNR value which is also the least important parameter is pruned or set to zero.
- Pruned parameters are marked using a binary mask and used later to learn a new task while the important parameters are fixed throughout the training.
- After a task is learnt, an associated binary mask is saved which will be used during inference to recover key parameters and hence the exact performance to the desired task.

SNR is a signal processing metric that is used to discern between “useful” and “unwanted” information in a signal. The SNR can be considered of as a measure of parameter relevance in neural models; the higher the SNR, the more effective or significant the parameter is to model predictions for a specific task.

- 5 - Split MNIST , 5 tasks
- Permuted MNIST, 10 permutations
- Alternating CIFAR10/100
- Sequence of 8 tasks

No pre-trained model were used in the paper. The authors clearly specify the batch size to be 64 and a learning rate of 0.01, decaying it by a factor of 0.3 once the loss plateaued.

The authors report the results on 5-Split MNIST dataset as :

We first present our results for class incremental learning of MNIST (5-Split MNIST) in which we learn the digits 0 − 9 in five tasks with 2 classes at a time in 5 pairs of 0/1, 2/3, 4/5, 6/7, and 8/9.

Authors report the values as

- Accuracy (Average Test Classification Accuracy Across All Tasks) - 99.63
- BWT (Backward Transfer to measure forgetting) - 0.00

Our observations were

- Accuracy - 98.64
- BWT - 0.00

The paper proposed for the model to learn a sequence of 10 random permutations and report average accuracy at the end. However, due to computational constraints, the model finished the epochs and we could get the test accuracy but not the average accuracy per task. The details for this run can be found in the run set 2 below.

We refrain from reporting the results on this dataset because they were quite random with the test accuracy on CIFAR10 much higher while CIFAR100 was comparatively very low. A similar result was raised in this github issue on the repository.

From our attempt at reproducibility, we conclude that UCB introduces a novel approach through Bayesian neural networks and it successfully experiments with the two proposed methodologies, using learning rate regularization and pruning the weights according to their importance. We were able to replicate the some results of the paper which were easy to reproduce. The paper was an interesting read and could be understood easily despite the mathematical complexity as it was well constructed.