The Power of Random Features of a Convolutional Neural Network (CNN)

This article presents experiments based on the ideas shown in the paper 'Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs.'

Sayak Paul

Created on May 18|Last edited on October 14

Comment

﻿
BatchNorm has been a favorite topic among the ML research community since it was proposed. It is often misunderstood and even quite poorly understood. So far, the research community has mostly focused on its normalization component. It's also important to note that a BatchNorm has two learnable parameters - a coefficient that is responsible for scaling and a bias that is responsible for shifting. Not much work has been done in order to study the effect of these two parameters systematically. 
Earlier this year, Jonathan Frankle and his team published a paper on Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs. They studied how well the scaled and shifted parameters of the BatchNorm layers adjust themselves with the random parameter initializations of CNNs. In this report, I am going to present my experiments based on the ideas presented in this paper.
﻿Check out the code on GitHub →﻿
﻿
Table of ContentsConfiguring the ExperimentWhat's BatchNorm Anyway?ResNet20 With All the Layers Set to TrainableResNet20 With An LR Schedule and All the Layers Set to TrainableThe Promise of BatchNormWhat if None of the Layers Are Trainable?How Important Is Batch Size?Keeping All Layers Trainable vs Only the BatchNorm Layers TrainableAre the Learned Convolutional Filters Consistent?Convolutional Filters Where Only the BatchNorm Layers Were TrainedNext Steps
﻿
Configuring the ExperimentFor these experiments, I used the CIFAR10 dataset and a ResNet20 based architecture specific to the dimensions of the images of the CIFAR10 dataset. Thanks to the Keras Idiomatic Programmer for the implementation of the ResNet architecture. Additionally, we used the Adam optimizer for all the experiments.
I used Colab TPUs to experiment quickly, but for consistency, I ran the same experiments on a GPU instance and the results were identical to a great extent. The central goal of the experiments is to compare the performance of CNN where only the BatchNorm layers are trainable as opposed to all the layers. 
Thanks to the ML-GDE program for the GCP credits which were used to spin up notebook instances and to save some model weights to GCS Buckets. 
What's BatchNorm Anyway?﻿Batch Normalization, also known as BatchNorm, is a common technique that is applied to input data to stabilize the training of deep neural networks. In general, the ranges of output values of the neurons of a deep neural network deviate from each other over the course of the training and in doing so, introduce unstable training behavior. BatchNorm helps mitigate this problem by normalizing the neuron outputs i.e. by subtracting the mean and dividing by the standard deviation across a mini-batch.
To introduce a bit of variance in the outputs to allow a deep network to adapt to variations, BatchNorm is parameterized by a scale and a shift parameter. These two parameters tell a BatchNorm layer how much scaling and shifting are required, and also become a part of the model training process.
﻿

Source﻿
You can find a more concrete overview of BatchNorm here.
Next up, let's dive into a comparison of performance for different flavors of ResNet20.
ResNet20 With All the Layers Set to Trainable﻿
﻿
Run set1
﻿
ResNet20 With An LR Schedule and All the Layers Set to Trainable﻿
﻿
Run set1
﻿
The Promise of BatchNormIn this section, I'll present the results we've been waiting to see until now – what if we only train the BatchNorm layers and keep all the other trainable parameters at their random initial values?
Note that in this case, the number of trainable parameters in the network is 4000 as can be seen in this notebook. 
﻿
﻿
Run set1
﻿
In this case, as well, the network clearly shows that it is indeed learning. Although the accuracy of the network is yet to match the ones from the earlier experiments, we can clearly see that if we train the network for longer we can hope for even better performance. (Note: that the network was trained for 75 epochs in this case.)
This training behavior clearly shows that the trainable parameters of the BatchNorm layers are capable of adjusting the distribution of the neuron outputs for empirical loss minimization, even when they are left at a random initialization. This is what the authors mentioned in the Abstract of the original paper: 
Not only do these results highlight the under-appreciated role of the affine parameters in BatchNorm, but - in a broader sense - they characterize the expressive power of neural networks constructed simply by shifting and rescaling random features.
The results we saw in this experiment seem to confirm it. Later in this report, this point will become even more prominent. For consistency, I trained the same network but this time none of the layers were trainable. In the next section, we see the results. 
In case, you are wondering why the validation accuracy/loss is better than the training accuracy/loss, you should give this article a read. 
What if None of the Layers Are Trainable?﻿
﻿
Run set1
﻿
How Important Is Batch Size?This section demonstrates the effect of changing the batch sizes, when we train only the BatchNorm layers. 
﻿
﻿
Run set4
﻿
Keeping All Layers Trainable vs Only the BatchNorm Layers TrainableAt one extreme level, the following three plots compare two models (one with all layers as trainable vs. one with only BatchNorm layers set to trainable) in terms of their training time and training progress - 
﻿
﻿
Run set2
﻿
While it's clear that the model in which only the BatchNorm layers were trained doesn't yet meet our performance expectations, with longer training and more thorough hyperparameter tuning, we can definitely squeeze more performance out of that network. In terms of model size and the number of parameters, here's how it is looking - 
Final model size
BatchNorm only: 2.6 MB
Fully fleshed out model: 7.3 MB
Number of learnable parameters:
BatchNorm only: 4000
Fully fleshed out model: 571,114
Are the Learned Convolutional Filters Consistent?Let's validate the performance of our model where just the BatchNorm layers were trained by investigating the learned convolution filters. Specifically, we are looking to see - do they learn anything useful?
First, we'll take a look at the 10th convolutional filter from the model where all the layers were trained.
﻿
﻿
Run set1
﻿
Convolutional Filters Where Only the BatchNorm Layers Were TrainedThe results are quite promising in this case as well -
﻿
Run set1
﻿
﻿
The above visualization further confirms that the learnable parameters in the BatchNorm layers indeed picked up the correct training signals. It's important to note that the 10th convolutional filter was not cherry-picked for visualization purposes. I got similar results across all the different convolution filters.
Next StepsI invite you the reader to take this research forward and explore the following next steps:
What if instead of training all the BatchNorm layers we train some of them, preferably the lower ones that come in the network topology? 
What if we tried this set of experiments on different networks or even different variants of ResNets? 
For these experiments, I only tuned the batch size hyperparameter. A more thorough hyperparameter tuning is definitely something I would consider doing here. 
Does this idea of only training BatchNorm layers help model pruning in any way?
﻿Check out the code on GitHub →﻿﻿
﻿