# What Do Compressed Deep Neural Networks Forget?

Pruning can be a clever way to reduce a model's resource greediness. But what gets forgotten when you do? . Made by Saurav Maheshkar using Weights & Biases
Saurav Maheshkar

### What is Neural Network Compression ?

Current state-of-the-art models are famously huge and over-parameterized––in fact, they contain way more parameters than the number of data points in the dataset. But in many ways, over-parameterization is behind the success of modern-day deep learning. Think about something like Switch Transformer having a trillion parameters or Vision Transformer-Huge having 632M parameters. These models require enormous amounts of computation and memory and that not only increases the infrastructure costs, but also makes deployment to resource-constrained environments such as mobile phones or smart devices challenging.
With this push towards bigger and deeper models comes the competing need for fast deployment and efficiency. One tactic that solves some of this give-and-take is compression. Practitioners have started focusing on neural network compression methods like pruning and quantization and have proved that training a larger model followed by pruning beats training a smaller model from scratch.
Gale et al., 2019 beautifully demonstrated that unstructured, sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with pruning as part of the optimization process.
Several techniques for building efficient AI have been proposed over the past few years such as:
1. Automated Design (Auto-ML)
2. Knowledge Distillation
3. Quantisation
4. Tensor Decomposition
5. Pruning
We're going to dig into pruning in this report, paying special attention to a recent paper called What Do Compressed Deep Neural Networks Forget? by Sara Hooker, Aaron Courville, Gregory Clark, Yann Dauphin and Andrea Frome. Let's first start with a definition:

## ✂️ What is Neural Network Pruning?

One popular approach for reducing the resource requirements at test time is Neural Network Pruning. This means systematically removing parameters (neurons, connections, etc.) from an existing network to try to reduce down its size.
Pruning has been used as a model compression technique for quite a while now. Before the turn of the millennium, Quinlan; (1986) and Mingers (1989) explored pruning methodologies for decision trees while Sietsma, Dow (1988), Karnin (1990), and Le Cun et al. (1989) provided the some of the first techniques for pruning model weights using techniques like second-order approximation of the loss surface.
Researchers have come up with many different ways of identifying and removing superfluous portions of a neural network model resulting in the development of various pruning algorithms.
These methodologies can be executed before, during, and after training and often differ across numerous dimensions.

### Some of the pruning methods are:

• Methods based on weight magnitude (Zhu and Gupta; 2017), activations, gradients, etc.
• Layer-wise vs global & Unstructured vs Structured
• Rule-based & Bayesian (Neklyudov et al.; 2017)
• One-shot vs Iterative Pruning (Han et al.; 2015)
• Techniques like: Finetuning, Reinitialization and Rewinding.
However, as (Blalock et al.; 2020) and (Hooker et al.; 2020) have pointed out, the community suffers from a lack of standardized benchmarks and metrics; and moreover metrics like test set accuracy conceal significant differences in how different classes and images are impacted by model compression techniques. This deficiency is substantial enough that it is hard to compare pruning techniques to one another or determine how much progress the field has made. Let's look at some of the questions Hooker et al. raised in their paper.

## 📚 Questions Raised By The Paper

• How can networks with radically different representations and number of parameters have comparable top-level metrics ? One possibility is that test-set accuracy is simply not a precise enough measure to capture how compression impacts the generalization properties of the model.
• Are certain types of examples or classes disproportionately impacted by model compression techniques like pruning and quantization?
• What makes performance on certain subsets of the dataset far more sensitive to varying model capacity?
• How does compression impact model sensitivity to certain types of distributional shifts such as image corruptions and natural adversarial examples ?

## 🧫 Methodology and Experimental Framework

The authors independently trained numerous models with various levels of pruning across three classification tasks and model architectures: a wide ResNet model trained on CIFAR-10, a ResNet-50 trained on ImageNet, and a ResNet-18 trained on CelebA. On ImageNet and CelebA, they also evaluate three different quantization techniques: float16 quantization , hybrid dynamic range quantization with int8 weights and fixed-point only quantization with int8 weights created with a small representative dataset fixed-point.
To minimally reproduce the results, instead of using the Resnet-18, I ran multiple experiments with the InceptionV3 Architecture with a pruning scheduler of constant sparsity s \in \{ 0, 0.3, 0.5, 0.7, 0.9, 0.99 \}, block size of (1,1) and average block pooling., implemented using the TensorFlow Model Optimization Toolkit. The models were trained for a Binary Image Classification (blonde vs non-blonde) which is an under-represented group in the CelebA dataset (what is sometime referred to as the "long-tail" in literature ). I did not experiment with quantization in my work.

## 🖥 Application and Code

You can try out the pruned models for yourself at this link (web application built using Streamlit; the image may take a little time to load). Alternatively you can also download the Docker Image for the application. The instructions are available here.
Web Application built using Streamlit
The following code snippet was used for creating pruned neural network models
# InceptionV3 Architecturemodel = tf.keras.Sequential([...])model.load_weights('baseline.h5')pruning_params = { 'pruning_schedule': tfmot.sparsity.keras.ConstantSparsity(0.3, 0), 'block_size': (1, 1), 'block_pooling_type': 'AVG'}model_thirty = tfmot.sparsity.keras.prune_low_magnitude(model,**pruning_params)log_dir_thirty = tempfile.mkdtemp()callbacks = [ tfmot.sparsity.keras.UpdatePruningStep(), tfmot.sparsity.keras.PruningSummaries(log_dir = log_dir_thirty), WandbCallback(data_type="image", validation_data=(x_valid, y_valid), save_model=True)]model_thirty.compile(...)model_thirty.fit(...)

### 📒 Model Performance Metrics

Following are the Metrics from over 40 experiments.
All the model weights are provided with this project as WandB Artifacts to ensure reproducibility and promote further experimentation.

## 🤓 Observations

As we can see from the table, there is a minimal drop in Top 1 Accuracy compared to compression levels as is expected from pruning algorithms like ConstantSparsity, but upon comparing the number of parameters one can make some interesting inferences.
1. The Network has acceptable accuracy for well-represented classes in the dataset even after pruning to substantial levels.
2. Upon compression the model tends to perform poorly on the under-represented classes. This suggests that most of the time during training, most of the weights are learning the long-tail of the distribution. In other words, "Compression disproportionately impacts model performance on the underrepresented long-tail of the data distribution." Perhaps an explanation of the "bigger is better" race.
3. This suggests that perhaps we should come up with better evaluation metrics that can identify the performance on the under-represented classes as well.

### ⚔️ Pruning Identified Exemplars

Apart from evaluating the impact of compression on class level performance using Welch's t-test and controlling for any overall difference in model test-set accuracy (Explored in-depth in the paper), the authors also identified images that are disproportionately impacted by compression. Given the limitations of un-calibrated probabilities in deep neural networks, they focused on the level of disagreement between the predictions of compressed and non-compressed networks on a given image. The notion of pruning identified exemplars (PIEs) is of prime importance here.
A set of predictions are generated using the population of models. The modal label , i.e. the class predicted most frequently by the pruned models population for a image is noted and if the modal label is different from the output of the non-pruned model then the modal is characterized as a PIE.
NOTE: There is no constraint that the non-pruned predictions for PIEs match the true label. Thus the detection of PIEs is an unsupervised protocol that can be performed at test time.

### How to Reproduce These Experiments

Follow these Steps in order to reproduce some of these experiments: