# Understanding the Effectivity of Ensembles in Deep Learning

Dissecting ensembles, one at a time. Made by Sayak Paul using Weights & Biases
Sayak Paul

## Introduction

The report explores the ideas presented in Deep Ensembles: A Loss Landscape Perspective by Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan.

In the paper, the authors investigate the question - why do deep ensembles work better than single deep neural networks?

In their investigation, the authors figure out:

• Different snapshots of the same model (i.e., model trained after 1, 10, 100 epochs) exhibit functional similarity. Hence, their ensemble is less likely to explore the different modes of local minima in the optimization space.

• Different solutions of the same model (i.e., trained with different random initializations each time) exhibit functional dissimilarity. Hence, their ensemble is more likely to explore the different modes of local minima in the optimization space.

Inspired by their findings, in this report, we present several different insights that are useful for understanding the dynamics of deep neural networks in general.

## Revisiting the Optimization Landscape of Neural Networks

Neural networks are stochastic functions i.e., each time you train a neural network, it may not lead to the exact same solution as before. Neural networks are optimized using gradient-based learning. This optimization problem is almost always non-convex. When expressed with Greek letters, this optimization problem looks like so -

$\operatorname{minimize}{\theta} \frac{1}{m} \sum{i=1}^{m} \ell\left(h_{\theta}\left(x_{i}\right), y_{i}\right)$

where,

• $\theta$ is the parameter vector,
• $m$ is the number of training examples,
• $h_\theta$ is the model (neural network) parameterized over $\theta$
• $\ell(h_\theta(x), y)$ is our loss function in which $x$ and $y$ correspond to an input and a label from the training dataset

Consider the figure below that shows a sample non-convex loss landscape (typical for neural networks). As we can see, there are multiple local minima in there. A neural network can only reach one of these local minima at one time after they are trained. The same neural network can end up in different landscapes each time they are trained with different random initializations exhibiting high variance in predictions.

We can also see that these local minima lie at the same level in the loss landscape, which further suggests that if a network ends up in one of these local minima, it will yield the same kind of performance more or less.

## Mitigating the High Variance of a Single Model With Ensembling

To allow a network to cover these local minima better, we often train several versions of the same model but with different initializations. During inference, we take predictions from each of these different solutions, and we average their predictions. It works quite well in practice, and this process is referred to as ensembling. Ensembling also helps to reduce the high variance that might come from the predictions of individual models (the same network trained multiple times with different random initializations).

In order to understand why ensembles work well, we need to figure out the ingredients that make these ensembles cover the loss landscape better?

Neural networks are parameterized functions, as we saw earlier. Each time we train a network, we end up in a different parameter space leading to different optimums. The more diverse this space, the better the coverage of different optimums. So, how do we quantify this diversity?

To investigate this systematically, the authors do the following (among other things):

• They measure the cosine similarity of the weights from different runs of the same network. Cosine similarity is a widely used metric to measure the similarity between two vectors. It does so by measuring the orientation and not the magnitude (refer to the figure below). Formally speaking, it is the dot product of two normalized vectors divided by the product of their respective norms. They want to examine the functional similarity of different trajectories (weights of the same model trained with different initialization).

Practically we can do this by training the same model with different initializations while grabbing trainable weights, ignore biases, flatten weights from each layer, and extend them to a list. Apply cosine similarity formula (NumPy implementation) for each pair of models.

# compute cosine similarity of weights
cos_sim = np.dot(weights1, weights2)/(norm(weights1)*norm(weights2))

• They measure the extent to which the predictions from different runs disagree with each other. The authors want to see if the models trained with different initializations fail for the same subset(or complete set) of the testing dataset. Suppose a model trained with different inits produce different predictions on the test dataset, we can say that the prediction is a function of its initialization.

Also, the examples which tend to confuse the model across different initializations can be called intrinsically hard examples. To find this, we first compared confusion matrix epoch-wise, i.e., confusion matrix across individual epochs from the same init. This was followed with solution-wise comparison, i.e., confusion matrix from different solutions (inits) of the same model.

Practically, to compute dissimilarity in predictions, add the total number of equality between the true labels and the predicted labels, normalize by dividing the sum with the total number of test data points followed by subtraction by 1.

# compute dissimilarity
dissimilarity_score = 1 - np.sum(np.equal(preds1, preds2))/10000


Before we dive deep into the experiments mentioned above, it is essential to review our experimental setup.

## Experimental Setup

• Dataset used (primarily): CIFAR-10
• Architectures:
• Dropout: 0.1 (only applicable when using SmallCNN and MediumCNN)
• Batch size: 128
• Learning rate schedule: Initially start at 1.6 × 10−3 and halving it every 10 epochs
• Data augmentation: Only when using ResNet20v1

Note: We did not exactly follow what is specified in section 3 of the paper. There are minor differences in our experimental setup and what the authors followed.

For convenience, below, we specify how the learning rate schedule would look and the data augmentation pipeline we followed.

def augment(image,label):
image = tf.image.random_crop(image, size=[32, 32, 3]) # Random crop back to 32x32
image = tf.image.random_brightness(image, max_delta=0.5) # Random brightness
image = tf.clip_by_value(image, 0., 1.)
return image, label


### Full code on GitHub →

We used Google Colab for running all of our experiments.

## Dissecting the Weight Space of a (Deep) Ensemble

Going back to our experiments, we are going to present them in two different flavors:

• For each of the experiments quantifying the diversity (cosine similarity, prediction disagreement):
• Take different snapshots of a model from the same training run and perform the experiment.
• Train the model multiple times with different random initializations and perform the experiment.

Note: By snapshots, we refer to models taken from epoch 0, epoch 1, and so on from the same training run (same initialization).

## Section 11

#### Observations

<br>

• The functions (different checkpoints of the same model) in the same trajectory are similar, and it holds for all variants (small, medium, and large) of the model.

• The cosine similarity between the weights of the different snapshots of the same model starts showing a high degree of similarity between each other as it approaches convergence. Thus, there is not much change in the weight space when the trajectory is settled for a loss landscape valley.

• The checkpoints from the later stage of training differ the most from the initial stage of training, followed by mild similarity (whitish region).

## Section 9

#### Observations

• The models trained with different initialization (different trajectories) are entirely dissimilar. This holds for all three variants of the model.

• Thus, initialization decides the weight space the model will explore.

## Section 11

#### Observations

• The functions (different checkpoints of the same model) in the same trajectory tend to disagree less about its predictions. Further, confirming that functions in the same trajectory are similar.

• From the prediction dissimilarity plot we can see that different snapshots of the same model starts showing a high degree of similarity between each other as it approaches convergence(increasing epoch). Thus one can say that many examples are functionally mapped ($x \rightarrow y$) when the trajectory is settled for a loss landscape valley.

• We also observe high dissimilarity in predictions between the checkpoints from the later stage of training and the very initial stage of training.

## Section 13

#### Observations

• The predictions for the same model with different initializations trained on the same dataset with same hyperparameters disagree. :astonished:

• Obviously there is a subset of examples that the model trained with different trajectories will agree upon.

• There must be a subset of intrinsically hard examples that the model trained with different trajectories will misclassify similarly. We shall investigate in the next section.

## Intrinsic Hardness as a Function of Initialization

Below we see that the set of examples that confuses a model epoch-wise changes as we proceed toward the optimization. We further see that this set varies when we train the model with different initialization. We could not enlist results from all the different initialization for space constraint, but feel free to check them out here. This suggests that the definition of intrinsically hard examples is relative to how a model is being initialized to train. This may also further suggest that the images that cause the top losses during training (epoch-wise) are also not the same when we change the initialization of a model.

Note: You can click on the little button located at the top-left corner and play with the slider to see how the confusion matrices change with epochs.

The idea of creating an epoch-wise callback is referred from this tutorial.

## Different Initializations and Their Paths to Optimization

We talked about different initializations of the same model and observed functional dissimilarity between them. To spice it up, let's try to visualize the path for different trajectories visually. The authors do so by taking three (for simplicity) different trajectories (inits) of the same model. They then take the softmax output from different checkpoints along individual training trajectories and append them to an array. The shape of the array should be (num_of_trajectories, num_of_epochs, num_of_test_examples, num_classes) and then compute a 2 component t-SNE of this array.

The predictions from all the solutions and their individual epochs were appended to a single array because they belong to the same "space". We apply 2 component t-SNE to reduce this higher dimensional space to a two-dimensional space. Below is the result of this experiment for Small and Medium sized CNN. And wow!

In our opinion and also from the plots (shown below), it is evident that the models with different initializations have different trajectories. As one approaches convergence, they tend to cluster around the same valley in space. Even though the models reach similar accuracy, we can clearly see the evidence of multiple minima which lie on the same plane.

## Accuracy as a Function of Ensemble Size

Another interesting question the authors explore is - how ensemble size affects the overall test accuracy? Below we can see that as we keep increasing the ensemble size, the model performance gets enhanced. For SmallCNN, after a certain period, the enhancement gets plateaued. We think this might be because a small-capacity model does not produce an optimum solution over the training dataset. Ensembling predictions do help improve model performance, but after reaching peak performance, the uncertainty from multiple suboptimal models take over the benefit of ensembling.

## Section 19

This suggests it’s because an ensemble is able to cover the optimization landscape better than a single model and indeed that seems to be the case.

Although this behavior is interesting for deployment-related situations using a large ensemble of very heavy models might not be practically feasible.

## Perturbating an already optimized solution space

The authors, in addition to the experiments based on the checkpoints along a trajectory also explore the subspace along an individual trajectory. Subspace along a trajectory is a set of functions (solutions) that exist in the function space around the explored space and while retraining with the same initialization could be explored. The authors use a representative set of four subspace sampling methods:

• Monte Carlo dropout
• Diagonal Gaussian approximation
• Low-rank covariance matrix Gaussian approximation
• Random subspace approximation

The authors construct their subspace around an optimized weight-space (weights and biases of a trained neural network) solution θ. By using the t-SNE plot experimental setup, they show that the created subspace lies in the same valley as the optimized solution while different solution lies in a different valley.

The authors validate two hypotheses -

1. Ensembling the solutions by sub-sampling around an optimized solution provides benefits in terms of model performance. But…
2. The relative benefit of simple ensembling (shown above) is higher as it averages prediction over more diverse solutions.

The plot below summarizes these -

## Conclusion

The paper we discussed in this report gives us an excellent understanding of why (deep) ensembles are very powerful in covering the optimization landscape better with simple experiments. Below we leave you with a couple of amazing papers in case you are interested in knowing more about different aspects of deep neural networks -

## Acknowledgements

Thanks to Yannic Kilcher for his amazing explanation video of the paper which helped us pursue our experiments.

Thanks to Balaji Lakshminarayanan for providing feedback on the initial draft of the report and rectifying our mistake on the tSNE projections.

Hope you have enjoyed reading this report. For any feedback reach out to us on Twitter: @RisingSayak and @ayushthakur0.

Sayak Paul and Ayush Thakur have contributed equally to this report.