Introduction

The report explores the ideas presented in Deep Ensembles: A Loss Landscape Perspective by Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan.

In the paper, the authors investigate the question - why do deep ensembles work better than single deep neural networks?

In their investigation, the authors figure out:

Inspired by their findings, in this report, we present several different insights that are useful for understanding the dynamics of deep neural networks in general.

Full code on GitHub →

Revisiting the Optimization Landscape of Neural Networks

Neural networks are stochastic functions i.e., each time you train a neural network, it may not lead to the exact same solution as before. Neural networks are optimized using gradient-based learning. This optimization problem is almost always non-convex. When expressed with Greek letters, this optimization problem looks like so -

$ \operatorname{minimize}{\theta} \frac{1}{m} \sum{i=1}^{m} \ell\left(h_{\theta}\left(x_{i}\right), y_{i}\right) $

where,

Consider the figure below that shows a sample non-convex loss landscape (typical for neural networks). As we can see, there are multiple local minima in there. A neural network can only reach one of these local minima at one time after they are trained. The same neural network can end up in different landscapes each time they are trained with different random initializations exhibiting high variance in predictions.

We can also see that these local minima lie at the same level in the loss landscape, which further suggests that if a network ends up in one of these local minima, it will yield the same kind of performance more or less.

Mitigating the High Variance of a Single Model With Ensembling

To allow a network to cover these local minima better, we often train several versions of the same model but with different initializations. During inference, we take predictions from each of these different solutions, and we average their predictions. It works quite well in practice, and this process is referred to as ensembling. Ensembling also helps to reduce the high variance that might come from the predictions of individual models (the same network trained multiple times with different random initializations).

In order to understand why ensembles work well, we need to figure out the ingredients that make these ensembles cover the loss landscape better?

Neural networks are parameterized functions, as we saw earlier. Each time we train a network, we end up in a different parameter space leading to different optimums. The more diverse this space, the better the coverage of different optimums. So, how do we quantify this diversity?

To investigate this systematically, the authors do the following (among other things):


Before we dive deep into the experiments mentioned above, it is essential to review our experimental setup.

Experimental Setup

Note: We did not exactly follow what is specified in section 3 of the paper. There are minor differences in our experimental setup and what the authors followed.

For convenience, below, we specify how the learning rate schedule would look and the data augmentation pipeline we followed.

def augment(image,label):
    image = tf.image.resize_with_crop_or_pad(image, 40, 40) # Add 8 pixels of padding
    image = tf.image.random_crop(image, size=[32, 32, 3]) # Random crop back to 32x32
    image = tf.image.random_brightness(image, max_delta=0.5) # Random brightness
    image = tf.clip_by_value(image, 0., 1.)
    return image, label

Full code on GitHub →

We used Google Colab for running all of our experiments.

Dissecting the Weight Space of a (Deep) Ensemble

Going back to our experiments, we are going to present them in two different flavors:

Note: By snapshots, we refer to models taken from epoch 0, epoch 1, and so on from the same training run (same initialization).

Cosine Similarity in Between the Weights (Snapshots)

Section 11

Observations

<br>

Cosine Similarity in Between the Weights (Different Inits)

Section 9

Observations

Disagreement Between Predictions (Snapshots)

Section 11

Observations

Disagreement Between Predictions (Different Inits)

Section 13

Observations

Reproduce analysis →

Intrinsic Hardness as a Function of Initialization

Below we see that the set of examples that confuses a model epoch-wise changes as we proceed toward the optimization. We further see that this set varies when we train the model with different initialization. We could not enlist results from all the different initialization for space constraint, but feel free to check them out here. This suggests that the definition of intrinsically hard examples is relative to how a model is being initialized to train. This may also further suggest that the images that cause the top losses during training (epoch-wise) are also not the same when we change the initialization of a model.

Note: You can click on the little button located at the top-left corner and play with the slider to see how the confusion matrices change with epochs.

The idea of creating an epoch-wise callback is referred from this tutorial.

Section 15

Different Initializations and Their Paths to Optimization

We talked about different initializations of the same model and observed functional dissimilarity between them. To spice it up, let's try to visualize the path for different trajectories visually. The authors do so by taking three (for simplicity) different trajectories (inits) of the same model. They then take the softmax output from different checkpoints along individual training trajectories and append them to an array. The shape of the array should be (num_of_trajectories, num_of_epochs, num_of_test_examples, num_classes) and then compute a 2 component t-SNE of this array.

The predictions from all the solutions and their individual epochs were appended to a single array because they belong to the same "space". We apply 2 component t-SNE to reduce this higher dimensional space to a two-dimensional space. Below is the result of this experiment for Small and Medium sized CNN. And wow!

In our opinion and also from the plots (shown below), it is evident that the models with different initializations have different trajectories. As one approaches convergence, they tend to cluster around the same valley in space. Even though the models reach similar accuracy, we can clearly see the evidence of multiple minima which lie on the same plane.

Section 17

Accuracy as a Function of Ensemble Size

Another interesting question the authors explore is - how ensemble size affects the overall test accuracy? Below we can see that as we keep increasing the ensemble size, the model performance gets enhanced. For SmallCNN, after a certain period, the enhancement gets plateaued. We think this might be because a small-capacity model does not produce an optimum solution over the training dataset. Ensembling predictions do help improve model performance, but after reaching peak performance, the uncertainty from multiple suboptimal models take over the benefit of ensembling.

Section 19

This suggests it’s because an ensemble is able to cover the optimization landscape better than a single model and indeed that seems to be the case.

Although this behavior is interesting for deployment-related situations using a large ensemble of very heavy models might not be practically feasible.

Perturbating an already optimized solution space

The authors, in addition to the experiments based on the checkpoints along a trajectory also explore the subspace along an individual trajectory. Subspace along a trajectory is a set of functions (solutions) that exist in the function space around the explored space and while retraining with the same initialization could be explored. The authors use a representative set of four subspace sampling methods:

The authors construct their subspace around an optimized weight-space (weights and biases of a trained neural network) solution θ. By using the t-SNE plot experimental setup, they show that the created subspace lies in the same valley as the optimized solution while different solution lies in a different valley.

The authors validate two hypotheses -

  1. Ensembling the solutions by sub-sampling around an optimized solution provides benefits in terms of model performance. But…
  2. The relative benefit of simple ensembling (shown above) is higher as it averages prediction over more diverse solutions.

The plot below summarizes these -

Reproduce analysis →

Conclusion

The paper we discussed in this report gives us an excellent understanding of why (deep) ensembles are very powerful in covering the optimization landscape better with simple experiments. Below we leave you with a couple of amazing papers in case you are interested in knowing more about different aspects of deep neural networks -

Acknowledgements

Thanks to Yannic Kilcher for his amazing explanation video of the paper which helped us pursue our experiments.

Thanks to Balaji Lakshminarayanan for providing feedback on the initial draft of the report and rectifying our mistake on the tSNE projections.

Hope you have enjoyed reading this report. For any feedback reach out to us on Twitter: @RisingSayak and @ayushthakur0.

Sayak Paul and Ayush Thakur have contributed equally to this report.