Skip to main content

Binary Classification with MC-Dropout Models

An introduction to classification with Bayesian Deep Learning with the Monte-Carlo Dropout implementation.
Created on November 12|Last edited on December 5

Introduction

Uncertainty measurement for Deep Learning models may play an important role in the context of risk management. Still, to understand the mechanism and interpretation of the aleatoric and epistemic uncertainty, we first analyze them in a simpler and risk-less context, the cats vs. dogs image classification.

Our submodels implementation allows us to use any pre-trained model as the encoder model E, for this dataset, it is convenient to use a well-known model for image classification. In particular, we use the Keras implementation of the EfficientNet, with the configuration B0 and the weights trained with the ImageNet dataset (https://image-net.org/).
For the stochastic classifier model C, we implement it by stacking three fully dense connection layers. As the MC-Dropout model requires, a dropout layer is added between each dense layer to enforce the stochastic output. The following figure shows the architecture of the model used for this binary classification problem.
EfficientNetB0 with MC-Dropout implementation. In this model, the EfficientNet is used as a determinist encoder model. Its output feeds a stochastic classifier model.
In the rest of this report, we will discuss the training procedure, the classification performance and analyze and interpret the uncertainty. Finally, we will share some insight into applying these models in the Her2 Scoring context.

Training Procedure

In our opinion, the most important feature of an MC-Dropout model is the seamless integration with the most commonly used deep learning frameworks, especially the no alteration of the training procedure. It can fit its weight by the classics backpropagation algorithms. The only modification in the architecture is the addition of the dropout layer, which is trained as usual, i.e., for each step, we take a sample of the weights by dropping some connections on the network and propagating the error through the remaining connections.
The model is trained with the RMSProp algorithm, with a learning rate of 0.001, preliminary experiments have shown that the SGD and moment base backpropagate algorithms results in underfitting and models with a low generalization.

Classification Metrics

In general, the performance metrics show that this model can perfectly classify images of cats and dogs.

Training set
14
Test set
1


Prediction examples

An MC-Dropout model performs the prediction of the class for a new input differently from a classic model of deep learning. The predictive distribution, i.e., the estimated distribution of classes for a given input, is calculated by sampling the trained model with T stochastic forward passes and average the results.
p(y=Ckxi,Dtrain)1Tt=1TF(xi,w^t)p(y=C_k|x_i,\mathcal{D}_{\text{train}}) \approx\frac{1}{T}\sum_{t=1}^T F(x_i, \hat{w}_t)

We implement this estimation with the 2-step predictive distribution algorithm presented previously. The following figures consist of the input image with the true and predicted classes (top left), the predictive distribution and epistemic uncertainty (top right), and a stochastic forward passes distribution by class.

Training set
5
Evaluation set
1



Epistemic Uncertainty

The previous section has shown that the model can classify the images, but we are interested in measuring uncertainty. We start with epistemic uncertainty. This kind o uncertainty represents the ignorance of the model parameter; it's also referred to as the model uncertainty because it can be interpreted as the confidence in the prediction. Epistemic uncertainty can reduce this type of uncertainty by adding new samples to our datasets.
They are three measurements of epistemic uncertainty: predictive entropy, mutual information, and variation ratio. For simplicity, we will focus our analysis on predictive entropy and mutual information:
H[yx,Dtrain]=kp(y=Ckxi,Dtrain)logp(y=Ckxi,Dtrain)\mathbb{H}[y|x,\mathcal{D}_\text{train}] = -\sum_k p(y=C_k|x_i,\mathcal{D}_{\text{train}})\log p(y=C_k|x_i,\mathcal{D}_{\text{train}})

I[yx,Dtrain]=H[yx,Dtrain]Ep(wDtrain)[H[yw]]\mathbb{I}[y|x,\mathcal{D}_\text{train}] =\mathbb{H}[y|x,\mathcal{D}_\text{train}] - \mathrm{E}_{p(w|\mathcal{D}_{\text{train}})}[\mathbb{H}[y|w]]

Predictive entropy H\mathbb{H} capture the uncertainty in the prediction, whereas the mutual information I\mathbb{I} captures the model’s confidence in its output.
We evaluate the best model obtained in the training phase on the test set. The next figure shows samples from the evaluation set with higher predictive entropy and mutual information.

Evaluation set
1

The epistemic uncertainty results show that the mean predictive entropy and mutual information for the evaluation set are near 0, so broadly speaking, the model has a high level of confidence, without any class preference. Nonetheless, some outliers have a large epistemic uncertainty; for those samples (see the previous figure), the model has a low confidence level in its prediction.

Aleatoric Uncertainty

The other type of uncertainty is the aleatoric uncertainty, also known as data uncertainty. This is an irreducible type of uncertainty and is the noise inherent in the observations. Aleatoric uncertainty is categorized into homoscedastic uncertainty, the uncertainty that stays constant for different inputs, and heteroscedastic uncertainty that depends on the inputs to the model, and can vary with each new image, thus some input will have higher uncertainty than others. Heteroscedastic uncertainty is especially important for computer vision applications.
By applying the method discussed in the paper, "what uncertainty we need for computer vision," we can measure for each input sample the aleatoric uncertainty, the predictive variance, building the Aleatoric model. By moddifying the stochastic classifier from the MC-Dropout model, removing the MC-Dropout layers and adding the predictive variance σ^\hat{\sigma} to the model's output. This method works by adding noise from a normal distribution ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) to the model output in previous to the softmax function, i.e. we corrupt the output in the logits space c^\hat{c}.
c^=Clogits(z,ω^(c))\hat{c} = C^{\text{logits}}(z, \hat{\omega}^{(c)})

σ^2=Cσ(z,w^(c))\hat{\sigma}^2 = C^\sigma(z, \hat{w}^{(c)})

p^(yz,ω^(c))=softmax(c^)\hat{p}(y| z, \hat{\omega}^{(c)}) = softmax(\hat{c})

The next figure shows the new aleatoric model architecture, we can reuse the same weights from the encoder.
Aleatoric model for binary classification, the uncertainty estimation is built in the model's output.
In order to training, the model uses the following loss function instead of the cross-entropy loss.
p~t(yz,ω^(c))=softmax(c^i+σ^i2ϵt)\tilde{p}_t(y| z, \hat{\omega}^{(c)}) = softmax(\hat{c}_i + \hat{\sigma}_i^2\cdot\epsilon_t)

Laleatoric=ilog(1Ttp~t(y=Cizi,ω^(c)))\mathcal{L}_{\text{aleatoric}} = \sum_i\log(\frac{1}{T}\sum_t \tilde{p}_t(y=C_i^*| z_i, \hat{\omega}^{(c)}))

Notices that for this model w^\hat{w} is fixed, so the model's outputs are deterministic. Instead, we obtain stochastics output in the loss function evaluation.
We train the aleatoric model and was evaluated in the test set.

Run set
13



Training set
13
evaluation set 2
2

A higher value of predictive variance means a higher noise in a given input image; The following figure shows samples from the evaluation set with the higher aleatoric uncertainty. The aleatoric uncertainty results show a difference in the distribution of uncertainty for each class. The dog class tends to have a higher uncertainty than the cat class.

Evaluation set
1


The importance of the uncertainty

The next figure shows some relevant examples from the evaluation set with the higher predictive entropy.
High predictive entropy examples for binary classification in the cats vs dogs dataset.
The model has low confidence with these examples, and we can infer that three types of conditions may cause this. The first row shows that an image with a small pixel region containing factual information is a source of uncertainty. As humans, we also have some difficulty distinguishing these types of pictures due to the image's low resolution.
The second row contains images out of the distribution of the training set. Pictures of people or text are a real problem, primarily when we can found this type of error in official datasets used by the community.
Finally, the third row contains images of cats and dogs in a weird-looking pose, like the first picture, with a dog looking at the camera that actually is the top of a cat's head. Or cat and dog with mixed features, like a pointy ear dog and a pitbull fur cat.
Dogs vs. Cats samples. Different dog breeds increase the dataset noise.
In the case of the predictive variance, samples with a large black region also present an high uncertainty. The difference between clases can be explained by the variability of the features of dogs in contrast to the low variability of the cat features. The differences between dogs' breeds, shapes, sizes, and furs are visually different from those of cats. Thus, creating more noise datasets for the dog class.

What can we expect to the Her2 Scoring dataset.

For the Her2 Scoring, we expect that samples with few relevant tissues or with no dyed cellular structures present a higher epistemic uncertainty. These type of samples contains no relevant information to make a confident decision. Thus, as we aim to mitigate the error in a high-risk environment, low confident results can be ignored for the aggregation step, or become candidates for expert evaluation.
The aleatoric uncertainty allows characterizing the training and test set. In the Her2 Scoring problem, we expect higher variance for the 2+ and 4+ class due to the overlapping features. Moreover, not all patches will contain dyed tissue, thus adding noise to these classes.