In this report, we will discuss a new training methodology, namely supervised contrastive learning(SCL) introduced in https://arxiv.org/abs/2004.11362 by Khosla et al. This methodology is an adaption of contrastive learning in the field of fully supervised learning problems. The method uses label information to cluster samples belonging to the same class in the embedding space. On top of it, a linear classifier can be used to classify the images. This method is said to be outperforming cross-entropy.
Source: original paper
Here in this article, we will discuss the proposed methodology in detail, followed by running three experiments to check how it performs compared to Cross-Entropy.
Cross-Entropy is something that we are all well aware of. It is used as a loss function for classification problems to define any errors. This measures the difference between two or more probability distributions. Every prediction task has a predicted value of the machine, i.e., the value that the device assumes will be, and a real value, i.e., the value that the distribution will have. This error is then used to optimize the model and find the parameter's optimum values so that the predicted values are close to the actual values(ground truth values).
Having said this, one wonders, ** what is the difference between KL divergence and cross-entropy?** The difference lies in the formula. Let's take a closer look at them. The formula for cross-entropy loss for multiclass classification is
-$\sum_{i=1}^{N} y_ilog$($\hat{y_i}$)
Where,
N is the number of classes.
** $y_{i}$ ** is the actual value.
** $\hat{y_i}$ ** is the predicted value.
Whereas the formula for KL Divergence for the same model will be
$\sum_{i=1}^{N} y_ilog$( $y_{i}$/$\hat{y_i}$)
Let's breakdown the formula.
KL Divergence = $\sum_{i=1}^{N} y_ilog$( $y_{i}$/$\hat{y_i}$)
= $\sum_{i=1}^{N} y_ilog$( $y_{i}$) - $\sum_{i=1}^{N} y_ilog$( $\hat{y_i}$)
= Entropy + Cross-Entropy
KL Divergence is the sum of entropy and the cross-entropy. In an ideal world, the entropy will be constant. Hence if you try to optimize KL Divergence, you are optimizing cross-entropy because entropy is going to be constant. However, in practice, cross-entropy is preferred as a loss function. The simple explanation behind this is that to optimize the model, cross-entropy prevents the computation of redundant terms (constant term).
Coming back to cross-entropy, what it does is
Finding loss is an essential step because backpropagation starts in the network from here and pushes the values of the parameters in the right direction. This results in the best working network. By choosing a loss function that is not appropriate or does not work well, the model may collapse. Still, cross-entropy has its problems.
It is sensitive to noise and adversarial examples. It gives poor results if these are present.
Cross entropy leads to poor margin because of which the model gives false results if the inputs differ from the training data even a bit.
To overcome these problems, one new approach has been proposed named Supervised Contrastive Learning. Let us see what is meant by contrastive learning.
The verbal meaning of contrasting is to compare in order to show differences. In contrastive learning, we contrast between similar and dissimilar things. The machine can detect the differences between similar and varying things. For instance, let us imagine that the machine will have a binary classification problem, where it will be required to classify between cats and dogs while looking at an input image. So in contrastive learning, we will make the machine understand that.
Taking this forward, let us try to understand supervised contrastive learning.
The idea proposed in supervised contrastive learning is pretty simple. Learn how to map the normalized encoding of samples belonging to the same category closer and the samples belonging to the other classes farther. This means that all cat image embeddings are close to each other while being distant from all dog image embeddings and vice versa.
We know that the neural network first converts the image into a representation, and then uses this representation to predict the result. So if the representations are formed, keeping the idea given above in mind, it will be easier for the classifier to give accurate results.
The proposed idea is divided into two steps:
Stage 1: In this stage, the network is trained using contrastive loss. Here the images are encoded in such a way that embeddings of similar classes are close and that of other classes are far. To do this, image labels are used. This stage has three components, namely data augmentation module, encoder network, and encoder network. These components are explained below separately.
Stage 2: Here, the encoder network used in Stage 1 is frozen, and the projector network is discarded. The representation learned from the encoder network is then used to learn a classifier, which is nothing but a linear layer. At this stage, the cross-entropy loss is used to predict the labels.
Let us have a look at the components of Stage 1 of the training.
1) Data Augmentation Module.
The module transforms the input image into augmented images. For each image, two augmented images are generated with different augmentation policy.
To get the first augmented image, the original image is randomly cropped and then resized into the input image's original size.
To get the second augmented image, three different options were evaluated.
i). AutoAugment
ii). RandAugment
iii). SimAugment ( the augmentation scheme proposed in SimCLR).
The authors used the same data augmentation policies used here in stage 2 to train the linear classifier. They found the best results with this practice.
For such a single input image, those same two stages result in two different augmented images. That means if there were N sample images, this stage would return 2N images.
**2) Encoder Network. **
It simply converts the image into a representation vector. The authors used headless ResNet-50 and ResNet-200 as the base model for the encoder network and got some fantastic results. The augmented images of the input image we got from the data augmentation module are sent to the same encoder separately, which outputs a pair of representation vectors. These outputs are normalized values. This means one input image will have two representations.
3) Projection network.
It converts the representation vectors into a vector suitable for contrastive loss calculation. The authors used a multi-layer perceptron with a single hidden layer of size 2048 and an output vector of size DP = 128. The encoded vectors which we get as an output from the encoder network are fed into this network. The output of this network i.e., the projection vectors are normalized and then used in the loss function.
Once the output vector of this projection network is sent to the supervised contrastive loss function(explained below), the loss is calculated and minimized.
This whole operation is shown diagrammatically below.
Where,
$x$ is the input image.
da1 & da2 are two different augmentation policies used in the Data Augmentation Module.
$x_1'$ & $x_2'$ are the output of the Data Augmentation Module.
$E(x_1')$ & $E(x_2')$ are the output vector of the Encoder Network.
$P(E(x_1'))$ & $P(E(x_2'))$ are the output vector of the Projector Network.
$y$ is the output.
The projected vector is sent to the supervised contrastive loss, which we will see in the next section.
The following formula gives the supervised contrastive loss function:
Where,
$N$ is the number of randomly sampled images in a mini-batch. After passing these N images through the model in Stage 1, we will get 2N images.
$i$ is the index of an arbitrary augmented image within a mini-batch.
$j$ is the index of the other augmented image originating from the same input image.
$k$ is the index of other images apart from $x_i$ and $x_j$
$z_i$ is the projected vector of input image. i.e. $z_i$=$P(E(x_i))$.
This means $z_i$ and $z_j$ are the projected vectors of the same image, and $z_k$ is the projected vector of any other.
$τ$ is a scalar temperature parameter, which is always positive.
$1_B$ is 1 iff the condition B is true, 0 otherwise.
** $N_y$** is the total number of images in the minibatch with the same label $y$.
$z_i$*$z_j$ computes an inner (dot) product between the normalized vectors.
Now to understand how this function is doing what it is expected to do, I would like to draw your attention to another topic called inner product, aka dot product. Look at the image below.
Here a & b are two vectors.
On the left, you can see if we increase the angle(theta), then two vectors separate, whereas at the right, if we decrease the angle between them, then they come close to each other. Keeping this into mind, let us come to dot products. The dot product between two vectors say a,b is given by
If the angle is small then the cosine is greater and vice versa. Now putting all these things together, if we want to place two vectors away from each other in space, we will have to increase the angle between them, and if we take the cosine into consideration, we will have to make the cosine of the angle small. Ultimately, we have to make the dot product of two vectors small. Similarly, when we want to place two vectors close to each other, we will make the dot product between the two vectors large. Therefore as a conclusion, we can say the higher the dot product, the closer the vectors and vice-versa.
But wait!! Is the dot product only about the cosine of the angle between the vectors? No right? It also has two more terms, which are the magnitude of the vectors. The dot product also depends on the magnitude of the vectors. However, we certainly do not want this. We want to use the dot product to measure the closeness of two vectors in space, and for this, the dot product should be independent of the vector magnitude. To do this, the authors came up with an idea. Instead of using the projected vectors directly coming from the projector network, they normalize the vectors. This is done so that the vectors lie in a unit hypersphere by making them have a unit distance from the center. i.e., making the magnitude of the vector 1. When these normalized vectors are used to find the dot product, the dot product gives a clear view of their closeness. One might ask, why not use cosine similarity instead of dot product with normalized vectors? The answer to this will be the computation cost. The computation cost in finding the cosine similarity between each vector will be much higher than this proposed method.
Now coming back to the loss function, when we try to minimize the loss function we try to maximize the log term. Notice that in the numerator of the log term in the loss function, we find the exponential of the dot product between the image belonging to the same class, whereas, in the denominator, we are finding the dot product between the image belonging to different classes. To maximize the log term, the numerator inside the log function must be increased, and the denominator must be decreased. i.e., the exponential of the dot product of images belonging to the same class is maximized. In contrast, the exponential of the dot product of images belonging to different classes is minimized. Ultimately, when we try to minimize the loss, we actually try to bring the vectors belonging to the same class close and those belonging to different classes far apart.
Now that we have seen all the components for supervised contrastive learning individually, let me quickly put all of them together in the next section.
In this chapter, we introduce the proposed methodology, which will help people understand the methodology's full vision.
Let us look at all the components step by step.
Step 0: This is the preprocessing step of the dataset. Before starting with the images, we resize them to a fixed size 128X128X3(for example). We also normalize the images.
(From here Step i.j means the jth step of the ith stage.)
Step 1.1: The dataset is sent to the data augmentation module, which applies different data augmentations(explained above) on this image dataset. For each image in the dataset, this module will be producing two augmented images. Now the dataset is finally ready to be sent in the encoder network.
Step 1.2: The encoder network has ResNet50 or ResNet 200(without the top) as the base network whose output is then sent to a Dense layer with 2048 neurons. Let us say the size of a minibatch is 64. In this minibatch, there are 32 pairs of images. Each image of a pair of the batch is one augmented image of the same image. So the input matrix is of shape (64,128,128,3), which is sent to the encoder network. The final output will have a shape of (64,2048). This output is nothing but the encoded vectors for the image. Each encoded vector has a size of 2048, and there are 2 encoded vectors for each image(s). But wait, 2 encoded vectors for the same image. Isn't that useless? No. Because we get 2 vectors for each image from two augmented images of that very image. Each of these two augmented images represents a different view of the data and contains some subset of the information in the original input image. In conclusion, two different encoded vectors for the same image give us some subset of the features' details.
These vectors are finally normalized to stay in a unit hypersphere space. In the above section, we have already discussed why we normalize the projected vector before sending it to the loss function. But normalization is done at this stage as well. As the authors have discovered various experiments, this standardization has always improved performance. This gave an output of shape (64,2048) which is the final output of the encoded network. This normalized encoded vector is sent to the projector network.
Step 1.3: The projector network is an MLP with one hidden layer of size 2048 and one output layer of size 128(as suggested in the paper, for our experiments we used a different architecture for the projection network explained in the next section.). This network will give an output of shape (64,128), which is, at last, normalized. The final normalized projected vector will have a shape (64, 128). From here onwards, the projected vector will be called z. Now the output of the projector network is sent to the loss function.
Step 1.4: At this step, the supervised contrastive loss function is used to find the loss. The output that we receive from the projector network is fed into this loss function. Let us look at the loss function again.
For each of the 64 values, we try to find out $L^{sup}_{i}$ and then add all of them together.
Let us see what is happening in $L^{sup}_{i}$.
Inside $L^{sup}{i}$, we find the inner product of $z{i}$ with every other vector in the batch but with some restrictions. These restrictions are applied with the help of some terms. Let us look at them:
Now coming to the log term of $L^{sup}_{i}$, the numerator and denominator have $exp(z_i*z_j/z_k)$ term. This exponential term ensures that the log argument goes no higher than 1.
Once the loss is calculated, the optimizer comes to action. It tries to minimize the loss. The loss is minimized by maximizing the numerator and minimizing the denominator of the log term in the loss function. After the backpropagation, the model learns the parameter in a way that it can place images belonging to the same class closer and those belonging to different classes farther.
Once the training of this model is over, we discard the projector network and use the trained encoder network for the second stage of the training.
Step 2.1: From here, the second stage of the training begins where we try to train a classifier on top of the encoder network. At this stage, the projector network is discarded, and only the encoder network is used, which is frozen. One more dense layer is added next to the frozen encoder network with the size equal to the number of classes in the dataset.
The input of this new network will be the same dataset. We will preprocess the dataset. The same image augmentation policies can be used here for data augmentation. In their paper, the authors reported that they got the best results when used the same augmentation in both the training stages.
Let us say the mini-batch size was 64. So the output of this network, which is the final output will have a shape (64,#classes). Once we get the output, this is sent to the loss function.
Step 2.2: At this step, the loss is calculated for the second stage of the training. Here the standard Cross-Entropy loss function is used. The loss is then backpropagated in the network, and the parameters are learned. Notice that at this stage of training, only trainable parameters are the parameters at the final layer.
In the next few sections, we will present the results of the experiments we did with three different datasets: Flowers-5, Cats-vs-Dogs, and a subset of ImageNet. This would help us validate the efficiency of the supervised contrastive learning framework.
For this experiment, a subset of ImageNet is used, which can be found here. This dataset has 1250 training images and 250 test images belonging to five classes.
All the images are resized to (128,128). The image pixels are scaled to the range of [0,1]
.
This experiment was done both with and without augmentation, and the results for both the cases are discussed below. We did not use AutoAugment as proposed in the paper (AutoAugment produced the best results). We used the following augmentation operations:
For the $1^{st}$ stage of training, the encoder network architecture was similar to the other two experiments' encoder network. However, the projector network had 256 neurons instead of 128 neurons.
For the $2^{nd}$ stage of training, the Dense layer had 5 neurons as the dataset had 5 classes. We used the softmax activation function in this Dense layer.
We tested this methodology with
We applied early stopping in order to control the training behavior of the linear model introduced in stage 2. For a sanity check, we also trained the linear layer without early stopping and let it overfit.
The table below summarizes the Supervised Contrastive Loss (produced in stage 1) and the Final Training as well as Validation Accuracy (produced in stage 2) with different optimization techniques, beginning to learn rate strategies and incremental schemes. Recognize that many of the experiments require an early stoppage, but some of them do not. Please check the code for some more details.
Optimizer + learning rate strategy + with or without augmentation | Supervised Contrastive Loss | Training Accuracy | Validation Accuracy |
---|---|---|---|
SGD + lr decayed + without augmentation | 0.00306 | 0.5832 | 0.4160 |
SGD + fixed lr + without augmentation | 0.1572 | 0.1976 | 0.2000 |
SGD + lr decayed + with augmentation | 0.159 | 0.172 | 0.184 |
Adam + lr decayed + without augmentation | 0.0104 | 0.984 | 0.6240 |
Adam + fixed lr + without augmentation | 0.0094 | 0.9808 | 0.6400 |
Adam + lr decayed + with augmentation | 0.00464 | 0.7544 | 0.6560 |
RMSprop + lr decayed + without augmentation | 00447 | 0.992 | 0.6920 |
RMSprop + fixed lr + without augmentation | 0.0100 | 0.9664 | 0.6360 |
RMSprop + lr decayed + with augmentation | 0.02736 | 0.657 | 0.6120 |
Let us look at the graph of all these results in the observation section, which we logged using wandb
.
All the images are resized to (128,128). The image pixels are scaled to the range of [0,1]
.
A minimal version of the augmentation policy used in SimCLR is used. This policy includes:
For the $1^{st}$ stage of training, the architecture of the encoder network was similar to the encoder network used in the other two experiments. However, the projector network had 128 neurons.
For the $2^{nd}$ stage of training, the Dense layer had 5 neurons as the dataset had 5 classes. We used the softmax activation function in this Dense layer.
We tested this methodology with and without data augmentation. We only used Adam optimizer with a Cosine Decay on the learning rate in this case.
All the images are resized to (128,128). The image pixels are scaled to the range of [0,1]
.
No data augmentation was performed.
For the $1^{st}$ stage of training, the architecture of the encoder network was similar to the encoder network used in the other two experiments. However, the projector network had 128 neurons.
For the $2^{nd}$ stage of training, the Dense layer had 1 neuron as the dataset had 2 classes (binary classification problem).
We only used Adam optimizer with its default configuration in tf.keras
.
To further justify the efficiency of the SCL framework, we took the embeddings from the encoder network and visualized them using t-SNE. Below you can see that it has been able to learn quite discriminative representations of the ImageNet subset that we used. We got similar results for the other datasets as well.
In this final section, we want to provide you with a gist of the things that worked for us on the different datasets we experimented with -
SCL also gives us a tremendous opportunity to use the encoder trained on the ImageNet dataset and use it for transfer learning purposes. Frameworks like this typically benefit from more and more time and longer training. It's almost always safe to train go for a longer stage 1 training when you have more data.
When you have a good amount of labeled data, it's worth giving SCL a try. But there's the catch. SCL still needs labeled data in order to shine which may not be always available.