Comparing Sigmoid-MSE With Softmax Cross-Entropy for Image Classification

In this article, we look at the results from an experiment to compare sigmoid with MSE and softmax with cross-entropy for image classification.

Ayush Thakur

Created on August 11|Last edited on February 9

Comment

﻿
In this article, we're going to look at what happens when we use the sigmoid activation function along with Mean Square Error(MSE) loss function instead of the usual choice of using softmax activation function along with categorical cross-entropy loss function for image classification. Let's get right to it! ﻿﻿
ExperimentLet's train a simple vanilla convolutional neural network (CNN) on the CIFAR-10 dataset to perform image classification, and then compare: 
﻿
Run set2
﻿
ObservationsWe can see that the model trained with softmax activation and categorical_crossentropy loss performed better than sigmoid activation and mse loss. 
The test error rate clearly shows the difference. 
What is interesting is the loss curve from both experiments. We can see the difference in scale. 
A Few Things We LearnedSigmoid is primarily used for binary classification and multi-label classification. In multi-label classification, there can be more than one correct answer. Thus the output values are NOT mutually exclusive. What sigmoid does is that it allows you to have a high probability for all your classes or some of them, or none of them. 
Softmax is primarily used for multi-class classification. Here we only have one correct answer that is the output values are mutually exclusive. Softmax will enforce that the sum of the probabilities of the output classes is equal to one, so in order to increase the probability of a particular class, the model must correspondingly decrease the probability of at least one of the other classes.
Mean Square Error is a distance metric. The output tensor is a probability distribution, and MSE may not be the right choice here as it works well on point estimates and not distributions. The output of the regression model is point estimate, and thus MSE works perfectly.
Cross-entropy is better suited to compute loss for discrete or continuous distribution(output tensor is so). Cross-entropy is better explained in this blog post and this YouTube video by Aurélien Géron.
﻿
﻿

Add a comment

Tags: Articles, Computer Vision, Experiment, Object Detection, CIFAR10, Classification

Iterate on AI agents and models faster. Try Weights & Biases today.