Classification Loss Functions: Comparing SoftMax, Cross Entropy, and More
Sometimes, when training a classifier, we can get confused about the last layer to put on our neural networks. This article helps you understand how to do it right.
Created on April 8|Last edited on July 3
Comment
After reading this excellent article from Sebastian Rashka about Log-Likelihood and Entropy in PyTorch, I decided to write this article to explore the different loss functions we can use when training a classifier in PyTorch. I also wanted to help users understand the best practices for classification losses when switching between PyTorch and TensorFlow-Keras.
If you'd like to follow along in the code, click the Colab button below. If you'd like the most basic summary, well, that's what the TLDR is for:
Here's what we'll be covering in this article:
Table of Contents
💡
PyTorch 🔥
In torch, we have access to many loss functions, most of them available under torch.nn module. Let's take a quick look at each of them.
Let's use the same simple model from Sebastian's article:
import torchimport torch.nn as nnmodel = nn.Sequential(nn.Linear(num_features, num_hidden),nn.Linear(num_hidden, num_classes)),# missing layer here ?!)outs = model(inputs)labels = torch.randint(0, num_classes, size=(inputs.shape[0],))# which loss func 😱loss_func(outs, labels)
This model returns real values as output with label shape (num_samples, num_classes).
Note: you do not need to one-hot encode the labels. The loss functions expect an integer value with the corresponding class.
For example, MNIST dataset has 10 classes. If we consider a batch of 4 samples, the labels and outs look like this:
outs:>> tensor([[-0.6463, -0.3399, -0.4934, -0.6603, -0.6330, 0.3151, -0.0421, -0.5026, 0.5083, 0.3044],[ 0.3952, 0.8189, -0.7942, -0.1095, 0.3726, 0.1933, -0.4391, -0.6973, -0.0887, 0.0189],[ 0.0693, 0.1846, -0.8829, -0.0268, 0.0059, 0.7330, -0.0757, -0.3720, 0.4267, 0.3611],[-0.6113, 0.2860, -0.3275, -0.3011, -0.6845, 0.1475, -0.1357, -0.0481, -0.2089, -0.7391]], grad_fn=<AddmmBackward0>)labels:>> tensor([8, 9, 5, 2])
NLLLoss Function
NLLLoss stands for negative-log-likelihood loss and is derived from the likelihood function. To use this loss function, you need to put a LogSoftMax layer at the end of the model––or in its functional form F.logsoftmax()––as it expects a probability distribution.
model = nn.Sequential(nn.Linear(num_features, num_hidden),nn.Linear(num_hidden, num_classes)),+ nn.LogSoftmax(dim=-1))+ loss_func = nn.NLLLoss()# works!loss_func(outs, labels)
Generally, this is a bad idea. You may run into numerical instabilities because you are computing canceling exponentials and logarithms.
You should use this function if your model is already constrained to output probability distributions, but not in this case. See below.
💡
Cross-Entropy Loss Function (a.k.a. the right way to do it)
This function integrates the NLLLoss with the LogSoftMax layer, and benefits from the numerical properties discussed in Sebastian's article (mostly not computing exp to compute log then)
model = nn.Sequential(nn.Linear(num_features, num_hidden),nn.Linear(num_hidden, num_classes)),- nn.Softmax(dim=-1))- loss_func = nn.NLLLoss()+ loss_func = nn.CrossEntropyLoss()# the right way to do it!loss_func(outs, labels)
TLDR: Do not put SoftMax and just use the CrossEntropyLoss
💡
FocalLoss Function
This function is a variant of the CrossEntropyLoss that enables us to reduce the penalty for the model making the wrong guesses. My colleague Aman wrote an excellent article about it.
This function is not in torch.nn, but we can easily implement it on top of the cross-entropy. This function performs better than CrossEntropy when you have imbalanced datasets.
import torch.nn.functional as Fclass FocalLoss(nn.Module):"Focal loss implemented using F.cross_entropy"def __init__(self, gamma: float = 2.0, weight=None, reduction: str = 'mean') -> None:super().__init__()self.gamma = gammaself.weight = weightself.reduction = reductiondef forward(self, inp: torch.Tensor, targ: torch.Tensor):ce_loss = F.cross_entropy(inp, targ, weight=self.weight, reduction="none")p_t = torch.exp(-ce_loss)loss = (1 - p_t)**self.gamma * ce_lossif self.reduction == "mean":loss = loss.mean()elif self.reduction == "sum":loss = loss.sum()return loss
Note: If you set gamma to zero, FocalLoss becomes CrossEntropy
💡
Run set
20
As you can see from the graphs above, the CrossEntropyLoss combined with the SoftMax performs poorly. There are currently more than 12k training scripts in GitHub that struggle with this.
Segmentation: Another Type of Classification
Semantic segmentation is a task where we classify the pixels of an image one by one. Since it's a type of classification task, the same loss you use for classification should work for semantic segmentation. In a segmentation task, you would want that the output of the model has the same number of channels as classes:
outs = model(inputs)outs.shape>> (bs, n_classes, height, width)# you have to indicate the dim of the classesloss_func = nn.CrossEntropyLoss(dim=1)
Here, the targets should not be one-hot encoded, with the corresponding class in the pixel position:
labels.shape> (bs, height, width)# for a 4 classes segmentation, with image size = (2, 5)labels>> tensor([[1, 3, 3, 2, 0],[0, 1, 2, 2, 0]])
Check other cool reports on how to perform semantic segmentation
Understanding State of the Art in Deep Learning: 3D Semantic Segmentation
This model takes input of a point cloud representing a real-world object and provides segmentation of the object into different parts.
Image Masks for Semantic Segmentation Using Weights & Biases
This article explains how to log and explore semantic segmentation masks, and how to interactively visualize models' predictions with Weights & Biases.
Image Segmentation Using Keras and Weights & Biases
This article explores semantic segmentation with a UNET-like architecture in Keras and interactively visualizes the model's prediction using Weights & Biases.
Barbershop: Hair Transfer with GAN-Based Image Compositing Using Segmentation Masks
A novel GAN-based optimization method for photo-realistic hairstyle transfer
Tensorflow & Keras Loss Functions
NOTE: Keras losses expect (labels, preds) order.
💡
CategoricalCrossentropy Loss Function
This loss function is the cross-entropy but expects targets to be one-hot encoded. you can pass the argument from_logits=False if you put the softmax on the model. As Keras compiles the model and the loss function, it's up to you, and no performance penalty is paid.
from tensorflow import keraslabels = [[0, 1, 0],[0, 0, 1]]preds = [[2., .1, .4],[1., 8., -1.]]ce = keras.losses.CategoricalCrossentropy(from_logits=True)ce(labels, preds).numpy()>> 5.601112
SparseCategoricalCrossentropy Loss Function
This is exactly like the PyTorch counterpart. Here, with nn.CrossEntropyLoss, you pass the labels as a tensor of classes, not one-hot encoded.
from tensorflow import keraslabels = [[1],[2]]preds = [[2., .1, .4],[1., 8., -1.]]sce = keras.losses.SparseCategoricalCrossentropy(from_logits=True)sce(labels, preds).numpy()>> 5.601112
This is the one I like, as it is the same as in PyTorch, that way I can switch datasets/dataloaders between frameworks and the processing would not change.
💡
Bonus: MultiLabel Classification
Same as before, but the data we want to classify may belong to none of the classes (or all of them!) at the same time. At first glance, this looks like a more complex problem, but it is actually binary classification done independently per class.
Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).
💡
For example, in an image of a living room, you could have multiple objects present: chair, sofa, table. So the image would be classified accordingly.
For a concrete example, let's take the same code as before (in PyTorch). The difference will be the label's shape. In this case, you need to one-hot encode your labels: 1 if the class is present and a 0 if not. You can have multiple 1s on the vector.
import torchimport torch.nn as nnmodel = nn.Sequential(nn.Linear(num_features, num_hidden),nn.Linear(num_hidden, num_classes)))outs = model(inputs)labels = torch.randint(0, 2, size=(inputs.shape[0], num_classes))# you need to cast the labels to float 😱labels = labels.float()# like the binary classification caseloss_func = nn.BCEWithLogitsLoss()loss_func(outs, labels)
✅ Don't forget to convert the labels to float
✅ Use nn.BCEWithLogitsLoss() as in the binary classification case, you are basically doing this per class
✅ You can use nn.BCELoss() if you already converted the NN output to [0,1] interval.
Example:
outs>> tensor([[-0.2979, -0.5301, 0.5834, -0.2693, -0.1757, 0.3889, 0.2475, 0.1707, 0.2099, -0.1733],[-0.3503, -0.2824, 0.5611, -0.0680, -0.1362, 0.3443, 0.3388, 0.1702, 0.0308, -0.2690],[-0.2405, -0.1986, 0.5248, -0.0708, 0.0414, 0.2687, 0.2693, 0.1571, 0.0103, -0.2734],[-0.4083, -0.4492, 0.6058, -0.1106, 0.0318, 0.5059, 0.1758, 0.1347, 0.2078, -0.1721]], grad_fn=<AddmmBackward0>)labels>> tensor([[0., 1., 0., 1., 1., 1., 1., 1., 1., 1.],[1., 0., 0., 0., 0., 1., 0., 0., 0., 1.],[0., 0., 0., 0., 0., 0., 1., 1., 1., 0.],[1., 1., 1., 0., 0., 0., 1., 1., 1., 0.]])loss_func = nn.BCEWithLogitsLoss()loss_func(outs, target)>> tensor(0.7162, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Other cool W&B reports
Setting Up TensorFlow And PyTorch Using GPU On Docker
A short tutorial on setting up TensorFlow and PyTorch deep learning models on GPUs using Docker.
Deep Learning on the M1 Pro with Apple Silicon
Let's take my new Macbook Pro for a spin and see how well it performs, shall we?
How the TorchData API Works: a Tutorial with Code
Let's check the new way of building Datasets on latest PyTorch 1.11 with TorchData.
Reproducible spaCy NLP Experiments with Weights & Biases
How to use Weights & Biases and spaCy to train custom, reproducible NLP pipelines
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.