Classification Loss Functions: Comparing SoftMax, Cross Entropy, and More

Sometimes, when training a classifier, we can get confused about the last layer to put on our neural networks. This article helps you understand how to do it right.
Thomas Capelle
Created on April 8|Last edited on July 3
Comment
After reading this excellent article from Sebastian Rashka about Log-Likelihood and Entropy in PyTorch, I decided to write this article to explore the different loss functions we can use when training a classifier in PyTorch. I also wanted to help users understand the best practices for classification losses when switching between PyTorch and TensorFlow-Keras. 
If you'd like to follow along in the code, click the Colab button below. If you'd like the most basic summary, well, that's what the TLDR is for:
﻿
﻿
﻿
Here's what we'll be covering in this article: 
Table of ContentsPyTorch 🔥Tensorflow & Keras Loss FunctionsBonus: MultiLabel ClassificationOther cool W&B reports
﻿
﻿
TLDR: Remove your SoftMax layer at the end of your model and use nn.CrossEntropyLoss
💡
PyTorch 🔥In torch, we have access to many loss functions, most of them available under torch.nn module. Let's take a quick look at each of them. 
Let's use the same simple model from Sebastian's article:
import torch
import torch.nn as nn
﻿
model = nn.Sequential(nn.Linear(num_features, num_hidden),
                      nn.Linear(num_hidden, num_classes)),
                      # missing layer here ?!
                      )
outs = model(inputs)
labels = torch.randint(0, num_classes, size=(inputs.shape[0],))
﻿
# which loss func 😱
loss_func(outs, labels)
This model returns real values as output with label shape (num_samples, num_classes).
Note: you do not need to one-hot encode the labels. The loss functions expect an integer value with the corresponding class. 
For example, MNIST dataset has 10 classes. If we consider a batch of 4 samples, the labels and outs look like this:
outs:
>> tensor([[-0.6463, -0.3399, -0.4934, -0.6603, -0.6330,  0.3151, -0.0421, -0.5026, 0.5083,  0.3044],
           [ 0.3952,  0.8189, -0.7942, -0.1095,  0.3726,  0.1933, -0.4391, -0.6973, -0.0887, 0.0189],
           [ 0.0693,  0.1846, -0.8829, -0.0268,  0.0059,  0.7330, -0.0757, -0.3720, 0.4267,  0.3611],
           [-0.6113,  0.2860, -0.3275, -0.3011, -0.6845,  0.1475, -0.1357, -0.0481, -0.2089, -0.7391]], grad_fn=<AddmmBackward0>)
﻿
labels:
>> tensor([8, 9, 5, 2])
NLLLoss FunctionNLLLoss stands for negative-log-likelihood loss and is derived from the likelihood function. To use this loss function, you need to put a LogSoftMax layer at the end of the model––or in its functional form F.logsoftmax()––as it expects a probability distribution.
model = nn.Sequential(nn.Linear(num_features, num_hidden),
                      nn.Linear(num_hidden, num_classes)),
+                     nn.LogSoftmax(dim=-1)
                      )
﻿
+ loss_func = nn.NLLLoss()
﻿
﻿
# works!
loss_func(outs, labels)
﻿
Generally, this is a bad idea. You may run into numerical instabilities because you are computing canceling exponentials and logarithms.
 You should use this function if your model is already constrained to output probability distributions, but not in this case. See below.
💡
Cross-Entropy Loss Function (a.k.a. the right way to do it)This function integrates the NLLLoss with the LogSoftMax layer, and benefits from the numerical properties discussed in Sebastian's article (mostly not computing exp to compute log then)
﻿
model = nn.Sequential(nn.Linear(num_features, num_hidden),
                      nn.Linear(num_hidden, num_classes)),
-                     nn.Softmax(dim=-1)
                      )
- loss_func = nn.NLLLoss()
+ loss_func = nn.CrossEntropyLoss()
﻿
﻿
# the right way to do it!
loss_func(outs, labels)
﻿
TLDR: Do not put SoftMax and just use the CrossEntropyLoss
💡
FocalLoss FunctionThis function is a variant of the CrossEntropyLoss that enables us to reduce the penalty for the model making the wrong guesses. My colleague Aman wrote an excellent article about it.
This function is not in  torch.nn, but we can easily implement it on top of the cross-entropy. This function performs better than CrossEntropy when you have imbalanced datasets.
import torch.nn.functional as F
﻿
class FocalLoss(nn.Module):
    "Focal loss implemented using F.cross_entropy"
    def __init__(self, gamma: float = 2.0, weight=None, reduction: str = 'mean') -> None:
        super().__init__()
        self.gamma = gamma
        self.weight = weight
        self.reduction = reduction
﻿
    def forward(self, inp: torch.Tensor, targ: torch.Tensor):
        ce_loss = F.cross_entropy(inp, targ, weight=self.weight, reduction="none")
        p_t = torch.exp(-ce_loss)
        loss = (1 - p_t)**self.gamma * ce_loss
        if self.reduction == "mean":
            loss = loss.mean()
        elif self.reduction == "sum":
            loss = loss.sum()
        return loss
﻿
Note: If you set gamma to zero, FocalLoss  becomes CrossEntropy
💡
﻿
Run set20
﻿
As you can see from the graphs above, the CrossEntropyLoss combined with the SoftMax performs poorly. There are currently more than 12k training scripts in GitHub that struggle with this.
Segmentation: Another Type of ClassificationSemantic segmentation is a task where we classify the pixels of an image one by one. Since it's a type of classification task, the same loss you use for classification should work for semantic segmentation. In a segmentation task, you would want that the output of the model has the same number of channels as classes:
outs = model(inputs)
﻿
outs.shape
>> (bs, n_classes, height, width)
﻿
# you have to indicate the dim of the classes
loss_func = nn.CrossEntropyLoss(dim=1)
Here, the targets should not be one-hot encoded, with the corresponding class in the pixel position:
labels.shape
> (bs, height, width)
﻿
# for a 4 classes segmentation, with image size = (2, 5)
labels
>> tensor([[1, 3, 3, 2, 0],
           [0, 1, 2, 2, 0]])
Check other cool reports on how to perform semantic segmentation
Understanding State of the Art in Deep Learning: 3D Semantic Segmentation
This model takes input of a point cloud representing a real-world object and provides segmentation of the object into different parts.
Image Masks for Semantic Segmentation Using Weights & Biases
This article explains how to log and explore semantic segmentation masks, and how to interactively visualize models' predictions with Weights & Biases.
Image Segmentation Using Keras and Weights & Biases
This article explores semantic segmentation with a UNET-like architecture in Keras and interactively visualizes the model's prediction using Weights & Biases.
Barbershop: Hair Transfer with GAN-Based Image Compositing Using Segmentation Masks
A novel GAN-based optimization method for photo-realistic hairstyle transfer
﻿
﻿
Tensorflow & Keras Loss FunctionsIn Keras, we have access to the same losses with slightly different names. 
 NOTE: Keras losses expect (labels, preds) order.
💡
CategoricalCrossentropy Loss FunctionThis loss function is the cross-entropy but expects targets to be one-hot encoded. you can pass the argument from_logits=False if you put the softmax on the model. As Keras compiles the model and the loss function, it's up to you, and no performance penalty is paid.
from tensorflow import keras
﻿
labels = [[0, 1, 0], 
          [0, 0, 1]]
preds  = [[2., .1, .4], 
          [1., 8., -1.]]
ce = keras.losses.CategoricalCrossentropy(from_logits=True)
ce(labels, preds).numpy()
﻿
>> 5.601112
﻿
SparseCategoricalCrossentropy Loss FunctionThis is exactly like the PyTorch counterpart. Here, with nn.CrossEntropyLoss, you pass the labels as a tensor of classes, not one-hot encoded.
from tensorflow import keras
labels = [[1], 
          [2]]
preds  = [[2., .1, .4], 
          [1., 8., -1.]]
sce = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
sce(labels, preds).numpy()
﻿
>> 5.601112
This is the one I like, as it is the same as in PyTorch, that way I can switch datasets/dataloaders between frameworks and the processing would not change.
💡
What Is Cross Entropy Loss? A Tutorial With Code
A tutorial covering Cross Entropy Loss, with code samples to implement the cross entropy loss function in PyTorch and Tensorflow with interactive visualizations.
﻿
Bonus: MultiLabel ClassificationSame as before, but the data we want to classify may belong to none of the classes (or all of them!) at the same time. At first glance, this looks like a more complex problem, but it is actually binary classification done independently per class.
Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).
💡
For example, in an image of a living room, you could have multiple objects present: chair, sofa, table. So the image would be classified accordingly.
For a concrete example, let's take the same code as before (in PyTorch). The difference will be the label's shape. In this case,  you need to one-hot encode your labels: 1 if the class is present and a 0 if not. You can have multiple 1s on the vector.
import torch
import torch.nn as nn
﻿
model = nn.Sequential(nn.Linear(num_features, num_hidden),
                      nn.Linear(num_hidden, num_classes)))
﻿
outs = model(inputs)
labels = torch.randint(0, 2, size=(inputs.shape[0], num_classes))
﻿
# you need to cast the labels to float 😱
labels = labels.float()
﻿
# like the binary classification case
loss_func = nn.BCEWithLogitsLoss()
loss_func(outs, labels)
﻿
✅ Don't forget to convert the labels to float 
✅ Use nn.BCEWithLogitsLoss() as in the binary classification case, you are basically doing this per class
✅ You can use nn.BCELoss() if you already converted the NN output to [0,1] interval.
Example:outs
>> tensor([[-0.2979, -0.5301,  0.5834, -0.2693, -0.1757,  0.3889,  0.2475,  0.1707,  0.2099, -0.1733],
           [-0.3503, -0.2824,  0.5611, -0.0680, -0.1362,  0.3443,  0.3388,  0.1702,  0.0308, -0.2690],
           [-0.2405, -0.1986,  0.5248, -0.0708,  0.0414,  0.2687,  0.2693,  0.1571,  0.0103, -0.2734],
           [-0.4083, -0.4492,  0.6058, -0.1106,  0.0318,  0.5059,  0.1758,  0.1347,  0.2078, -0.1721]], grad_fn=<AddmmBackward0>)
﻿
labels
>> tensor([[0., 1., 0., 1., 1., 1., 1., 1., 1., 1.],
           [1., 0., 0., 0., 0., 1., 0., 0., 0., 1.],
           [0., 0., 0., 0., 0., 0., 1., 1., 1., 0.],
           [1., 1., 1., 0., 0., 0., 1., 1., 1., 0.]])
﻿
loss_func = nn.BCEWithLogitsLoss()
loss_func(outs, target)
>> tensor(0.7162, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
﻿
Other cool W&B reports
Setting Up TensorFlow And PyTorch Using GPU On Docker
A short tutorial on setting up TensorFlow and PyTorch deep learning models on GPUs using Docker.
Deep Learning on the M1 Pro with Apple Silicon 
Let's take my new Macbook Pro for a spin and see how well it performs, shall we?
How the TorchData API Works: a Tutorial with Code
Let's check the new way of building Datasets on latest PyTorch 1.11 with TorchData.
Reproducible spaCy NLP Experiments with Weights & Biases
How to use Weights & Biases and spaCy to train custom, reproducible NLP pipelines
﻿
﻿
Add a comment
Tags: Articles, Domain Agnostic, Intermediate, Experiment, PyTorch, Keras
Iterate on AI agents and models faster. Try Weights & Biases today.