Skip to main content

Classification Loss Functions: Comparing SoftMax, Cross Entropy, and More

Sometimes, when training a classifier, we can get confused about the last layer to put on our neural networks. This article helps you understand how to do it right.
Created on April 8|Last edited on July 3
After reading this excellent article from Sebastian Rashka about Log-Likelihood and Entropy in PyTorch, I decided to write this article to explore the different loss functions we can use when training a classifier in PyTorch. I also wanted to help users understand the best practices for classification losses when switching between PyTorch and TensorFlow-Keras.
If you'd like to follow along in the code, click the Colab button below. If you'd like the most basic summary, well, that's what the TLDR is for:



Here's what we'll be covering in this article:

Table of Contents



TLDR: Remove your SoftMax layer at the end of your model and use nn.CrossEntropyLoss
💡

PyTorch 🔥

In torch, we have access to many loss functions, most of them available under torch.nn module. Let's take a quick look at each of them.
Let's use the same simple model from Sebastian's article:
import torch
import torch.nn as nn

model = nn.Sequential(nn.Linear(num_features, num_hidden),
nn.Linear(num_hidden, num_classes)),
# missing layer here ?!
)
outs = model(inputs)
labels = torch.randint(0, num_classes, size=(inputs.shape[0],))

# which loss func 😱
loss_func(outs, labels)
This model returns real values as output with label shape (num_samples, num_classes).
Note: you do not need to one-hot encode the labels. The loss functions expect an integer value with the corresponding class.
For example, MNIST dataset has 10 classes. If we consider a batch of 4 samples, the labels and outs look like this:
outs:
>> tensor([[-0.6463, -0.3399, -0.4934, -0.6603, -0.6330, 0.3151, -0.0421, -0.5026, 0.5083, 0.3044],
[ 0.3952, 0.8189, -0.7942, -0.1095, 0.3726, 0.1933, -0.4391, -0.6973, -0.0887, 0.0189],
[ 0.0693, 0.1846, -0.8829, -0.0268, 0.0059, 0.7330, -0.0757, -0.3720, 0.4267, 0.3611],
[-0.6113, 0.2860, -0.3275, -0.3011, -0.6845, 0.1475, -0.1357, -0.0481, -0.2089, -0.7391]], grad_fn=<AddmmBackward0>)

labels:
>> tensor([8, 9, 5, 2])

NLLLoss Function

NLLLoss stands for negative-log-likelihood loss and is derived from the likelihood function. To use this loss function, you need to put a LogSoftMax layer at the end of the model––or in its functional form F.logsoftmax()––as it expects a probability distribution.
model = nn.Sequential(nn.Linear(num_features, num_hidden),
nn.Linear(num_hidden, num_classes)),
+ nn.LogSoftmax(dim=-1)
)

+ loss_func = nn.NLLLoss()


# works!
loss_func(outs, labels)

Generally, this is a bad idea. You may run into numerical instabilities because you are computing canceling exponentials and logarithms.
You should use this function if your model is already constrained to output probability distributions, but not in this case. See below.
💡

Cross-Entropy Loss Function (a.k.a. the right way to do it)

This function integrates the NLLLoss with the LogSoftMax layer, and benefits from the numerical properties discussed in Sebastian's article (mostly not computing exp to compute log then)

model = nn.Sequential(nn.Linear(num_features, num_hidden),
nn.Linear(num_hidden, num_classes)),
- nn.Softmax(dim=-1)
)
- loss_func = nn.NLLLoss()
+ loss_func = nn.CrossEntropyLoss()


# the right way to do it!
loss_func(outs, labels)

TLDR: Do not put SoftMax and just use the CrossEntropyLoss
💡

FocalLoss Function

This function is a variant of the CrossEntropyLoss that enables us to reduce the penalty for the model making the wrong guesses. My colleague Aman wrote an excellent article about it.
This function is not in torch.nn, but we can easily implement it on top of the cross-entropy. This function performs better than CrossEntropy when you have imbalanced datasets.
import torch.nn.functional as F

class FocalLoss(nn.Module):
"Focal loss implemented using F.cross_entropy"
def __init__(self, gamma: float = 2.0, weight=None, reduction: str = 'mean') -> None:
super().__init__()
self.gamma = gamma
self.weight = weight
self.reduction = reduction

def forward(self, inp: torch.Tensor, targ: torch.Tensor):
ce_loss = F.cross_entropy(inp, targ, weight=self.weight, reduction="none")
p_t = torch.exp(-ce_loss)
loss = (1 - p_t)**self.gamma * ce_loss
if self.reduction == "mean":
loss = loss.mean()
elif self.reduction == "sum":
loss = loss.sum()
return loss

Note: If you set gamma to zero, FocalLoss becomes CrossEntropy
💡

Run set
20

As you can see from the graphs above, the CrossEntropyLoss combined with the SoftMax performs poorly. There are currently more than 12k training scripts in GitHub that struggle with this.

Segmentation: Another Type of Classification

Semantic segmentation is a task where we classify the pixels of an image one by one. Since it's a type of classification task, the same loss you use for classification should work for semantic segmentation. In a segmentation task, you would want that the output of the model has the same number of channels as classes:
outs = model(inputs)

outs.shape
>> (bs, n_classes, height, width)

# you have to indicate the dim of the classes
loss_func = nn.CrossEntropyLoss(dim=1)
Here, the targets should not be one-hot encoded, with the corresponding class in the pixel position:
labels.shape
> (bs, height, width)

# for a 4 classes segmentation, with image size = (2, 5)
labels
>> tensor([[1, 3, 3, 2, 0],
[0, 1, 2, 2, 0]])
Check other cool reports on how to perform semantic segmentation



Tensorflow & Keras Loss Functions

In Keras, we have access to the same losses with slightly different names.
NOTE: Keras losses expect (labels, preds) order.
💡

CategoricalCrossentropy Loss Function

This loss function is the cross-entropy but expects targets to be one-hot encoded. you can pass the argument from_logits=False if you put the softmax on the model. As Keras compiles the model and the loss function, it's up to you, and no performance penalty is paid.
from tensorflow import keras

labels = [[0, 1, 0],
[0, 0, 1]]
preds = [[2., .1, .4],
[1., 8., -1.]]
ce = keras.losses.CategoricalCrossentropy(from_logits=True)
ce(labels, preds).numpy()

>> 5.601112


SparseCategoricalCrossentropy Loss Function

This is exactly like the PyTorch counterpart. Here, with nn.CrossEntropyLoss, you pass the labels as a tensor of classes, not one-hot encoded.
from tensorflow import keras
labels = [[1],
[2]]
preds = [[2., .1, .4],
[1., 8., -1.]]
sce = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
sce(labels, preds).numpy()

>> 5.601112
This is the one I like, as it is the same as in PyTorch, that way I can switch datasets/dataloaders between frameworks and the processing would not change.
💡


Bonus: MultiLabel Classification

Same as before, but the data we want to classify may belong to none of the classes (or all of them!) at the same time. At first glance, this looks like a more complex problem, but it is actually binary classification done independently per class.
Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).
💡
For example, in an image of a living room, you could have multiple objects present: chair, sofa, table. So the image would be classified accordingly.
For a concrete example, let's take the same code as before (in PyTorch). The difference will be the label's shape. In this case, you need to one-hot encode your labels: 1 if the class is present and a 0 if not. You can have multiple 1s on the vector.
import torch
import torch.nn as nn

model = nn.Sequential(nn.Linear(num_features, num_hidden),
nn.Linear(num_hidden, num_classes)))

outs = model(inputs)
labels = torch.randint(0, 2, size=(inputs.shape[0], num_classes))

# you need to cast the labels to float 😱
labels = labels.float()

# like the binary classification case
loss_func = nn.BCEWithLogitsLoss()
loss_func(outs, labels)

✅ Don't forget to convert the labels to float
✅ Use nn.BCEWithLogitsLoss() as in the binary classification case, you are basically doing this per class
✅ You can use nn.BCELoss() if you already converted the NN output to [0,1] interval.

Example:

outs
>> tensor([[-0.2979, -0.5301, 0.5834, -0.2693, -0.1757, 0.3889, 0.2475, 0.1707, 0.2099, -0.1733],
[-0.3503, -0.2824, 0.5611, -0.0680, -0.1362, 0.3443, 0.3388, 0.1702, 0.0308, -0.2690],
[-0.2405, -0.1986, 0.5248, -0.0708, 0.0414, 0.2687, 0.2693, 0.1571, 0.0103, -0.2734],
[-0.4083, -0.4492, 0.6058, -0.1106, 0.0318, 0.5059, 0.1758, 0.1347, 0.2078, -0.1721]], grad_fn=<AddmmBackward0>)

labels
>> tensor([[0., 1., 0., 1., 1., 1., 1., 1., 1., 1.],
[1., 0., 0., 0., 0., 1., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 1., 1., 1., 0.],
[1., 1., 1., 0., 0., 0., 1., 1., 1., 0.]])

loss_func = nn.BCEWithLogitsLoss()
loss_func(outs, target)
>> tensor(0.7162, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)


Other cool W&B reports


Iterate on AI agents and models faster. Try Weights & Biases today.