Understanding Logits, Sigmoid, Softmax, and Cross-Entropy Loss in Deep Learning

An explainer comparing commonly used functions in ML
Created on May 22|Last edited on June 2
Comment
﻿
IntroductionHave you ever wondered how deep-learning models can differentiate cat images from dogs? Or if you used ChatGPT and wondered - how does it know which words to predict next? Or perhaps a consumer complaint came in and a deep learning model was automatically able to classify it to the correct department (s). 
At the root of it all lies the trio of Softmax function, Sigmoid function & Cross-Entropy Loss! 
In today’s day and age where data is oil and AI is everywhere, it is important to understand the basics. As part of this blog post, let’s go on a journey together to learn about logits, softmax & sigmoid activation functions first, understand how they are used everywhere in deep learning networks, what are their use cases & advantages, and then also look at cross-entropy loss.
This blog post is not theoretical! Even though we refer to some mathematical formulas, we will explain the terms simply to the reader using PyTorch code. Some proficiency in Python will really help to understand this piece and the concepts mentioned in it completely. In code we will be using TIMM, to create our image classification models to further understand logits, softmax activation function, cross-entropy loss & sigmoid activation function.
💡
LogitsYou would have heard the term logits a lot in the context of deep learning. But, what are logits? This definition will only get clearer in the following sections but put simply: the raw outputs from the final layer of a deep learning network are called logits or also more commonly referred to as activations.
Since deep learning networks in their most basic form consist of matrix multiplications and non-linearities, these raw outputs can range from (−R,R)(-R, R)(−R,R)﻿ where R represents Real Numbers. Thus, these raw outputs can be infinitely negative or positive. 
Since these raw outputs can’t directly be interpreted as model scores, we have activation functions that are applied on top of these raw outputs before getting the final score. 
In the next two sections, we will look at two such activation functions in detail—sigmoid and softmax—with the help of sample problems.
Sigmoid FunctionLet’s say you're a data scientist working on this machine learning problem:
“Given an input image, predict whether the image is a car or not.”
Now, what would you commonly want as the outputs for this deep learning model? You would want a number between 0 & 1, right? You would want the number to be as close to 1 if the model thinks that the input image is a car and the number to be as close to 0 as possible if the model thinks the input image is not a car.
﻿
Now, as mentioned before, since a logit is a raw output from the deep learning network, and all deep learning networks consist of matrix multiplications and non-linearities like ReLU, the range of Logits is from [−R,R][-R, R][−R,R]﻿ where RRR﻿ refers to real numbers.
But, remember that we want the output to be between 0 and 1 and not a real number that can be infinitely negative or positive. So how do we convert Logit to a number between 0 and 1?
The answer here: sigmoid. It's activation function commonly used after the outputs of the final layer of deep learning (logits) to convert raw outputs to values between 0 and 1.
Mathematically speaking, the sigmoid function is defined as:
σ(t)=11+e−t\sigma(t) = \frac{1}{1 + e^{-t}}σ(t)=1+e−t1​﻿
If we plot the function, it looks like below:
﻿
As can be seen above, the output value of the sigmoid function at 0 is 0.5 σ(0)=0.5\sigma(0)=0.5σ(0)=0.5﻿, and the values are very close to 0 for inputs< -4 & close to 1 for inputs > 4. Let’s look at logits and sigmoid function in code. We'll use TIMM to create our deep learning model:
import timm, torch
﻿
x = torch.randn(1, 3, 224, 224)
m = timm.create_model('resnet18', num_classes=1)
m(x)
>> tensor([[-0.1522]], grad_fn=<AddmmBackward0>) #Logit
﻿
torch.sigmoid(m(x))
>> tensor([[0.4620]], grad_fn=<SigmoidBackward0>)
As can be seen in the simple code example above, the raw output from the final layer of the model is -0.1522, but this can’t really be interpreted as a probability score, so after applying the sigmoid function, the final score becomes 0.4620.
That’s all there is to the Sigmoid function. So far, as a reader you just understand what are Logits and sigmoid  activation function. Basically, logits are the raw outputs from the final layer of the deep learning model, and sigmoid is an activation function that converts these raw outputs to final scores between 0 and 1.
Let’s now move on to softmax activation function.
Softmax activation functionLet’s now update our problem statement:
“Given some input image, predict whether it is a horse, a ball, a fish, a car, or a stick.”
Here, we don’t just have a single class as output that you are trying to predict, but multiple classes. We'll want the outputs for each of the class to be between 0 and 1 and the sum of these scores to be 1. In such a case a sigmoid activation function won’t give you these properties. 
Can the reader guess why using sigmoid activation function won’t give us what we want? Well, because it doesn’t ensure that the sum of scores is equivalent to 1. A sigmoid function simply converts all raw scores to be between 0 and 1, but doesn’t ensure that their sum is 1. 
💡
﻿
Let’s look at the image above to clearly understand logits and softmax activation function. As previously mentioned, the outputs that we get from the final layer of a deep learning model are referred to as logits. Softmax itself is an activation function that is applied on top of Logits (the outputs from the final layer) to get final scores/ probabilities so that final scores are between 0 & 1 and their total sum is 1. 
Let’s see what this looks like in code using TIMM first:
import timm
import torch
﻿
m = timm.create_model('resnet18', num_classes=5)
x = torch.randn(1, 3, 224, 224)
logits = m(x)
logits 
 
>> tensor([[-0.2135, -0.0248,  3.985, -4.235, -0.1831]], grad_fn=<AddmmBackward0>)
﻿
scores = torch.softmax(logits)
scores 
﻿
>> tensor([[0.0096, 0.0117, 0.9765, 0.0002, 0.0020]], grad_fn=<SoftmaxBackward0>)
As we can see above, logits are raw outputs from the deep learning model m given some input image x. 
As previously mentioned, logits can have a range of values belonging to all real numbers from negative infinity to positive infinity. 
💡
These logits are then passed on to a normalizing function such as softmax, whose outputs can only have a range of (0,1) and such that their total sum is 1.
Step-by-step calculation of the softmax outputsLet’s now understand and look at the Softmax function in detail. 
softmax(Zi)=exp⁡(Zi)∑nexp⁡(Zj)\text{softmax}(Z_{i}) = \frac{\exp(Z_i)}{\sum_n \exp(Z_j)}softmax(Zi​)=∑n​exp(Zj​)exp(Zi​)​﻿
Given some input vector ZZZ﻿, the softmax value of the ZiZ_iZi​﻿ is equal to eZi/(eZ1+eZ2+...+eZn)e^{Z_i}/(e^{Z_1} + e^{Z_2} + ... + e^{Z_n})eZi​/(eZ1​+eZ2​+...+eZn​)﻿ assuming that there is a total of NNN﻿ elements in the input vector Z.Z.Z.﻿ 
 So, let’s try to apply softmax to our raw outputs (logits) from the neural network before that we created using TIMM.
Let’s assume that the raw output logits are stored in vector X. We want to calculate the softmax value of the first element of this vector.
﻿X=[−0.2135,−0.0248,3.985,−4.235,−0.1831]X = [-0.2135, -0.0248,  3.985, -4.235, -0.1831]X=[−0.2135,−0.0248,3.985,−4.235,−0.1831]﻿﻿
As per the mathematical formula:
﻿softmax(−0.2135)=e−0.2135e−0.2135+e−0.0248+e3.985+e−4.235+e−0.1831\text{softmax}(-0.2135) = \frac{e^{-0.2135}}{e^{-0.2135} + e^{-0.0248} + e^{ 3.985} + e^{-4.235} + e^{-0.1831}}softmax(−0.2135)=e−0.2135+e−0.0248+e3.985+e−4.235+e−0.1831e−0.2135​﻿﻿
Calculating exponential of each term separately first:
﻿e−0.2135e^{-0.2135}e−0.2135﻿≈0.8073
﻿e−0.0248e^{-0.0248}e−0.0248﻿≈0.9754
﻿e0.3985e^{0.3985}e0.3985﻿  ≈53.7032
﻿e−0.4235e^{
−0.4235}e−0.4235﻿≈0.0144
﻿e−0.1831e^{−0.1831}e−0.1831﻿≈0.8327
And then the final sum of exponentials *= 56.3330
Therefore, Softmax(−0.2135)=0.8073/56.3330=0.0096\text{Softmax}(-0.2135) = 0.8073/56.3330 = 0.0096Softmax(−0.2135)=0.8073/56.3330=0.0096﻿﻿
Thus, the output of the softmax activation function for the first element -0.2135 is 0.0096. 
Some properties of the softmax function: 
It wants to "pick a thing" so you'll see it as a final normalizing layer after almost all classification problems. 
The outputs shall always be between 0 and 1.
The sum of the output probabilities from the softmax function will always equal 1.
It is differentiable, which means it can be used in gradient-based optimization methods like backpropagation.
It can be viewed as a way of converting logits (real-valued vector) into a probability distribution.
Softmax activation function wants to a pick a single classLet’s look at the first point - which is one of the most important to understand intuitively regarding the softmax function. As famously shared by Jeremy Howard in the fast.ai course: 
﻿
﻿
The softmax function wants to pick a thing. Why? The exponential of something grows really fast!
For example, e4=54.5981e^4 = 54.5981e4=54.5981﻿ whereas e8=2980.9579e^8=2980.9579e8=2980.9579﻿. So, if you have Logits, where one logit is a bit bigger than the others, its softmax is going to be much bigger than the others. 
Let’s say we are given some logits [0.02, -2.49, 1.25], the exponentials of these values would be [1.02, 0.08, 3.49]. The Softmax values are [0.22, 0.02,. 0.76]. You can see how the Softmax of logit 1.25 is much bigger than all others. 
And this is good, right? Cause that’s what you want when you ask the model questions like “Which pet breed is in the image?”, you want to model to pick a single breed. So that’s what softmax does. 
Negative Log LikelihoodHaving looked at the sigmoid and softmax activation functions before, it is now time to look at losses. Specifically, we will be uncovering PyTorch’s torch.nll_loss function which is negative log-likelihood loss. 
Before, we begin we are now trying to look at what the model is really trying to optimize for. We want to model to minimize the loss. But, how does it work for multi-class classification problems?
Let’s say you are working on the following problem statement: 
“You are given five images of either of grizzly bear, teddy bear, or brown bear and your job is to classify them to the correct class.”
Let’s see what this looks like in PyTorch code:
import torch, timm 
import torch.nn.functional as F
﻿
classes = ['Grizzly', 'Brown', 'Teddy']
targets = [1, 0, 2, 0, 2] # these are your labels
[classes[idx] for idx in targets]
>> ['Brown', 'Grizzly', 'Teddy', 'Grizzly', 'Teddy']
This means that as per the labels, the first image is that of a brown bear, the second image is that of a grizzly, and so on. 
m = timm.create_model('resnet18', num_classes=3, pretrained=True)
x = torch.randn(5, 3, 224, 224) # five images of three channels of shape 224x224
logits = m(x)
logits
﻿
>>
tensor([[ 0.1752,  0.0539, -0.0581],
        [ 0.0544,  0.0642,  0.0782],
        [ 0.3230,  0.0958,  0.1941],
        [ 0.1143, -0.0508,  0.0701],
        [ 0.1249,  0.0510,  0.1147]], grad_fn=<AddmmBackward0>)
As you can see above, the logits are raw outputs from the deep learning model ranging [−R,R][-R, R][−R,R]﻿.
outputs = torch.softmax(logits, dim=1)
outputs 
﻿
>> 
tensor([[0.3734, **0.3308**, 0.2957],
        [**0.3296**, 0.3329, 0.3375],
        [0.3737, 0.2978, **0.3285**],
        [**0.3566**, 0.3023, 0.3411],
        [0.3426, 0.3182, **0.3391**]], grad_fn=<SoftmaxBackward0>)
﻿
idxs = torch.arange(5)
-outputs[idxs, targets]
>> tensor([-0.3308, -0.3296, -0.3285, -0.3566, -0.3391], grad_fn=<NegBackward0>)
Now, as for the loss function, we are concerned about the highlighted values above based on our targets [1, 0, 2, 0, 2]. If we take the negative of these values it is referred to as negative log-likelihood. Ideally, we want these values to be highest and closer to 1. 
If we take the negative of these values, in PyTorch, is is referred to as F.nll_loss. 
F.nll_loss(outputs, targets, reduction='none')
﻿
>> tensor([-0.3308, -0.3296, -0.3285, -0.3566, -0.3391], grad_fn=<NllLossBackward0>)
As can be seen above, the output of F.nll_loss is the negative of the highlighted values before. 
Note that we have not taken log above. -outputs[idxs, targets] is the same as F.nll_loss as shown above. This is because, in PyTorch, it is much faster to take the log after the softmax outputs and then do F.nll_loss. 
💡
Cross Entropy LossThe log of softmax of Logits followed by F.nll_loss is referred to as Cross Entropy Loss.
Does this sound complex? Not if you break it in steps. Below four steps are the complete definition of Cross-Entropy Loss.  
Calculate the Logits (raw outputs from the final layer of the model)
Take the softmax of logits. 
Take the log of outputs from step-2. 
Calculate F.nll_loss which is simply indexing as shown before, by passing in outputs from step-3 and your target values.
Let’s now see it in code.
import torch, timm 
import torch.nn.functional as F
﻿
############################## step=1 ##############################
classes = ['Grizzly', 'Brown', 'Teddy']
targets = [1, 0, 2, 0, 2] # these are your labels
m = timm.create_model('resnet18', num_classes=3, pretrained=True)
x = torch.randn(5, 3, 224, 224) # five images of three channels of shape 224x224
logits = m(x)
logits
>>
tensor([[ 0.1752,  0.0539, -0.0581],
        [ 0.0544,  0.0642,  0.0782],
        [ 0.3230,  0.0958,  0.1941],
        [ 0.1143, -0.0508,  0.0701],
        [ 0.1249,  0.0510,  0.1147]], grad_fn=<AddmmBackward0>)
Now, we have the raw outputs (logits) from our model.
############################## step=2 ##############################
﻿
torch.softmax(logits, dim=1)
>> 
tensor([[0.3734, 0.3308, 0.2957],
        [0.3296, 0.3329, 0.3375],
        [0.3737, 0.2978, 0.3285],
        [0.3566, 0.3023, 0.3411],
        [0.3426, 0.3182, 0.3391]], grad_fn=<SoftmaxBackward0>)
Okay, as per next step, we need to take the log. 
############################## step=3 ##############################
torch.log(torch.softmax(logits, dim=1))
>> 
tensor([[-0.9850, -1.1062, -1.2183],
        [-1.1099, -1.1001, -1.0861],
        [-0.9842, -1.2115, -1.1132],
        [-1.0312, -1.1963, -1.0755],
        [-1.0711, -1.1450, -1.0813]], grad_fn=<LogBackward0>)
Finally, we perform negative log likelihood using PyTorch passing in the targets and our outputs from step-3. 
############################## step=4 ##############################
F.nll_loss(torch.log(torch.softmax(logits, dim=1)), torch.tensor(targets), reduction='none')
﻿
>> tensor([1.1062, 1.1099, 1.1132, 1.0312, 1.0813], grad_fn=<NllLossBackward0>)
This is the same as Cross Entropy Loss! Let’s compare the outputs with the output from Cross Entropy Loss using PyTorch. 
import torch.nn as nn
loss_fn = nn.CrossEntropyLoss(reduction='none')
loss_fn(logits, targets)
﻿
>> tensor([1.1062, 1.1099, 1.1132, 1.0312, 1.0813], grad_fn=<NllLossBackward0>)
As you can see the outputs are the same! Thus, we have successfully calculated Cross Entropy Loss step-by-step! As mentioned earlier, “The log of softmax of Logits followed by F.nll_loss is referred to as Cross Entropy Loss.”
Note that PyTorch provides a function called torch.log_softmax which can be used as a replacement for torch.log(torch.softmax(...)).
💡
ConclusionAs part of this blog post, we explained what Logits mean in context of deep learning. In simple language, these are the raw outputs from the final layer of a deep learning network. 
Next, we also looked at Sigmoid and Softmax activation functions. Sigmoid activation function can covert all raw values to scores between (−1,1)(-1, 1)(−1,1)﻿. Softmax can also do the same but it also ensures that outputs sum to 1. Thus, sigmoid is preferred for binary & multi-label classification problems whereas softmax is preferred for multi-class classification problems where we want the model to “pick a class”.
Lastly, we looked at the negative log-likelihood function from PyTorch and calculated the Cross-Entropy Loss using 4 simple steps! We hope that by doing so, the reader has a clear and crisp understanding of these concepts. 🙂
﻿
﻿
Add a comment
Binjian • 1 year ago*
Thanks for the great post. Just one small proposal. For the part about NLL, taking the softmax of logits is unnessary and a little bit confusing, as it's not required by NLL, which only expects unbounded logits. This mentioned by Jeremy Howard in his fast.ai course.
Regan Yue • 2 years ago
Hi, This is Regan. I am currently operating a Chinese AI blog named Baihai IDP. Please allow me to translate this blog post into Chinese. I am very interested in the content of your blog post. I believe that the information in it would be of great benefit to a wider audience if it were translated into Chinese. I would be sure to include a link to the original blog post and your name as the author. I would also be happy to provide you with a copy of the translated post for your records. I hope you will consider my request and I look forward to hearing from you.
Akbar Shah • 2 years ago
(0, 1) :)?
Tags: Articles, PyTorch, Domain Agnostic, Classification, Computer Vision, Tutorial
Iterate on AI agents and models faster. Try Weights & Biases today.