The Softmax Activation Function Explained

In this short tutorial, we'll explore the Softmax activation function, including its use in classification tasks, and how it relates to cross entropy loss.
Saurav Maheshkar
Created on January 18|Last edited on July 29
Comment
In this article, we'll dig into the Softmax activation function, including its use in classification tasks, and how it relates to cross entropy loss. 
Here's what we'll be covering: 
Table Of ContentsWhat Is The Softmax Function Used For?Why Use Softmax Instead Of The Max Or Argmax Activation Functions?The Softmax Activation Function ExpressedSoftmax + Cross-Entropy Loss (Caution: Math Alert)Conclusion
﻿
﻿﻿﻿
What Is The Softmax Function Used For?One of the most common tasks in ML is Classification, which means that given an input (image, video, text, or audio), can a model return which class it belongs to? If we use the simplest form of neural network out there, say, a multilayer perceptron, how do we convert the output to class? 
NOTE: Remember that MLP is nothing but a weighted sum of inputs i.e. ∑i=1Nwixi∈R\large \displaystyle \sum_{i=1}^{N} w_ix_i \hspace{0.5em} \in \mathbb{R}i=1∑N​wi​xi​∈R﻿, a scalar value.
💡
Essentially, we need some way to transform this number into something that can give us a notion of which class the input {xi}\large \{ x_i \}{xi​}﻿ belongs to. This is where Activation functions come in (Softmax being one of the most commonly used activation functions). 
For example, say we take the most common (and inarguably the most important) Image Classification problem: Hot Dog or Not Hot Dog 🌭. Given an image of a food, our task is to classify the image as "Hot Dog" or "Not Hot Dog".  In essence, our task is Binary Classification. If we assign say 1 to "Hot Dog" and 0 to "Not Hot Dog", then our model should output something between 0 and 1 and based on some threshold, we can assign classes appropriately.
But what if we have a Multi-Class Classification problem? 0 and 1 won't do that. 
Enter Softmax.
Why Use Softmax Instead Of The Max Or Argmax Activation Functions?You might be asking yourself why we should use Softmax instead of just using the maximum or argmax functions. Let's dig in. 
First, consider using the max function i.e. a function that returns the largest value from a give sequence of inputs. So, if we have an input like i={4,2}\large i = \{ 4, 2 \} i={4,2}﻿ then the output would look like z=max(i)={1,0}\large z = max(i) = \{ 1, 0\}z=max(i)={1,0}﻿.  All the other values are just returned as zeros. The argmax is a slightly different variant of this where the function returns the index of the largest value rather than the entire list.
Softmax is a softer version of the max function (who would've guessed!). Instead of returning a binary sequence with 1 for max and 0 otherwise, what if we want probability values instead of just zeros for the non-max inputs. As you can imagine for multi-class classification 0s and 1s don't really help. What we rather want is a distribution of values. This is where softmax comes in.
The Softmax Activation Function ExpressedThe Softmax Activation Function can be mathematically expressed as :-
\huge ﻿
σ(z)i=ezi∑j=1Kezj\huge \sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}σ(z)i​=∑j=1K​ezj​ezi​​﻿
This function outputs a sequence of probability values, thus making it useful for multi-class classification problems. For example, for a 5-class classification problem, the output from the Softmax Function might look something like this:
[ 0.1,0.1,0.2,0.4,0.2 ]\huge [\,0.1 , 0.1 , 0.2, 0.4, 0.2\,][0.1,0.1,0.2,0.4,0.2]﻿
As you can see, the sum is 1.0\large 1.01.0﻿, and the interpretation would be that assuming the classes have been one-hot encoded, the 4th class (or 3rd index) is the most probable, with the 5th and 3rd closely after. 
Illustration of how One-Hot Encoding would work for a sentence. Source: SauravMaheshkar/infographics﻿
Softmax + Cross-Entropy Loss (Caution: Math Alert)Using our definition from the above section say p1,p2,...pn\large p_1, p_2, ... p_np1​,p2​,...pn​﻿ represent probabilities output from the network and z1,z2,...,zn\large z_1, z_2, ..., z_nz1​,z2​,...,zn​﻿ represent the unnormalized log probabilities, q1,q2,...,qn\large q_1, q_2, ..., q_nq1​,q2​,...,qn​﻿ represent the softmax outputs i.e. qi=σ(zi)∀i\large q_i = \sigma(z_i) \hspace{0.2em} \forall iqi​=σ(zi​)∀i﻿, then using cross-entropy loss as :- 
Ji=−∑ipi×log(qi)∂J∂z=∂∂z{−∑ipi×log(σ(zi))}=−∑ipi×∂∂zlog(σ(zi))=−∑i≠jpj×∂∂zlog(σ(zj))−pi×∂∂zlog(σ(zi))=−∑i≠jpj×1σ(zi)⋅∂∂zσ(zj)−pi×1σ(zi)⋅∂∂zσ(zi)\huge \begin{array}{ll}

J_i &= - \displaystyle \sum_{i} p_i \times log(q_i) \\

\frac{\partial J}{\partial z} &= \frac{\partial}{\partial z} \left\{ - \sum_i p_i \times log(\sigma(z_i)) \right\} \\

&= - \sum_{i} p_i \times \frac{\partial}{ \partial z} log(\sigma(z_i))\\

&= - \sum_{i \neq j} p_j \times  \frac{\partial}{\partial z} log(\sigma(z_j)) - p_i \times \frac{\partial}{\partial z} log(\sigma(z_i)) \\

&= - \sum_{i \neq j} p_j \times  \frac{1}{\sigma(z_i)} \cdot \frac{\partial}{\partial z} \sigma(z_j) - p_i \times \frac{1}{\sigma(z_i)}\cdot \frac{\partial}{\partial z} \sigma(z_i) \\

\end{array}Ji​∂z∂J​​=−i∑​pi​×log(qi​)=∂z∂​{−∑i​pi​×log(σ(zi​))}=−∑i​pi​×∂z∂​log(σ(zi​))=−∑i=j​pj​×∂z∂​log(σ(zj​))−pi​×∂z∂​log(σ(zi​))=−∑i=j​pj​×σ(zi​)1​⋅∂z∂​σ(zj​)−pi​×σ(zi​)1​⋅∂z∂​σ(zi​)​﻿
﻿
∂∂σ(zj)z=∂∂ziezj∑kezk=ezj∂∂zi[∑k1ezk]=ezj(−ezi∑kezK)2=−(ezj∑kezk)⋅(ezi∑kezk)=−σ(zj)⋅σ(zi)\huge \begin{array}{ll}

\frac{\partial }{\partial }\frac{\sigma(z_j)}{z} &= \frac{\partial}{\partial z_i} \frac{e^{z_j}}{\sum_k e^{z_k}} \\

&=e^{z_j} \frac{\partial}{\partial z_i} \left[ \displaystyle \sum_k \frac{1}{e^{z_k}} \right] \\

&= e^{z_j} \left( -\frac{e^{z_i}}{\sum_k e^{z_K}}\right)^2 \\

&= - \left( \frac{e^{z_j}}{\sum_{k} e^{z_k}} \right) \cdot \left( \frac{e^{z_i}}{\sum_{k} e^{z_k}}\right) \\

&= - \sigma(z_j) \cdot \sigma(z_i)

\end{array}∂∂​zσ(zj​)​​=∂zi​∂​∑k​ezk​ezj​​=ezj​∂zi​∂​​k∑​ezk​1​​=ezj​(−∑k​ezK​ezi​​)2=−(∑k​ezk​ezj​​)⋅(∑k​ezk​ezi​​)=−σ(zj​)⋅σ(zi​)​﻿
∂∂ziz=∂∂ziσ(ezk)∑kezk=ezi∑kezk+ezi∂∂zi1∑kezk=σ(zi)−ezi(1∑kezk)2zzi=σ(zi)−σ(zi)2\huge \begin{array}{ll} 

\frac{\partial}{\partial} \frac{z_i}{z} &= \frac{\partial}{\partial z_i} \frac{\sigma (e^{z_k})}{\sum_k e^{z_k}} \\

&= \frac{e^{z_i}}{\sum_k e^{z_k}} + e^{z_i} \frac{\partial}{\partial z_i} \frac{1}{\sum_k e^{z_k}} \\

& = \sigma(z_i) - e^{z_i} (\frac{1}{\sum_k e^{z_k}})^2 z^{z_i} \\

&= \sigma(z_i) - \sigma(z_i)^2

\end{array}∂∂​zzi​​​=∂zi​∂​∑k​ezk​σ(ezk​)​=∑k​ezk​ezi​​+ezi​∂zi​∂​∑k​ezk​1​=σ(zi​)−ezi​(∑k​ezk​1​)2zzi​=σ(zi​)−σ(zi​)2​﻿
Now let's throw everything together
∂J∂z=−∑i≠jpj×1σ(zi)(−σ(zj)⋅σ(zi))−pi×1σ(zi)(σ(zi)−σ(zi)2)=∑i≠jpj  σ(zi)−pi(1−σ(zi))=σ(zi)∑i≠jpj−pi+piσ(zi)=σ(zi)(∑i≠jpj−∑i≠jpi+∑i≠jpiσ(zi))=σ(zi)−pi\huge \begin{array}{ll} 

\frac{\partial J}{\partial z} &=  - \displaystyle\sum_{i \neq j} p_j  \times \frac{1}{\sigma(z_i)} \left( - \sigma(z_j) \cdot \sigma (z_i) \right) - p_i \times \frac{1}{\sigma(z_i)} \left( \sigma(z_i) - \sigma(z_i)^2 \right) \\

&= \displaystyle \sum_{i\neq j} p_j \,\, \sigma(z_i) - p_i(1 - \sigma(z_i)) \\

&= \sigma(z_i) \displaystyle \sum_{i \neq j} p_j - p_i + p_i\sigma(z_i)
 \\

&= \sigma(z_i)  \left( \displaystyle \sum_{i \neq j} p_j - \displaystyle \sum_{i \neq j} p_i + \displaystyle \sum_{i \neq j} p_i \sigma(z_i)\right) \\

&= \sigma(z_i) - p_i 
\end{array}∂z∂J​​=−i=j∑​pj​×σ(zi​)1​(−σ(zj​)⋅σ(zi​))−pi​×σ(zi​)1​(σ(zi​)−σ(zi​)2)=i=j∑​pj​σ(zi​)−pi​(1−σ(zi​))=σ(zi​)i=j∑​pj​−pi​+pi​σ(zi​)=σ(zi​)​i=j∑​pj​−i=j∑​pi​+i=j∑​pi​σ(zi​)​=σ(zi​)−pi​​﻿
ConclusionAnd that wraps up our short tutorial on the Softmax Activation Function. If you have any questions or comments, please feel free to add them below.
 To see the full suite of Weights & Biases features, please check out this short 5-minute guide. If you want more reports covering the math and "from-scratch" code implementations, let us know in the comments down below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental concepts like Linear Regression, Cross Entropy Loss, and Decision Trees.
An Introduction to Linear Regression For Machine Learning (With Examples)
In this article, we provide an overview of, and a tutorial on, linear regression using scikit-learn, with code and interactive visualizations so you can follow.
Decision Trees: A Guide with Examples
A tutorial covering Decision Trees, complete with code and interactive visualizations
What Is Cross Entropy Loss? A Tutorial With Code
A tutorial covering Cross Entropy Loss, with code samples to implement the cross entropy loss function in PyTorch and Tensorflow with interactive visualizations.
Introduction to Cross Validation Techniques
A tutorial covering Cross Validation techniques, complete with code and interactive visualizations.
Introduction to K-Means Clustering (With Examples)
A tutorial covering K-Means Clustering, complete with code and interactive visualizations.
A Gentle Introduction To Weight Initialization for Neural Networks
An explainer and comprehensive overview of various strategies for neural network weight initialization
﻿
﻿
Add a comment
Ikram Ali • 4 years ago
Softmax is a softer version of the max function. Damn, I didn't see that coming.
Tags: Beginner, Domain Agnostic, Classification, Tutorial, Softmax, Chum here
Iterate on AI agents and models faster. Try Weights & Biases today.