Skip to main content

The Softmax Activation Function Explained

In this short tutorial, we'll explore the Softmax activation function, including its use in classification tasks, and how it relates to cross entropy loss.
Created on January 18|Last edited on July 29
In this article, we'll dig into the Softmax activation function, including its use in classification tasks, and how it relates to cross entropy loss.
Here's what we'll be covering:

Table Of Contents




What Is The Softmax Function Used For?

One of the most common tasks in ML is Classification, which means that given an input (image, video, text, or audio), can a model return which class it belongs to? If we use the simplest form of neural network out there, say, a multilayer perceptron, how do we convert the output to class?
NOTE: Remember that MLP is nothing but a weighted sum of inputs i.e. i=1NwixiR\large \displaystyle \sum_{i=1}^{N} w_ix_i \hspace{0.5em} \in \mathbb{R}, a scalar value.
💡
Essentially, we need some way to transform this number into something that can give us a notion of which class the input {xi}\large \{ x_i \} belongs to. This is where Activation functions come in (Softmax being one of the most commonly used activation functions).
For example, say we take the most common (and inarguably the most important) Image Classification problem: Hot Dog or Not Hot Dog 🌭. Given an image of a food, our task is to classify the image as "Hot Dog" or "Not Hot Dog". In essence, our task is Binary Classification. If we assign say 1 to "Hot Dog" and 0 to "Not Hot Dog", then our model should output something between 0 and 1 and based on some threshold, we can assign classes appropriately.
But what if we have a Multi-Class Classification problem? 0 and 1 won't do that.
Enter Softmax.

Why Use Softmax Instead Of The Max Or Argmax Activation Functions?

You might be asking yourself why we should use Softmax instead of just using the maximum or argmax functions. Let's dig in.
First, consider using the max function i.e. a function that returns the largest value from a give sequence of inputs. So, if we have an input like i={4,2}\large i = \{ 4, 2 \}  then the output would look like z=max(i)={1,0}\large z = max(i) = \{ 1, 0\}. All the other values are just returned as zeros. The argmax is a slightly different variant of this where the function returns the index of the largest value rather than the entire list.
Softmax is a softer version of the max function (who would've guessed!). Instead of returning a binary sequence with 1 for max and 0 otherwise, what if we want probability values instead of just zeros for the non-max inputs. As you can imagine for multi-class classification 0s and 1s don't really help. What we rather want is a distribution of values. This is where softmax comes in.

The Softmax Activation Function Expressed

The Softmax Activation Function can be mathematically expressed as :-
\huge

σ(z)i=ezij=1Kezj\huge \sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

This function outputs a sequence of probability values, thus making it useful for multi-class classification problems. For example, for a 5-class classification problem, the output from the Softmax Function might look something like this:
[0.1,0.1,0.2,0.4,0.2]\huge [\,0.1 , 0.1 , 0.2, 0.4, 0.2\,]

As you can see, the sum is 1.0\large 1.0, and the interpretation would be that assuming the classes have been one-hot encoded, the 4th class (or 3rd index) is the most probable, with the 5th and 3rd closely after.
Illustration of how One-Hot Encoding would work for a sentence. Source: SauravMaheshkar/infographics

Softmax + Cross-Entropy Loss (Caution: Math Alert)

Using our definition from the above section say p1,p2,...pn\large p_1, p_2, ... p_n represent probabilities output from the network and z1,z2,...,zn\large z_1, z_2, ..., z_n represent the unnormalized log probabilities, q1,q2,...,qn\large q_1, q_2, ..., q_n represent the softmax outputs i.e. qi=σ(zi)i\large q_i = \sigma(z_i) \hspace{0.2em} \forall i, then using cross-entropy loss as :-
Ji=ipi×log(qi)Jz=z{ipi×log(σ(zi))}=ipi×zlog(σ(zi))=ijpj×zlog(σ(zj))pi×zlog(σ(zi))=ijpj×1σ(zi)zσ(zj)pi×1σ(zi)zσ(zi)\huge \begin{array}{ll} J_i &= - \displaystyle \sum_{i} p_i \times log(q_i) \\ \frac{\partial J}{\partial z} &= \frac{\partial}{\partial z} \left\{ - \sum_i p_i \times log(\sigma(z_i)) \right\} \\ &= - \sum_{i} p_i \times \frac{\partial}{ \partial z} log(\sigma(z_i))\\ &= - \sum_{i \neq j} p_j \times \frac{\partial}{\partial z} log(\sigma(z_j)) - p_i \times \frac{\partial}{\partial z} log(\sigma(z_i)) \\ &= - \sum_{i \neq j} p_j \times \frac{1}{\sigma(z_i)} \cdot \frac{\partial}{\partial z} \sigma(z_j) - p_i \times \frac{1}{\sigma(z_i)}\cdot \frac{\partial}{\partial z} \sigma(z_i) \\ \end{array}


σ(zj)z=ziezjkezk=ezjzi[k1ezk]=ezj(ezikezK)2=(ezjkezk)(ezikezk)=σ(zj)σ(zi)\huge \begin{array}{ll} \frac{\partial }{\partial }\frac{\sigma(z_j)}{z} &= \frac{\partial}{\partial z_i} \frac{e^{z_j}}{\sum_k e^{z_k}} \\ &=e^{z_j} \frac{\partial}{\partial z_i} \left[ \displaystyle \sum_k \frac{1}{e^{z_k}} \right] \\ &= e^{z_j} \left( -\frac{e^{z_i}}{\sum_k e^{z_K}}\right)^2 \\ &= - \left( \frac{e^{z_j}}{\sum_{k} e^{z_k}} \right) \cdot \left( \frac{e^{z_i}}{\sum_{k} e^{z_k}}\right) \\ &= - \sigma(z_j) \cdot \sigma(z_i) \end{array}

ziz=ziσ(ezk)kezk=ezikezk+ezizi1kezk=σ(zi)ezi(1kezk)2zzi=σ(zi)σ(zi)2\huge \begin{array}{ll} \frac{\partial}{\partial} \frac{z_i}{z} &= \frac{\partial}{\partial z_i} \frac{\sigma (e^{z_k})}{\sum_k e^{z_k}} \\ &= \frac{e^{z_i}}{\sum_k e^{z_k}} + e^{z_i} \frac{\partial}{\partial z_i} \frac{1}{\sum_k e^{z_k}} \\ & = \sigma(z_i) - e^{z_i} (\frac{1}{\sum_k e^{z_k}})^2 z^{z_i} \\ &= \sigma(z_i) - \sigma(z_i)^2 \end{array}

Now let's throw everything together
Jz=ijpj×1σ(zi)(σ(zj)σ(zi))pi×1σ(zi)(σ(zi)σ(zi)2)=ijpjσ(zi)pi(1σ(zi))=σ(zi)ijpjpi+piσ(zi)=σ(zi)(ijpjijpi+ijpiσ(zi))=σ(zi)pi\huge \begin{array}{ll} \frac{\partial J}{\partial z} &= - \displaystyle\sum_{i \neq j} p_j \times \frac{1}{\sigma(z_i)} \left( - \sigma(z_j) \cdot \sigma (z_i) \right) - p_i \times \frac{1}{\sigma(z_i)} \left( \sigma(z_i) - \sigma(z_i)^2 \right) \\ &= \displaystyle \sum_{i\neq j} p_j \,\, \sigma(z_i) - p_i(1 - \sigma(z_i)) \\ &= \sigma(z_i) \displaystyle \sum_{i \neq j} p_j - p_i + p_i\sigma(z_i) \\ &= \sigma(z_i) \left( \displaystyle \sum_{i \neq j} p_j - \displaystyle \sum_{i \neq j} p_i + \displaystyle \sum_{i \neq j} p_i \sigma(z_i)\right) \\ &= \sigma(z_i) - p_i \end{array}


Conclusion

And that wraps up our short tutorial on the Softmax Activation Function. If you have any questions or comments, please feel free to add them below.
To see the full suite of Weights & Biases features, please check out this short 5-minute guide. If you want more reports covering the math and "from-scratch" code implementations, let us know in the comments down below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental concepts like Linear Regression, Cross Entropy Loss, and Decision Trees.

Ikram Ali
Ikram Ali •  
Softmax is a softer version of the max function. Damn, I didn't see that coming.
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.