Skip to main content

Why use softmax as opposed to standard normalization?

What's the fuss around Softmax activation for output layer?
Created on August 14|Last edited on August 18

Problem

In the output layer of a neural network, it is typical to use the softmax function to approximate a probability distribution:

math-20200818.png

This is expensive to compute because of the exponents. Why not simply perform a Z transform so that all outputs are positive, and then normalize just by dividing all outputs by the sum of all outputs? (Originally asked in this Stack Overflow thread.)

Answer

The softmax function is




Run set
25