ReLU vs. Sigmoid Function in Deep Neural Networks: Why ReLU is so Prevalent

What's all the fuss about using ReLU anyway?. Made by Ayush Thakur using Weights & Biases
Ayush Thakur

Problem

Most state-of-the-art models use rectified linear units (ReLU) as non-linearity instead of sigmoid function in a deep neural network. The question is why? That's what we're here to find out:

Let's Investigate

We should start with a little context: historically, training deep neural nets was not possible with the use of sigmoid-like activation functions. It was ReLU (among other things, admittedly) that facilitated the training of deeper nets. And ever since we have been using ReLU as a default activation function for the hidden layers. So exactly what makes ReLU a better choice over sigmoid?
Fig 1: Sigmoid and ReLU activation function. (Source)
Let's set up a simple experiment to see the effects of ReLU and Sigmoid activation function. We'll train a vanilla-CNN classifier on Cifar-10 dataset. Specifically, we'll first train our classifier with sigmoid activation in the hidden later, then train the same classifier with ReLU activation.

Try out the experiments on Google Colab \rightarrow

Computational Speed

ReLUs are much simpler computationally. The forward and backward passes through ReLU are both just a simple "if" statement.
Sigmoid activation, in comparison, requires computing an exponent.
This advantage is huge when dealing with big networks with many neurons, and can significantly reduce both training and evaluation times.
The graph above clearly shows the stark difference in training times here. Using sigmoid took more double the amount of time.

Vanishing Gradient

Additionally, sigmoid activations are easier to saturate. There is a comparatively narrow interval of inputs for which the sigmoid's derivative is sufficiently nonzero. In other words, once a sigmoid reaches either the left or right plateau, it is almost meaningless to make a backward pass through it, since the derivative is very close to 0.
On the other hand, ReLU only saturates when the input is less than 0. And even this saturation can be eliminated by using leaky ReLUs. For very deep networks, saturation hampers learning, and so ReLU provides a nice workaround.
To check out the effect of ReLU check out Visualizing and Debugging Neural Networks with PyTorch and W&B.

Convergence Speed

With a standard sigmoid activation, the gradient of the sigmoid is typically some fraction between 0 and 1. If you have many layers, they multiply, and might give an overall gradient that is exponentially small, so each step of gradient descent will make only a tiny change to the weights, leading to slow convergence (the vanishing gradient problem).
In contrast, with ReLu activation, the gradient of the ReLu is either 0 or 1. That means that often, after many layers, the gradient will include the product of a bunch of 1's and the overall gradient won't be too small or too large.

Observations

In other words, training with ReLU gets us better model performance and faster convergence. It's hard to argue with that.

Additional Resources

The following are some really good discussions on this topic: