What's all the fuss about using ReLU anyway?. Made by Ayush Thakur using Weights & Biases

Most state-of-the-art models use rectified linear units (ReLU) as non-linearity instead of sigmoid function in a deep neural network. The question is why? That's what we're here to find out:

We should start with a little context: historically, training deep neural nets was not possible with the use of sigmoid-like activation functions. It was ReLU (among other things, admittedly) that facilitated the training of deeper nets. And ever since we have been using ReLU as a default activation function for the hidden layers. So exactly what makes ReLU a better choice over sigmoid?

Fig 1: Sigmoid and ReLU activation function. (Source)

Let's set up a simple experiment to see the effects of ReLU and Sigmoid activation function. We'll train a vanilla-CNN classifier on Cifar-10 dataset. Specifically, we'll first train our classifier with sigmoid activation in the hidden later, then train the same classifier with ReLU activation.

ReLUs are much simpler computationally. The forward and backward passes through ReLU are both just a simple "if" statement.

Sigmoid activation, in comparison, requires computing an exponent.

This advantage is huge when dealing with big networks with many neurons, and can significantly reduce both training and evaluation times.

The graph above clearly shows the stark difference in training times here. Using sigmoid took more double the amount of time.

Additionally, sigmoid activations are easier to saturate. There is a comparatively narrow interval of inputs for which the sigmoid's derivative is sufficiently nonzero. In other words, once a sigmoid reaches either the left or right plateau, it is almost meaningless to make a backward pass through it, since the derivative is very close to 0.

On the other hand, ReLU only saturates when the input is less than 0. And even this saturation can be eliminated by using leaky ReLUs. For very deep networks, saturation hampers learning, and so ReLU provides a nice workaround.

To check out the effect of ReLU check out Visualizing and Debugging Neural Networks with PyTorch and W&B.

With a standard sigmoid activation, the gradient of the sigmoid is typically some fraction between 0 and 1. If you have many layers, they multiply, and might give an overall gradient that is exponentially small, so each step of gradient descent will make only a tiny change to the weights, leading to slow convergence (the vanishing gradient problem).

In contrast, with ReLu activation, the gradient of the ReLu is either 0 or 1. That means that often, after many layers, the gradient will include the product of a bunch of 1's and the overall gradient won't be too small or too large.

- The model trained with ReLU converged quickly and thus takes much lesser time compared to Sigmoid.
- We can clearly see overfitting in the model trained with ReLU. This is due to the quick convergence.
- The model performance is significantly better when trained with ReLU.

In other words, training with ReLU gets us better model performance and faster convergence. It's hard to argue with that.

The following are some really good discussions on this topic: