In this article, we are going to study Activation Functions. We will look at the importance of activation functions in the neural network and then we will compare different activation functions. Later on, I will show the results which I got on implementing different activation functions in an image classification problems. After reading this article you will get to know which activation function should be used at different layers of the neural network and why.
Activation functions, as their name suggests are the functions that activate the neurons of any Neural Network. These are the mathematical functions which are attached to neurons and decide whether or not the current neuron will be activated or fired(outputs 1) based on whether the neuron’s input is relevant for model’s prediction. It does so by normalizing the output of any neuron between 1 and 0 or -1 and 1 (some exceptions are there). Activation functions also add non-linearity to the neural network.
Let’s try to understand this. Consider the following network in the diagram without any activation function.
Here , from the first layer we get,
$h1= w1 \times x1 + w3 X x2 + b1$ & h2= w2 X x1 + w4 X x2 + b2.
From the second layer, we get,
output= w5 X h1 + w6 X h2
or, output= w5 X (w1 X x1 + w3 X x2 + b1) + w6 X (w2 X x1 + w4 X x2 + b2)
or, output= (w5 X w1+ w6 X w2) X x1 + (w5 X w3+w6 X w4) X x2 + w5 X b1+ w6 X b2
or, output= W1 X x1 + W2 X x2 + B,
where, W1=w5 X w1+ w6 X w2, W2=w5 X w3+w6 X w4 & B= w5 X b1+ w6 X b2.
So the output of the above network is similar to some linear regression model. This will not be the case if we use an activation function. Activation function will introduce some non-linearity in the model. This added non-linearity makes neural network differ from any simple linear regression model. It gives the neural network the ability to solve complex problems or we can say, the ability to understand the complex relationship or patterns between different features. It is not always mandatory for the features to have a linear relationship among themselves, it may also have some non-linear relationships too. These non-linear relationships can be understood by the Neural Network very well. Thanks to activation functions for that!! :clap:
In a neural network, inputs are fed into the network from the input layer. In the neurons of the next layer, a **weighted sum ** of the inputs is calculated and a bias is added to the sum. This sum is then passed through an activation function. The output of this activation function is the input of the next layer. As we have discussed above, these activation functions can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions.
The basic process carried out by a neuron in a neural network is:
In this section, we will be looking at different activation functions. We will compare these functions based on their behaviour to know which function will perform better at different layers of the neural network. We will be going through the following activation functions
Sigmoid
Tanh
ReLU
Leaky ReLU
Parametric ReLU
ELU
Swish
Mish
Softmax
So let's start with Sigmoid.
The equation of sigmoid function is f(x) = 1/(1 + e^-x). It is a non-linear function where a small change in x brings a large change in y. Below is the image of sigmoid and it's derivative.
The sigmoid function is generally used in the case of binary classification where the output is either 0 or 1. As the output of sigmoid lies between 0 and 1 so, the result can be predicted easily to be 1 if the value is greater than 0.5 and 0 otherwise.
The equation for tanh is f(x) = 2/(1 + e^-2x)-1. It is a mathematically shifted version of sigmoid and works better than sigmoid in most of the cases. Below is the image of tanh and it's derivative.
Unlike the sigmoid function, tanh is generally used in the hidden layers. The output of tanh lies between -1 and +1. Hence the mean for the hidden layers stays near 0. This makes the learning of the hidden layer converge quickly.
The equation for ReLU is f(x) = max(0,x). It gives an output x if x is positive and 0 otherwise. Below is the image of ReLU and it's derivative.
ReLU should be used in the hidden layers. As it is computationally less expensive than sigmoid and tanh, therefore it is a better choice than them. It is also to be noted that ReLU is faster than both tanh and sigmoid. Also in hidden layers, at a time only a few neurons are activated, making it efficient and easy for computation.
It is a variant of ReLU. The equation for Leaky ReLU is f(x) = max(αx,x) where α is a small constant(normally 0.01). Below is the image of Leaky ReLU.
Apart from this, it enjoys similar advantages as ReLU.
Similar to ReLU, LEaky ReLU should also be used in the hidden layers. But leaky ReLU should always be used as an alternative to ReLU because it doesn't necessarily perform better than ReLU always.
This is another variant of ReLU. The equation for parametric ReLU is f(x) = max(αx,x). But here the value of α is not assigned by us. It is a parameter which is learnt along with weights and biases during the training period. Below is the image of Parametric ReLU.
It should be used in the hidden layers. As it is advantageous over leaky Relu that is why it is possible for parametric ReLU to perform better than leaky ReLU. But again, if we compare it with ReLU, then just like leaky ReLU, it should also be treated as an alternative to ReLU.
This is another activation function trying to fix the problem of ReLU and to some extent, it also does so. The equation for parametric ReLU is f(x) = x if x >0 else α(e^x-1). Here α is a hyperparameter whose value lies in the range[0.1,0.3]. For any positive value of x, ELU is similar to ReLU but in the negative region, the value of y will be slightly below zero. Below is the image of ELU.
(I know you guys have already guessed it)
It should be used in the hidden layers.
This was introduced in 2017. The equation of swish is f(x)=x · σ(x). Here is the image of Swish.
I know what you guys are thinking after looking at the figure of swish. Another variant of ReLU!!!! Of course, at the first sight swish does look like ReLU but it is not the case. Till now all the activation functions that we have read about, had one thing in common. They all were monotonic in nature. This means they were either entirely nonincreasing or nondecreasing. But swish is a non-monotonic function. Look at the negative region. Just after 0, the function is going downwards. But just in a moment, it again starts going upward. This makes swish different from the rest of the activation functions.
(you guessed it correct again.)
It is a great alternative to ReLU. The authors recorded an increment of 0.9% and 0.6% in top-1 classification accuracy on Imagenet for Mobile NASNetA and Inception-ResNet-v2 respectively.
Swish is not the only non-monotonic function here. Mish is here to accompany it. Mish was actually inspired by Swish. The equation of mish is 𝑓(𝑥) = 𝑥 ⋅ 𝑡𝑎𝑛ℎ(𝜍(𝑥)) where, 𝜍(𝑥) = ln(1 + 𝑒^𝑥) is the softplus activation function. Below is the image of Mish.
It should be used in the hidden layer. The author of mish has recorded an increase in Top-1 test accuracy by 0.494% and 1.671% as compared to the same network with Swish and ReLU respectively for Squeeze Excite Net- 18 for CIFAR 100 classification.
The equation for softmax function is f(xi)=e^xi/∑(j=1 to n) e^xj where n is the total number of neurons in the layer. The output of this function lies in the range [0,1] and add up to 1.
As it forms a probability distribution, it should be used in the output layer of a network which is trying to classify object belonging to multiple classes.
So by now, we have talked everything in and out about the activation functions. A lot of facts were kept in front of you. Now it's time to check the legitimacy of these facts. In the next section, I am going to perform two experiments to make things clearer to you all.
For the first experiment, I am going to use everyone's favourite dataset. The FashionMNIST dataset. The model architecture is as followed. . In this model, I have used the following six activation functions to check which one is performing well.
The model is trained for 30 epochs.
In case you want to make your hands dirty with some code then check out this notebook. You will also get to know how I instrumented this project with W&B and got some fabulous results.
You can check my W&B project workspace here.
Here is the colab notebook for you.
For activation functions I used
For Weight Initializers I used
And the learning rates that I used are
I ran it for 10 epochs. You can check out the result here and see how these activation functions performed with different hyperparameters.
In the previous experiment Mish undoubtedly performed much better than others. Let's see what happens at this experiment. For this experiment, I am using the Animal Classifier dataset which has more real-life kind of images. The flow of the experiment is as followed.
And this is the architecture of the model used for training.
I trained the model at two learning rates, which are 0.001 and 0.01. The notebook containing the code is here. You can check out my W&B workspace here. This model was trained for 50 epochs.
This section is inspired by Ayush Thakur's Report where he visualized the activations.
By visualizing the activations, we can have an essence of the stuff going on inside a neural network. To do this Kera's custom callback is used and logged using W&B. I chose the very first convolution layer of the model here to see the noticeable differences with each activation function. So let's look at the results.
I hope this report was fun for you guys as same got me amused while preparing it. Knowing about activation functions is really very important because they play a major role in the training of neural networks. So it is better to be aware of the pros and cons of different activation functions beforehand. And guys, by now you definitely know a lot about activation functions so congratulations for that🥳.
Talking about the activation functions which we compared in this report, Mish has performed better than other activation functions. But Swish gave an equal fight to Mish where at the end Mish won. So mish stood on the words of the author of the paper and gave really fantastic results. But Mish being a new activation function, has really few implementations. On this note, I would suggest you implement Mish, Swish and ReLU in other problems and find whether Mish's brilliant performance is constant or not.
References: