## Introduction

In this article, we are going to study Activation Functions. We will look at the importance of activation functions in the neural network and then we will compare different activation functions. Later on, I will show the results which I got on implementing different activation functions in an image classification problems. After reading this article you will get to know which activation function should be used at different layers of the neural network and why.

## What are activation Functions?

Activation functions, as their name suggests are the functions that activate the neurons of any Neural Network. These are the mathematical functions which are attached to neurons and decide whether or not the current neuron will be activated or fired(outputs 1) based on whether the neuron’s input is relevant for model’s prediction. It does so by normalizing the output of any neuron between 1 and 0 or -1 and 1 (some exceptions are there). Activation functions also add non-linearity to the neural network.

## But why do we need non-linearity in neural network?

Let’s try to understand this. Consider the following network in the diagram without any activation function.

Here , from the first layer we get,

$h1= w1 \times x1 + w3 X x2 + b1$ & h2= w2 X x1 + w4 X x2 + b2.

From the second layer, we get,

output= w5 X h1 + w6 X h2

or, output= w5 X (w1 X x1 + w3 X x2 + b1) + w6 X (w2 X x1 + w4 X x2 + b2)

or, output= (w5 X w1+ w6 X w2) X x1 + (w5 X w3+w6 X w4) X x2 + w5 X b1+ w6 X b2

or, output= W1 X x1 + W2 X x2 + B,

where, W1=w5 X w1+ w6 X w2, W2=w5 X w3+w6 X w4 & B= w5 X b1+ w6 X b2.

So the output of the above network is similar to some linear regression model. This will not be the case if we use an activation function. Activation function will introduce some non-linearity in the model. This added non-linearity makes neural network differ from any simple linear regression model. It gives the neural network the ability to solve complex problems or we can say, the ability to understand the complex relationship or patterns between different features. It is not always mandatory for the features to have a linear relationship among themselves, it may also have some non-linear relationships too. These non-linear relationships can be understood by the Neural Network very well. Thanks to activation functions for that!! :clap:

## How does activation function work in a neural network?

In a neural network, inputs are fed into the network from the input layer. In the neurons of the next layer, a **weighted sum ** of the inputs is calculated and a bias is added to the sum. This sum is then passed through an activation function. The output of this activation function is the input of the next layer. As we have discussed above, these activation functions can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions.

The basic process carried out by a neuron in a neural network is:

## Comparison of different Activation Functions

In this section, we will be looking at different activation functions. We will compare these functions based on their behaviour to know which function will perform better at different layers of the neural network. We will be going through the following activation functions

• Sigmoid

• Tanh

• ReLU

• Leaky ReLU

• Parametric ReLU

• ELU

• Swish

• Mish

• Softmax

### Sigmoid

The equation of sigmoid function is f(x) = 1/(1 + e^-x). It is a non-linear function where a small change in x brings a large change in y. Below is the image of sigmoid and it's derivative.

• It is derivable at every point. This is a desired property for any activation function. That means we can find the gradient of the sigmoid curve at any two points which will help in backpropagation of error in the model. If an activation function is not derivable at each point then it becomes problematic in backpropagation as for those points, gradients can't be found.
• The output values are bound between 0 to 1. This means the activations will not be blown up.
• This functions gives clear predictions at points. Let me explain this to you. This means, for input values greater than 2 or less than -2 brings the output close to 1 or 0 respectively.

• One of the major disadvantages of using sigmoid is the problem of vanishing gradient. See in the image, for a very high or very low value of x, the derivative of the sigmoid is very low. The highest value of the derivative is 0.25. If you remember, we use chain rule in gradients calculation which is used in backpropagation. Due to chain rule, as we move towards the input layer, the number of factors in the product keeps on increasing. These factors are small numbers. So the product keeps on decreasing as we move towards the beginning as the factors are small. This makes the optimization of the network very difficult(derivative being very low). This can result in the network refusing to learn further, or being too slow to reach an accurate prediction.

• The outputs aren’t zero centred. The output of this activation function always lies within 0 & 1 i.e. always positive. A zero centred function would be a function where the outputs are sometimes less than 0(negative) and sometimes greater than 0(positive). Due to this nature of the sigmoid function, all the gradients connected to a neuron of a layer will always be either all positive or all negative. This will obstruct the possible update directions of the gradients. As a result, it would take a substantially longer time to converge. Whereas zero centred function helps in fast convergence.
• It saturates and kills gradients. Refer to the figure of the derivative of the sigmoid. At both positive and negative ends, the value of the gradient saturates at 0. That means for those values, the gradient will be 0 or close to 0, which simply means no learning in backpropagation.
• It is computationally expensive because of the exponential term in it.

#### Uses

The sigmoid function is generally used in the case of binary classification where the output is either 0 or 1. As the output of sigmoid lies between 0 and 1 so, the result can be predicted easily to be 1 if the value is greater than 0.5 and 0 otherwise.

### Tanh

The equation for tanh is f(x) = 2/(1 + e^-2x)-1. It is a mathematically shifted version of sigmoid and works better than sigmoid in most of the cases. Below is the image of tanh and it's derivative.

• It has similar advantages as sigmoid but better than that because it is zero centred. The output of tanh lies between -1 and 1. Hence solving one of the issues with the sigmoid.

• It also has the problem of vanishing gradient but the derivatives are steeper than that of the sigmoid. Hence making the gradients stronger for tanh than sigmoid.
• As it is almost similar to sigmoid, tanh is also computationally expensive.
• Similar to sigmoid, here also the gradients saturate.

#### Uses

Unlike the sigmoid function, tanh is generally used in the hidden layers. The output of tanh lies between -1 and +1. Hence the mean for the hidden layers stays near 0. This makes the learning of the hidden layer converge quickly.

### ReLU(Rectified Linear Unit)

The equation for ReLU is f(x) = max(0,x). It gives an output x if x is positive and 0 otherwise. Below is the image of ReLU and it's derivative.

• It is computationally effective as it involves simpler mathematical operations than sigmoid and tanh.
• Although it looks like a linear function, it adds non-linearity to the network, making it able to learn complex patterns.
• It doesn't suffer from the vanishing gradient problem.
• It is unbounded at the positive side. Hence removing the problem of gradient saturation.
• It provides sparsity to the network, which as a result lessens the space and time complexity. Sparse networks are always faster than dense networks as there are lesser things to compute in a sparse network. Sparsity results in concise models that often have better predictive power and less overfitting/noise.

• It suffers from the dying ReLU problem. ReLU is always going to discard the negative values i.e. the deactivations by making it 0. But because of this, the gradient of these units will also become 0 and by now we all know that 0 gradient means no weight updation during backpropagation. Simply speaking, the neurons which will go to this state will stop responding to the deviation of input or the error. This, as a result, hampers the ability of the model to fit the data properly.
• It is non-differentiable at 0.

#### Uses

ReLU should be used in the hidden layers. As it is computationally less expensive than sigmoid and tanh, therefore it is a better choice than them. It is also to be noted that ReLU is faster than both tanh and sigmoid. Also in hidden layers, at a time only a few neurons are activated, making it efficient and easy for computation.

### Leaky ReLU

It is a variant of ReLU. The equation for Leaky ReLU is f(x) = max(αx,x) where α is a small constant(normally 0.01). Below is the image of Leaky ReLU.

• It tries to remove the dying ReLU problem. Instead of making the negative input 0, which was the case of ReLU, it makes the input value really small but proportional to the input. Because of this, the gradient doesn't saturate to 0. If the input is negative, the gradient will be α. As a result, there will be learning for these units as well.

Apart from this, it enjoys similar advantages as ReLU.

• One disadvantage of leaky ReLU is that the value of α is always constant and is a hyperparameter. Most used value of α is 0.01. How would you know it should be 0.01 or 0.02 or anything else? It can be any smaller constant number which will be data specific. But there is no such provision of choosing this value by the neural network as per the inputs.

#### Uses

Similar to ReLU, LEaky ReLU should also be used in the hidden layers. But leaky ReLU should always be used as an alternative to ReLU because it doesn't necessarily perform better than ReLU always.

### Parametric ReLU

This is another variant of ReLU. The equation for parametric ReLU is f(x) = max(αx,x). But here the value of α is not assigned by us. It is a parameter which is learnt along with weights and biases during the training period. Below is the image of Parametric ReLU.

• As mentioned above, α is a parameter. So the value of α is learnt along with the other parameters of the neural network. Hence removing the disadvantage of leaky ReLU.

#### Uses

It should be used in the hidden layers. As it is advantageous over leaky Relu that is why it is possible for parametric ReLU to perform better than leaky ReLU. But again, if we compare it with ReLU, then just like leaky ReLU, it should also be treated as an alternative to ReLU.

### ELU(Exponential Linear Unit)

This is another activation function trying to fix the problem of ReLU and to some extent, it also does so. The equation for parametric ReLU is f(x) = x if x >0 else α(e^x-1). Here α is a hyperparameter whose value lies in the range[0.1,0.3]. For any positive value of x, ELU is similar to ReLU but in the negative region, the value of y will be slightly below zero. Below is the image of ELU.

• Unlike ReLU, It is derivable at 0.
• It bends smoothly at 0 whereas ReLU's bent was really sharp. This smoothness plays a beneficial role in optimization and generalization.
• Just like leaky ReLU and Parametric ReLU, ELU also produces negative outputs. Which jabs the parameters of the model in the right direction and improves the learning capacity of the model.
• The negative values of ELU pushes the mean value towards zero. Doesn't it sound similar to batch normalization? It does because it is. But here, with really lesser computational complexity. The learning of the model is sped up because of this shifting of the mean value to zero. Mean activations close to zero decreases the bias shift for units in the next layer which speeds up learning by bringing the natural gradient closer to the unit natural gradient.
• Unlike leaky ReLU and parametric ReLU, ELU ensures a noise-robust deactivation state. Let me make it a bit simpler for you all. In the case of leaky ReLU and parametric ReLU, for negative values of x, we get a negative value of y. This means for very large negative input, there is still a room for considerable negative output. This clearly means that there can be some disbalance in the model because of this noise. But ELU comes here for the rescue. In ELU, in the negative region, the curve is not a straight line because of the exponential term. Because of this term only, the negative values saturate to some level and as a result, the model is not impacted more by the noise. So there will be a taste of the negative inputs but they would not be allowed to create any disbalance in the model.
• Because of the above characteristic of ELU, the risk of overfitting is also reduced.

(I know you guys have already guessed it)

• It is computationally expensive due to the presence of the exponential term.
• α is a parameter.

#### Uses

It should be used in the hidden layers.

### Swish

This was introduced in 2017. The equation of swish is f(x)=x · σ(x). Here is the image of Swish.

I know what you guys are thinking after looking at the figure of swish. Another variant of ReLU!!!! Of course, at the first sight swish does look like ReLU but it is not the case. Till now all the activation functions that we have read about, had one thing in common. They all were monotonic in nature. This means they were either entirely nonincreasing or nondecreasing. But swish is a non-monotonic function. Look at the negative region. Just after 0, the function is going downwards. But just in a moment, it again starts going upward. This makes swish different from the rest of the activation functions.

• Just like ReLU, swish is unbounded above. This means that for very large values, the outputs do not saturate to the maximum value(which was the case of sigmoid and tanh). This means for any value the gradient doesn't become 0. Hence making the learning more capable.
• It is bounded below. This means as the input tends to negative infinity, the output tends to some constant. It is really important at the beginning as many negative values are conquered. With this power, swish forgets the very large negative values which are nothing but the deactivations. This feature of swish introduces regularization in the model.
• It is non-monotonic in nature i.e. it doesn't move in one direction. In the positive region, the nature of swish is somewhat similar to that of ReLU. It is the negative region where the difference lies and this difference is nothing but the non-monotonicity. Because of this feature, it is possible for the output to still fall even if the input increases. This in return increases the information storage capacity of the model and of course the discriminative capacity. But how does it happen? It happens because swish has negative derivatives at some points and positive derivatives at others instead of all positive or all negative derivatives. This has a huge role in the success of swish over ReLU as the expressivity of the model increases.
• This function is self-gated. I know you are thinking about LSTM now. Well, that is completely justified because this feature was actually inspired by LSTM. In LSTM we had sigmoids as the gates. These gates controlled the quantity of a vector that was passed to the next stage and this was achieved by multiplying it by the output of the sigmoid which is nothing but a number between o and 1. Just like this, in swish also, the sigmoid controls the quantity of information going to the next layer. In the case of activation, the value remains the same. But in the case of deactivations, the value is reduced by multiplying with sigmoid of the value. So, self-gated means that the gate is actually the sigmoid activation of itself. where the gate is σ(x) and the value to pass through is x. With self-gating, all this is achieved with just one scalar input. But in the case of multi-gating, it would have required multiple two-scalar inputs.
• It is a smooth function. By smooth it is meant that it doesn't change the direction all of a sudden like ReLU(near 0). Rather it bends really smoothly from 0 towards x<0. These smoother transition results in a smoother loss function. This allows the optimizer to go through fewer oscillations which helps in faster convergence, effective optimization and generalization.

(you guessed it correct again.)

• It is computationally expensive.

#### Uses

It is a great alternative to ReLU. The authors recorded an increment of 0.9% and 0.6% in top-1 classification accuracy on Imagenet for Mobile NASNetA and Inception-ResNet-v2 respectively.

### Mish

Swish is not the only non-monotonic function here. Mish is here to accompany it. Mish was actually inspired by Swish. The equation of mish is 𝑓(𝑥) = 𝑥 ⋅ 𝑡𝑎𝑛ℎ(𝜍(𝑥)) where, 𝜍(𝑥) = ln(1 + 𝑒^𝑥) is the softplus activation function. Below is the image of Mish.

• It has the same property of unbounded above and bounded below like swish.
• It is non-monotonic in nature and hence provides similar benefits to swish.
• It is also self-gated. Here the gate is 𝑡𝑎𝑛ℎ(𝜍(𝑥)) instead of σ(x). But the purpose of the gate is the same. This difference of gate is the actual reason for the performance difference of swish and mish.
• It is continuously differentiable with infinite order. It falls under the class of C∞ whereas ReLU falls under the class of C0. By now we all know what is the importance of differentiability of an activation function.
• It is self regularized. The first derivative of Mish is given by-
As stated by the author of Mish, the delta(x) term present there somehow showcases similar behaviour to that of a regularizing preconditioner which helps in making optimization of deep complex neural networks much easier. This adds an advantage over Swish.

• It is computationally very expensive.

#### Uses

It should be used in the hidden layer. The author of mish has recorded an increase in Top-1 test accuracy by 0.494% and 1.671% as compared to the same network with Swish and ReLU respectively for Squeeze Excite Net- 18 for CIFAR 100 classification.

### Softmax

The equation for softmax function is f(xi)=e^xi/∑(j=1 to n) e^xj where n is the total number of neurons in the layer. The output of this function lies in the range [0,1] and add up to 1.

• It is able to handle multiple classes. It normalizes the outputs for each class between 0 and 1 and divides by their sum. Hence forming a probability distribution. Therefore giving a clear probability of input belonging to any particular class.

#### Uses

As it forms a probability distribution, it should be used in the output layer of a network which is trying to classify object belonging to multiple classes.

So by now, we have talked everything in and out about the activation functions. A lot of facts were kept in front of you. Now it's time to check the legitimacy of these facts. In the next section, I am going to perform two experiments to make things clearer to you all.

## Experiment 1

For the first experiment, I am going to use everyone's favourite dataset. The FashionMNIST dataset. The model architecture is as followed. . In this model, I have used the following six activation functions to check which one is performing well.

• Sigmoid
• ReLU
• Leaky ReLU
• ELU
• Swish
• Mish

The model is trained for 30 epochs.

In case you want to make your hands dirty with some code then check out this notebook. You will also get to know how I instrumented this project with W&B and got some fabulous results.

You can check my W&B project workspace here.

## Hyperparameter Tuning

Here is the colab notebook for you.

1. For activation functions I used

• ReLU
• Swish
• Mish
2. For Weight Initializers I used

• he_normal
• he_uniform
• Orthogonal
• lecun_normal
• lecun_uniform
3. And the learning rates that I used are

• 0.1
• 0.01
• 0.001

I ran it for 10 epochs. You can check out the result here and see how these activation functions performed with different hyperparameters.

## Experiment 2

In the previous experiment Mish undoubtedly performed much better than others. Let's see what happens at this experiment. For this experiment, I am using the Animal Classifier dataset which has more real-life kind of images. The flow of the experiment is as followed.

And this is the architecture of the model used for training.

I trained the model at two learning rates, which are 0.001 and 0.01. The notebook containing the code is here. You can check out my W&B workspace here. This model was trained for 50 epochs.

## Visualizing the Activations With ReLU, Swish, and Mish

This section is inspired by Ayush Thakur's Report where he visualized the activations.

By visualizing the activations, we can have an essence of the stuff going on inside a neural network. To do this Kera's custom callback is used and logged using W&B. I chose the very first convolution layer of the model here to see the noticeable differences with each activation function. So let's look at the results.

## Final Words.

I hope this report was fun for you guys as same got me amused while preparing it. Knowing about activation functions is really very important because they play a major role in the training of neural networks. So it is better to be aware of the pros and cons of different activation functions beforehand. And guys, by now you definitely know a lot about activation functions so congratulations for that🥳.

Talking about the activation functions which we compared in this report, Mish has performed better than other activation functions. But Swish gave an equal fight to Mish where at the end Mish won. So mish stood on the words of the author of the paper and gave really fantastic results. But Mish being a new activation function, has really few implementations. On this note, I would suggest you implement Mish, Swish and ReLU in other problems and find whether Mish's brilliant performance is constant or not.

References: