**Author: Ayush Thakur**

In this post, we’ll see what makes a neural network underperform and ways we can debug this by visualizing the gradients and other parameters associated with model training. We’ll also discuss the problem of vanishing and exploding gradients and methods to overcome them.

Finally, we’ll see why proper weight initialization is useful, how to do it correctly, and dive into how regularization methods like dropout and batch normalization affect model performance.

Neural network bugs are really hard to catch because:

- The code never crashes, raises an exception, or even slows down.
- The network still trains and the loss will still go down.
- The values converge after a few hours, but to really poor results

I highly recommend reading A Recipe for Training Neural Networks by Andrej Karparthy if you’d like to dive deeper into this topic.

So how can we debug our neural networks better?

There is no decisive set of steps to be followed while debugging neural networks. But here is a list of concepts that, if implemented properly, can help debug your neural networks.

We must understand the nuances of data - the type of data, the way it is stored, class balances for targets and features, value scale consistency of data, etc.

We must think about data preprocessing and try to incorporate domain knowledge into it. There are usually two occasions when data preprocessing is used:

- Data cleaning: The objective task can be achieved easily if some parts of the data, known as artifacts, is removed.
- Data augmentation: When we have limited training data, we transform each data sample in numerous ways to be used for training the model (example scaling, shifting, rotating images). This post is not focusing on the issues caused by bad data preprocessing.

If we have a small dataset of 50-60 data samples, the model will overfit quickly i.e., the loss will be zero in 2-5 epochs. To overcome this, be sure to remove any regularization from the model. If your model is not overfitting, it might be because might be your model is not architected correctly or the choice of your loss is incorrect. Maybe your output layer is activated with sigmoid while you were trying to do multi-class classification. These errors can be easy to miss error. Check out my notebook demonstrating this here.

So how can one avoid such errors? Keep reading.

Using fancy regularizers and schedulers may be overkill. In case of an error, it’s easier to debug a small network. Common errors include forgetting to pass tensors from one layer to another, have insane input to output neurons ratio, etc.

If your model architecture is built on top of a standard backbone like VGG, Resnet, Inception, etc you can use pre-trained weights on a standard dataset - if you can, find one on the dataset that you are working with. One interesting recent paper, Transfusion: Understanding Transfer Learning for Medical Imaging shows that using even a few early layers from a pre-trained ImageNet model can improve both the speed of training and final accuracy of medical imaging models.

Therefore, you should use a general-purpose pre-trained model, even if it is not in the domain of the problem you’re solving. Not that this paper does note that the amount of improvement from an ImageNet pre-trained model when applied to medical imaging is not that great. Thus there isn’t much guarantee on a head start either. For more, I recommend reading this amazing blog post by Jeremy Howard.

First, make sure that you are using the right loss function for the given task. For a multi-class classifier, a binary loss function will not help improve the accuracy, so categorical cross-entropy is the right choice.

If your model started by guessing randomly (i.e. no pretrained model), check to see if the initial loss is close to your expected loss. If you’re using cross-entropy loss, check to see that your initial loss is approximately

You can get some more suggestions here.

This parameter determines the step size at each iteration while moving toward the minimum of a loss function. You can tweak the learning rate according to how steep or smooth your loss function is. But this can be a time and resource consuming step. Can you find the most optimal learning rate, automatically?

Leslie N. Smith presented a very smart and simple approach to systematically find a learning rate in a short amount of time and minimal resources. All you need is a model and a training set. The model is initialized with a small learning rate and trained on a batch of data. The associated loss and learning rate is saved. The learning rate is then increased, either linearly or exponentially, and the model is updated with this learning rate. This repeats till a very high(maximum) learning rate is not reached.

In this notebook, you’ll find an implementation of this approach in PyTorch. I have implemented a class LRfinder. The method `range_test`

holds the logic described above. Using `wandb.log()`

I was able to log the learning rate and corresponding loss.

```
if logwandb:
wandb.log({'lr': lr_schedule.get_lr()[0], 'loss': loss})
```

Use this LRFinder to automatically find the optimal learning rate for your model.

```
lr_finder = LRFinder(net, optimizer, device)
lr_finder.range_test(trainloader, end_lr=10, num_iter=100, logwandb=True)
```

You can now head to your W&B run page and find the minima of the LR curve. Use this as your learning rate and train on the entire batch of training set.

When the learning rate is too low the model is not able to learn anything and it remains plateaued. When the learning rate is just large enough it starts learning and you will find a sudden dip in the plot. The minima of the curve is what you are looking for as the optimal learning rate. When the learning rate is high the loss explodes i.e. sudden jump in loss.

If you are using Keras to build your model you can make use of the learning rate finder as shown in this blog by PyImageSearch. You can also refer this blog post for an implementation in TensorFlow 2.0.

There was a major problem 10 years ago in training a deep neural network due to the use of sigmoid/tanh activation functions. To understand this problem the reader is expected to have an understanding of feedforward and backpropagation algorithms along with gradient-based optimization. I recommend that you watch this video or read this blog for a better understanding of this problem.

In a nutshell, when backpropagation is performed, the gradient of the loss with respect to weights of each layer is calculated and it tends to get smaller as we keep on moving backwards in the network. The gradient for each layer can be computed using the chain rule of differentiation. Since the derivative of sigmoid ranges only from 0-0.25 numerically the gradient computed is really small and thus negligible weight updates take place. Due to this problem, the model could not converge or it would take a long time to do so.

Suppose you are building a not so traditional neural network architecture. The easiest way to debug such a network is to visualize the gradients. If you are building your network using Pytorch W&B automatically plots gradients for each layer. Check out my notebook.

You can find two models, NetwithIssue and Net in the notebook. The first model uses sigmoid as an activation function for each layer. The latter uses Relu. The last layer in both the models uses a softmax activation function.

```
net = Net().to(device)
optimizer = optim.Adam(net.parameters())
wandb.init(project='pytorchw_b')
wandb.watch(net, log='all')
for epoch in range(10):
train(net, device, trainloader, optimizer, epoch)
test(net, device, testloader, classes)
print('Finished Training')
```

W&B provides first class support for PyTorch. To automatically log gradients and store the network topology, you can call watch and pass in your PyTorch model. If you want to log histograms of parameter values as well, you can pass `log='all'`

argument to the watch method.

In this run the model was trained for 40 epochs on MNIST handwritten dataset. It eventually converged with a train-test accuracy of over 80%. You can notice a zero gradient for most of the epochs. To see the gradient plots below, click on a run in your projects and then click on the Gradients section.

ReLUs aren’t a magic bullet since they can “die” when fed with values less than zero. A large chunk of the network might stop learning if most of the neurons die within a short period of training. In such a situation, take a closer look at your initial weights or add a small initial bias to your weights. If that doesn’t work, you can try to experiment with Maxout, Leaky ReLUs and ReLU6 as illustrated in the MobileNetV2 paper.

This problem occurs when the later layers learn slower compared to the initial layers, unlike the vanishing gradient problem where earlier layers learn slower than the later layers. This problem occurs when the gradient grows exponentially as we move backwards through the layers. Practically, when gradients explode, the gradients could become NaN because of the numerical overflow or we might see irregular oscillations in the training loss curve. In the case of vanishing gradients, the weight updates are very small while in case of exploding gradients these updates are huge because of which the local minima is missed and models do not converge. You can watch this video for a better understanding of this problem or go through this blog.

Let’s try to visualize the gradients in case of the exploding gradients. Check out this notebook here where I intentionally initialized the weights with a big value of 100, such that they would explode.

```
class NetforExplode(nn.Module):
def __init__(self):
super(NetforExplode, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv1.weight.data.fill_(100)
self.conv1.bias.data.fill_(-100)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.conv2.weight.data.fill_(100)
self.conv2.bias.data.fill_(-100)
self.fc1 = nn.Linear(9216, 128)
self.fc1.weight.data.fill_(100)
self.fc1.bias.data.fill_(-100)
self.fc2 = nn.Linear(128, 10)
self.fc2.weight.data.fill_(100)
self.fc2.bias.data.fill_(-100)
```

Notice how the gradients, in the plots below, are increasing exponentially going backward. The gradient value for conv1 is in the order of 10^7 while for conv2 is 10^5. Bad weight initialization can be one reason for this problem.

This is one of the most important aspects of training a neural network. Problems like image classification, sentiment analysis or playing Go can’t be solved using deterministic algorithms. You need a non deterministic algorithm to solve such problems. These algorithms use elements of randomness when making decisions during the execution of the algorithm. These algorithms make careful use of randomness. Artificial neural networks are trained using a stochastic optimization algorithm called stochastic gradient descent. Training a neural network is simply a non deterministic search for a ‘good’ solution.

As the search process (training) unfolds, there is a risk that we are stuck in an unfavorable area of the search space. The idea of getting stuck and returning a ‘less-good’ solution is called being getting stuck in a local optima. At times vanishing/exploding gradients prevent the network from learning. To counter this weight initialization is one method of introducing careful randomness into the searching problem. This randomness is introduced in the beginning. Using mini-batches for training with `shuffle=True`

is another method of introducing randomness during progression of search. For more clarity of the underlying concept check out this blog.

A good initialization has many benefits. It helps the network achieve global minima for gradient based optimization algorithms (just a piece of the puzzle). It prevents vanishing/exploding gradient problems. A good initialization can speed up training time as well. This blog explains the basic idea behind weight initialization well.

The choice of your initialization method depends on your activation function. To learn more about initialization check out this article.

- When using ReLU or leaky RELU, use He initialization also called Kaiming initialization.
- When using SELU or ELU, use LeCun initialization.
- When using softmax or tanh, use Glorot initialization also called Xavier initialization.

Most initialization methods come in uniform and normal distribution flavors. Check out this pytorch doc for more info.

Check out my notebook to see how you can initialize weights in Pytorch.

```
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 32, 3, 1)
torch.nn.init.kaiming_uniform_(self.conv1.weight, mode='fan_in', nonlinearity='relu')
self.conv2 = nn.Conv2d(32, 32, 3, 1)
torch.nn.init.kaiming_uniform_(self.conv2.weight, mode='fan_in', nonlinearity='relu')
self.conv3 = nn.Conv2d(32, 64, 3, 1)
torch.nn.init.kaiming_uniform_(self.conv3.weight, mode='fan_in', nonlinearity='relu')
self.conv4 = nn.Conv2d(64, 64, 3, 1)
torch.nn.init.kaiming_uniform_(self.conv4.weight, mode='fan_in', nonlinearity='relu')
self.pool1 = torch.nn.MaxPool2d(2)
self.pool2 = torch.nn.MaxPool2d(2)
self.fc1 = nn.Linear(1600, 512)
self.fc2 = nn.Linear(512, 128)
self.fc3 = nn.Linear(128, 10)
```

Notice how the layers were initialized with kaiming_uniform. You’ll notice this model overfits. By simplifying the model you can easily overcome this problem.

Dropout is a regularization technique that “drops out” or “deactivates” a few neurons in the neural network randomly, in order to avoid the problem of overfitting. During training some neurons in the layer after which the dropout is applied are “turned off”. An ensemble of neural networks with fewer parameters (simpler model) reduces overfitting. Dropout simulates this phenomenon, contrary to snapshot ensembles of networks, without additional computational expense of training and maintaining multiple models. It introduces noise into a neural network to force it to learn to generalize well enough to deal with noise.

Let’s implement dropout and see how it affects model performance. Check out my notebook to see how one can use Batch Normalization and Dropout in Pytorch. I started with a base model to set the benchmark for this study. The implemented architecture is simple and results in overfitting.

Notice how in the plot below for the run `base_model`

test loss increases eventually. I then applied Dropout layers with a drop rate of 0.5 after Conv blocks. To initialize this layer in PyTorch simply call the Dropout method of torch.nn.

```
self.drop = torch.nn.Dropout()
```

Dropout prevented overfitting (look for the `dropout_model`

run in the chart below) but the model didn’t converge quickly as expected. This means that ensemble networks take longer to learn. In the context of dropout not every neuron is available while learning.

Batch Normalization is a technique to improve optimization. It’s a good practice to normalize the input data before training on it which prevents the learning algorithm from oscillating. We can say that the output of one layer is the input to the next layer. If this output can be normalized before being used as the input the learning process can be stabilized. This dramatically reduces the number of training epochs required to train deep networks. Batch Normalization makes normalization a part of the model architecture and is performed on mini-batches while training. Batch Normalization also allows the use of much higher learning rates and for us to be less careful about initialization.

To initialize this layer in PyTorch simply call the BatchNorm2d method of torch.nn.

```
self.bn = torch.nn.BatchNorm2d(32)
```

Batch Normalization took fewer steps to converge the model (look for the run `batch_norm`

in plot below). Since the model was simple, overfitting could not be avoided.

Now let’s use both these layers together. If you are using BN and Dropout together follow this order (for more insight check out this paper).

Notice in the run `bn_drop`

below that by using both Dropout and Batch Normalization overfitting was eliminated while the model converged quicker.

When you have a large dataset, it’s important to optimize well, and not as important to regularize well, so batch normalization is more important for large datasets. You can of course use both batch normalization and dropout at the same time, though Batch Normalization also acts as a regularizer, in some cases eliminating the need for Dropout.

- The article 'Checklist for debugging neural networks' would be a good next step.
- Unit testing neural networks is not easy. This article discusses how to unit test machine learning code.
- I highly recommend reading why are deep neural networks hard to train?
- For a more in depth explanation on gradient clipping check out how to avoid exploding gradients in neural networks with gradient clipping?
- The effects of weight initialization on neural nets by Sayak Paul, where he discusses in depth the different effects of weight initialization.

I hope that this blog will be helpful for everyone in the Machine Learning community. I’ve tried to share some insights of my own and lots of good reading material for deeper understanding of the topics. The most important aspect of debugging neural network is to track your experiments so you can reproduce them later. Weight and Biases is really handy when it comes to tracking your experiments. With all the latest ways to visualize your experiments, it’s getting easier day by day.

I would like to thank Lavanya for the opportunity. I learned a lot in the process. Thank you Sayak Paul for the constant mentoring.