Exploring ResNets With W&B

Ayush Chaurasia, Contributor

Exploring ResNets With W&B

In this tutorial, we’ll walk through implementing residual neural networks, the underlying architecture and the theory behind skip connections. Residual networks, also known as resnets, are one of the most used architectures in computer vision. Skip connection is used in almost all of the state of the art models for object detection and processing.

If you’d like to reproduce this analysis, check out this Colab notebook. You can access a live dashboard to explore this analysis in more detail. Finally here’s a video tutorial on the resnet research paper and implementation.

What is a Residual Network, really?

A residual neural network (ResNet) is an artificial neural network (ANN) that builds on constructs known from pyramidal cells in the cerebral cortex. Residual neural networks utilize skip connections, or shortcuts to jump over some layers. Typical ResNet models are implemented with double- or triple- layer skips that contain nonlinearities (ReLU) and batch normalization in between.

When deep learning initially took off, it was pretty clear that the deeper models could approximate more complex functions. Researchers started experimenting with various deep models. Although these models did set some historical benchmark, they did not follow any pattern for the layer dimensions (for example, AlexNet ). Then various architectures were proposed which gradually increased the channels by choosing appropriate filters. But it was noticed that the deeper models didn’t perform as expected due to one of the most famous drawbacks faced by deep neural nets which, i.e. gradient vanishing.

Vanishing Gradients

Gradient vanishing is said to occur when the derivative w.r.t the layers at the beginning of the model shrink to a value which is very close to zero. When this happens, the change in weights w.r.t those layers is almost insignificant. So, it can be said that these layers don’t contribute as much in the learning process as they are capable of. So, deeper networks can sure learn to approximate more complex functions but before that there is this need to overcome the problem of gradient vanishing. This can be achieved by using skip connections.

Skip Connections

Skip connection is a process by which layer activations of a particular layer is added directly to the activation of some other layer which is “deeper” in the network.

Image result for skip connections

As it is represented in the diagram, the activation of layer ‘l-2’ is added directly to the activation of layer ‘l’ which is deeper in the network. This process continues throughout the network and thus, the activations of the layers are pushed deep into the network. This helps us overcome the problem of gradient vanishing as the activations of the layers towards the beginning of the network have been added directly to the deeper layers.

Network-In-Network Block

All the different resnet architectures are a repetition of some smaller blocks. There are 2 types of blocks involved in the implementation of resnets, a base block and a bottleneck block.

The base block consists of 2 convolution layers and their corresponding activations and batch normalization. The bottleneck block consists of 3 convolution layers, one of which is a 1x1 convolution block or Network-in-Network block. By using network-in-network convolution, the number of operations required to get the output of a desired dimension is reduced significantly.  It is often used to reduce the number of depth channels, since it is often very slow to multiply volumes with extremely large depths.This is why these networks have gained huge popularity in the deep networks. Here’s an excerpt from a stackoverflow answer which quickly demonstrates the impact of 1x1 convolution.

It’s Implementation Time!

We’ll employ all of the previous techniques to build our own implementation of resnets from scratch in pytorch. Although I won’t cover the code line-by-line in this tutorial, I’ll share the link to the implementation and walkthrough. Here’s the link to the code and I’ve covered the research paper and the implementation here. Now, let’s get into the bits that are important for our discussion here. In the code, we have a class called resnet that takes the architecture that we want to use as the building block of our resnet. Here, we can use either the base block or the bottleneck block. The second parameter of the class resnet is a list of four numbers which represents the number of the chosen blocks to be used at a particular channel state.( There are 4 states where the channel dimension is 64, 128, 256, and 512 respectively). The third parameter is the number of classes the resnet needs to classify. Now, to make things more concrete, let us look at an example.

Implementing ResNet-18

To implement resnet-18, we’ll use 2 base blocks at each of the four stages. Each base block consists of 2 convolutional layers. We’ll also add a fully connected layer at the end and a convolutional layer in the beginning. Now the total number of layers becomes 18, hence the name resnet-18. We’ll use this network to train on the cifar-10 dataset.

Optimizing ResNet-18

We’ll use adam optimizer to optimize our resnet 18 implementation. We’ll use different learning rates and compare their performance using wandb library.

We’ll log the training loss and test accuracy of the network and then compare them. `wandb.log()` is used to log a value:

Let’s have a look at the performance of this model with various learning rates. Following is the graph generated using wandb dashboard.

By looking at the graph, it’s pretty clear that the loss is minimum for the given number of epochs when using learning rate of 0.001. But loss doesn’t always capture the whole picture, so let’s have a look at the test accuracy.

This visualization confirms our previous deduction that the learning rate of 0.001 has the best accuracy.

NOTE: The results might vary if you increase the number of epochs, change the optimizer or customize the batch size.

Moving on. Let’s see how a more complex model, i.e, resnet-50 performs on this dataset, keeping the same number of epochs. Keep in mind that more complex models take comparatively more epochs to train.

Implementing ResNet-50

Resnet 50 is implemented using the bottleneck blocks. We’ll use 3,4,6, and 3 bottleneck blocks in each of the 4 stages of the network. As in resnet-18, we’ll again add a fully connected layer at the last and a convolutional layer at the beginning. Each of the bottleneck block contains 3 convolutional layers. This brings the total number of layers to 50, hence the name resnet-50.

We’ll use wandb.watch() to log the model weights and parameters to study the weights if required for the experimentation.

Now, let’s move on to the optimization of the model.

Optimizing ResNet-50

We’ll again use adam optimizer to optimize our resnet 50 implementation. We’ll use different learning rates and compare their performance using wandb library.

Here’s a visualization of the loss function generated using wandb dashboard.

As seen in the graph, the performance of the model with learning rates 0.001 and 0.0001 are almost the same. So, here we’ll need more metrics to choose a winner.

Let’s have a look at the test accuracy metric.

This visualization offers clear distinction in performance of models with different learning rates. Here the model with the minimum loss isn’t the most accurate one, which happens quite a lot.

Let’s now compare the performance of the 2 models.

ResNet-18 Vs ResNet-50

Let’s choose the best performing configuration of both resnet-18 and resnet-50 and compare them. Following is the comparison of the networks in terms of loss minimization.

This visualization provides a hint that resnet-18 is better at minimizing the loss. But that won’t always be the case. Notice how the 2 graphs are almost identical to each other, the only difference being the starting point. This means that the random weights initialization has benefitted resnet-18 over resnet-50. Maybe, in a different run resnet-50 might perform better than resnet-18. So, let’s look at the test accuracy graph.

This visualization makes things more concrete. Although at the final point in the graph both the models have almost the same accuracy, but there’s still something that we need to address. The number of epochs are kept the same in each run for the sake of experimentation. I’ve mentioned previously that complex models need more epochs to train and here the layer difference between resnet-18 and resnet-50 is quite large. Thus, we can deduce that if resnet-50 is performing similar to resnet-18 keeping the epochs same, there is a high possibility that it might outperform resnet-18 when the number of epochs is increased.

Hyperparameter Sweeps

We have manually tuned our model to select the best under these circumstances but generally, in the real world environment, hyperparameter tuning is a tiring, time consuming, yet crucial part of model tuning. Wandb provides a way to automatically tune the hyperparameters of your choice. This is where “parameter sweeps” comes in. When using parameter sweeps, you just need to specify the parameters you want to tune along with their possible values and wandb automatically creates and records the performance of all the combinations of various hyperparameters.

Let’s perform a parameter Sweep for our resnet-50 model directly on colab.

The Sweep Configuration file

For performing parameter sweep, you need a sweep configuration file which is a dictionary that contains the parameters you want to tune along with their possible values.

Here’s the configuration dictionary that we’ll use to sweep the parameters.

You can experiment with whatever number parameters that you want to but as resnet-50 takes a lot of time to train, I’m sweeping only learning rate and optimizer. Also, the number of epochs set to 10 although it can also be used as a hyperparameter.

The method parameter defines the method used to search for the best hyperparameter values. Here, we have set it to ‘random’ as we want to randomly search for the best available hyperparameter. The metric parameter determines the metric you want to minimize or maximize.

The next step is to create a sweep id which can be done as follows:

The sweep() function takes the configuration file and the name of the project. There are other optional parameters which can be found on the documentation page.

The next step is to provide the default values of the parameters that we want to sweep and set them as the config in the `init` function. The parameters mentioned here will be automatically updated at each run.

Now, we are ready to run the sweep.

The command to run the sweep is `wandb.agent()` which takes the sweep id that we have generated and the function name in which the code is encapsulated.

Parameter Sweeps Results

Now that the parameter sweep has been done and the visualizations and other details have been saved to wandb dashboard, let’s compare the runs with various values of hyperparameters. Let’s first look at the training loss.

All of these runs have been performed automatically by just using one command. This is the utility of using parameter sweeps. Now we just have to compare them to find out the best hyperparameter combination.

Now, let’s look at the visualization of test accuracy generated using the parameter sweeps.

Now, as it is pretty clear by the graph, the model with learning rate 0.001 and adam optimizer performs the best among the other 5 settings.

Parameter Importance

Wandb has recently introduced a new visualization called parameter importance which actually visualizes the importance of each parameter and also it correlation with the metric being optimized.

Here’s the visualization of parameter importance for training loss.

Let’s also visualize the parameter importance for the test accuracy.

This brings us to the end of this tutorial. Go ahead and try to further customize the hyper-parameters and improve the model. Happy coding!

If you’d like to reproduce this analysis, check out this Colab notebook. You can access a live dashboard to explore this analysis in more detail. Finally here’s a video tutorial on the resnet research paper and implementation.

Join our mailing list to get the latest machine learning updates.