CNN - Training and Fine tuning of pre-trained model

Learn how to use CNNs: train from scratch, finetune a pretrained model, use a pre-trained model as it is.
Created on April 3|Last edited on August 30
Comment
﻿
﻿
Part A: Training from scratch
Question 1
(a) Total number of computations done by our networkIn this part, we have built a small CNN model consisting of 5 convolution layers. Each convolution layer is followed by a ReLU activation and a max pooling layer. After 5 such conv-relu-maxpool blocks of layers we have one dense layer followed by the output layer containing 10 neurons (1 for each of the 10 classes). The input layer is compatible with the images in the iNaturalist dataset.
Definition:
The number of channels used is denoted by n
The number of filters in the layer i is denoted by K_i
We are giving an input image whose dimensions are denoted by I
Let F_i be the spatial extent of every filter in layer i
Assumption:As the ReLU activation function only compares only the output of every neuron in the feature map, we do not consider it as a computation.
The 2*2 max pooling layer also compares only the output of every neuron in the feature map. So it is  also not regarded as the computation.
In our architecture,
The dimensions for our image is 274 x 274
The number of filters we use are 64
Spatial extent of every filter = 3
The number of channels used are 3
Stride, S =1
Padding, P = 0
In this architecture, we are using a 2 Dimensional Convolutional layer. So here, we move along height, depth and width.For each of the 64 filters, we have 3 feature maps produced. Here 3 denotes the number of channel.This can also be defined as RGB. Thus the total number of feature maps will be 64 * 3 = 192. This can be mathematically represented as below. We can calculate the dimensions of feature map, that is the dimension of the output of the convolution layer.
Thus we obtain a feature map of size 272 * 272 = 73984
Layer 1 Calculation :  (272 x 272) x (3x3+3x3-1) x (64x3) = 241483776
Layer 2 Calculation :  (134x134) x (3x3+3x3-1) x (64x3) = 58608384
Layer 3 Calculation :  (65x65) x (3x3+3x3-1) x (64x3)   = 13790400
Layer 4 Calculation :  (30x30) x (3x3+3x3-1) x (64x3)   = 2937600
Layer 5 Calculation :  (13x13) x (3x3+3x3-1) x (64x3)   = 551616
We can identify that last two layers are dense layers. In the first case, the input to the first dense layer consists of 1024 neurons. The dimensions of this vector is 2304. As there are 1024 neurons, we also have 1024 bias given to each neuron in that particular layer.
The pre-activation of the first dense layer requires (1024 * 2304) + 1024 = 2360320 number of computations. Thus we can conclude that the total number of computations for this layer is  2360320.
In the final layer, we have 10 neurons and we have 1024 inputs coming from the previous layer. Thus the total number of computations will be 10 * 1024 + 10 = 10250 computations. The post activation function has  10 outputs, which means that there are 10 extra calculations, which means that the total number of computations for the final layer is 10250 + 10 = 10260.
The total number of computations in the first dense layer = 2360320.
The total number of computations in the second dense layer = 10260.
The total number of computations in fully connected dense layers = 2370580.
Hence total number of computations done by the network is : 2370580 + 551616 + 2937600 + 13790400 + 58608384 + 241483776 = 319742356
(b) Total number of parameters in our networkTotal number of parameters in the network is : 2521354. 
Here the parameters just denote the weights that are learned during training a model. This depends on the size of our model.
Note : The code is flexible such that the number of filters, size of filters and activation function in each layer can be changed. The number of neurons in the dense layer can also be changed.
Question 2We have trained our model using the iNaturalist dataset. 10% of the training data is kept for hyperparameter tuning. Each class is equally represented in the validation data. 
Hyper parameter ConfigurationsThe following are the hyper-parameter configurations that are used in the sweep configurations in wandb. They are used to suggest the best hyper-parameter configuration for our model.
Drop out - Dropout is a technique used to prevent a model from overfitting. Dropout works by randomly setting the outgoing edges of hidden units (neurons that make up hidden layers) to 0 at each update of the training phase.
Batch Normalization - Batch normalization (also known as batch norm) is a method used to make artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling.
Size of each filter - This can be understood with the following example. Large amount of pixels are necessary for the network recognize the object, you may use bigger filters, on other hand if objects are somewhat small or local features, you consider applying smaller filters relative to your input image size.
Data Augumentation -  In this network we preprocessed the input image how the ImageNet data set is preprocessed. It given better and faster results.
filter organisation - same -means fixing the number of filter throughout the layers.
                         double - doubling the filters in the preceding layers. 
Hence the current convolutional layer contains the double number of filters compared to previous layer.
Number of filters - The number of filters is the number of neurons, since each neuron performs a different convolution on the input to the layer
Epochs -  Epoch plays an important role in CNN modeling, as this value is key to finding the model that represents the sample with less error. Both epoch and batch size has to be specified before training the neural network.
Dense neurons - Dense Layer is simple layer of neurons in which each neuron receives input from almost all the neurons of previous layer, thus called as dense. Dense Layer is used to classify image based on output from convolutional layers.
Chosen HyperparametersFilter Size c1, c2, c3, c4, c5 - 4, 3
Batch size : 32, 64
Number of filters - 32, 64
Filter Organization - Same, Double
Data Augmentation - True or False
Batch Normalization - True or False
Dense Neurons - 1024, 512, 4096
Dropout fraction - 0.3, 0.2
Filter organisation - same, double
10.Optimizer - rmsprop, adam
Strategies to reduce the number of run while still achieving a high accuracy
Early stoppingEarly stopping is one of the best strategies to reduce the number of runs, still achieving a higher accuracy.This means that stop training when a monitored metric has stopped improving.There are other ways to reduce run time like reduce image dimensions, adjust the number of layers max-pooling layers,including dropout, convolution, batch normalization layer for ease of use,using GPUs to accelerate the calculation process.However all these methods doesn't guarantee reduce in number of runs. But they will reduce the run time. There are still many strategies to reduce the number of run while achieving higher accuracy. But we will see what we have implemented.
We have implemented the following strategy in our code for reducing runtime.
Initially we started the sweep configuration in wandb with the above set of hyperparameters, having low number of epoch.
We created dropout and tested the model.
This is followed by batch normalization in the subsequent sweeps.
Now we were able to figure out the filter size of CNN.
filter organisation gives the number of filters in each layer.
The importance of data augmentation, batch normalization and dropouts were given below.Data Augmentation is also performed to get a better model. Data augmentation is a strategy that enables practitioners to significantly increase the diversity of data available for training models, without actually collecting new data. Data augmentation techniques such as cropping, padding, and horizontal flipping are commonly used to train large neural networks.However, the results are not quite promising as we expected and the training time has also increased. 
So we just tried something new by using preprocessing function from Keras. A preprocessing layer which preprocess the images as in the imagenet dataset during training.
For Keras, 64 number of filters gives better results. Having same number of filters in each layer works better.preprocessing from Keras works better. A good value for dropout in a hidden layer is between 0.2 and 0.5. 
In the Keras library, you can add dropout after any hidden layer, and you can specify a dropout rate, which determines the percentage of disabled neurons in the preceding layer.
Where the dropouts are added ?In the Keras library, we can add dropout after any hidden layer, and we can specify a dropout rate, which determines the percentage of disabled neurons in the preceding layer.Usually, dropout is placed on the fully connected layers only because they are the one with the greater number of parameters and thus they're likely to excessively co-adapting themselves causing overfitting.
Why batch normalization is needed ?1.Reduces internal covariant shift.
2.Reduces the dependence of gradients on the scale of the parameters or their initial values.
3.Regularizes the model and reduces the need for dropout, photometric distortions, local response normalization and other regularization techniques.
The below plots are generated automatically by using the Sweep configuration in wandb.
﻿
Accuracy vs Created plot and Correlation Summary Table﻿
Sweep: CNN_sweep 132
﻿
Parallel Coordinates Plot﻿
Sweep: CNN_sweep 132
Sweep: CNN_sweep 214
﻿
﻿
﻿
﻿
Question 3In our training, we saw the maximum accuracy, when we have used dense neurons of 1024. This also means that adding more filters is good. The higher the number of filters, the higher the number of abstractions that your Network is able to extract from image data.So, if the number of filters is too high, we also receive noises from the input images. So we should wisely choose the number of inputs. This can be visualized from the above plot of validation accuracy vs created.
A simple way to prevent over fitting of data is to use drop outs. If we didn't use drop out, the model will start memorizing the data set, leading to big raise in test error, with very low train error. This means that the model is not good without dropouts. 
Batch normalization is a layer that allows every layer of the network to do learning more independently.It avoids internal covariance shift. It is used to normalize the output of the previous layers. The activations scale the input layer in normalization. Having batch normalization, improved our model.
Drop out rate of 0.5 to 0.8 generally works good. But as we are using RandomCrop from keras, in our model we see that the dropout rate from 0.3 to 0.5 works well.
﻿
﻿
Importance of Batch NormalizationWhen all other parameters remain same and if batch normalization is absent, we are seeing less accuracy and if batch normalization is true, we are seeing higher accuracy. This can be visualized by the following figure. 
﻿
﻿
﻿
Sweep: CNN_sweep 12
 
Sweep: CNN_sweep 214
﻿
﻿
﻿
Importance of Data AugmentationIn all of the below cases shown in the plot, batch normalization is set to false,to check the importance of data augmentation. Here we can clearly see that when data augmentation has lesser significance, this is because we have used the randomcrop. Thats why it has lesser importance.
﻿
﻿
﻿
Sweep: CNN_sweep 13
 
Sweep: CNN_sweep 214
﻿
﻿
﻿
Importance of how filters are setupThe same filter architecture produced higher accuracy than the double filter architecture. This can be visualized in the below plot. Both the double filter architecure accuracy is lower than same filter architecture.
﻿
﻿
﻿
Sweep: CNN_sweep 14
 
Sweep: CNN_sweep 214
﻿
﻿
Importance of Dropout layerWe have already seen that the data augmentation has lesser significance. So keeping that in mind and keeping all other parameters as constant, we can see that drop out rate of 0.2 is working fine than drop out rate of 0.3. This means that with less number of neurons in the model (20 % neurons retained) is working better. This can be visualized in the following plot.
﻿
﻿
﻿
Sweep: CNN_sweep 12
 
Sweep: CNN_sweep 214
﻿
﻿
Importance of OptimizersAdam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems. From the plot below, it is clear that for the same configuration, optimizer adam works better than the optimizer RMSProp.
Importance of Batch sizeThe batch size of 32 is giving better accuracy than the batch size of 64.The batch size depends on the size of the images in your dataset. The number of batch size should be chosen not very much and not very low and in a way that almost the same number of images remain in every step of an epoch.For our dataset, batch size of 32 is working fine and this is evident from the following plot.
﻿
﻿
﻿
Sweep: CNN_sweep 12
 
Sweep: CNN_sweep 214
﻿
﻿
﻿
Question 4(a) The best model from our sweep was found to be having a dropout rate of 0.2, batch normalization is set to true and data augmentation is set to true. Filter size is set to 4 for all layers. Filter size of 1024 is chosen, Same filter strategy is used and the batch size of 32 is used and other parameters can be seen from Question 2 plots. The accuracy for the trained set was found to be 39.54%
(b) A 10 x 3 grid containing sample images from the test data and predictions made by your best model
﻿
The predictions and the label are also printed below each image, which makes the grid look clean and elegant.
﻿
We can also see that our model performs moderately on randomly selected images.
﻿
(c) There are 64 filters in the first layer of our model. Therefore we have plotted them in an 8 x 8 grid. 
﻿
From the before figure, it is clear that all the 64 images are different which means that each filter gives different output and so each filter captures different dimensions of the same input. This how our network starts learning. Thus, we have visualised all the filters in the first layer of our best model for a random image from the test set
﻿
Run: sleek-water-1491
﻿
﻿
﻿
Question 5Guided Backpropagation has been applied to 10 neurons in the fifth convolution layer and the images for the same has been plotted below in 2*5 grid to make it clean and elegant.
Each of these individual figures represent individual neuron, to which guided back propagation has been applied. Let us give the numbering to identify the neuron. The image in the position 11 represents first neuron and the image in the 12 position represents second neuron and so on. Similarly the image in the 2*5 position represents tenth neuron.
It can be seen from the image that second, seventh and ninth neuron has performed well in gathering information (features)than other neuron, while fourth, sixth and tenth neuron were able to detect features moderately.
﻿
Guided backpropagation visualizes fine-grained details in the image. Its premise is as follows, neurons act like detectors of particular image features, so when backpropagation, the gradient, negative gradients are set to zero to highlight the pixels that are important in the image.
﻿
﻿
Run set108
﻿
﻿
Question 6
GitHub Link﻿https://github.com/vishnukt2506/cs6910_Assignment_2/tree/main/PART_A﻿
This repository contains the codes for assignment 2 of the course CS6910 - Fundamentals of Deep Learning of IIT Madras, handled by Prof Mitesh Khapra
It is recommended top read the README.md file before using the repository.
﻿
﻿
﻿
﻿
Part B : Fine-tuning a pre-trained model
Question 1
(a) How to address if the dimensions of the images in our data is not the same as that in the ImageNet dataConvolutional neural networks require identical image sizes to work properly. Of course, in the real world, images are often not uniform. We can solve this problem easily.
There are many ways to address it. Most of the techniques can be grouped into two broad classes of solutions, They are transformations and inherent network properties.
Transformation based techniquesIn the case of variable-sized images, we can apply  transformations to get the same-sized image. 
A few of them are discussed below.
Resize - Resizing the variable-sized images to the same size image. We can easily implement this using tf.data input pipeline.
Crop - We can also randomly crop the images, after cropping, we can resize them to the same size. This operation produces very good data augmentation. We can easily implement this using tf.data input pipeline. 
These are the very popular transformation methods.
Inherent Network PropertyFully convolutional networks (FCN), has no limitations on the input size at all. This is  because once the kernel and step sizes are described, the convolution at each layer can generate appropriate dimension outputs according to the corresponding inputs.
This can also be achieved by Global Average Pooling.
(b) ImageNet has 1000 classes. The naturalist dataset has only 10 classes. How to resolve ?ImageNet has 1000 classes and hence the last layer of the pre-trained model would have 1000 nodes. However, the naturalist dataset has only 10 classes. This issue can be resolved by ignoring the last SoftMax layer and taking into account of our own with 10 nodes. Now the issue is resolved.
Our implementation is modular so that it allows to swap in any model (InceptionV3, InceptionResNetV2, ResNet50, Xception).
Question 2
Common trick to keep training traceableWe had noticed that InceptionV3, InceptionResNetV2, ResNet50, Xception are very huge models as compared to the simple model that we implemented in the previous part. Even fine-tuning on a small training data may be very expensive.
The following is a common trick used to keep the training tractable. Ensemble Learning theory can be applied. In this theory, multiple learners are used. We donot feel for a weak learner. Instead we are going to use multiple weak learners and combine them together. This will help a lot in increasing the model accuracy, after bringing and making them work together. The next question will be how to bring them together. This is explained in the following steps.
We can mention three major kinds of meta-algorithms that aims at combining weak learners:
Bagging and Boosting are considered with homogeneous weak layers, while Stacking often deals with heterogeneous weak learners. We can see them in a little brief below.
bagging - that often considers homogeneous weak learners, learns them independently from each other in parallel and combines them following some kind of deterministic averaging process
boosting, that often considers homogeneous weak learners, learns them sequentially in a very adaptative way (a base model depends on the previous ones) and combines them following a deterministic strategy
stacking, that often considers heterogeneous weak learners, learns them in parallel and combines them by training a meta-model to output a prediction based on the different weak models predictions
Strategies that we tried in our modelAs training a huge model is expensive, we followed the following strategy.
We have done freezing of all layers except k layers in the architecture.
We are also interested to see if adding a dense layer before the last softmax layer would help us increasing the training traceability. If we change only the last softmax layer, and adding another new layer is just going to create a feature extractor.This can also be done.
A new method is to use the combination of above both, which will result in multiple strategies. This can be done through wandb sweep configuration. Thus wandb helps a lot to reduce our confusions.
Question 3From the below plot, we can see that most of the higher accuracies are from inceptionresnet-v2, this is then followed by Xception. These two models works far better than another model.The training speed is much faster than training a model from scratch.Adding an extra dense layer has no effect on the performance.
﻿
﻿
﻿
Sweep: CNN_sweep 19
Sweep: CNN_sweep 211
Run set 30
Run set 40
﻿
﻿
From the below plot it is clear that, we are able to achieve an accuracy more than 80 percent consistently. This is why pre trained model are more important. This is also the reason why in most DL applications, instead of training a model from scratch, we would use a model pre-trained on a similar/related task/dataset.
﻿
﻿
﻿
Sweep: CNN_sweep 111
Sweep: CNN_sweep 29
﻿
﻿
It can be seen from the below image that, loss and training accuracy for various configurations have less variance. This means that using any configuration is going to have less impact on loss function and training accuracy. Now coming to validation accuracy, we can see that the variance is high. This where we need to give more interpretations. Let us see them one by one.
﻿
﻿
﻿
Sweep: CNN_sweep 111
Sweep: CNN_sweep 29
﻿
﻿
The below plot clearly indicates that Xception model is working better than resnet50. Xception is an extension of the Inception architecture which replaces the standard Inception modules with depthwise separable convolutions. That is why it is performing better than resnet50. The best accuracy is also from Xception. However, Inception remained consistent, which we have seen from the previous plot.
﻿
﻿
﻿
Sweep: CNN_sweep 10
Sweep: CNN_sweep 22
﻿
As we have seen above, the following plot is also giving an extra evidence that the model InceptionV3 is better than resnet50. This is because, the validation loss for resnet50 model is more than the validation loss for Inceptionv3 model.
﻿
﻿
Sweep: CNN_sweep 10
Sweep: CNN_sweep 22
﻿
﻿
﻿
Question 4
GitHub Link﻿https://github.com/vishnukt2506/cs6910_Assignment_2/tree/main/Part_B﻿
This repository contains the codes for assignment 2 of the course CS6910 - Fundamentals of Deep Learning of IIT Madras, handled by Prof Mitesh Khapra
It is recommended top read the README.md file before using the repository.
﻿
Part C﻿
﻿
Object DetectionObject detection is a computer vision technique that works to identify and locate objects within an image or video. 
We can tell that, object detection draws bounding boxes around these detected objects, which allow us to locate where said objects are in (or how they move through) a given scene.
Our system will help the traffic instructors whether any truck is coming on the wrong road. It will help the instructors to make advice to the truck drivers to take diversion and go on less traffic roads.
The drive link for the object detection video : https://drive.google.com/drive/folders/1eTbkRROFwQR_UKDLm67qIxMZI1XfixSS?usp=sharing﻿
The youtube link for the same video : https://youtu.be/4zeEbKGm3zk﻿
YOLOv3 (You Only Look Once, Version 3) is a real-time object detection algorithm that identifies specific objects in videos, live feeds, or images. YOLO uses features learned by a deep convolutional neural network to detect an object. Now version 5 is also available. It was found that YOLOv5 outperforms YOLOv4 and YOLOv3 in terms of accuracy. The detection speed of YOLOv3 was faster compared to YOLOv4 and YOLOv5 and the detection speed of YOLOv4 and YOLOv5 were identical.
﻿
Self Declaration
Team MembersWe hereby declare that all the works done above have equal contribution of both of the team mates.
OE21S024 - 50 % contribution
CS20M041 - 50 % contribution
The works include the following 
All the report works
Coding and developing model
Object Detection and YOLOv5 works
GitHub Maintenance
﻿
﻿
﻿
Add a comment
CNN - Training and Fine tuning of pre-trained model

Part A: Training from scratch

Question 1

(a) Total number of computations done by our network

Assumption:

(b) Total number of parameters in our network

Question 2

Hyper parameter Configurations

Chosen Hyperparameters

Strategies to reduce the number of run while still achieving a high accuracy

Early stopping

The importance of data augmentation, batch normalization and dropouts were given below.

Where the dropouts are added ?

Why batch normalization is needed ?

Accuracy vs Created plot and Correlation Summary Table

Parallel Coordinates Plot

Question 3

Importance of Batch Normalization

Importance of Data Augmentation

Importance of how filters are setup

Importance of Dropout layer

Importance of Optimizers

Importance of Batch size

Question 4

Question 5

Question 6

GitHub Link

﻿

Part B : Fine-tuning a pre-trained model

Question 1

(a) How to address if the dimensions of the images in our data is not the same as that in the ImageNet data

Transformation based techniques

Inherent Network Property

(b) ImageNet has 1000 classes. The naturalist dataset has only 10 classes. How to resolve ?

Question 2

Common trick to keep training traceable

Strategies that we tried in our model

Question 3

Question 4

GitHub Link

Part C

Object Detection

Self Declaration

Team Members