Assignment 2

Learn how to use CNNs: train from scratch, finetune a pretrained model, use a pre-trained model as it is. The two contributors to the code are: Prithaj Banerjee - CS21S045 Kondapalli Jayavardhan - CS21S011
Jayavardhan, Prithaj Banerjee
Created on March 31|Last edited on April 3
Comment
﻿
﻿
InstructionsThe goal of this assignment is threefold: (i) train a CNN model from scratch and learn how to tune the hyperparameters and visualise filters (ii) finetune a pre-trained model just as you would do in many real world applications (iii) use an existing pre-trained model for a cool application.
We strongly recommend that you work on this assignment in a team of size 2. Both the members
of the team are expected to work together (in a subsequent viva both members will be expected to answer questions, explain the code, etc).
Collaborations and discussions with other groups are strictly prohibited.
You must use Python (numpy and pandas) for your implementation. 
You can use any and all packages from keras, pytorch, tensorflow
You can run the code in a jupyter notebook on colab by enabling GPUs.
You have to generate the report in the same format as shown below using wandb.ai. You can start by cloning this report using the clone option above. Most of the plots that we have asked for below can be (automatically) generated using the apis provided by wandb.ai
You also need to provide a link to your github code as shown below. Follow good software engineering practices and set up a github repo for the project on Day 1. Please do not write all code on your local machine and push everything to github on the last day. The commits in github should reflect how the code has evolved during the course of the assignment.
You have to check moodle regularly for updates regarding the assignment.
Problem Statement In Part A and Part B of this assignment you will build and experiment with CNN based image classifiers using a subset of the iNaturalist dataset. In Part C you will take a pre-trained object detection model and use it for a novel application.
Part A: Training from scratch
Question 1 (5 Marks)Build a small CNN model consisting of  5 convolution layers. Each convolution layer would be followed by a ReLU activation and a max pooling layer. Here is sample code for building one such conv-relu-maxpool block in keras.
model = Sequential()
model.add(Conv2D(16, (3, 3), input_shape=input_shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
After 5 such conv-relu-maxpool blocks of  layers you should have one dense layer followed by the output layer containing 10 neurons (1 for each of the 10 classes). The input layer should be compatible with the images in the iNaturalist dataset.
The code should be flexible such that the number of filters, size of filters and activation function in each layer can be changed. You should also be able to change the number of neurons in the dense layer.
AnswerWe build a 5 layered CNN model with the configurations mentioned and made the parameters flexible to change.
The link for the notebook is : https://github.com/Doeschate/CS6910_Assignment2/blob/main/Part_A/Assignment2_PartA_Q1.ipynb﻿
The link for the python code which can be ran with command line arguments is : https://github.com/Doeschate/CS6910_Assignment2/blob/main/Part_A/Assignment2_PartA_Q1.py﻿
These code contains model building and printing the model. Training testing are done in separate files.
(a) What is the total number of computations done by your network? (assume mmm﻿ filters in each layer of size k×kk\times kk×k﻿  and nnn﻿ neurons in the dense layer)
Answerwe are taking input size as H x W x 3
Kernel size is k*k and each layer has m kernels
Max pooling size is also k
n neurons in Dense layer
Output layer has 10 neurons
Layer 1
Input size                             :   H x W x 3
Convolution output size   :   [H-(k-1)] x [W-(k-1)] x m
Convolution operations   :   [H-(k-1)] x [W-(k-1)] x k^2 x 3 x m
Activation operations       :   [H-(k-1)] x [W-(k-1)] x m 
Max pooling operations  :    [H-2(k-1)] x [W-2(k-1)] x m
parameters                         :    (k x k x 3 + 1)x m
Layer 2
Input size                             :  [H-2(k-1)] x [W-2(k-1)] x m
Convolution output size   :  [H-3(k-1)] x [W-3(k-1)] x m
Convolution operations   :  [H-3(k-1)] x [W-3(k-1)] x k^2 x m^2
Activation operations       :  [H-3(k-1)] x [W-3(k-1)] x m 
Max pooling operations   : [H-4(k-1)] x [W-4(k-1)] x m
parameters                          : (k x k x m + 1)x m
Layer 3
Input size                            :  [H-4(k-1)] x [W-4(k-1)] x m
Convolution output size  :  [H-5(k-1)] x [W-5(k-1)] x m
Convolution operations  :  [H-5(k-1)] x [W-5(k-1)] x k^2 x m^2
Activation operations      :  [H-5(k-1)] x [W-5(k-1)] x m 
Max pooling operations  : [H-6(k-1)] x [W-6(k-1)] x m
parameters                         : (k x k x m + 1)x m
Layer 4
Input size                            : [H-6(k-1)] x [W-6(k-1)] x m
Convolution output size  : [H-7(k-1)] x [W-7(k-1)] x m
Convolution operations  :  [H-7(k-1)] x [W-7(k-1)] x k^2 x m^2
Activation operations      :  [H-7(k-1)] x [W-7(k-1)] x m 
Max pooling operations  : [H-8(k-1)] x [W-8(k-1)] x m
parameters                         : (m x k x m + 1)x m
Layer 5
Input size                            :  [H-8(k-1)] x [W-8(k-1)] x m
Convolution output size  :  [H-9(k-1)] x [W-9(k-1)] x m
Convolution operations   :  [H-9(k-1)] x [W-9(k-1)] x k^2 x m^2
Activation operations      :  [H-9(k-1)] x [W-9(k-1)] x m 
Max pooling operations  :  [H-10(k-1)] x [W-10(k-1)] x m
parameters                         :  (k x k x m + 1)x m
﻿
Let p = [H-10(k-1)] x [W-10(k-1)] x m
Dense layer 1
Input size                       :  p x 1
Output size                    :  n x 1
Matrix operations       :  [n x p^2] + n
Activation operations :  n
parameters                    : n*p + n
 Output Layer 
Input size                       : n x 1
Output size                    : 10
Matrix operations        : [n x 10^2] + n
Activation operations : 10
parameters                    : 10 x n + 10
Total number of computation done by the network is sum of convolution operations, Activation operations, Max pooling operations in Layer 1,2,3,4,5 and Matrix operations, Activation operations in Dense, output layer.    
﻿
(b) What is the total number of parameters in your network? (assume mmm﻿ filters in each layer of size k×kk\times kk×k﻿  and nnn﻿ neurons in the dense layer)
AnswerTotal number of parameters in the network is sum of parameters in all the layers.
Total number of parameters    :  3k^2m + 4m^2k^2 + np + (11)n + 5m + 10   
Question 2 (10 Marks)You will now train your model using the iNaturalist dataset. The zip file contains a train and a test folder. Set aside 10% of the training data for hyperparameter tuning. Make sure each class is equally represented in the validation data. Do not use the test data for hyperparameter tuning. 
Using the sweep feature in wandb find the best hyperparameter configuration. Here are some suggestions but you are free to decide which hyperparameters you want to explore
number of filters in each layer : 32, 64, ...
filter organisation: same number of filters in all layer, doubling in each subsequent layer, halving in each subsequent layer, etc
data augmentation (easy to do in keras): Yes, No
dropout: 20%, 30% (btw, where will you add dropout? you should read up a bit on this)
batch normalisation: Yes, No
Based on your sweep please paste the following plots which are automatically generated by wandb:
accuracy v/s created plot (I would like to see the number of experiments you ran to get the best configuration). 
parallel co-ordinates plot
correlation summary table (to see the correlation of each hyperparameter with the loss/accuracy)
Also write down the hyperparameters and their values that you sweeped over. Smart strategies to reduce the number of runs while still achieving a high accuracy would be appreciated. Write down any unique strategy that you tried.
Answer: The code for this question is https://github.com/Doeschate/CS6910_Assignment2/blob/main/Part_A/Assignment2_PartA_Q2.ipynb﻿
Validation Accuracy v/s created plot and Training Accuracy v/s created plots are shown below that plots the validation and training accuracy achieved for each experiment performed according to its creation date. Validation accuracy achieved with the given dataset after splitting the training data into 90-10 train-validation split is around 42% and training accuracy achieved more than 50% 
 Parallel coordinates plot is attached just below the Accuracy v/s created plot that shows all the hyperparameters corresponding to its accuracy.
Correlation summary table is plotted below parallel coordinates plot that shows a summary of correlation of different hyperparameters with respect to the validation accuracy.
Training loss, Validation Accuracy and Training Accuracy are also plotted below for different experiments.
﻿
Run set30
﻿
We have tried different hyperparameters in sweep as follows:
Different filter sizes :
Different ways of organizing filters across layers:
Data augmentation:
Dropout:
Batch normalisation:
Different number of epochs:
Different pool kernel size:
Different pool strides:
Different dense layer size:
Different learning rates:
Different activation functions:
Different Batch sizes:
Weight decay:
Different optimizers:
The sweep config file that we used for hyperparameter tuning is attached below: 
#This is the config file for performing sweep in wandb
sweep_config = {
  'name': 'Assignment2_PartA_Q2',
  'method': 'bayes',
  'metric': {
      'name': 'Validation Accuracy',
      'goal': 'maximize'   
    },
  'parameters': {
      'epochs': {
            'values': [5,10,20,30,40,50]
        },
        'conv_attributes_channels': {
            'values': [[32,64,32,64,32],[32,32,32,32,32],[16,32,64,128,256],[32,64,128,256,512],[256,128,64,32,16],[64,64,64,64,64],[64,128,256,512,1024]]
        },
        'conv_attributes_kernel_size': {
            'values': [[3,3,5,7,9],[7,5,5,3,3],[11,7,5,3,3],[3,3,3,5,5],[3,3,3,3,3],[11,7,7,5,3],[11,9,7,5,3],[3,5,7,9,11]]
        },
        'pool_attributes_kernel_size': {
            'values': [[2,2,2,2,2],[2,2,2,1,1],[2,1,3,1,2],[3,3,3,2,2]]
        },
        'pool_attributes_stride': {
            'values': [[2,2,2,2,2],[2,2,2,1,1],[1,1,2,2,2],[1,2,1,2,1],[2,2,2,2,1]]
        },
        'dense_layer_size': {
            'values': [32,64,128,256,512]
        },
        'learning_rate': {
            'values': [0.001,0.002,0.0015,0.0001,0.00015, 0.00001]
        },
        'activation': {
            'values': ['relu','elu','gelu','sigmoid']
        },
        'dropout': {
            'values': [0.0 ,0.2 ,0.3 ,0.4 ,0.5]
        },
        'batch_normalization': {
            'values': [False,True]
        },
        'batch_size': {
            'values': [16, 32, 64,128]
        },
        'weight_decay': {
            'values': [0.0,0.00001,0.0001]
        },
        'dataset_augmentation':{
              'values': [False,True]
        },
        'optimizer_name':{
              'values': ['Adam','SGD']
        }
    }
}
﻿
Strategy for reducing the total number of experiments :
Since there are multiple number of hyperparameters and each with multiple values so tuning with different combination of each of these will generate exponential number of plots. So, we applied some unique smart strategies to reduce the number of runs as follows:
Bayes search strategy: Wandb provides with different search strategies to choose between the hyperparameters such as 1) Random search, 2) Grid Search and 3) Bayes Search. We choose Bayes search strategy which uses gaussian function to model the function and then chooses parameter to optimize probability of improvement. Here we are optimizing (maximizing) over the validation accuracy.
Data Augmentation and Batch Normalization: Initially we observed that with batch_normalization and dataset_augmentation true it is giving better result than with false, so to reduce the number of experiments we set these two parameters to 'True' for later experiments and obtained better accuracy.
Weight Decay : We also used weight decay as a hyperparameter to avoid complex models forming.
Resize Image : We resized the image to 256x256 to reduce the experiment time as the processing speed will decrease for each image.
Question 3 (15 Marks)Based on the above plots write down some insightful observations. For example, 
adding more filters in the initial layers is better 
Using bigger filters in initial layers and smaller filters in latter layers is better
..
(Note: I don't know if any of the above statements is true. I just wrote some random comments that came to my mind)
Answer:
Some insightful observations based on the above plots: Comments on how increasing/decreasing the size of the filters within or across layers helps/does not help                                      Different filter sizes : As in the sweep hyperparameter conv_attributes_kernel_size we used different filter size like 'values': [[3,3,5,7,9],[7,5,5,3,3],[11,7,5,3,3],[3,3,3,5,5],[3,3,3,3,3],[11,7,7,5,3],[11,9,7,5,3],[3,5,7,9,11]]. So [3,3,5,7,9] means 3x3 size filter in the 1st layer followed by 3x3 in 2nd layer, 5x5 in the 3rd layer, 7x7 in 4th and 9x9 in the 5th layer. Similarly we tried with increasing the filter size in each layer as in [3,3,5,7,9] or decreasing like in [7,5,5,3,3] or even keeping all filter sizes same like in [3,3,3,3,3]. As in the parallel coordinate plot we observed that the increasing filter size like [3,3,5,7,9] is giving better accuracy than decreasing the filter size in later layers like [11,7,7,5,3]. Even keeping filter sizes same in each layer like [3,3,3,3,3] is giving better accuracy than decreasing the filter sizes in subsequent layers.
Comments on how increasing/decreasing/not changing the number of filters across layers helps/does not help:                         Different ways of organizing filters across layers:  As in the sweep hyperparameter conv_attributes_channels we organized the number of filters in many ways like keeping more number of filters initially and gradually decreasing or keeping less number of filters and increasing it or keeping it same.  We sweeped over 'values': [[32,64,32,64,32],[32,32,32,32,32],[16,32,64,128,256],[32,64,128,256,512],[256,128,64,32,16],[64,64,64,64,64],[64,128,256,512,1024]]. We got our best test result when we used more number of filters in the initial layers and gradually decreased it like [256,128,64,32,16], even when we applied same number of filters [32,32,32,32,32] we got good validation accuracy. Whereas applying less filters in initial layers and gradually increasing the number of filters like [64,128,256,512,1024] is giving less accuracy.
Comments on whether data augmentation helped or not:                                                                                                                                          Data augmentation: We ran sweep with both dataset_augmentation as 'True' and 'False'. We got better accuracy for cases where dataset_augmentation was 'True' than when we didn't do dataset_augmentation. So, dataset_augmentation plays a vital contribution in accuracy also as from the correlation plot it is clear than it is positively related to the validation accuracy.
Comments on whether dropout helps or not:                                                                                                                                                          Dropout: As found from the articles it is better to use dropouts in the dense layers than in convolution layers. So, we used different dropout values in the dense layer like 'values': [0.0 ,0.2 ,0.3 ,0.4 ,0.5]. So, it is found that the accuracy increases when we use dropouts in dense layers. Our best model is trained with dropout taking 0.3 . From the correlation table it is evident that dropout is positively correlated with the accuracy.
Comments on whether batch normalization helps or not :                                                                                                                                       Batch normalisation: We ran sweep with both batch_normalization as 'True' and 'False'. We got better accuracy for cases where batch_normalization was 'True' than when we didn't do batch_normalization. So, batch_normalization plays a vital contribution in accuracy also as from the correlation plot it is clear than it is positively related to the validation accuracy.
Different number of epochs: We tried with different epochs like 'values': [5,10,20,30,40,50] and found that number of epoch plays an important role in accuracy. If we use small epochs like 5 or 10 then the model is less trained and doesn't give good accuracy and with too much number of epochs its overfitted like in 50 and accuracy starts decreasing. So, epochs like 30, 40 are good for this model. Our best model is trained with 30 epochs.
Different pool kernel size: We used different pool_attributes_kernel_size 'values': [[2,2,2,2,2],[2,2,2,1,1],[2,1,3,1,2],[3,3,3,2,2]] and found best model with kernel sizes [3,3,3,2,2]
Different pool strides: We used different pool_attributes_stride 'values': [[2,2,2,2,2],[2,2,2,1,1],[1,1,2,2,2],[1,2,1,2,1],[2,2,2,2,1]]. We got the best accuracy with pool strides [2,2,2,2,1].
Comments on the size of the dense layer:                                                                                                                                                                 Different dense layer size:  We tried with different dense_layer_size like 'values': [32,64,128,256,512] and found that it plays an important role in accuracy. We got best accuracy with dense layer size as 128.
Comments on the use of different learning rates:                                                                                                                                                     Different learning rates: We ran sweep with  learning_rate 'values': [0.001,0.002,0.0015,0.0001,0.00015, 0.00001] . It is observed that lower learning rate is giving better accuracy like 0.0001 or 0.00001 is giving better accuracy than that of 0.001. Our best model is trained with learning rate 0.0001.
Different activation functions: As in the sweep we use different activation functions activation : {'values': ['sigmoid','relu','elu','gelu']}. From the parallel coordinate plots, it is evident that the activation functions like 'gelu', 'relu' and 'elu' are giving far better results than 'sigmoid'. In fact we got our highest test accuracy of 42% with 'gelu' activation function. Moreover from the Correlation summary table it is evident that sigmoid activation function is highly negatively correlated with the validation accuracy.
Comments on the use of different batch sizes:                                                                                                                                                     Different Batch sizes:  With different batch_size the accuracy also varies. So we tried with different batch sizes 'values': [16, 32, 64,128]. Our best model is trained with batch size of 16. As from the correlation table batch size highly important and it is negatively correlated with the accuracy so small sizes gave better accuracy.
Weight decay: We tried with different weight_decay 'values': [0.0,0.00001,0.0001] and we got best accuracy at 0.00001
Comments on the use of different learning algorithms:                                                                                                                                      Different optimizers: We tried with two optimizer_name in the sweep 'values': ['Adam','SGD']. 'Adam' gave better accuracy that 'SGD' and our best model is trained with 'Adam'.
﻿
Question 4 (5 Marks)You will now apply your best model on the test data (You shouldn't have used test data so far. All the above experiments should have been done using train and val data only). 
(a) Use the best model from your sweep and report the accuracy on the test set. 
AnswerThe code for this question is https://github.com/Doeschate/CS6910_Assignment2/blob/main/Part_A/Assignment2_PartA_Q4.ipynb﻿
These are the parameters for our best model:
wandb: Agent Starting Run: aiu1r4ra with config:
wandb: 	activation: gelu
wandb: 	batch_normalization: True
wandb: 	batch_size: 16
wandb: 	conv_attributes_channels: [256, 128, 64, 32, 16]
wandb: 	conv_attributes_kernel_size: [3, 3, 5, 7, 9]
wandb: 	dataset_augmentation: True
wandb: 	dense_layer_size: 128
wandb: 	dropout: 0.3
wandb: 	epochs: 30
wandb: 	learning_rate: 0.0001
wandb: 	optimizer_name: Adam
wandb: 	pool_attributes_kernel_size: [3, 3, 3, 2, 2]
wandb: 	pool_attributes_stride: [2, 2, 2, 2, 1]
wandb: 	weight_decay: 1e-05
The test accuracy on the test data achieved was : 42.35 %
The test accuracy of the network along with the test accuracy of each network is given below: 
Accuracy of the network: 42.35 %
Accuracy of Fungi: 46.5 %
Accuracy of Insecta: 41.0 %
Accuracy of Aves: 52.0 %
Accuracy of Mammalia: 53.5 %
Accuracy of Mollusca: 27.0 %
Accuracy of Animalia: 35.5 %
Accuracy of Arachnida: 44.5 %
Accuracy of Reptilia: 44.0 %
Accuracy of Plantae: 63.0 %
Accuracy of Amphibia: 16.5 %
﻿
Run set30
﻿
﻿
(b) Provide a 10 x 3 grid containing sample images from the test data and predictions made by your best model (more marks for presenting this grid creatively).
Answer: True vs Prediction for our best model on 10x3 grid containing sample images:
True vs Pred on 10x3 grid test images
With our model having accuracy 42.35% we predicted 16 out of 30 images correctly.
(c) Visualise all the filters in the first layer of your best model for a random image from the test set. If there are 64 filters in the first layer plot them in an 8 x 8 grid. 
AnswerVisualising all the first layer filters of our best model
Visualization of the 1st layer filters
Visualization of the 1st layer filters on a random image
Visualization of the 1st layer filters on a random image
commentary on what the filters learned:
In our model first conv_2d layer contains 256 filters each of size 3x3. We tried to visualize the filters first and then choose one image at random and applied the filters on the image and tried to visualize what the filters learned. 
As evident from the first image all the 256 filters learnt different weights as the color combination are different for each cell of each 3x3 kernel. 
Now when these filters are applied on a single image it detects different portion of the image as it is viewed from the different color contrast of the 2nd image and thus it detects different aspects of the same image which combined into single image with all the features.
Question 5 (10 Marks)Apply guided back propagation on any 10 neurons in the CONV5 layer and plot the images which excite this neuron. The idea again is to discover interesting patterns which excite some neurons. You will draw a 10 x 1 grid below with one image for each of the 10 neurons.
AnswerWe applied guided back propagation on any 10 neurons in the CONV5 layer and plot the images which excite this neuron
﻿
﻿
Some interesting patterns in the image:
We applied guided back propagation on any 10 neurons in the CONV5 layer and plot the images which excite this neuron. We found out only a small portion of the image is excited and a pattern like edge detection of the actual image is formed. This is evident from the fact that the most important neurons of the image are excited by the guided back-propagation and gave the most important features of the image when plotted.
The code link for this question is https://github.com/Doeschate/CS6910_Assignment2/blob/main/Part_A/Assignment2_PartA_Q5.ipynb﻿
Question 6 (10 Marks)Paste a link to your github code for Part A
Example: https://github.com/<user-id>/cs6910_assignment2/partA;
We will check for coding style, clarity in using functions and a README file with clear instructions on training and evaluating the model (the 10 marks will be based on this).
We will also run a plagiarism check to ensure that the code is not copied (0 marks in the assignment if we find that the code is plagiarised).
We will check the number of commits made by the two team members and then give marks accordingly. For example, if we see 70% of the commits were made by one team member then that member will get more marks in the assignment (note that this contribution will decide the marks split for the entire assignment and not just this question).
We will also check if the training and test data has been split properly and randomly. You will get 0 marks on the assignment if we find any cheating (e.g., adding test data to training data) to get higher accuracy.
Answer The link for the code repository: https://github.com/Doeschate/CS6910_Assignment2/tree/main/Part_A﻿
README.md file is also provided in this PartA folder which can be used to run the code.
Part B : Fine-tuning a pre-trained model
Question 1 (5 Marks)In most DL applications, instead of training a model from scratch, you would use a model pre-trained on a similar/related task/dataset. From keras, you can load any model (InceptionV3, InceptionResNetV2, ResNet50, Xception, etc) pre-trained  on the ImageNet dataset. Given that ImageNet also contains many animal images, it stands to reason that using a model pre-trained on ImageNet maybe helpful for this task. 
You will load a pre-trained model and then fine-tune it using the naturalist data that you used in the previous question. Simply put, instead of randomly initialising the weigths of a network you will use the weights resulting from training the model on the ImageNet data (keras directly provides these weights). Please answer the following questions:
(a) The dimensions of the images in your data may not be the same as that in the ImageNet data. How will you address this?
AnswerImages in iNaturalist dataset have dimensions in the order of 800*600, 800*350 etc., while Images in ImageNet dataset have dimensions in the order of 256x256 or 224x224. So, all the pretrained models have input size as ImageNet dataset (let 224*224). To match iNaturalist dataset dimensions to pretrained models input size, we resized the images to 224*224 and proceed with training.
(b) ImageNet has 1000 classes and hence the last layer of the pre-trained model would have 1000 nodes. However, the naturalist dataset has only 10 classes. How will you address this?
AnswerAs ImageNet dataset has 1000 classes, Last layer of pre-trained models has 1000 nodes, but iNaturalist data has 10 classes only. To address this issue , we deleted last fully connected layer and added new layer with 10 nodes in the pretrained models. We learned the weights of newly created layer by trained with iNaturalist data.
Your implementation should be modular so that it allows to swap in any model (InceptionV3, InceptionResNetV2, ResNet50, Xception).
(Note: This question is only to check the implementation. The subsequent questions will talk about how exactly you will do the fine-tuning)
Question 2 (5 Marks)You will notice that InceptionV3, InceptionResNetV2, ResNet50, Xception are very huge models as compared to the simple model that you implemented in Part A. Even fine-tuning on a small training data may be very expensive. What is a common trick used to keep the training tractable (you will have to read up a bit on this)? Try different variants of this trick and fine-tune the model using the iNaturalist dataset. For example, '___'ing all layers except the last layer, '___'ing upto k layers and  '___'ing the rest. Read up on pre-training and fine-tuning to understand what exactly these terms mean.
Write down the different strategies that you tried (simple bullet points would be fine).
Answer In CNN's,generally initial layers will learn edges, blobs, colour etc and middle layers will try to learn midlevel vision like structures,
patterns etc, and final dense layers will act as a classifier to predict the input image label according to the learned features. By taking this point into consideration. 
We tried the Following two strategies :
Pre-trained : Taking the models already trained with Imagenet Dataset, remove the last fully connected layer and added new layer to match with iNaturalist dataset label count. Rest of the pretrained model is fixed, we only adjust the weights of newly added layer during backpropagation while training with iNaturalist dataset. This is also know as feature extractor, because we are using the model to extract features from the data and only newly created layer was trained to decide the label of data using extracted features. This is useful when our dataset is similar to pretrained dataset and if we have less computing resourses for training.
Fine-Tuning : Take the model that already trained with Imagenet Dataset, change the last fully connected layer according to iNaturalist dataset.Now we fine-tune the weights of the pretrained network by continuing the backpropagation. we can fine-tune all the layers or we can freeze some of the initial layers and fine-tune last few layers. These strategies are generally  freezing all layers except the last layer or freezing upto k layers and training the rest. This method is useful, when the pretrained dataset covers mostly generic features of domain and our dataset is specific to certain aspect of domain.
We use adam optimizer with cross entropy loss in training, the hyper parameters are :
Pre_trained model
unfreezed_from_last ( Number of layers, weights can update during backpropagation )
learning_rate
batch_size
epochs
weight_decay
﻿
The strategies in bullet points :
Freezing all the layers in the pretrained Freezing all layers except the last layer in the base model
Unfreezing the last k layers in the pretrained model and freezing the rest.
Unfreezing all the layers in the pretrained model.

The first strategy is also called as "Transfer Learning" which we do not update the parameters of the model.
Question 3 (15 Marks)Now finetune the model using different strategies that you discussed above and different hyperparameter choices. Based on these experiments write down some insightful inferences (once again you will find the sweep function to be useful to plot and compare different choices). 
Here are some examples of inferences that you can draw:
Using a huge pre-trained network works better than training a smaller network from scratch (as you did in Part A)
InceptionV3 works better for this task than ResNet50
Using a pre-trained model, leads to faster convergence as opposed to training a model from scratch
... ....
(Note: I don't know if any of the above statements is true. I just wrote some random comments that came to my mind)
Of course, provide evidence (in the form of plots) for each inference.
Of course, provide appropriate plots for the above inferences (mostly automatically generated by wandb). The more insightful and thorough your inferences and the better the supporting evidence (in terms of plots), the more you will score in this question.
Answer﻿
Run set39
﻿
Inferences:We tried following huge pretrained models and they best Validation Accuracies.
       Resnet50 : 81.9%  
       Resnet18 : 75%     
       Googlenet : 73.9%
       InceptionV3 : 75.8% 
       InceptionResnetV2 : 76.4%
       Resnet50 performs well than other models. It gives 81.9% validation accuracy and 82% test accuracy.
Using a huge pre-trained network works is very better than training a smaller network from scratch as in part A. Accuracy increased nearly 2 times.
Accuracy is improved when more layers are open for training. Need to train with more epochs when trainable layers are increased. Infact, number of trainable layers is the most important hyperparameter among others in the sweep. This makes the model computationally cheaper.  Unfreezing more layers increases training time and is computationally more expensive.
Using a pre-trained model leads to faster convergence as opposed to training a model from scratch. 
The best accuracy for pre-trained model occurs at 7 epochs (some models take 10 epochs), while as for part A we required nearly 30 epochs or more.
Negative correlation with batch size implies that training with smaller batches(8,16) are worked well than larger batch size(64,128).
As epochs increasing val-accuracy is increasing. Increasing the number of epochs do help in certain cases but the model was trying to overfit the data.		
Using more number of epochs doesn't necessary reduce the validation loss, sometimes we observed that training accuracy is increasing but validation accuracy is decreasing, this is clear sign of overfitting.
The positive correlation with weight decay implies that adding weight decay to our optimizer(Adam) increases accuracy. It decreases the overfitting.
we tried Adam, SGD optimizers. Adam performs well than SGD.
Using Data Augumentation increases the accuray. Infact, we got best validation accuracy of 82% got with data augumentation. 			  
Question 4 (10 Marks)Paste a link to your github code for Part A
Example: https://github.com/<user-id>/cs6910_assignment2/partB﻿
Follow the same instructions as in Question 6 of Part A. 
AnswerThe code repository link: https://github.com/Doeschate/CS6910_Assignment2/tree/main/Part_B﻿
README.md file is also provided in the Part_B folder which can be used to run the codes.
Part C : Using a pre-trained model as it is
Question 1 (15 Marks)Object detection is the task of identifying objects (such as cars, trees, people, animals) in images. Over the past 6 years, there has been tremendous progress in object detection with very fast and accurate models available today. In this question you will use a pre-trained YoloV3 model and use it in an application of your choice. Here is a cool demo of YoloV2 (click on the image to see the demo on youtube).
﻿
Go crazy and think of a cool application in which you can use object detection (alerting lab mates of monkeys loitering outside the lab, detecting cycles in the CRC corridor, ....). More marks if you come up with an application which has social relevance.
Make a similar demo video of your application, upload it on youtube and paste a link below (similar to the demo I have pasted above). 
Also note that I do not expect you to train any model here but just use an existing model as it is. However, if you want to fine-tune the model on some application-specific data then you are free to do that (it is entirely up to you).
Notice that for this question I am not asking you to provide a github link to your code. I am giving you a free hand to take existing code and tweak it for your application. Feel free to paste the link of your code here nonetheless (if you want).
Example: https://github.com/<user-id>/cs6910_assignment2/partC﻿
AnswerWe have uploaded two videos of fire detection in public place in youtube. We used pretrained YoloV3 model to detect fire.
﻿
 Fire Detection at Petrol Pump﻿
﻿
﻿
Fire Detection at Airport﻿
﻿
Social relevance :The application is fire detection using Yolov3 model. Sudden fire in public places may cause a human loss. Detecting fire as soon  as possible can save life's. We used Yolov3 model to detect fire in the footages. This application can be used in real world scenarios like alerting the people if fire is created in the place.﻿ We showed two examples of fire detection in public place 1. Fire detection in petrol pump and 2. Fire detection in Airport.
Self DeclarationList down the contributions of the two team members:
For example, 
CS21S045: (50% contribution)
Designing and implementing Convolution Neural Network model from scratch.
Analysing the parallel coordinates plot and correlation plot for sweeps on our model trained from scratch.
Writing inferences based on above analysis
Analysing and implementing Guided Backpropagation on Convolution layer 5 neurons.
Implemented YoloV3 Model & Object Detection
Written README file for Part-A
CS21S011: (50% contribution)
Designing and implementing Convolution Neural Network model from scratch.
Analysing the parallel coordinates plot and correlation plot for sweeps on best pre-trained models.
Writing inferences based on above analysis
Training huge pre trained models on iNaturalist dataset.
Implemented YoloV3 Model & Object Detection
Written README file for Part-B
We, Prithaj Banerjee and Kondapalli Jayavardhan, swear on our Honour that the above declaration is correct.
Note: Your marks in the assignment will be in proportion to the above declaration. Honesty will be rewarded (Help is always given in CS6910 to those who deserve it!). 
This is an opportunity for you to come clean. If one of the team members has not contributed then it should come out clearly from the above declaration. There will be a viva after the submission. If your performance in the viva does not match the declaration above then both the team members will be Penalised (50% of the marks earned in the assignment). 
﻿
Add a comment