Assignment 2
Learn how to use CNNs: train from scratch, finetune a pretrained model, use a pre-trained model as it is.
Created on April 1|Last edited on April 3
Comment
Instructions
- The goal of this assignment is threefold: (i) train a CNN model from scratch and learn how to tune the hyperparameters and visualise filters (ii) finetune a pre-trained model just as you would do in many real world applications (iii) use an existing pre-trained model for a cool application.
- We strongly recommend that you work on this assignment in a team of size 2. Both the members of the team are expected to work together (in a subsequent viva both members will be expected to answer questions, explain the code, etc).
- Collaborations and discussions with other groups are strictly prohibited.
- You must use Python (numpy and pandas) for your implementation.
- You can use any and all packages from keras, pytorch, tensorflow
- You can run the code in a jupyter notebook on colab by enabling GPUs.
- You have to generate the report in the same format as shown below using wandb.ai. You can start by cloning this report using the clone option above. Most of the plots that we have asked for below can be (automatically) generated using the apis provided by wandb.ai
- You also need to provide a link to your github code as shown below. Follow good software engineering practices and set up a github repo for the project on Day 1. Please do not write all code on your local machine and push everything to github on the last day. The commits in github should reflect how the code has evolved during the course of the assignment.
- You have to check moodle regularly for updates regarding the assignment.
Problem Statement
In Part A and Part B of this assignment you will build and experiment with CNN based image classifiers using a subset of the iNaturalist dataset. In Part C you will take a pre-trained object detection model and use it for a novel application.
Part A: Training from scratch
Question 1 (5 Marks)
Build a small CNN model consisting of 5 convolution layers. Each convolution layer would be followed by a ReLU activation and a max pooling layer. Here is sample code for building one such conv-relu-maxpool block in keras.
model = Sequential()model.add(Conv2D(16, (3, 3), input_shape=input_shape))model.add(Activation('relu'))model.add(MaxPooling2D(pool_size=(2, 2)))
After 5 such conv-relu-maxpool blocks of layers you should have one dense layer followed by the output layer containing 10 neurons (1 for each of the 10 classes). The input layer should be compatible with the images in the iNaturalist dataset.
The code should be flexible such that the number of filters, size of filters and activation function in each layer can be changed. You should also be able to change the number of neurons in the dense layer.
(a) What is the total number of computations done by your network? (assume filters in each layer of size and neurons in the dense layer)
Computations Used are shown below:

(b) What is the total number of parameters in your network? (assume filters in each layer of size and neurons in the dense layer)
Answer:
To know total parameters, we need to consider only convolution layers, dense layer and output layer. Pooling layers and activation layers will have 0 parameters.
For our model,
Number of filters = m =32
Filter size (k*k) = (3*3)
No. Of neurons in dense layer = n = 128
No of output classes = 10
Filter size for first layer: k*k*3
Filter size for other layers: k*k*m
Parameters in each layer = (Number of filters * Filter Size) + Number of filters (required for bias)
So, for first conv layer, number of parameters = m*( k*k*3)+m = 3mk2 + m = 896
For 2nd, 3rd, 4th, 5th layer, number of parameters = m*( k*k*m )+m (each) = 9248
S0 for 4 layers = 9248*4
For dense layer, the input will be flattened and will be of 7*7*32 + 128 number of parameters = 200832
For output layer, number of parameters = No of output classes*(n+1) = 10*n + 10 = 1290
Parameters in batch normalization = 128
Number of batch normalization used = 10 (after each activation layer and convolution layer)
So, total number of parameters = 896 + 36992 + 200832 + 1290 + 1280 = 240906
Question 2 (10 Marks)
You will now train your model using the iNaturalist dataset. The zip file contains a train and a test folder. Set aside 10% of the training data for hyperparameter tuning. Make sure each class is equally represented in the validation data. Do not use the test data for hyperparameter tuning.
Using the sweep feature in wandb find the best hyperparameter configuration. Here are some suggestions but you are free to decide which hyperparameters you want to explore
- number of filters in each layer : 32, 64, ...
- filter organisation: same number of filters in all layer, doubling in each subsequent layer, halving in each subsequent layer, etc
- data augmentation (easy to do in keras): Yes, No
- dropout: 20%, 30% (btw, where will you add dropout? you should read up a bit on this)
- batch normalisation: Yes, No
Based on your sweep please paste the following plots which are automatically generated by wandb:
- accuracy v/s created plot (I would like to see the number of experiments you ran to get the best configuration).
- parallel co-ordinates plot
- correlation summary table (to see the correlation of each hyperparameter with the loss/accuracy)
Also write down the hyperparameters and their values that you sweeped over. Smart strategies to reduce the number of runs while still achieving a high accuracy would be appreciated. Write down any unique strategy that you tried.
Hyperparameters and their values that we sweeped over are:
- Epochs: 10, 12, 15
- Number of filters: 32,64
- Number of neurons in dense layer: 64, 128
- Augmentation: True, False
- Double: True, False
- Learning Rate: 0.0001, 0.0002, 0.0005
- Dropout: 0.1, 0.2, 0.4
Unique Strategies used are:
- We have used Bayes optimization technique to maximize the validation accuracy
- We have used early stopping by using Wandb Hyperband technique to reduce those runs which are not performing good
- We have used GPU in Google colab to make our runs faster.
Run set
39
Run set
39
Run set
39
Question 3 (15 Marks)
Based on the above plots some insightful observations are:
- Doubling the filters is useful as it helps in abstracting more complex patterns.
- Data augmentation is important, as it allows the model to learn on data with more variations, hence model performs well on test data.
- Batch normalization is important after each layer.
- Dropout of 10% works best for our model. It also prevents the model from overfitting on train data.
- As the model is not very complex, so more epochs are required for training. We have trained our model on 20 epochs.
- Below 10 epochs the validation accuracy of the model doesn't rise much. It rise after 10 epochs.
- We also tried L2 regularization, but when it is used with normalization and back propagation, it doesn't make much difference to the accuracy of the model.
- Learning rate is also an important parameter. Our model is performing best on learning rate = 0.0002. With greater learning rate model is converging fast bust less accurate.
- As training data is not so large and only 5 convolution layers are used hence achieved frequency is less.
Question 4 (5 Marks)
You will now apply your best model on the test data (You shouldn't have used test data so far. All the above experiments should have been done using train and val data only).
(a) Use the best model from your sweep and report the accuracy on the test set.
We have got the best test accuracy of 40.90%. This model was trained on following configurations:
- Epochs: 20
- No of filters: 32
- No of neurons: 128
- Augmentation: True
- Learning Rate: 0.0002
- Dropout: 0.2
- Batch Normalization: True
- Double: True
- Batch Size: 128
(b) Provide a 10 x 3 grid containing sample images from the test data and predictions made by your best model (more marks for presenting this grid creatively).
Answer: Following is a 10 x 3 grid containing sample images from the test data and predictions made by our best model.

(c) Visualise all the filters in the first layer of your best model for a random image from the test set. If there are 64 filters in the first layer plot them in an 8 x 8 grid.

Random Image from Test Set

Filters in first layer

Question 5 (10 Marks)
Apply guided back propagation on any 10 neurons in the CONV5 layer and plot the images which excite this neuron. The idea again is to discover interesting patterns which excite some neurons. You will draw a 10 x 1 grid below with one image for each of the 10 neurons.

Original Image
Guided Backpropagation on 10 neurons

Question 6 (10 Marks)
Paste a link to your github code for Part A
- We will check for coding style, clarity in using functions and a README file with clear instructions on training and evaluating the model (the 10 marks will be based on this).
- We will also run a plagiarism check to ensure that the code is not copied (0 marks in the assignment if we find that the code is plagiarised).
- We will check the number of commits made by the two team members and then give marks accordingly. For example, if we see 70% of the commits were made by one team member then that member will get more marks in the assignment (note that this contribution will decide the marks split for the entire assignment and not just this question).
- We will also check if the training and test data has been split properly and randomly. You will get 0 marks on the assignment if we find any cheating (e.g., adding test data to training data) to get higher accuracy.
Part B : Fine-tuning a pre-trained model
Question 1 (5 Marks)
In most DL applications, instead of training a model from scratch, you would use a model pre-trained on a similar/related task/dataset. From keras, you can load any model (InceptionV3, InceptionResNetV2, ResNet50, Xception, etc) pre-trained on the ImageNet dataset. Given that ImageNet also contains many animal images, it stands to reason that using a model pre-trained on ImageNet maybe helpful for this task.
You will load a pre-trained model and then fine-tune it using the naturalist data that you used in the previous question. Simply put, instead of randomly initialising the weigths of a network you will use the weights resulting from training the model on the ImageNet data (keras directly provides these weights). Please answer the following questions:
(a) The dimensions of the images in your data may not be the same as that in the ImageNet data. How will you address this?
ANS: In order to address this issue, we resized the images to a size that the user specifies. For example (256, 256) or (64, 64). We did the resizing using the resize functionality provided by Keras.
(b) ImageNet has 1000 classes and hence the last layer of the pre-trained model would have 1000 nodes. However, the naturalist dataset has only 10 classes. How will you address this?
ANS: In order to address this issue, we removed the last layer of the pre-trained base model and replaced it with a dense layer containing 10 neurons and softmax activation. In addition we also provide the user the option to add any number of dense layers (varying from 0 to n) with a variable number of neurons (again, specified by the user) between the penultimate layer of the base model and our output dense layer.
Your implementation should be modular so that it allows to swap in any model (InceptionV3, InceptionResNetV2, ResNet50, Xception).
The code provides user the liberty to choose any pre-trained base model. The code has been uploaded on Github.
(Note: This question is only to check the implementation. The subsequent questions will talk about how exactly you will do the fine-tuning)
Question 2 (5 Marks)
You will notice that InceptionV3, InceptionResNetV2, ResNet50, Xception are very huge models as compared to the simple model that you implemented in Part A. Even fine-tuning on a small training data may be very expensive. What is a common trick used to keep the training tractable (you will have to read up a bit on this)? Try different variants of this trick and fine-tune the model using the iNaturalist dataset. For example, '___'ing all layers except the last layer, '___'ing upto k layers and '___'ing the rest. Read up on pre-training and fine-tuning to understand what exactly these terms mean.
Write down the different strategies that you tried (simple bullet points would be fine).
ANS: When we have a huge pre-trained model that has been trained on a standard large-sized dataset, we can leverage previous learnings by this pre-trained model and then use it as a base model for fine-tuning for our dataset. This way of using patterns learnt for a different problem, in order to solve our problem instead of training from scratch is called Transfer learning. Here, the word "Transfer" is appropriate because we are essentially transferring the learnings of basic features by the pre-trained model to our model as a ready-made base to train on top of. There are two ways in which this "transfer of learning" can be achieved and we have tried these two strategies:
- Freezing all layers of the base model: Here "Freezing" refers to using the whole set of parameters learnt by the pre-trained model as it is without any training from our side.
- Training top k layers of base model and freezing the rest: Here we train the top k (or last k) layers of the base model and use the parameters learnt by the rest of the layers as it is. 'k' is a variable that may take values like 1, 2, 3, .... up-to the total number of layers in the base model-1. (-1 because making all layers of base model trainable would take away the essence of transfer learning).
Question 3 (15 Marks)
Now finetune the model using different strategies that you discussed above and different hyperparameter choices. Based on these experiments write down some insightful inferences (once again you will find the sweep function to be useful to plot and compare different choices).
Here are some examples of inferences that you can draw:
- Using a huge pre-trained network works better than training a smaller network from scratch (as you did in Part A)
- InceptionV3 works better for this task than ResNet50
- Using a pre-trained model, leads to faster convergence as opposed to training a model from scratch
- ... ....
(Note: I don't know if any of the above statements is true. I just wrote some random comments that came to my mind)
Of course, provide evidence (in the form of plots) for each inference.
Of course, provide appropriate plots for the above inferences (mostly automatically generated by wandb). The more insightful and thorough your inferences and the better the supporting evidence (in terms of plots), the more you will score in this question.
ANS:
Run set
41
Run set
41
Note: The grey (non colour) points correspond to results of Part A and are not to be considered for the plot below:
Run set
41
We used the Bayes Sweep Strategy for sweeping across the hyperparameter values with the objective of maximizing the validation accuracy (val_accuracy). We first tried out different random combinations of hyperparameter values and then sweeped over the most promising ones in an attempt to increase efficiency. The values of hyperparamters over which the sweeps took place are as follows: (Learning rate, Batch Size and Optimize were fixed as 0.0001, 64, Adam respectively)
- base_model_name: Xception, Vgg19, Vgg16, ResNet50, MobileNet, InecptionV3, InceptionResNetV2
- freeze_all_layers: True, False
- k (No. of top layers of base model to train): 1,2,3
- data_augment (Resizing, Rescaling, Random flip, Random rotation): True, False
- max_epochs: 3,5,10
- dropout: 0.2, 0.3, 0.5, 0.6
- num_dense (No. of dense layers between base_model and output layer): 0, 1, 2
- neurons_dense (No. of neurons in the dense layers as per num_dense): 64, 128, 256, 512, 1024
- activ_dense (Activation for the dense layers as per num_dense): relu, tanh, sigmoid
Based on the above plots the following inferences were noted:
- The Xception pre-trained model gave the best val accuracy of 77.78%. In general, Xception was seen to produce good accuracies in the range of 73-77% consistently. | Best Performing Model: Xception with all layers frozen, 2 dense layers with 128 neurons and relu activation each between the base model and the output layer, with data augmentation, dropout 0.5 and max_epochs 10.
- Other pre-trained models that were seen to produce good accuracy were InceptionV3 (having val accuarcy in the range 69.07% to 73.97%) and InceptionResNetV2 (having val accuarcy in the range 73.97 to 77.68%)
- ResNet50 was seen to perform moderately (midway between the good and poor performers) with val accuracy in the range 45.95% to 69.27%.
- MobileNet was the worst performer with an accuracy as low as 22.22 % (which is even worse than the model we built from scratch in Part A) | Worst Performing Model: MobileNet with all but one layers frozen (k=1, top layer trainable), no dense layers between the base model and the output layer, dropout 0.3 and with data augmentation.
- Among the other poor performers were Vgg16 (having val accuracy in the range 30-45%) and Vgg19 (having val accuracy in the 40s)
- In general it was observed that using pre-trained models with ImageNet weights increased the accuracy and improved the performance as compared to the model we built from scratch in Part A.
- Max epoch of around 5 sufficed as increasing it to 10 produced little to no improvements in the overall accuracy.
- With base models like Xception even freezing all layers sufficed. In general making 2-3 layers of the base model trainable produced very minute changes to the accuracies.
Question 4 (10 Marks)
Paste a link to your github code for Part A
Follow the same instructions as in Question 6 of Part A.
Part C : Using a pre-trained model as it is
Question 1 (15 Marks)
Object detection is the task of identifying objects (such as cars, trees, people, animals) in images. Over the past 6 years, there has been tremendous progress in object detection with very fast and accurate models available today. In this question you will use a pre-trained YoloV3 model and use it in an application of your choice. Here is a cool demo of YoloV2 (click on the image to see the demo on youtube).
Go crazy and think of a cool application in which you can use object detection (alerting lab mates of monkeys loitering outside the lab, detecting cycles in the CRC corridor, ....). More marks if you come up with an application which has social relevance.
Make a similar demo video of your application, upload it on youtube and paste a link below (similar to the demo I have pasted above).
Also note that I do not expect you to train any model here but just use an existing model as it is. However, if you want to fine-tune the model on some application-specific data then you are free to do that (it is entirely up to you).
Notice that for this question I am not asking you to provide a github link to your code. I am giving you a free hand to take existing code and tweak it for your application. Feel free to paste the link of your code here nonetheless (if you want).
ANS: We first tested the pre-trained YoloV3 model on images to get a sense of how it is performing object detection. Once we had a working idea of the process, we could extend it to videos which are nothing but a collection of frames (images). We decided to apply the YoloV3 model on 3 different videos: (Unfortunately the length of videos had to be kept short due to computational limitations)
- A multi-cam view of vehicles on a heavy traffic road (fast paced video) | Social relevance: Traffic surveillance
- A video of giraffes and zebras in their natural habitat (relatively slow paced video) | Social relevance: Surveillance of wild animals in natural reserves.
- A compilation of short clips from the TV show The Big Bang Theory (normally paced video but with different challenges i.e. video with dark tone/ dark background, video with a lot of detectable objects in consecutive frames)
We observed that the YoloV3 model mostly performed well on all the three differently paced videos except for occasional glitches (for example, in the 2nd video the model detects a puff of smoke as a dog for a split second, in the 3rd video the model detects a dummy skeleton as a person for a split second & some cereal boxes as books). Despite these occasional glitches, the YoloV3 model was seen to give impressive results even with dark background videos. Also, the model was seen to be consistent in detection i.e. there was less flickering of detection boxes.
Given below are the results on 3 images we tested initially and thereafter the results on the 3 videos:

Image 1: The YoloV3 model detects cars, persons, bicycle.

Image 2: The YoloV3 model detects birds.

Image 3: The YoloV3 model detects persons and cars.
Original Video 1: https://drive.google.com/file/d/18Lk310ouCXWQtqKJr8OSz-5gei-yoY7D/view?usp=sharing

Video 1 Results. Click on the link above to view the video.
Original Video 2: https://drive.google.com/file/d/1giS9PrhcGIvf1tPU-RdxVmeWZ_jR01rS/view?usp=sharing

Video 2 results. Click on the link above to view the video.
Original Video 3: https://drive.google.com/file/d/1zJelEIvYRKPJOrjSsvvuQMcB-IRTH7C8/view?usp=sharing

Video 3 results. Click on the link to view the video.
Self Declaration
List down the contributions of the two team members:
For example,
CS21D002: (50% contribution)
- Part A Implementation
- Part A Report
CS21D700: (50% contribution)
- Part B (Implementation + Report)
- Part C (Implementation + Report)
We, Nency Bansal_CS21D002 and Ritwiz Kamal_CS21D700, swear on our honour that the above declaration is correct.
Note: Your marks in the assignment will be in proportion to the above declaration. Honesty will be rewarded (Help is always given in CS6910 to those who deserve it!).
This is an opportunity for you to come clean. If one of the team members has not contributed then it should come out clearly from the above declaration. There will be a viva after the submission. If your performance in the viva does not match the declaration above then both the team members will be penalised (50% of the marks earned in the assignment).
Add a comment