Skip to main content

CS6910: Assignment 2 Report: Convolutional Neural Networks

Learning how to use CNNs: by training from scratch, fine-tuning a pre-trained model, and using a pre-trained model as it is.
Created on April 2|Last edited on April 3

TEAM MEMBERS:

  • Dipra Bhagat CS21S048
  • Subham Das CS21S058

PART A: Training From Scratch:-

Question 1 (5 Marks):

  • Building a CNN:-
A CNN consisting of 5 convolutional layers was built using Tensorflow. Each convolutional layer was followed by an activation layer and a max pooling layer. Batch normalization and dropout were also added with each of the 'flexible conv-relu-maxpool' blocks. This was followed by a dense layer and a final output layer containing 10 neurons.
The iNaturalist dataset has images from 10 different classes with varying image sizes so, all the images were scaled to the same size of (200, 200, 3) before training.
As instructed, the Collab notebook labelled 'Part A' in the Github repository has the code for this whole section. The function CNN() has the ability to build a CNN of the desired specifications, with the following customizations possible:
  • Number of filters per convolutional layer
  • Size of each filter
  • Filter multiplier (to determine number of filters for successive layers)
  • Batch Normalization (On/Off)
  • Dropout
  • Size of dense layer

Calculating number of parameters and computations:
We have taken image size of (198, 198, 3) with max pool shape of (2, 2). Computations have been done according to our model.

1.a) Total number of computations: 1188k2m+324m2k2+320m+16384m2n+10n2+101188k^2m + 324m^2k^2 + 320m + 16384m^2n + 10n^2 + 10
1.b) Total number of parameters: mk2(3+4m)+128mn+5m+11n+10mk^2(3+4m) + 128mn + 5m + 11n +10


Question 2 (10 Marks):

For training the model, iNaturalist dataset has been used. The training data was split and about 10% of it was kept aside as validation data, used for hyperparameter tuning. For hyperparameter tuning, the range from which the hyperparameters were chosen are as follows:-
  • "n_filters" (Number of Filters in each layer): [16, 32, 64]
  • "filter_multiplier" (Multiplier for Filter Organisation, for reducing, increasing or keeping the number of filters constant in the subsequent layer): [0.5, 1, 2]
  • "filter_size" (Size of Filters): [3, 3, 5, 5, 7], [5 ,5, 4, 3, 2], [10, 8, 6, 4, 2], [2, 4, 6, 8, 10]
  • "augment_data" (Data Augmentation: Vertical shift, horizontal shift, shear, zoom, rotation horizontal flip, etc.): [True, False]
  • "dropout" (Dropout for preventing Overfitting: Added in fully connected layers and in each of the convolutional layers after the activation layers): [0.3, 0.5]
  • "batch_norm" (Batch Normalisation: added in each of the convolutional layers after the activation layers): [False, True]
  • "epochs": [5, 10, 30]
  • "dense_size" (Size of the dense layer): [32, 64, 128]
  • "lr" (Learning Rate): [0.01, 0.001]
  • "batch_size" (Size of each batch): [64, 128, 256]
  • "activation" (Activation Function): ["relu", "leaky"]
Adam was used as the optimizer for training and categorical cross-entropy as the loss function. The Bayesian method was used to run the sweeps which use a gaussian process to model the function and then choose parameters to optimize the probability of improvement of a specified metric key (we are maximizing the validation accuracy in our case)

We swept the hyperparameters over 100 runs using the stopping and reducing techniques mentioned before. The highest validation accuracy obtained was 36.14% for the following set of hyperparameters as shown in the plot below:-

  • "n_filters": 64
  • "filter_multiplier": 0.5
  • "filter_size": [3, 3, 5, 5, 7]
  • "augment_data": True
  • "dropout": 0.5
  • "batch_norm": True
  • "epochs": 10
  • "dense_size": 128
  • "lr": 0.001
  • "batch_size": 128
  • "activation": relu


Run: num_64_org_0.5_aug_Y_drop_0.5_norm_Y
1

Strategies followed for reducing the number of runs:-
  1. A preliminary Dry run was done with a random search strategy to find the probable range where the best combination of hyperparameters laid. It helped in narrowing down the number of combinations that the sweep agents would have used.
  2. Again, the 'Bayes method' was used for sweeps which inherently tried to optimize the probability for improving a specified hyperparameter thus reaching the best combination faster.
  3. There will be sometimes during the runs when the validation loss seems to increase after a dip. At that point we can safely say that the best configuration won't be reached from that combination and can terminate the run early. It requires a bit of manual intervention but often a good strategy to reduce the number of runs as most of the times the model need not be continued to be trained if it gives bad results constantly.


Charts:


Sweep: Final Sweep(Bayesian) 1
107



Sweep: Final Sweep(Bayesian) 1
107



Sweep: Final Sweep(Bayesian) 1
107



Question 3 (15 Marks):

Following are the observations based on the above plots generated by the runs we have carried out:
  • There was an increase in performance whenever we started with a more number of filters and kept decreasing them consequently in the next layers. Having large filters in the initial layers helps capture more features of the image. Lesser filters will capture them fast but gives less accuracy.

  • Filter multiplier less than 1 (0.5 in my case) was found to be good. This is also reflected in the correlation data plot above.

  • Slowly increasing the filter sizes in subsequent layers was improving the model performance. We got the best accuracy by using the following filter sizes: [(3,3), (3,3), (5,5), (5,5), (7,7)]

  • Large dense layer sizes are performing better than smaller ones. This is also reflected in the correlation summary table which tells that dense layer size has a positive correlation with validation accuracy. We got best results by using dense layer size as 128. Layer sizes of 64, 32 were giving poor results.

  • Adam optimizer was giving significantly better performance than SGD. So we used Adam as the optimizer with a learning rate of 0.001.

  • Learning rate of 0.001 gave best results. Decreasing the learning rate to 0.0001 was affecting validation accuracy negatively. Again this is also reflected in the correlation summary table, where learning rate has a positive correlation with validation accuracy.

  • ReLU activation was giving better accuracy in our case. In the correlation plot also it can be seen that, ReLU is having positive correlation with the validation accuracy while LeakyReLU is having negative correlation.

  • Batch normalization is increasing the performance of the model, as it normalizes the inputs to every layer to unit gaussian. Hence it will be helpful for the model to learn from a fixed distribution rather than learning from a distribution that changes at every step. It is important and necessary, since it leads to faster training and better generalization.

  • Larger batch sizes were degrading the training performance. The correlation plot also has a negative correlation with validation accuracy. Although we got the best accuracy by using a batch size of 128.

  • More dropout (0.5 in our case) was shown to be good (correlation data), which makes sense because of the natural tendency of even a moderately complex CNN model to overfit. It helped validation metrics to improve.

  • With Data Augmentation, the training accuracy was slightly decreased but, the validation accuracy was increased. Data augmentation is useful as it brings variability to the dataset and helps to make the CNN model more robust. It also helped avoid overfitting.


Question 4 (5 Marks):

Applying the best model on the test data:
  • Best Model used:
"n_filters": 32
"filter_multiplier": 0.5
"filter_size": [3, 3, 5, 5, 7]
"augment_data": True
"dropout": 0.5
"batch_norm": True
"epochs": 10
"dense_size": 128
"lr": 0.001
"batch_size": 128
"activation": relu

  • (4.a) The test accuracy obtained by using the above model on test data is 40.07%.

  • (4.b) 10 x 3 grid containing sample images from the test data and predictions made by our best model:



  • (4.c) Visualizing all the filters in the first layer of our best model for a random image from the test set:

Visualizing feature maps from the first layer:

A random image to visualize feature maps
Feature maps corresponding to the image above

Inferences from filters and feature maps:
We can clearly see that the filters have correctly learned different abstract views of the snake. Some filters have learned the sharp edges, some are detecting lines, curves, and bends. As these are the filters of the first layer, they are learning the high-level features of the image and not getting into minute details.


Question 5 (10 Marks):

Applying guided backpropagation on any 10 neurons in the CONV5 layer and plotting the images which excite this neuron:


Some interesting patterns that were observed in the above images:
  • The first row shows different images.
  • In the second row are the excitation of 10 neurons from the last convolution layer after being hit by the particular image. Black means the neuron is fired, and white means the neuron is dead.
  • The third row is the output after backpropagation.
  • For example, the first image(class amphibia) fully excites neuron number 8.
  • Neuron 0 is mostly white across most of the images and light grey for some images. This means neuron 0 doesn't often fire.
  • Neurons 5 and 6 are mostly blackish and dark greyish, so we can say that neuron 6 is fired often by most images.
  • Interestingly, each class of images hit the neurons differently.



Question 6 (10 Marks):




PART B: Fine-tuning a pre-trained model:-



Question 1(5 Marks):


(1.a) The dimensions of the images in your data may not be the same as that in the ImageNet data. How will you address this?
Ans) We will resize the images since there's always a chance that the images could be of different dimensions. We have used ImageDataGenerator (128 x 128) to resize and rescale the input images to suit our needs. There are also other ways such as cropping (might cut off the object if it is not in the center but applicable for insects as they occupy a very less part of the picture), Augmentation (takes care of generating more instances of useful variants of previously scaled/cropped images) etc.
Again, the dimensions of the input images from the iNaturalist dataset were all scaled down to (128,128,3) before feeding into the pre-trained network. The dimensions of the input image does not affect the convolutional layers as the filters in these layers were trained to detect certain patterns in the images. Their functionalities were independent of the input image sizes as both the datasets are very similar containing animal images. So, the initial convolutional layers need not be trained again from scratch and the pre-trained weights can be used directly (with a bit of fine-tuning if necessary). However, the weights of the fully-connected layers towards the end of the model depend on the input image sizes and cannot be used directly so, they need to be trained from scratch depending on the current dataset.

(1.b) ImageNet has 1000 classes and hence the last layer of the pre-trained model would have 1000 nodes. However, the naturalist dataset has only 10 classes. How will you address this?
Ans) For ImageNet, it has 1000 classes = 1000 nodes in the output layer. But before that we need to make the second last layer compatible with the final layer. We do that by: flattening and then connecting it to a dense layer of size 64, 128 and 256 with a ReLU or a Leaky ReLU activation. Batch normalization and dropout are also added to this layer. For the final layer, we can either train the weights after replacing the output layer with our layer of 10 nodes(representing 10 classes) or we can try add more layers after the base model to suit our needs. In our implementation, by replacing the output layer of ImageNet model we added our output layer with softmax activation.

The implementation of the pre-trained models have been made flexible so that any of the models can be swapped in easily using appropriate arguments.


Question 2(5 Marks):


2)You will notice that InceptionV3, InceptionResNetV2, ResNet50, Xception are very huge models as compared to the simple model that you implemented in Part A. Even fine-tuning on a small training data may be very expensive. What is a common trick used to keep the training tractable (you will have to read up a bit on this)? Try different variants of this trick and fine-tune the model using the iNaturalist dataset. For example, '___'ing all layers except the last layer, '___'ing upto k layers and '___'ing the rest. Read up on pre-training and fine-tuning to understand what exactly these terms mean.
Write down the different strategies that you tried (simple bullet points would be fine):

Ans) For this case, we found out that Freezing works very well because the pre-trained filters in the convolutional layers are still able to detect similar patterns on the new dataset without fine-tuning with the two datasets being quite similar. Thus, only a percentage of the layers towards the end of the model were fine tuned and the rest were frozen during training. 'Percentage of layers' was used as a hyperparameter for InceptionV3, InceptionResNetV2, ResNet50 and Xception and is different for each layer. And for the final and penultimate layers, the last layer of the pre-trained model was dropped and a dense layer as well as a final output layer were added to the rest of the model. The last two added layers were trained completely from scratch.
In compact form the strategies we employed were:

Pretraining:

  • Instantiate a base model and load pre-trained weights(of InceptionV3, InceptionResNetV2, ResNet50, Xception, etc) into it.
  • Freeze all layers in the base model by setting trainable = False (in keras). (We are not training the whole model)
  • We attach our own set of layers (1 Dense Layer and one Output layer) to the pretrained model and only train these layers keeping all weights of the base model fixed.

Fine-Tuning:

  • After pre-training, the last "100" layers of the model were unfrozen and re-trained keeping the rest of the layers of the model fixed with a very low learning rate.
  • It's also critical to use a very low learning rate at this stage, because we are training a much larger model than in the first round of training, on a dataset that is typically very small. As a result, we are at risk of overfitting very quickly if we apply large weight updates. Here, we only want to readapt the pretrained weights in an incremental way.


Question 3(15 Marks):


3) Now finetune the model using different strategies that you discussed above and different hyperparameter choices. Based on these experiments write down some insightful inferences (once again you will find the sweep function to be useful to plot and compare different choices):

Ans) The results of the sweeps are as follows for the below hyperparameters:
  • Data Augmentation : Yes, No
  • Pre-trained Model : InceptionV3, InceptionResNetV2, ResNet50, Xception
  • Batch size for training : 32, 64
  • Size of pre-final dense layer : 64, 128, 256
  • Dropout in pre-final layer : 0.2, 0.4, 0.6
  • Epochs: 10
  • Layers frozen for training : 50%, 70%, 100%
  • Batch Normalization : Yes, No(Pre-Final Layer)
  • Activation Function : ReLU, Leaky ReLU(Pre-Final Layer)
  • Learning Rate: 0.001, 0.0001

Plots:



Sweep: PartB_final_sweep 1
42



Sweep: PartB_final_sweep 1
42



Sweep: PartB_final_sweep 1
42



Sweep: PartB_final_sweep 1
42

From Almost 50 runs we got 58.46% as the validation accuracy. The best hyper-parameters are as follows:

  • Data Augmentation : Yes
  • Pre-trained Model : InceptionResNetV2
  • Batch size for training : 64
  • Size of pre-final dense layer : 64
  • Dropout in pre-final layer : 0.6
  • Epochs: 10
  • Layers freezing point : 100 (all layers frozen except last 100 layers)
  • Batch Normalization : Yes
  • Activation Function : ReLU
  • Learning Rate: 0.0001


Run: model_IRNV2_drop_0.6_batch_size_64_n_dense_64
1


Inferences based on these plots:


  • It is seen that the Pre trained model converges faster than training a network from scratch.

  • Using a huge pre-trained network works better than training a smaller network from scratch as in part A. As we can observe from the plots.

  • Inception, InceptionResNetV2 and Xception has better accuracy as compared to ResNet50 for iNaturalist dataset.

  • InceptionResNetV2 has the best accuracy, as compared to others.

  • Using ResNet50 does not work well at all as it has a very high negative correlation of 0.87 with validation accuracy. 3 runs in the sweep used ResNet50 as the pre-trained model and the highest validation accuracy was only 0.1001 with and the average around 0.11.

  • InceptionResNetV2 was found to be the best pre-trained model out of the four with a positive correlation of 0.31 with validation accuracy. Over 20 runs in the sweep used InceptionResNetV2 and the average validation accuracy was close to 0.55. InceptionV3 also performed very well.

  • Batches of size 64 were preferred over batches of size 128 or 256. It worsen again if taken too small.

  • Augmenting the training data using keras was useful as most of the top models used data augmentation.

  • ReLU was again found to be the best activation function.

  • If we add more layers after the base_model layers, we get better accuracy.

  • Dropout in the pre-final dense layer did had a positive correlation with validation accuracy. However, larger dropout were preferred. 50% dropout was used most of the time.

  • The size of the pre-final dense layer also had a high correlation with validation accuracy. Most runs in the sweep used 128/256 neurons in the dense layer. The size of the dense layer is important as it is completely trained from scratch on the data from the current iNaturalist dataset and is not pre-trained.



Question 4(10 Marks):




PART C: Using a Pre-Trained Model as it is:-



Question 1(15 Marks):

In this question you will use a pre-trained YoloV3 model and use it in an application of your choice:

Ans) Application: Autonomous Detection of Road Signs and Traffic Signals For A.I Assisted Driving:
The pre-trained YoloV3 model has been used for object identification on an original video from Youtube taken from the dashboard camera of a car while driving which can be used for self driving cars.. The Yolov3 model has different versions on basis of frame rates and accuracy. The higher the frame rate of a model, the lower is the accuracy. Hence, in real time applications, it becomes important to select a proper tradeoff between the frame rate and accuracy.


Societal Impact:
Introduction of Self Driving Cars have suddenly brought forth a sky high interest in object detection and image classification among students, researchers as well as professionals. As more and more people use these kinds of cars the more data will be generated and taking into account the continuous increase in technologies related to the development of high processing units have made it all the more possible to extract useful information from unimaginable amounts of data and at high speeds. Tech giants pioneering in self driving cars eg. Tesla, have actually become a company handling huge number of confidential and sensitive data of their customers. So what are they doing amassing such amounts of data? Increasing the accuracy of their models. The more data they have, the more learning the models can do giving the models access to innumerable possibilities that can be predicted from before hand to take the best pathway possible in case of an unexpected situation. Our effort in this area with this model might be miniscule but it gives a clear understanding on how the output of the model works and how inferences can be made by whatever agent that is installed in the self driving car to choose among various possibilities pertaining to the situation present at hand. Each mistake will lead to new learning and in turn giving the customers more safety and precaution as well as prevention from potential harmful choices that had previously led other customers to danger. Self driving cars have not yet reached a very high accuracy and are often plagued with uncertain and funny glitches as sometimes seen from videos on Youtube but if they are used as assistants right now giving advises depending on the current situations then it will be really helpful for most people on the road. As even the best of us get distracted sometimes and having an A.I assistant to help us on our journey giving us pointers on the traffic signals and traffic signs up ahead will certainly make it easier for the usual people.




Self Declaration:-

Contribution:

CS21S048: (50% contribution)
  • Coding for Part A, B, and C
  • Testing the code
  • Setting up Wandb sweeps
  • Generating plots and others figures used in the report
  • Performing object detection
  • Updating GitHub repository
  • Writing report
CS21S058: (50% contribution)
  • Coding for Part A, B, and C
  • Testing the code
  • Setting up Wandb sweeps
  • Generating plots and others figures used in the report
  • Performing object detection
  • Updating GitHub repository
  • Writing report
We, Dipra Bhagat and Subham Das, swear on our honour that the above declaration is correct.