CS6910 Assignment-1

Write your own backpropagation code and keep track of your experiments using wandb.ai
Created on February 8|Last edited on March 12
Comment
﻿
Instructions
The goal of this assignment is twofold: (i) implement and use gradient descent (and its variants) with
backpropagation for a classification task (ii) get familiar with wandb which is a cool tool for running and keeping track of a large number of experiments
We strongly recommend that you work on this assignment in a team of size 2. Both the members
of the team are expected to work together (in a subsequent viva both members will be expected to answer questions, explain the code, etc).
Collaborations and discussions with other groups are strictly prohibited.
You must use Python (numpy and pandas) for your implementation. 
You cannot use the following packages from keras, pytorch, tensorflow: optimizers, layers 
If you are using any packages from keras, pytorch, tensorflow then post on moodle first to check with the instructor.
You can run the code in a jupyter notebook on colab by enabling GPUs.
You have to generate the report in the same format as shown below using wandb.ai. You can start by cloning this report using the clone option above. Most of the plots that we have asked for below can be (automatically) generated using the apis provided by wandb.ai. You will upload a link to this report on gradescope.
You also need to provide a link to your github code as shown below. Follow good software engineering practices and set up a github repo for the project on Day 1. Please do not write all code on your local machine and push everything to github on the last day. The commits in github should reflect how the code has evolved during the course of the assignment.
You have to check moodle regularly for updates regarding the assignment.
﻿
Problem StatementIn this assignment, you need to implement a feedforward neural network and write the backpropagation code for training the network. We strongly recommend using numpy for all matrix/vector operations. You are not allowed to use any automatic differentiation packages. This network will be trained and
tested using the Fashion-MNIST dataset. Specifically, given an input image (28 x 28 = 784
pixels) from the Fashion-MNIST dataset, the network will be trained to classify the image
into 1 of 10 classes.
﻿
Question 1 (2 Marks)Download the fashion-MNIST dataset and plot 1 sample image for each class as shown in the grid below. Use "from keras.datasets import fashion_mnist" for getting the fashion mnist dataset.
﻿
﻿
﻿
Run set 20
﻿
Question 2 (10 Marks)Implement a feedforward neural network which takes images from the fashion-mnist data as input and outputs a probability distribution over the 10 classes.
Your code should be flexible so that it is easy to change the number of hidden layers and the number of neurons in each hidden layer.
We will check the code for implementation and ease of use.
Code is here
﻿
Question 3 (18 Marks)Implement the backpropagation algorithm with support for the following optimisation functions 
sgd
momentum based gradient descent
nesterov accelerated gradient descent
rmsprop
adam
nadam
(12 marks for the backpropagation framework and 2 marks for each of the optimisation algorithms above)
We will check the code for implementation and ease of use (e.g., how easy it is to add a new optimisation algorithm such as Eve). Note that the code should be flexible enough to work with different batch sizes.
Code is here
﻿
Question 4 (10 Marks)Use the sweep functionality provided by wandb to find the best values for the hyperparameters listed below. Use the standard train/test split of fashion_mnist (use (X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()).  Keep 10% of the training data aside as validation data for this hyperparameter search. Here are some suggestions for different values to try for hyperparameters. As you can quickly see that this leads to an exponential number of combinations. You will have to think about strategies to do this hyperparameter search efficiently. Check out the options provided by wandb.sweep and write down what strategy you chose and why.
number of epochs: 5, 10
number of hidden layers:  3, 4, 5
size of every hidden layer:  32, 64, 128
weight decay (L2 regularisation): 0, 0.0005,  0.5
learning rate: 1e-3, 1e-4 
optimizer:  sgd, momentum, nesterov, rmsprop, adam, nadam
batch size: 16, 32, 64
weight initialization: random, Xavier
activation functions: sigmoid, tanh, ReLU
wandb will automatically generate the following plots. Paste these plots below using the "Add Panel to Report" feature. Make sure you use meaningful names for each sweep (e.g. hl_3_bs_16_ac_tanh to indicate that there were 3 hidden layers, batch size was 16, and activation function was ReLU) instead of using the default names (whole-sweep, kind-sweep) given by wandb.
We have used Random Search strategy to search through the plethora of hyperparameter configurations.
We did not use Exhaustive Grid Search as it iterates over all possible hyperparameter settings and it is very computationally heavy. Bayesian Optimization uses a gaussian process to model the function and then chooses parameters to optimize the probability of improvement. This strategy requires a metric key to be specified. But, this strategy does not scale for a large number of hyperparameters. Random Searching allows to iterate over multiple settings without any strategies as it is not biased towards any hyperparameter setting. Therefore, we get a wholistic view of the performance of the model on the given data.
﻿
﻿
﻿
Run set 264
﻿
Question 5 (5 marks)We would like to see the best accuracy on the validation set across all the models that you train.
wandb automatically generates this plot which summarises the test accuracy of all the models that you tested. Please paste this plot below using the "Add Panel to Report" feature
We achieved highest validation accuracy of 89.47% as shown in the following plot.
﻿
﻿
﻿
Run set 20
﻿
Question 6 (20 Marks)Based on the different experiments that you have run we want you to make some inferences about which configurations worked and which did not. 
Here again, wandb automatically generates a "Parallel co-ordinates plot" and a "correlation summary" as shown below. Learn about a "Parallel co-ordinates plot" and how to read it.
By looking at the plots that you get, write down some interesting observations (simple bullet points but should be insightful). You can also refer to the plot in Question 5 while writing these insights. For example, in the above sample plot there are many configurations which give less than 65% accuracy. I would like to zoom into those and see what is happening. 
I would also like to see a recommendation for what configuration to use to get close to 95% accuracy.
Observations:(Not an observation) We used random seed of 42 to get reproducible results.
Weight initialization was a big factor for convergence (which is even seen in the correlational summary). With Xavier initialization, the model almost always converged to a high validation accuracy of at least 70%, whereas Random initialization gave the worst results and the model was always stuck at 10% (random guessing). It could not converge despite faster optimizers like Adam and Nadam. The results with Normal initialization are between Xavier and Random.
LeakyReLU activation with any faster optimizer produced a validation accuracy close to 80%. LeakyReLU outperformed ReLU as the gradient does not vanish the negative region for LeakyReLU and this gave it a better scope for learning.
We found that L2 regularization had very less impact on the training. This may be because our networks are not so deep (max 5 hidden layers). We feel L2 regularization may have better impact in models with more number of hidden layer (maybe 10 or more).
Normalizing data to mean 0 and variance 1 produces 3-4% better accuracy than the un-normalized data.
We observed that at least 10 epochs are needed for proper convergence with Normal initialization. Random initialization is still stuck even after 15 epochs. We guess this is because Random initialization begins in a very flat region. Xavier could converge to 80% accuracy within 5 epochs. So, we can say epochs and initialization are tightly correlated.
Sigmoid/Tanh activation saturates after a while and inhibits learning. This is due to the "dying neuron problem". This also happens in the negative region of ReLU, and that's why LeakyReLU is better.
MGD/NAG performed equally well as Adam given other hyperparameter settings and SGD failed to converge in most cases. Adam/Nadam/RMSprop/Adagrad gave good convergence (at least 75%) in most runs.
Hyperparam Setting which gave best results (val_accuracy of 89.4%): Hidden layers - 3, hidden units -128, batch_size - 64, epochs - 15, activation - LeakyReLU, optimizer - Adagrad, l2 - 0.05, weight_initialization - Xavier 
Hyperparam Setting which gave worst results (val_accuracy of 6.61%): Hidden layers - 4, hidden units -128, batch_size - 128, epochs - 5, activation - LeakyReLU, optimizer - RMSprop, l2 - 0.005, weight_initialization - Normal 
We think that using the best hyperparameter setup with Adam/Nadam, Data Augmentation, Dropout, Batch Normalization and at least 25-30 epochs will give us 95% accuracy.
﻿
﻿
﻿
Run set 264
﻿
Question 7 (10 Marks)For the best model identified above, report the accuracy on the test set of fashion_mnist and plot the confusion matrix as shown below. More marks for creativity (less marks for producing the plot shown below as it is)
Best Model Configuration:
no. of hidden layers - 3
no. of hidden units (in each layer) - 128
batch_size - 64
epochs - 15
activation - LeakyReLU
optimizer - Adagrad
l2 - 0.05
lr - 0.01
weight initialization - Xavier
Fashion MNIST train accuracy - 94.06%
Fashion MNIST validation accuracy - 89.47%
Fashion MNIST test accuracy - 88.77%
﻿
﻿
﻿
Run set 266
﻿
﻿
Question 8 (5 Marks)In all the models above you would have used cross-entropy loss. Now compare the cross-entropy loss with the squared error loss. I would again like to see some automatically generated plots or your own plots to convince me whether one is better than the other.
We ran the best hyperparameter configuration but changed the loss function for each run (first cross-entropy, second mean-squared-error) and from the following plots we observed that the cross-entropy is better as it captures the information gain between the target probability distribution and predicted probability distribution. If we use MSE, the output need not be bounded between [0, 1] and the sum of the outputs for each training instance need not add up to 1. But still, we observe that MSE and Cross-Entropy result almost similar loss and accuracy during training/validation.
﻿
﻿
﻿
Run set2
﻿
Question 9 (10 Marks)Paste a link to your github code for this assignment
GitHub Link: https://github.com/VarunGumma/CS6910-Assignment-1
Report Link: https://wandb.ai/cs21m070_cs21m022/IITM-CS6910-Projects/reports/CS6910-Assignment-1--VmlldzoxNTM0ODEw
We will check for coding style, clarity in using functions and a README file with clear instructions on training and evaluating the model (the 10 marks will be based on this)
We will also run a plagiarism check to ensure that the code is not copied (0 marks in the assignment if we find that the code is plagiarised)
We will check the number of commits made by the two team members and then give marks accordingly. For example, if we see 70% of the commits were made by one team member then that member will get more marks in the assignment (note that this contribution will decide the marks split for the entire assignment and not just this question).
We will also check if the training and test data has been split properly and randomly. You will get 0 marks on the assignment if we find any cheating (e.g., adding test data to training data) to get higher accuracy
﻿
Question 10 (10 Marks)Based on your learnings above, give me 3 recommendations for what would work for the MNIST dataset (not Fashion-MNIST). Just to be clear, I am asking you to take your learnings based on extensive experimentation with one dataset and see if these learnings help on another dataset. If I give you a budget of running only 3 hyperparameter configurations as opposed to the large number of experiments you have run above then which 3 would you use and why. Report the accuracies that you obtain using these 3 configurations.
We took the three hyperparameter configurations which gave the best accuracy on the Fashion MNIST dataset:
Configuration-1no. of hidden layers: 3
no. of hidden units (in each layer): 128
activation: LeakyReLU
batch_size: 64
epochs: 15
l2: 0.05
lr: 0.01
optimizer: Adagrad
weight initialization: Xavier
Loss: Cross-Entropy
MNIST train accuracy: 99.88%
MNIST validation accuracy: 98.13%
MNIST test accuracy: 97.92%
Configuration-2no. of hidden layers: 4
no. of hidden units (in each layer): 128
activation: LeakyReLU
batch_size: 16
epochs: 15
l2: 0.0
lr: 0.0001
optimizer: Adam
weight initialization: Xavier
Loss: Cross-Entropy
MNIST train accuracy: 99.40%
MNIST validation accuracy: 97.77%
MNIST test accuracy: 97.37%
Configuration-3no. of hidden layers: 4
no. of hidden units (in each layer): 64
activation: ReLU
batch_size: 64
epochs: 10
l2: 0.05
lr: 0.01
optimizer: Adagrad
weight initialization: Xavier
Loss: Cross-Entropy
MNIST train accuracy: 98.90%
MNIST validation accuracy: 97.58%
MNIST test accuracy: 97.49%
﻿
Self DeclarationContributions of the two team members:
CS21M070: (100% contribution)\textbf{\small CS21M070: (100\% contribution)}CS21M070: (100% contribution)
Plotted sample images
Built the Feed Forward Neural Network class
Implemented SGD, MGD, NAG, Adagrad, RMSprop, Adam and Nadam
Analysed the performance of the network on various hyper-parameter settings
Plotted sweeps, parallel-plots, confusion matrix
Ran sweeps for MSE vs CrossEntropy
Analysed and wrote inferences for the MNIST dataset
CS21M022: (100% contribution)\textbf{\small CS21M022: (100\% contribution)}CS21M022: (100% contribution)
Plotted sample images
Built the Feed Forward Neural Network class
Implemented SGD, MGD, NAG, Adagrad, RMSprop, Adam and Nadam
Analysed the performance of the network on various hyper-parameter settings
Plotted sweeps, parallel-plots, confusion matrix
Ran sweeps for MSE vs CrossEntropy
Analysed and wrote inferences for the MNIST dataset
We, Varun Gumma and Hanumantappa Budihal, swear on our honour that the above declaration is correct.
﻿
﻿
Add a comment