Assignment-1 Report

Feed forward neural network implementation in python and visualisation using wandb.
Mitesh Khapra, Shubham Mohan Randive cs20m064
Created on February 27|Last edited on March 13
Comment
﻿
Problem StatementIn this assignment you need to implement a feedforward neural network and write the backpropagation code for training the network. We strongly recommend using numpy for all matrix/vector operations. You are not allowed to use any automatic differentiation packages. This network will be trained and
tested using the Fashion-MNIST dataset. Specifically, given an input image (28 x 28 = 784
pixels) from the Fashion-MNIST dataset, the network will be trained to classify the image
into 1 of 10 classes.
﻿
Question 1 (2 Marks)Download the fashion-MNIST dataset and plot 1 sample image for each class as shown in the grid below. Use "from keras.datasets import fashion_mnist" for getting the fashion mnist dataset.
﻿
﻿
﻿
Run Set 11
﻿
Question 2 (10 Marks)Implement a feedforward neural network which takes images from the fashion-mnist data as input and outputs a probability distribution over the 10 classes.
Your code should be flexible so that it is easy to change the number of hidden layers and the number of neurons in each hidden layer.
We will check the code for implementation and ease of use.
﻿
Question 3 (18 Marks)Implement the backpropagation algorithm with support for the following optimisation functions 
sgd
momentum based gradient descent
nesterov accelerated gradient descent
rmsprop
adam
nadam
(12 marks for the backpropagation framework and 2 marks for each of the optimisation algorithms above)
We will check the code for implementation and ease of use (e.g., how easy it is to add a new optimisation algorithm such as Eve). Note that the code should be flexible enough to work with different batch sizes.
﻿
Question 4 (10 Marks)Use the sweep functionality provided by wandb to find the best values for the hyperparameters listed below. Use the standard train/test split of fashion_mnist (use (X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()).  Keep 10% of the training data aside as validation data for this hyperparameter search. Here are some suggestions for different values to try for hyperparameters. As you can quickly see that this leads to an exponential number of combinations. You will have to think about strategies to do this hyperparameter search efficiently. Check out the options provided by wandb.sweep and write down what strategy you chose and why.
number of epochs: 5, 10
number of hidden layers:  3, 4, 5
size of every hidden layer:  32, 64, 128
weight decay (L2 regularisation): 0, 0.0005,  0.5
learning rate: 1e-3, 1 e-4 
optimizer:  sgd, momentum, nesterov, rmsprop, adam, nadam
batch size: 16, 32, 64
weight initialisation: random, Xavier
activation functions: sigmoid, tanh, ReLU
wandb will automatically generate the following plots. Paste these plots below using the "Add Panel to Report" feature. Make sure you use meaningful names for each sweep (e.g. hl_3_bs_16_ac_tanh to indicate that there were 3 hidden layers, batch size was 16 and activation function was ReLU) instead of using the default names (whole-sweep, kind-sweep) given by wandb.
﻿
﻿
﻿
Run set 2134
﻿
Question 5 (5 marks)We would like to see the best accuracy on the validation set across all the models that you train.
wandb automatically generates this plot which summarises the test accuracy of all the models that you tested. Please paste this plot below using the "Add Panel to Report" feature
﻿
﻿
﻿
Run set 2134
﻿
Question 6 (20 Marks)Based on the different experiments that you have run we want you to make some inferences about which configurations worked and which did not. 
Here again, wandb automatically generates a "Parallel co-ordinates plot" and a "correlation summary" as shown below. Learn about a "Parallel co-ordinates plot" and how to read it.
By looking at the plots that you get, write down some interesting observations (simple bullet points but should be insightful). You can also refer to the plot in Question 5 while writing these insights. For example, in the above sample plot there are many configurations which give less than 65% accuracy. I would like to zoom into those and see what is happening. 
I would also like to see a recommendation for what configuration to use to get close to 95% accuracy.
﻿
﻿
﻿
Run set 2134
﻿
Question 7 (10 Marks)For the best model identified above, report the accuracy on the test set of fashion_mnist and plot the confusion matrix as shown below. More marks for creativity (less marks for producing the plot shown below as it is)
﻿
﻿
﻿
Sweep: 7c0nik6g 117
﻿
Question 8 (5 Marks)In all the models above you would have used cross entropy loss. Now compare the cross entropy loss with the squared error loss. I would again like to see some automatically generated plots or your own plots to convince me whether one is better than the other.
﻿
Question 9 (10 Marks)Github links:
For better understanding of neural networks we have developed neural network separately in our github repository. Because of that, we have commit managements, but the final wandb process has been carried out by both of us equally. 
Main git repository  : https://github.com/shubham303/deep-learning-codes﻿
Secondary git repository : https://github.com/cs20m072/cs6910﻿﻿﻿
﻿﻿﻿
﻿
﻿
﻿
Question 10 (10 Marks)Based on your learning's above, give me 3 recommendations for what would work for the MNIST dataset (not Fashion-MNIST). Just to be clear, I am asking you to take your learnings based on extensive experimentation with one dataset and see if these learnings help on another datasets. If I give you a budget of running only 3 hyper parameter configurations as opposed to the large number of experiments you have run above then which 3 would you use and why. Report the accuracies that you obtain using these 3 configurations. 
Learning from the experiments:
the limitations of ReLU is the case where large weight updates can mean that the summed input to the activation function is always negative, regardless of the input to the network.This means that a node with this problem will forever output an activation value of 0.0. This is referred to as a “dying ReLU“
Tanh takes less time compared to sigmoid and has comparable performance to sigmoid.
From the limited number of random experiments, we found out that nesterov has a relatively better performance in terms of validation_accuracy compared others.
10 epochs with learning rate of 0.001 would be a good choice for mnist dataset.
﻿
Run set134
﻿
﻿
Self DeclarationList down the contributions of the two team members:
CS20M072 : (55% contribution)
implementing feed forward neural network ...
implementing SGD, nesterov, adam, nadam , momentum, rmsprop
implementing cross entropy and mean square loss
﻿
CS20M064 : (45% contribution)
implementing feed forward neural network   ( As mentioned in question 9, we have developed neural network separately)
code refactoring and object oriented implementation
visualisation of graphs and plots.
setting up the sweep in Wandb
confusion matrix
﻿
We, Shubham and Vivek, swear on our honour that the above declaration is correct.
﻿
﻿
﻿
Add a comment