Skip to main content

CS6910 Assignment 1

Implementation of Feedforward Neural Network using NumPy library.
Created on March 15|Last edited on March 15

Submitted by

Saish Jaiswal (CS20D405)

Ishika Gupta (CS20S057)


Instructions

  • The goal of this assignment is twofold: (i) implement and use gradient descent (and its variants) with backpropagation for a classification task (ii) get familiar with wandb which is a cool tool for running and keeping track of a large number of experiments
  • We strongly recommend that you work on this assignment in a team of size 2. Both the members of the team are expected to work together (in a subsequent viva both members will be expected to answer questions, explain the code, etc).
  • Collaborations and discussions with other groups are strictly prohibited.
  • You must use Python (numpy and pandas) for your implementation.
  • You cannot use the following packages from keras, pytorch, tensorflow: optimizers, layers
  • If you are using any packages from keras, pytorch, tensorflow then post on moodle first to check with the instructor.
  • You can run the code in a jupyter notebook on colab by enabling GPUs.
  • You have to generate the report in the same format as shown below using wandb.ai. You can start by cloning this report using the clone option above. Most of the plots that we have asked for below can be (automatically) generated using the apis provided by wandb.ai. You will upload a link to this report on gradescope.
  • You also need to provide a link to your github code as shown below. Follow good software engineering practices and set up a github repo for the project on Day 1. Please do not write all code on your local machine and push everything to github on the last day. The commits in github should reflect how the code has evolved during the course of the assignment.
  • You have to check moodle regularly for updates regarding the assignment.


Problem Statement

In this assignment, you need to implement a feedforward neural network and write the backpropagation code for training the network. We strongly recommend using numpy for all matrix/vector operations. You are not allowed to use any automatic differentiation packages. This network will be trained and tested using the Fashion-MNIST dataset. Specifically, given an input image (28 x 28 = 784 pixels) from the Fashion-MNIST dataset, the network will be trained to classify the image into 1 of 10 classes.



Question 1 (2 Marks)

Download the fashion-MNIST dataset and plot 1 sample image for each class as shown in the grid below. Use "from keras.datasets import fashion_mnist" for getting the fashion mnist dataset.




Run set 2
10


Question 2 (10 Marks)

Implement a feedforward neural network which takes images from the fashion-mnist data as input and outputs a probability distribution over the 10 classes.

Your code should be flexible so that it is easy to change the number of hidden layers and the number of neurons in each hidden layer.

We will check the code for implementation and ease of use.

The code is uploaded on GitHub\textbf{\small The code is uploaded on GitHub}



Question 3 (18 Marks)

Implement the backpropagation algorithm with support for the following optimisation functions

  • sgd
  • momentum based gradient descent
  • nesterov accelerated gradient descent
  • rmsprop
  • adam
  • nadam

(12 marks for the backpropagation framework and 2 marks for each of the optimisation algorithms above)

We will check the code for implementation and ease of use (e.g., how easy it is to add a new optimisation algorithm such as Eve). Note that the code should be flexible enough to work with different batch sizes.

The code is uploaded on GitHub\textbf{\small The code is uploaded on GitHub}



Question 4 (10 Marks)

Use the sweep functionality provided by wandb to find the best values for the hyperparameters listed below. Use the standard train/test split of fashion_mnist (use (X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()). Keep 10% of the training data aside as validation data for this hyperparameter search. Here are some suggestions for different values to try for hyperparameters. As you can quickly see that this leads to an exponential number of combinations. You will have to think about strategies to do this hyperparameter search efficiently. Check out the options provided by wandb.sweep and write down what strategy you chose and why.

  • number of epochs: 5, 10
  • number of hidden layers: 3, 4, 5
  • size of every hidden layer: 32, 64, 128
  • weight decay (L2 regularisation): 0, 0.0005, 0.5
  • learning rate: 1e-3, 1 e-4
  • optimizer: sgd, momentum, nesterov, rmsprop, adam, nadam
  • batch size: 16, 32, 64
  • weight initialization: random, Xavier
  • activation functions: sigmoid, tanh, ReLU

wandb will automatically generate the following plots. Paste these plots below using the "Add Panel to Report" feature. Make sure you use meaningful names for each sweep (e.g. hl_3_bs_16_ac_tanh to indicate that there were 3 hidden layers, batch size was 16, and activation function was ReLU) instead of using the default names (whole-sweep, kind-sweep) given by wandb.

Sweep Strategy Used\textbf{\small Sweep Strategy Used}

  • Wandb has three hyperparameter searching strategies: Grid, Random, and Bayes.
  • We want to test the exponential number of hyperparameter configurations.
  • Grid search iterates over all possible combinations of parameter values and it is computationally heavy.
  • Bayesian Optimization uses a gaussian process to model the function and then chooses parameters to optimize the probability of improvement. This strategy requires a metric key to be specified. But, this strategy does not scale for a large number of hyperparameters.
  • Random search chooses random sets of values. So, we used this strategy for running hyperparameter sweeps.



Run set 2
41


Question 5 (5 marks)

We would like to see the best accuracy on the validation set across all the models that you train.

wandb automatically generates this plot which summarises the test accuracy of all the models that you tested. Please paste this plot below using the "Add Panel to Report" feature

We got the validation accuracy of 88.417 %\textbf{\small 88.417 \%} for the configuration showing the highest one in the following plot.




Run set 2
36


Question 6 (20 Marks)

Based on the different experiments that you have run we want you to make some inferences about which configurations worked and which did not.

Here again, wandb automatically generates a "Parallel co-ordinates plot" and a "correlation summary" as shown below. Learn about a "Parallel co-ordinates plot" and how to read it.

By looking at the plots that you get, write down some interesting observations (simple bullet points but should be insightful). You can also refer to the plot in Question 5 while writing these insights. For example, in the above sample plot there are many configurations which give less than 65% accuracy. I would like to zoom into those and see what is happening.

I would also like to see a recommendation for what configuration to use to get close to 95% accuracy.

Observations\textbf{\small Observations}

  • The Nadam optimizer and the Relu activation function worked quite well in most of the cases. The reason for Relu activation performing well maybe because the gradient is always a constant which helps it to reduce the impact of vanishing gradients as we increase the number of hidden layers.
  • We got the highest validation accuracy of 88.417 %\textbf{\small 88.417 \%} for the following configuration -- Number of epochs: 5, Number of hidden layers: 5, Size of each hidden layer: 128, Learning rate: 0.001, Optimizer: nadam, Batch size: 32, Weight Initialization: random, Activation: relu, Loss: cross_entropy
  • We got the least validation accuracy of 10.133 %\textbf{\small 10.133 \%} for the following configuration -- Number of epochs: 10, Number of hidden layers: 4, Size of each hidden layer: 32, Learning rate: 0.001, Optimizer: adam, Batch size: 16, Weight Initialization: random, Activation: relu, Loss: squared_error
  • Adam works well only with a certain hyperparameter configurations.
  • We could observe that for a given set of hyperparameters, the cross-entropy loss performed better than the squared-error loss. Since the outputs are probability values, using cross_entropy is a better choice.
  • We have not tried the L2-regularization but, the aforementioned configuration for which we got the best accuracy along with Xavier initialization\textbf{\small Xavier initialization}, L2 regularization\textbf{\small L2 regularization}, and data-augmentation\textbf{\small data-augmentation} may give an accuracy close to 95%.
  • We also observed that normalizing\textbf{\small normalizing} the data to mean zero and variance one, does help to improve the performance of the model.



Run set 2
34


Question 7 (10 Marks)

For the best model identified above, report the accuracy on the test set of fashion_mnist and plot the confusion matrix as shown below. More marks for creativity (less marks for producing the plot shown below as it is)

We got the best accuracy on the Fashion MNIST dataset for the following configuration:

  • Number of epochs: 5
  • Number of hidden layers: 5
  • Size of each hidden layer: 128
  • Learning rate: 0.001
  • Optimizer: nadam
  • Batch size: 32
  • Weight Initialization: random
  • Activation: relu
  • Loss: cross_entropy
  • Fashion MNIST Training Accuracy: 92.098%\textbf{\small 92.098\%}
  • Fashion MNIST Validation Accuracy: 88.417%\textbf{\small 88.417\%}
  • Fashion MNIST Testing Accuracy: 87.76%\textbf{\small 87.76\%}

The following is the confusion matrix for the same configuration.




Run set 2
1



Question 8 (5 Marks)

In all the models above you would have used cross-entropy loss. Now compare the cross-entropy loss with the squared error loss. I would again like to see some automatically generated plots or your own plots to convince me whether one is better than the other.

  • We used the following hyperparameter configuration:

  • Number of epochs: 5

  • Number of hidden layers: 5

  • Size of each hidden layer: 128

  • Learning rate: 0.001

  • Optimizer: nadam

  • Batch size: 32

  • Weight Initialization: random

  • Activation: relu

We ran the same hyperparameter configuration — one using the cross-entropy loss and the other using the squared error loss. We found that the cross-entropy loss performed better than the squared error loss function.




Run set
2




Question 9 (10 Marks)

GitHub Repository:\textbf{\small GitHub Repository:} https://github.com/SaishJaiswal/CS6910-Deep-Learning

Wandb Report Link:\textbf{\small Wandb Report Link:} https://wandb.ai/saish/Deep-Learning/reports/CS6910-Assignment-1--Vmlldzo1MzI2OTE



Question 10 (10 Marks)

Since both the MNIST and the Fashion_MNIST are image datasets, and Fashion_MNIST being more complex than the MNIST dataset, we can use the hyperparameter configurations that worked best on the Fashion_MNIST dataset to train the model for the MNIST dataset. The following are the three configurations that worked best for the Fashion_MNIST dataset with respect to the runs we have in our wandb sweep.

Configuration 1

  • Number of epochs: 5
  • Number of hidden layers: 5
  • Size of each hidden layer: 128
  • Learning rate: 0.001
  • Optimizer: nadam
  • Batch size: 32
  • Weight Initialization: random
  • Activation: relu
  • Loss: cross_entropy
  • MNIST Training Accuracy: 99.46%\textbf{\small 99.46\%}
  • MNIST Validation Accuracy: 97.40%\textbf{\small 97.40\%}
  • MNIST Testing Accuracy: 97.02%\textbf{\small 97.02\%}

Configuration 2

  • Number of epochs: 10
  • Number of hidden layers: 3
  • Size of each hidden layer: 64
  • Learning rate: 0.001
  • Optimizer: adam
  • Batch size: 64
  • Weight Initialization: random
  • Activation: relu
  • Loss: cross_entropy
  • MNIST Training Accuracy: 99.318%\textbf{\small 99.318\%}
  • MNIST Validation Accuracy: 97.033%\textbf{\small 97.033\%}
  • MNIST Testing Accuracy: 96.92%\textbf{\small 96.92\%}

Configuration 3

  • Number of epochs: 5
  • Number of hidden layers: 3
  • Size of each hidden layer: 128
  • Learning rate: 0.001
  • Optimizer: sgd
  • Batch size: 32
  • Weight Initialization: xavier
  • Activation: tanh
  • Loss: cross_entropy
  • MNIST Training Accuracy: 98.5%\textbf{\small 98.5\%}
  • MNIST Validation Accuracy: 96.18%\textbf{\small 96.18\%}
  • MNIST Testing Accuracy: 95.93%\textbf{\small 95.93\%}


Self Declaration

Contributions of the two team members:

CS20D405: (100% contribution)\textbf{\small CS20D405: (100\% contribution)}

  • Plotting 10 Fashion-MNIST Images
  • Implementing FeedForward Neural Network
  • Implementing Backpropagation Algorithm
  • Implementation of the six optimizers — sgd, mgd, nag, rmsprop, adam, nadam
  • Running sweeps in wandb
  • Comparing results for different hyperparameter configurations
  • Creating Confusion Matrix plot
  • Creating Parallel Co-ordinate chart
  • Analysis of the results for Cross-entropy loss and Squared-error loss
  • Running the algorithm for MNIST dataset

CS20S057: (100% contribution)\textbf{\small CS20S057: (100\% contribution)}

  • Plotting 10 Fashion-MNIST Images
  • Implementing FeedForward Neural Network
  • Implementing Backpropagation Algorithm
  • Implementation of the six optimizers — sgd, mgd, nag, rmsprop, adam, nadam
  • Running sweeps in wandb
  • Comparing results for different hyperparameter configurations
  • Creating Confusion Matrix plot
  • Creating Parallel Co-ordinate chart
  • Analysis of the results for Cross-entropy loss and Squared-error loss
  • Running the algorithm for MNIST dataset

We, Saish Jaiswal and Ishika Gupta, swear on our honour that the above declaration is correct.

Number of late-days used: 2\textbf{\small Number of late-days used: 2}