CS6910 Assignment-1: Feed Forward Neural Network Implementation by scratch

Created on February 18|Last edited on March 19

Comment

﻿
Problem StatementIn this assignment one needs to implement a feedforward neural network and write the backpropagation code for training the network. We will be using numpy for all matrix/vector operations. We are not allowed to use any automatic differentiation packages.
This network will be tested using the Fashion-MNIST dataset. Specifically, given an input image (28 x 28 = 784
pixels) from the Fashion-MNIST dataset, the network will be trained to classify the image into 1 of 10 classes.
﻿
10 different classes in the fashion-MNIST dataset and their Sample ImagesThere are 10 different classes present in the fashion-MNIST dataset.
Class labels are as followed- 
{0: 'T-shirt/Top', 1:'Trouser', 2:'Pullover', 3:'Dress', 4:'Coat', 5:'Sandal', 6:'Shirt', 7:'Sneaker', 8:'Bag' ,9:'Ankle Boot'}
The sample image from each class has been displayed below
﻿
﻿
﻿
﻿
﻿
﻿
Implementation of feedforward network:A class for neural network has been made which does: initialization of weights & biases, forward propagation, backward propagation, prediction of final output ,accuracy score & plot loss curve (Both training loss and validation loss) while Neural Network fitting.
The neural network class takes in 5 arguments for class initialization: num_layers: total layers in the neural network (including output and input layer), activation function: activation function being used in the hidden layers, loss: which loss to use 'mse' or 'cross_entropy', batch_size & lambda_val: l2 regularizer value (weight decay).
Early stopping has also been implemented to avoid overfitting. Due to the implementation of early stop, it is seen that in the sweeps that some models have early stopped. It can be checked by checking their epoch and their default number of epoch. 
Initialization of Weight-Bias parameters & forward-propagationInside Neural-Network class these functions have been made for initialization of parameters and forward propagation:'weight_bias_initialize' & 'forward_propagation'.
'weight_bias_initialize' function takes 2 arguments: a list of number of neurons per layer and a weight initializer. The weight initializer can take two values: 'Xavier' or 'Random'. Depending on the choice, the initial weight and bias gets initialized.
This functions returns 4 outputs: 'parameter':current parameter, 'old_parameter': for momentum based gradient descent, rewritten nesterov gradient descent, 'v' & 'm': both copies of the old_parameter, use in rmsprop, adam & nadam  . All the outputs are python dictionary.
'forward_propagation' function takes 2 arguments: The training data & current parameters(Weight and Bias). The function returns 3 outputs: y_pred : predicted probability distribution, Activations (hih_ihi​) and Pre-Activation(aia_iai​). The activations and pre-activation are a python dictionary and these will help for calculating gradients while implementing backpropagation.
One thing is to keep in mind that, at the final layer, softmax activation-function is used, the neural network class doesn't allow to change that. 
﻿
Implementation of the backpropagation algorithm with different optimisation functionsThe backpropagation algorithm has been implemented for both losses: cross-entropy & mean squared error. 
The 'backpropagate' function in the neural network class takes in 5 arguments: predicted_label (y_hat: a probability distribution), true_label (y_true), Activations (hih_ihi​), Pre-Activation (aia_iai​) & parameters(Weight & Bias). The functions returns a dictionary containing the gradients with respect to every Weight and Bias.
Regarding optimisation, a class 'NN_optimizers' has been made which takes in 8 arguments: parameters, gradients, learning_rate, old_parameters, v, m, t and num_layers.
These optimisation functions have been implemented:
SGD
Momentum based gradient descent
Nesterov accelerated gradient descent
Rmsprop
Adam
Nadam
Initially I was calling Neural Network class for calculating the look ahead gradients in Nesterov gradient descent, later I used rewritten Nesterov gradient descent.
﻿
Hyperparameter tuning using Wandb sweeps10% of the training dataset, was kept separately as validation set to do the hyperparameter tuning. 
The sweep functionality provided by wandb, allows you to have three search space methods: Grid, Random & Bayes. Due to exponential number of combinations, grid search cannot be used  since grid search iterates over every combination of hyperparameter values and will take a lot of time to check all the combinations. Bayes search could have been used but it has some problems when the search space dimension is large. Random search method was the best choice here since it chooses random set of parameters after each iteration and takes less time for tuning. So I chose to go for Random Search Strategy  to find some of the best values for the hyperparameters. In total 167 configurations were tried out. 
Search Space:
Number of epochs: 5, 15, 20, 30, 40
Number of hidden layers:  3, 4, 5, 6
Size of every hidden layer:  32, 64, 128, 256, 512
Weight decay (L2 regularisation): 0, 0.05, 0.5
Learning rate: 1e-2, 1e-3, 1e-4 
Optimizer:  sgd, momentum, nesterov, rmsprop, adam, nadam
Batch size: 32, 64, 128, 256
Weight initialisation: random, Xavier
Activation functions: sigmoid, tanh, relu, identity
﻿
﻿
﻿
Cross-entropy166
﻿
﻿
﻿
Best ConfigurationThe best validation accuracy achieved was 89.87% across all the configurations that were tried out. 
The best configuration found by hyperparameter tuning is:Number of Neurons in every hidden layer: 128
Number of hidden layers: 3
Activation function: tanh
Learning Rate: 0.0001
Initializer: Xavier
Optimizer: Adam
Batch Size:32
Weight Decay (λ\lambdaλ): 0
Epochs: 30
﻿
﻿
﻿
Val_cross_entropy166
﻿
﻿
﻿
﻿
Parallel Coordinates PlotAround 60 models out of 167 models, had an Validation accuracy of less than 10%, the training accuracy was similar to that. It means that the model did not get trained at all. They seem to be stuck at a flat error surface. 
From the parameter importance plot, it is seen that neurons per layer, number of hidden layers, learning rate, and batch size play an important role in determining the validation accuracy
It is interesting to note that, the identity activation function doesn't add any non linearity to the neural network even then, 29 configurations out of 43 configurations with identity activation function in the hidden layers had validation accuracy greater than 80%. The highest being 86.13%
Sigmoid activation function didn't perform well, out of 35 configurations, only 6 configurations had validation accuracy greater than 80%, highest being 87.72%. This problem might be due to vanishing gradient problem in sigmoid.  Tanh and ReLU activation functions, in general performed well than sigmoid & identity. Most of the configurations having these two activation functions had validation accuracy greater than 80%.
Xavier initialization worked better than random initialization, out of 79 configurations having Xavier initialization 59 configurations had validation accuracy greater than 80%. While with random initialization 28 configurations out 88 configurations had validation accuracy greater than 80%.
Regarding different optimisers, Adam & NAdam worked the best, 18 configurations out of 25 configurations of NAdam had validation accuracy greater than 80%, 22 out of 32 configurations of Adam had validation accuracy greater than 80%. Nesterov Gradient Descent is like hit or fail, with configurations having nesterov gradient descent, the validation accuracy is greater than 80% or it is less than 10%, nothing in between. It the optimisers have to be compared then, the order will be Adam > Nadam > Rmsprop = Nesterov > Momentum > SGD. 
 Recommendation for 95% Validation Accuracy 
Based on the experiments conducted, use a neural network with Xavier initialization with tanh or ReLU activation function in the hidden layers. The network can have 3-5 hidden layers, with each layer having 128 neurons. The recommended optimizer is Adam/ NAdam with low learning rate of around 0.0001. Weight decay is recommended to be 0 or very close to zero. The batch size is recommended to be 32 or 64 with sufficient number of epochs to learn the parameters. 
For better generalization, data augmentation can be performed, and Dropout can also be added. 
﻿
﻿
﻿
Parallel_plot166
﻿
Confusion MatrixTraining Set 
Training Accuracy: 92.315%
﻿
﻿
﻿
﻿
Test Set
Test Accuracy:88%
It is seen that the model had difficulty in predicting Pullover and Shirt, many of the pullovers were predicted as coats & many of the Shirt were predicted as T-shirt
﻿
﻿
﻿
﻿
﻿
﻿
﻿
Cross-Entropy Vs Mean Squared ErrorAnother sweep with loss as Mean Squared Error was performed as well. The best validation accuracy in this case was 88.65%.
The best configuration in this case was:
Number of Neurons in every hidden layer: 256
Number of hidden layers: 3
Activation function: tanh
Learning Rate: 0.01
Initializer: Xavier
Optimizer: nesterov
Batch Size:64
Weight Decay (λ\lambdaλ): 0.5
Epochs: 30
﻿
﻿
﻿
MSE-sweep149
﻿
Even though, a random search was performed for tuning the hyperparameters for both Cross entropy and MSE, higher number of configurations had validation accuracy greater than 80% with loss as Cross-Entropy, as compared to MSE. Another point to note that best validation accuracy achieved with Loss as Cross-Entropy was 89.87% while loss with MSE was 88.65%. On grouping the validation accuracies of all the configurations ran from the sweep, it is seen that with Cross-Entropy error, to achieve convergence lesser number of epochs were required. 
Here, a single configuration is compared, the configuration which gave best validation accuracy with Cross-Entropy Error is compared with Mean Squared error. The two plots below,  Training Accuracy (MSE & Cross-Entropy) Vs Epoch &  Validation Accuracy (MSE & Cross-Entropy) Vs Epoch show that in general, Models with loss as Cross-Entropy  are much better in getting higher accuracy in lesser number of epochs.
﻿
﻿
﻿
﻿
﻿
Accuracy achieved with the same configuration but with Mean Squared ErrorTraining Accuracy:92.25%
Test Accuracy:87.73%
These are some of the reasons why Cross-Entropy should be used instead of Mean-Squared Error for image classification tasks
﻿
﻿
Github LinkGithub link: https://github.com/shashwat-3004/CS6910_assignment-1
﻿
MNIST datasetUsing the parallel coordinates plot obtained by running the sweeps, the performance on the MNIST dataset was checked using 3 of the configurations that resulted in the some of the highest validation accuracies for Fashion-MNIST dataset. Based on the observations it is recommended to use xavier initialization, tanh or relu activation function, with optimizers such as adam and nadam, use a low learning rate around 0.001 or 0.0001. Since MNIST is a much simpler dataset, and a similar image classification task with the same number of classes, the configurations of hyperparameters that worked well for Fashion-MNIST is expected to work well for MNIST dataset too. The table below demonstrates this. High test accuracy scores of 97-98% can be achieved. 
3 ConfigurationsConfiguration-1Configuration-2Configuration-3
Number of Neurons128512256
Number of Hidden Layers333
Activation Functiontanhrelutanh
InitializationXavierXavierXavier
Learning Rate0.00010.0010.001
OptimizerAdamNadamNadam
Batch Size32128128
Weight Decay000
Epochs301515
Training Accuracy99.78%99.71%99.22%
Test Accuracy97.71%98.24%97.53%
Code Specifications All the code used to run the experiments are in the GitHub repository.
A python script called train.py is in the root directory of the GitHub repository that accepts the following command line arguments with the specified values -  
python train.py --wandb_entity myname --wandb_project myprojectname
NameDefault ValueDescription
-wp, --wandb_projectAssignment-1Project name used to track experiments in Weights & Biases dashboard
-we, --wandb_entityshashwat_mm19b053Wandb Entity used to track experiments in the Weights & Biases dashboard.
-d, --datasetfashion_mnistchoices:  ["mnist", "fashion_mnist"]
-e, --epochs30Number of epochs to train neural network.
-b, --batch_size32Batch size used to train neural network.
-l, --losscross_entropychoices:  ["mse", "cross_entropy"]
-o, --optimizeradamchoices:  ["sgd", "momentum", "nag", "rmsprop", "adam", "nadam"]
-lr, --learning_rate0.0001Learning rate used to optimize model parameters
-m, --momentum0.9Momentum used by momentum and nag optimizers.
-beta, --beta0.9Beta used by rmsprop optimizer
-beta1, --beta10.9Beta1 used by adam and nadam optimizers.
-beta2, --beta20.999Beta2 used by adam and nadam optimizers.
-eps, --epsilon0.0000001Epsilon used by optimizers.
-w_d, --weight_decay0Weight decay used by optimizers.
-w_i, --weight_initXavierchoices:  ["random", "Xavier"]
-nhl, --num_layers3Number of hidden layers used in feedforward neural network.
-sz, --hidden_size128Number of hidden neurons in a feedforward layer.
-a, --activationtanhchoices:  ["identity", "sigmoid", "tanh", "relu"]

﻿
Self DeclarationI, Shashwat Patel (MM19B053), swear on my honour that I have written the code and the report by myself. 
I have taken ideas on how to store different parameters, how to forward prop and backward prop using loop. I have mentioned the reference for the same. 
Another thing to mention is that, while running sweep I had split the data usking sklearn.model_selection.train_test_split(), later I made my own function for splitting the data. The results for validation data might slightly differ due to this.
References I used:https://towardsdatascience.com/building-a-deep-neural-network-from-scratch-using-numpy-4f28a1df157a ,I took his idea of using loops and storing every parameter(Weight, bias, activation, pre-activation & gradients) in a dictionary
https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function ,Learnt about softmax derivative jacobian from here.
https://365datascience.com/tutorials/machine-learning-tutorials/what-is-xavier-initialization/ , Xavier Initialization
https://datascience.stackexchange.com/questions/20139/gradients-for-bias-terms-in-backpropagation ,bias-term in backpropagation for a mini-batch
https://docs.wandb.ai/guides/track/config ,learnt how to use argparse
https://docs.wandb.ai/guides/track/launch , https://docs.wandb.ai/ref/python/sweep , wandb documentation to learn how to run sweeps
﻿
﻿
﻿

	Configuration-1	Configuration-2	Configuration-3
Number of Neurons	128	512	256
Number of Hidden Layers	3	3	3
Activation Function	tanh	relu	tanh
Initialization	Xavier	Xavier	Xavier
Learning Rate	0.0001	0.001	0.001
Optimizer	Adam	Nadam	Nadam
Batch Size	32	128	128
Weight Decay	0	0	0
Epochs	30	15	15
Training Accuracy	99.78%	99.71%	99.22%
Test Accuracy	97.71%	98.24%	97.53%

Name	Default Value	Description
`-wp`, `--wandb_project`	Assignment-1	Project name used to track experiments in Weights & Biases dashboard
`-we`, `--wandb_entity`	shashwat_mm19b053	Wandb Entity used to track experiments in the Weights & Biases dashboard.
`-d`, `--dataset`	fashion_mnist	choices: ["mnist", "fashion_mnist"]
`-e`, `--epochs`	30	Number of epochs to train neural network.
`-b`, `--batch_size`	32	Batch size used to train neural network.
`-l`, `--loss`	cross_entropy	choices: ["mse", "cross_entropy"]
`-o`, `--optimizer`	adam	choices: ["sgd", "momentum", "nag", "rmsprop", "adam", "nadam"]
`-lr`, `--learning_rate`	0.0001	Learning rate used to optimize model parameters
`-m`, `--momentum`	0.9	Momentum used by momentum and nag optimizers.
`-beta`, `--beta`	0.9	Beta used by rmsprop optimizer
`-beta1`, `--beta1`	0.9	Beta1 used by adam and nadam optimizers.
`-beta2`, `--beta2`	0.999	Beta2 used by adam and nadam optimizers.
`-eps`, `--epsilon`	0.0000001	Epsilon used by optimizers.
`-w_d`, `--weight_decay`	0	Weight decay used by optimizers.
`-w_i`, `--weight_init`	Xavier	choices: ["random", "Xavier"]
`-nhl`, `--num_layers`	3	Number of hidden layers used in feedforward neural network.
`-sz`, `--hidden_size`	128	Number of hidden neurons in a feedforward layer.
`-a`, `--activation`	tanh	choices: ["identity", "sigmoid", "tanh", "relu"]

Add a comment