CS6910 Assignment-1: Feed Forward Neural Network Implementation by scratch
Problem Statement
In this assignment one needs to implement a feedforward neural network and write the backpropagation code for training the network. We will be using numpy for all matrix/vector operations. We are not allowed to use any automatic differentiation packages. This network will be tested using the Fashion-MNIST dataset. Specifically, given an input image (28 x 28 = 784 pixels) from the Fashion-MNIST dataset, the network will be trained to classify the image into 1 of 10 classes.
10 different classes in the fashion-MNIST dataset and their Sample Images
There are 10 different classes present in the fashion-MNIST dataset. Class labels are as followed-
{0: 'T-shirt/Top', 1:'Trouser', 2:'Pullover', 3:'Dress', 4:'Coat', 5:'Sandal', 6:'Shirt', 7:'Sneaker', 8:'Bag' ,9:'Ankle Boot'}
The sample image from each class has been displayed below
Implementation of feedforward network:
A class for neural network has been made which does: initialization of weights & biases, forward propagation, backward propagation, prediction of final output ,accuracy score & plot loss curve (Both training loss and validation loss) while Neural Network fitting.
The neural network class takes in 5 arguments for class initialization: num_layers: total layers in the neural network (including output and input layer), activation function: activation function being used in the hidden layers, loss: which loss to use 'mse' or 'cross_entropy', batch_size & lambda_val: l2 regularizer value (weight decay).
Early stopping has also been implemented to avoid overfitting. Due to the implementation of early stop, it is seen that in the sweeps that some models have early stopped. It can be checked by checking their epoch and their default number of epoch.
Initialization of Weight-Bias parameters & forward-propagation
Inside Neural-Network class these functions have been made for initialization of parameters and forward propagation:'weight_bias_initialize' & 'forward_propagation'.
'weight_bias_initialize' function takes 2 arguments: a list of number of neurons per layer and a weight initializer. The weight initializer can take two values: 'Xavier' or 'Random'. Depending on the choice, the initial weight and bias gets initialized. This functions returns 4 outputs: 'parameter':current parameter, 'old_parameter': for momentum based gradient descent, rewritten nesterov gradient descent, 'v' & 'm': both copies of the old_parameter, use in rmsprop, adam & nadam . All the outputs are python dictionary.
'forward_propagation' function takes 2 arguments: The training data & current parameters(Weight and Bias). The function returns 3 outputs: y_pred : predicted probability distribution, Activations (hih_i) and Pre-Activation(aia_i). The activations and pre-activation are a python dictionary and these will help for calculating gradients while implementing backpropagation.
One thing is to keep in mind that, at the final layer, softmax activation-function is used, the neural network class doesn't allow to change that.
Implementation of the backpropagation algorithm with different optimisation functions
The backpropagation algorithm has been implemented for both losses: cross-entropy & mean squared error.
The 'backpropagate' function in the neural network class takes in 5 arguments: predicted_label (y_hat: a probability distribution), true_label (y_true), Activations (hih_i), Pre-Activation (aia_i) & parameters(Weight & Bias). The functions returns a dictionary containing the gradients with respect to every Weight and Bias.
Regarding optimisation, a class 'NN_optimizers' has been made which takes in 8 arguments: parameters, gradients, learning_rate, old_parameters, v, m, t and num_layers.
These optimisation functions have been implemented:
- SGD
- Momentum based gradient descent
- Nesterov accelerated gradient descent
- Rmsprop
- Adam
- Nadam
Initially I was calling Neural Network class for calculating the look ahead gradients in Nesterov gradient descent, later I used rewritten Nesterov gradient descent.
Hyperparameter tuning using Wandb sweeps
10% of the training dataset, was kept separately as validation set to do the hyperparameter tuning.
The sweep functionality provided by wandb, allows you to have three search space methods: Grid, Random & Bayes. Due to exponential number of combinations, grid search cannot be used since grid search iterates over every combination of hyperparameter values and will take a lot of time to check all the combinations. Bayes search could have been used but it has some problems when the search space dimension is large. Random search method was the best choice here since it chooses random set of parameters after each iteration and takes less time for tuning. So I chose to go for Random Search Strategy to find some of the best values for the hyperparameters. In total 167 configurations were tried out.
Search Space:
- Number of epochs: 5, 15, 20, 30, 40
- Number of hidden layers: 3, 4, 5, 6
- Size of every hidden layer: 32, 64, 128, 256, 512
- Weight decay (L2 regularisation): 0, 0.05, 0.5
- Learning rate: 1e-2, 1e-3, 1e-4
- Optimizer: sgd, momentum, nesterov, rmsprop, adam, nadam
- Batch size: 32, 64, 128, 256
- Weight initialisation: random, Xavier
- Activation functions: sigmoid, tanh, relu, identity
Best Configuration
The best validation accuracy achieved was 89.87% across all the configurations that were tried out.
The best configuration found by hyperparameter tuning is:
- Number of Neurons in every hidden layer: 128
- Number of hidden layers: 3
- Activation function: tanh
- Learning Rate: 0.0001
- Initializer: Xavier
- Optimizer: Adam
- Batch Size:32
- Weight Decay (λ\lambda): 0
- Epochs: 30
Parallel Coordinates Plot
- Around 60 models out of 167 models, had an Validation accuracy of less than 10%, the training accuracy was similar to that. It means that the model did not get trained at all. They seem to be stuck at a flat error surface.
- From the parameter importance plot, it is seen that neurons per layer, number of hidden layers, learning rate, and batch size play an important role in determining the validation accuracy
- It is interesting to note that, the identity activation function doesn't add any non linearity to the neural network even then, 29 configurations out of 43 configurations with identity activation function in the hidden layers had validation accuracy greater than 80%. The highest being 86.13%
- Sigmoid activation function didn't perform well, out of 35 configurations, only 6 configurations had validation accuracy greater than 80%, highest being 87.72%. This problem might be due to vanishing gradient problem in sigmoid. Tanh and ReLU activation functions, in general performed well than sigmoid & identity. Most of the configurations having these two activation functions had validation accuracy greater than 80%.
- Xavier initialization worked better than random initialization, out of 79 configurations having Xavier initialization 59 configurations had validation accuracy greater than 80%. While with random initialization 28 configurations out 88 configurations had validation accuracy greater than 80%.
- Regarding different optimisers, Adam & NAdam worked the best, 18 configurations out of 25 configurations of NAdam had validation accuracy greater than 80%, 22 out of 32 configurations of Adam had validation accuracy greater than 80%. Nesterov Gradient Descent is like hit or fail, with configurations having nesterov gradient descent, the validation accuracy is greater than 80% or it is less than 10%, nothing in between. It the optimisers have to be compared then, the order will be Adam > Nadam > Rmsprop = Nesterov > Momentum > SGD.
Recommendation for 95% Validation Accuracy
Based on the experiments conducted, use a neural network with Xavier initialization with tanh or ReLU activation function in the hidden layers. The network can have 3-5 hidden layers, with each layer having 128 neurons. The recommended optimizer is Adam/ NAdam with low learning rate of around 0.0001. Weight decay is recommended to be 0 or very close to zero. The batch size is recommended to be 32 or 64 with sufficient number of epochs to learn the parameters.
For better generalization, data augmentation can be performed, and Dropout can also be added.
Confusion Matrix
Training Set
- Training Accuracy: 92.315%
Test Set
- 
Test Accuracy:88% 
- 
It is seen that the model had difficulty in predicting Pullover and Shirt, many of the pullovers were predicted as coats & many of the Shirt were predicted as T-shirt 
Cross-Entropy Vs Mean Squared Error
Another sweep with loss as Mean Squared Error was performed as well. The best validation accuracy in this case was 88.65%.
The best configuration in this case was:
- Number of Neurons in every hidden layer: 256
- Number of hidden layers: 3
- Activation function: tanh
- Learning Rate: 0.01
- Initializer: Xavier
- Optimizer: nesterov
- Batch Size:64
- Weight Decay (λ\lambda): 0.5
- Epochs: 30
Even though, a random search was performed for tuning the hyperparameters for both Cross entropy and MSE, higher number of configurations had validation accuracy greater than 80% with loss as Cross-Entropy, as compared to MSE. Another point to note that best validation accuracy achieved with Loss as Cross-Entropy was 89.87% while loss with MSE was 88.65%. On grouping the validation accuracies of all the configurations ran from the sweep, it is seen that with Cross-Entropy error, to achieve convergence lesser number of epochs were required.
Here, a single configuration is compared, the configuration which gave best validation accuracy with Cross-Entropy Error is compared with Mean Squared error. The two plots below, Training Accuracy (MSE & Cross-Entropy) Vs Epoch & Validation Accuracy (MSE & Cross-Entropy) Vs Epoch show that in general, Models with loss as Cross-Entropy are much better in getting higher accuracy in lesser number of epochs.
Accuracy achieved with the same configuration but with Mean Squared Error
Training Accuracy:92.25%
Test Accuracy:87.73%
These are some of the reasons why Cross-Entropy should be used instead of Mean-Squared Error for image classification tasks
Github Link
Github link: https://github.com/shashwat-3004/CS6910_assignment-1
MNIST dataset
Using the parallel coordinates plot obtained by running the sweeps, the performance on the MNIST dataset was checked using 3 of the configurations that resulted in the some of the highest validation accuracies for Fashion-MNIST dataset. Based on the observations it is recommended to use xavier initialization, tanh or relu activation function, with optimizers such as adam and nadam, use a low learning rate around 0.001 or 0.0001. Since MNIST is a much simpler dataset, and a similar image classification task with the same number of classes, the configurations of hyperparameters that worked well for Fashion-MNIST is expected to work well for MNIST dataset too. The table below demonstrates this. High test accuracy scores of 97-98% can be achieved.
3 Configurations
| Configuration-1 | Configuration-2 | Configuration-3 | |
|---|---|---|---|
| Number of Neurons | 128 | 512 | 256 | 
| Number of Hidden Layers | 3 | 3 | 3 | 
| Activation Function | tanh | relu | tanh | 
| Initialization | Xavier | Xavier | Xavier | 
| Learning Rate | 0.0001 | 0.001 | 0.001 | 
| Optimizer | Adam | Nadam | Nadam | 
| Batch Size | 32 | 128 | 128 | 
| Weight Decay | 0 | 0 | 0 | 
| Epochs | 30 | 15 | 15 | 
| Training Accuracy | 99.78% | 99.71% | 99.22% | 
| Test Accuracy | 97.71% | 98.24% | 97.53% | 
Code Specifications
All the code used to run the experiments are in the GitHub repository.
A python script called train.py is in the root directory of the GitHub repository that accepts the following command line arguments with the specified values -  
python train.py --wandb_entity myname --wandb_project myprojectname
| Name | Default Value | Description | 
|---|---|---|
| -wp,--wandb_project | Assignment-1 | Project name used to track experiments in Weights & Biases dashboard | 
| -we,--wandb_entity | shashwat_mm19b053 | Wandb Entity used to track experiments in the Weights & Biases dashboard. | 
| -d,--dataset | fashion_mnist | choices: ["mnist", "fashion_mnist"] | 
| -e,--epochs | 30 | Number of epochs to train neural network. | 
| -b,--batch_size | 32 | Batch size used to train neural network. | 
| -l,--loss | cross_entropy | choices: ["mse", "cross_entropy"] | 
| -o,--optimizer | adam | choices: ["sgd", "momentum", "nag", "rmsprop", "adam", "nadam"] | 
| -lr,--learning_rate | 0.0001 | Learning rate used to optimize model parameters | 
| -m,--momentum | 0.9 | Momentum used by momentum and nag optimizers. | 
| -beta,--beta | 0.9 | Beta used by rmsprop optimizer | 
| -beta1,--beta1 | 0.9 | Beta1 used by adam and nadam optimizers. | 
| -beta2,--beta2 | 0.999 | Beta2 used by adam and nadam optimizers. | 
| -eps,--epsilon | 0.0000001 | Epsilon used by optimizers. | 
| -w_d,--weight_decay | 0 | Weight decay used by optimizers. | 
| -w_i,--weight_init | Xavier | choices: ["random", "Xavier"] | 
| -nhl,--num_layers | 3 | Number of hidden layers used in feedforward neural network. | 
| -sz,--hidden_size | 128 | Number of hidden neurons in a feedforward layer. | 
| -a,--activation | tanh | choices: ["identity", "sigmoid", "tanh", "relu"] | 
Self Declaration
I, Shashwat Patel (MM19B053), swear on my honour that I have written the code and the report by myself.
I have taken ideas on how to store different parameters, how to forward prop and backward prop using loop. I have mentioned the reference for the same.
Another thing to mention is that, while running sweep I had split the data usking sklearn.model_selection.train_test_split(), later I made my own function for splitting the data. The results for validation data might slightly differ due to this.
References I used:
- 
https://towardsdatascience.com/building-a-deep-neural-network-from-scratch-using-numpy-4f28a1df157a ,I took his idea of using loops and storing every parameter(Weight, bias, activation, pre-activation & gradients) in a dictionary 
- 
https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function ,Learnt about softmax derivative jacobian from here. 
- 
https://365datascience.com/tutorials/machine-learning-tutorials/what-is-xavier-initialization/ , Xavier Initialization 
- 
https://datascience.stackexchange.com/questions/20139/gradients-for-bias-terms-in-backpropagation ,bias-term in backpropagation for a mini-batch 
- 
https://docs.wandb.ai/guides/track/config ,learnt how to use argparse 
- 
https://docs.wandb.ai/guides/track/launch , https://docs.wandb.ai/ref/python/sweep , wandb documentation to learn how to run sweeps