Assignment 1
Problem Statement
Question 1 (2 Marks)
Download the fashion-MNIST dataset and plot 1 sample image for each class as shown in the grid below. Use "from keras.datasets import fashion_mnist" for getting the fashion mnist dataset.
Answer
We used "from keras.datasets import fashion_mnist" to get the dataset and plotted 1 sample image for each class as shown below with the class label for each. The code is also given below:
Code link https://github.com/jayavardhankondapalli/cs6910_assignment1/blob/main/Assignment1_Q1.py
Question 2 (10 Marks)
Implement a feedforward neural network which takes images from the fashion-mnist data as input and outputs a probability distribution over the 10 classes.
Your code should be flexible so that it is easy to change the number of hidden layers and the number of neurons in each hidden layer.
We will check the code for implementation and ease of use.
Answer
We have implemented a feedforward neural network that takes images from the fashion-mnist data as input and outputs a probability distribution over the 10 classes. The code is tested on the test data and it is flexible to change the hyperparameters.
Github link for the code: https://github.com/jayavardhankondapalli/cs6910_assignment1/blob/main/assignmetnt1_q2_q3.py
Question 3 (18 Marks)
Implement the backpropagation algorithm with support for the following optimisation functions
- sgd
- momentum based gradient descent
- nesterov accelerated gradient descent
- rmsprop
- adam
- nadam
(12 marks for the backpropagation framework and 2 marks for each of the optimisation algorithms above)
We will check the code for implementation and ease of use (e.g., how easy it is to add a new optimisation algorithm such as Eve). Note that the code should be flexible enough to work with different batch sizes.
Answer
We have implemented the backpropagation algorithm with support for the following optimization functions
- sgd
- momentum based gradient descent
- nesterov accelerated gradient descent
- rmsprop
- adam
- nadam
Moreover the code is flexible to add new optimization algorithm like Eve by just putting the code for that function.
Github code link is: https://github.com/jayavardhankondapalli/cs6910_assignment1/blob/main/assignmetnt1_q2_q3.py
Question 4 (10 Marks)
Use the sweep functionality provided by wandb to find the best values for the hyperparameters listed below. Use the standard train/test split of fashion_mnist (use (X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()). Keep 10% of the training data aside as validation data for this hyperparameter search. Here are some suggestions for different values to try for hyperparameters. As you can quickly see that this leads to an exponential number of combinations. You will have to think about strategies to do this hyperparameter search efficiently. Check out the options provided by wandb.sweep and write down what strategy you chose and why.
- number of epochs: 5, 10
- number of hidden layers: 3, 4, 5
- size of every hidden layer: 32, 64, 128
- weight decay (L2 regularisation): 0, 0.0005, 0.5
- learning rate: 1e-3, 1 e-4
- optimizer: sgd, momentum, nesterov, rmsprop, adam, nadam
- batch size: 16, 32, 64
- weight initialisation: random, Xavier
- activation functions: sigmoid, tanh, ReLU
wandb will automatically generate the following plots. Paste these plots below using the "Add Panel to Report" feature. Make sure you use meaningful names for each sweep (e.g. hl_3_bs_16_ac_tanh to indicate that there were 3 hidden layers, batch size was 16, and activation function was ReLU) instead of using the default names (whole-sweep, kind-sweep) given by wandb.
Answer
wandb provides with wandb.sweep() functionality that helps to programmatically choose between different hyperparameters and reduces manual labor of choosing sets of parameters by hand. It also helps to plot graphs and helps in comparison between the results of different hyperparameters.
wandb.sweep() provides three strategies to choose hyperparameters from the sweep_config file. These are:
1.grid: In this search strategy it runs with all possible combinations of hyperparameters. So, if there are p hyperparameters with each having n values then there will be a total of npn^p combinations. So, the program will run with an exponential number of combinations and takes exponential time. So, we are not choosing this method.
2.random: In this method sweep selects randomly from each set of hyperparameters. Although our experiment produces some good combinations still since the parameters are chosen randomly this method is also not our choice.
3.bayes: In this method, we need to choose a metric (example: accuracy or loss) and a goal (example: maximize or minimize). In this strategy, sweep helps to choose hyperparameters based on our goal and thus improves the metric. Thus this strategy is more efficient and hence it is our choice.
The hyperparameter values that we choose in our sweep_config for getting the best parameters is as follows:
sweep_config = {
"method":"bayes"
}
metric = {
"name" : "val_accuracy",
"goal" : "maximize"
}
sweep_config['metric']=metric
parameter_dict = {
'number_of_epochs': {
'values':[5, 10, 15]
},
'number_of_hidden_layers': {
'values':[3, 4, 5]
},
'size_of_every_hidden_layer': {
'values':[32, 64, 128]
},
'weight_decay': {
'values':[0, 0.0005, 0.005]
},
'learning_rate': {
'values':[1e-4, 1e-5]
},
'optimizer': {
'values':['stochastic', 'momentum', 'nesterov accelerated', 'rmsprop', 'adam', 'nadam']
},
'batch_size': {
'values':[16, 32, 64]
},
'weight_initialisation': {
'values':['random', 'xavier']
},
'activation_functions': {
'values':['sigmoid', 'tanh', 'relu']
}
}
sweep_config['parameters']=parameter_dict
Following are the plots generated in sweep when run on the above sweep_config.
Github code link is: https://github.com/jayavardhankondapalli/cs6910_assignment1/blob/main/assignmetnt1_q4.py
Question 5 (5 marks)
We would like to see the best accuracy on the validation set across all the models that you train.
wandb automatically generates this plot which summarises the test accuracy of all the models that you tested. Please paste this plot below using the "Add Panel to Report" feature
Answer
Github code link is: https://github.com/jayavardhankondapalli/cs6910_assignment1/blob/main/assignmetnt1_q4.py
After performing sweep as in question 4, we get the best validation accuracy of 80.4% for the parameter values as below:
activation_function : tanh,batch_size : 32learning_rate : 0.0001,number_of_epochs :10number_of_hidden_layers : 5optimizer : nadamsize_of_every_hidden_layer : 64weight_decay : 0.005weight_initialisation : xavierloss_function : cross_entropyval_accuracy : 0.804
Question 6 (20 Marks)
Based on the different experiments that you have run we want you to make some inferences about which configurations worked and which did not.
Here again, wandb automatically generates a "Parallel co-ordinates plot" and a "correlation summary" as shown below. Learn about a "Parallel co-ordinates plot" and how to read it.
By looking at the plots that you get, write down some interesting observations (simple bullet points but should be insightful). You can also refer to the plot in Question 5 while writing these insights. For example, in the above sample plot there are many configurations which give less than 65% accuracy. I would like to zoom into those and see what is happening.
I would also like to see a recommendation for what configuration to use to get close to 95% accuracy.
Answer
From the "Parallel co-ordinate plot" below we can observe the parameters of various plots and we can observe that 'nadam' and 'adam' performs well as optimizer. This is also evident from the Question 5, parameters importance plot that optimizer.value_nadam has the highest importance and positively correlated whereas for such a huge data scholastic gradient descent optimizer does not perform well because with so many data points it has a lot of oscillations. This is also evident from the parallel co-ordinate plot below that 'sgd' does not perform well.
Nextly, the second most important parameter is the size of each hidden layer that is the number of neurons present at each layer which is also positively correlated with the accuracy and our model performs well with 64 nodes in the hidden layer.
Learning rate also acts as an important factor in this regard. A lower learning rate increases the accuracy and we are using a learning rate of 0.00001.
Small batch sizes may aid generalization, but they may not be able to converge to global minima. Similarly, a big batch size can be costly in terms of both cost and generality. So we need a batch size that is neither too little nor too huge. This view is supported by a tiny negative correlation value for batch size in the correlation table. It performs well with a medium batch size of 32 with lower learning rate values.
Possible modifications to attain 95% accuracy:
It is clear from the setups that utilizing neural networks, the test data accuracy cannot exceed 85 percent. We are utilizing the Fashion-MNIST dataset in our assignment, which has Images, and we know that convolutional neural networks perform better for datasets with images than neural networks. As a result, we can achieve an accuracy of up to 95% utilizing convolutional neural networks.
Github code link is: https://github.com/jayavardhankondapalli/cs6910_assignment1/blob/main/assignmetnt1_q4.py
Question 7 (10 Marks)
For the best model identified above, report the accuracy on the test set of fashion_mnist and plot the confusion matrix as shown below. More marks for creativity (less marks for producing the plot shown below as it is)
Answer
Github code link: https://github.com/jayavardhankondapalli/cs6910_assignment1/blob/main/assignmetnt1_q7.py
The best model identified from Question 4 has the following hyperparameters:
-activation_function: tanh,
-batch_size: 32,
-learning_rate: 0.0001,
-number_of_epochs:10,
-number_of_hidden_layers: 5,
-optimizer:nadam,
-size_of_every_hidden_layer: 64,
-weight_decay: 0.005,
-weight_initialisation: xavier,
-loss_function:cross_entropy
The test accuracy for the model trained with these parameters is 80.05% on the test data. The confusion matrix is drawn below with these parameters:
Question 8 (5 Marks)
In all the models above you would have used cross entropy loss. Now compare the cross entropy loss with the squared error loss. I would again like to see some automatically generated plots or your own plots to convince me whether one is better than the other.
Answer
Red plot is with cross entropy loss function. Blue plot is with Squared error loss function.
From the accuracy graph we can say that with cross entropy accuracy is high than squared error loss function. But in loss graph cross entropy has high average loss than squared error average loss. That is even with high loss , cross entropy can give high accuracy than squared error. Therefore cross entropy is better than squared error for this data set. So, we can conclude that If network outputs are probabilities of classes than cross entropy gives better result than squared error.
Question 9 (10 Marks)
Paste a link to your github code for this assignment
Link: https://github.com/jayavardhankondapalli/cs6910_assignment1.git
The Readme file to use the git repository is also given in the git link: https://github.com/jayavardhankondapalli/cs6910_assignment1/blob/main/README.md
Question 10 (10 Marks)
Based on your learnings above, give me 3 recommendations for what would work for the MNIST dataset (not Fashion-MNIST). Just to be clear, I am asking you to take your learnings based on extensive experimentation with one dataset and see if these learnings help on another dataset. If I give you a budget of running only 3 hyperparameter configurations as opposed to the large number of experiments you have run above then which 3 would you use and why. Report the accuracies that you obtain using these 3 configurations.
Answer
Github code link: https://github.com/jayavardhankondapalli/cs6910_assignment1/blob/main/assignmetnt1_q10.py
Based on our extensive experiments on the fashion_mnist dataset, we concluded that lesser learning rates are giving better results. For activation function, 'tanh' activation function performs better than 'relu' in general, with a higher number of epochs we are getting better accuracy. With number of hidden layer size as 64, we are getting better results. Optimizers like 'adam', 'nadam' are performing better than optimizers like 'sgd', 'mgd','nesterov'. So, considering all these above points in mind we tried three experiments with hyperparameter values as follows:
-
bs-32-lr-1e-05-ep-30-op-adam-nhl-5-shl-64-act-tanh-wd-0.0005-wi-xavier
-
bs-64-lr-1e-05-ep-30-op-adam-nhl-5-shl-128-act-relu-wd-0-wi-xavier
-
bs-64-lr-1e-05-ep-20-op-adam-nhl-5-shl-128-act-relu-wd-0-wi-xavier
So, with parameters batch size as 32, learning rate 0.00001, epoch as 30, optimizers as adam, number of hidden layers as 5, size of hidden layer 64, activation function tanh, weight initialization xavier and weight decay as 0.0005, we are getting the best accuracy amongst the three in mnist dataset. The test accuracy is coming close to 90% for these parameters.
The plots for loss and accuracy are given below:
Self Declaration
List down the contributions of the two team members:
CS21S045: (50% contribution)
- implementing forward propagation, backward propagation
- implementing Momentum Gradient descent, RMSprop, Nadam gradient descent algorithms
- analysing the parallel co-ordinates plot and writing inferences
- wandb integration
- Training data analysis
- plotting the confusion matrix
- Readme File finalization
- code flexibility decisions
CS21S011: (50% contribution)
- implementing forward propagation, backward propagation
- implementing Stochastic Gradient Descent, Nestrov and Adam optimizers. gradient descent algorithms
- Squared error Vs cross entropy analysis
- setting up the sweep in wandb
- analysing the parallel co-ordinates plot and writing inferences
- Readme file finalization
- Modeling the code.
At the end of every day, we connected in meet to discuss and share the knowledge and implemented the code together.
We, Prithaj_Banerjee_CS21S045 and Kondapalli_Jayavardhan_CS21S011, swear on our honour that the above declaration is correct.