CS6910 - Assignment 1
Problem Statement
In this assignment you need to implement a feedforward neural network and write the backpropagation code for training the network. We strongly recommend using numpy for all matrix/vector operations. You are not allowed to use any automatic differentiation packages. This network will be trained and tested using the Fashion-MNIST dataset. Specifically, given an input image (28 x 28 = 784 pixels) from the Fashion-MNIST dataset, the network will be trained to classify the image into 1 of 10 classes.
Question 1 (2 Marks)
image
Question 2 (10 Marks)
A feedforward neural network which takes images from the fashion-mnist data as input and outputs a probability distribution over the 10 classes has been implemented.
The code is flexible and we can easily change the number of hidden layers and the number of neurons in each hidden layer.
Question 3 (24 Marks)
The backpropagation algorithm with support for the following optimisation functions has been implemented:
- sgd
- momentum based gradient descent
- nesterov accelerated gradient descent
- rmsprop
- adam
- nadam
The code is flexible enough to work with different batch sizes.
Question 4 (10 Marks)
The following search strategies are provided by wandb:
- grid: This strategy iterates over all possible combinations of parameter values. The number of combinations is exponential in terms of the number of values being experimented with for each hyperparameter. Consequently, the runtime increases rapidly beyond reasonable measure as our hyperparamter search space expands. Thus, even though this strategy provides the optimal solution, this exhaustive search is not feasible in terms of runtime and/or computational power.
- random: This strategy chooses a random set of values in each iteration. It also requires provision of an argument count which indicates the number of (random) combinations to be chosen. Setting this count to a reasonable value allows us to execute the search much faster in comparison to the grid strategy, but we suffer with respect to the optimality of the solution. Furthermore, the random nature of experimentation might yield unsatisfactory solutions for sweeps that do not cover a major portion of the search space, thus making this strategy unreliable, especially in case of smaller values of count.
- bayes: This strategy uses a Gaussian process to model the relationship between the hyperparameters and the model metric (in this case taken to be val_accuracy). This leads to hyperparameter choices that are more likely to improve upon previous choices. By ignoring the regions of space expected to yield poor results and instead focusing on the promising regions, we can reach near-optimal solutions in far lesser iterations.
Owing to the features of the strategies provided as described above, I used the Bayesian strategy to search through the hyperparameter space. In addition to this, I made a few changes to the search space specified in the problem statement, such as focusing on bigger batch sizes (given the dimensionality of the input data), and higher learning rates (which seem to be helpful for better convergence given the low number of epochs during hyperparameter search).
Question 5 (5 marks)
Question 6 (20 Marks)
Inferences
- A weight decay of 0.5 is too high and assumes too strongly about the magnitude of the weights. This leads to the model being unable to learn better weights and thus ending up with unsatisfactory results.
- 10 epochs (and even 20 epochs) are insufficient to near convergence. Consequently, the hyperparameter sets with 20 epochs clearly provided better results. Despite this inference, sticking to lesser epochs allows us to explore more of the hyperparameter space to yield a set of combinations that can then be exploited (with more epochs) to find the best set.
- Random initialization of weights can lead to them blowing out of proportion (in case of zero weight decay), or simply vanishing. As a result, it (in addition to weight decay being 0.5) is a prominent characteristic of the models with <75%<75\% accuracy. Xavier initialization provides better initial weights that have the same variance across every layer, and thus provides better results.
- Choosing a bigger hidden layer size provides better accuracy. This is likely owing to the dimensionality of the input data as well as the provision of more hidden features to the model.
- The choice for the number of hidden layers is a slightly tricky one. Having a small number of hidden layers in the network loses out on the benefits of structured learning available in case of a deeper network. At the same time, given the choice of rather large hidden layer sizes, very deep networks can overfit by memorizing the input data. Hence, although the sweeps indicate a negative correlation between the number of hidden layers and validation accuracy, models with 2-3 hidden layers yield the best results in comparison to lesser or more hidden layers.
- Given the fairly low number of epochs, higher learning rates tend to perform better.
- The tanh activation function far outperforms the ReLu and sigmoid activation functions.
- ReLu in particular is affected harshly by random initialization of weights and occasionally leads to very high/low weights. Consequently, a notable proportion of the worst performing models use ReLu as the activation function. Barring these external factors, it performs reasonably well.
- The sigmoid and tanh activation functions have small gradients for high magnitudes of the input, and may lead to slower convergence. However, the tanh function tends to center the layer outputs centred around zero unlike the sigmoid function, and is still able to near convergence faster.
- The Nadam optimizer performs the best, followed by the Adam optimizeR, owing to the combination of various features of other optimizers in these. Among the rest, RMSProp and Nesterov-accelerated gradient descent seem to perform reasonably well. Given the somewhat higher choices for learning rate for the best models, momentum-based gradient descent does not feature in the top performers.
Recommendations for higher accuracy
The following hyperparameters are suggested for training a model to obtain higher accuracy:
- Epochs: 50
- Number of hidden layers: 2
- Size of hidden layers: 256
- Activation function: tanh
- Optimiser: Nadam
- Weight initialisation: Xavier
- Learning rate: 0.001
- Batch size: 64
- Weight decay: 0.0
This is in fact simply the best set of hyperparameters as suggested by the sweeps used to generate the parallel coordinates plot in Question 6. The only modifications are the increase in the number of epochs and that in the size of the hidden layers, which is based on the clear increase in train accuracy as the number of epochs increases, as well as the positive correlation in general with increasing hidden layer size given the dimensionality of the data.
Note that this configuration allows us to obtain an approximate train accuracy of 95%95\%. The validation accuracy for the final few iterations as well as the test accuracy oscillate in the region 88%−89%88\% - 89\%. This clearly indicates that the above configuration overfits the training data. Some of the measures that can be taken to bring the test accuracy closer to 95%95\% are:
- Dataset augmentation: Given the nature of the problem (multi-class classification of images), this is an ideal strategy as it allows us to build a model that would generalize better. Furthermore, if computational costs are an issue, we can focus on augmenting the data only for those classes that are frequently misclassified (as can be concluded from the confusion matrix). This allows us to study these particular images better.
- Early stop: Since the validation accuracy for the above configuration starts oscillating in the final few iterations, this might be a viable strategy to prevent overfitting.
- Dropout: This is another regularization technique
- Ensemble models: A combination of the best-performing models can be trained for a large number of epochs so as to make them unstable, and these can then be bagged to eliminate the variance to a large degree, thus obtaining a strong model that also generalizes well.
- Greedy layerwise pre-training: This allows our model to start off with weights that capture the characteristics of the data to some extent.
Question 7 (10 Marks)
The best model identified by the sweep has an accuracy of 88.36%88.36\% on the test set of fashion_mnist. The hyperparameters for this model are as follows:
- Epochs: 20
- Number of hidden layers: 2
- Size of hidden layers: 128
- Activation function: tanh
- Optimiser: Nadam
- Weight initialisation: Xavier
- Learning rate: 0.001
- Batch size: 64
- Weight decay: 0.0
Note: The following confusion matrix has been plotted for different values of epochs (20, 40, 60, 80, 100).
Question 8 (5 Marks)
Squared Error Loss
Loss and Accuracy Plots
Parallel Coordinates Plot and Parameter Importance
Inferences
- The models built using squared error as the loss function slightly underperform in comparison to those built using cross entropy as the loss function.
- A striking difference in the best models obtained using squared error as the loss function is the propensity towards a greater number of neurons (either by having more hidden layers or a larger hidden layer size).
- This is likely due to the nature of the loss function and the modifications required to minimise it.
- Cross entropy focuses on the output probability of the correct label, and minimising cross entropy requires us to increase this probability alone.
- On the other hand, squared error takes into account the probabilities of all labels (correct or incorrect) while calculating the loss. Given the nature of the squared error, we would additionally require all the incorrect labels to have more or less an equally distributed probability, as opposed to one incorrect label taking a significant chunk. This idiosyncrasy of the squared error loss requires a more complex model to optimise, hence the greater number of neurons present in the best models trained using squared error loss.
- Furthermore, with an increasing number of classes, the neural network is further strained in case of the squared error loss, leading to more complex structures.
- Since the number of classes in the fashion_mnist dataset is only 10, this difference in complexity is visible albeit small. With a larger number of classes, using squared error as the loss function, despite providing unbiased estimates, is clearly the worse option.
- Cross entropy, by focusing on the probability of the correct label, requires relatively simpler models and is thus the better choice.
Question 9 (10 Marks)
Link to the Github code for this assignment:
Question 10 (10 Marks)
Based on my experimentation on the fashion_mnist dataset, I would suggest the following configurations for the MNIST dataset:
- Configuration 1:
- Epochs: 40
- Number of hidden layers: 2
- Size of hidden layers: 256
- Activation function: tanh
- Optimiser: Nadam
- Weight initialisation: Xavier
- Learning rate: 0.001
- Batch size: 64
- Weight decay: 0.0
- Configuration 2:
- Epochs: 40
- Number of hidden layers: 2
- Size of hidden layers: 256
- Activation function: tanh
- Optimiser: Adam
- Weight initialisation: Xavier
- Learning rate: 0.001
- Batch size: 64
- Weight decay: 0.0
- Configuration 3:
- Epochs: 40
- Number of hidden layers: 3
- Size of hidden layers: 256
- Activation function: ReLu
- Optimiser: Adam
- Weight initialisation: Xavier
- Learning rate: 0.001
- Batch size: 64
- Weight decay: 0.0
The configurations suggested above are based on ideas drawn from the fashion_mnist dataset. There is clearly a need for more epochs as well as more neurons per hidden layer (due to the input dimensionality). Increasing these, and using the best configuration obtained from the sweep for the fashion_mnist dataset for the other hyperparameter values yields configuration 1. Doing the same for the second and third best configurations obtained from the sweep for the fashion_mnist dataset yields configurations 2 and 3. These configurations reflect our learnings based on the fashion_mnist dataset. Since the dimensionality as well as structure of the input data is the same, i.e., each image is modeled by its 28×2828 \times 28 pixels (and each pixel is represented by a number between 0 and 255), we can appropriately draw conclusions about good values for the hyperparameters based on the results obtained from the sweep on the fashion_mnist dataset.
The following accuracies were obtained using the configurations specified above:
Configuration | Train Accuracy | Test Accuracy |
---|---|---|
Config. 1 | 0.99998 | 0.9799 |
Config. 2 | 1.0 | 0.9807 |
Config. 3 | 0.9957 | 0.9747 |
Self Declaration
I, Abdullah Mohammed, swear on my honour that I have written the code and the report by myself and have not copied it from the internet or other students.