Fashion MNIST

Explore various hyperparameters of a CNN trained on Fashion MNIST to identify 10 types of clothing. Made by Stacey Svetlichnaya using Weights & Biases
Stacey Svetlichnaya

Introduction

In this project I explore some hyperparameters of a simple convolutional neural network (CNN) to build intuitions for a manageable example. The dataset is Fashion MNIST: 60,000 images of 10 classes of clothing (dress, shirt, sneaker, etc).

Findings so far

Varying basic hyperparameters: batch size, dropout, learning rate

I train a small convolutional network (2 convolutional layers with max pooling followed by dropout and a fully-connected layer) on Fashion MNIST.

The baseline accuracy is already impressive: 0.9366 training / 0.9146 validation (suggesting slight overfitting). What happens as we increase dropout, vary batch size, and change the learning rate? You can see the effect of these hyperparameters on train/val accuracy by checking exactly one of the three tabs in the "Results of varying hyperparameters" section below.

  1. Batch size: No significant effect, and the default of 32 performs well
  2. Dropout: Increasing the dropout has the predictable effect of decreasing training accuracy and no improvement on validation accuracy.
  3. Learning rate: Again the baseline of 0.01, and generally lower values, perform better. Setting the learning rate too high (0.1) leads to a sudden divergence.

Results of varying hyperparameters

Results of varying hyperparameters

Varying layer size

What happens if we vary the sizes of the three layers (two convolutional, one fully-connected)? You can enable the tabs in the following section to see the results).

Hidden (fc) layer size

Increasing the size of the penultimate fully-connected layer leads to lower training loss and slightly faster learning but doesn't significantly affect the validation accuracy (although a size of 512 performs well and may be worth exploring).

Layers 1 & 2

Increasing both layers gives the model more predictive capacity (more parameters) and increases validation accuracy.

Results of varying layer size

Results of varying layer size

Combinations of layer sizes

By increasing all the layers while maintaining their relative sizes, the validation accuracy goes up by about 1% from baseline.

Next steps

Results of different layer sizes

Results of different layer sizes

Hyperparameters more generally

Below you can see a parallel coordinates chart that shows correlations between hyperparameters and an output metric of choice. In this case, I'm using validation accuracy, which you can see in the colorful column on the right. Most of the experiments so far have a high validation accuracy around 0.9. Some hyperparameters like dropout, batch size, and layer 2 size have been sampled more extensively and do not seem to have a strong effect on performance. Others, like momentum, have not been varied and could be promising candidates for further experimentation.

Parallel coordinates chart

Parallel coordinates chart

Evaluating specific examples

In this view we can browse through predictions on specific examples from different runs. One common misclassification you'll notice as you browse these examples is between bags and shirts, e.g. because a bag handle resembles a neckline. Even a human can have trouble with these: shirt or bag

Logged example predictions from different runs

Logged example predictions from different runs