Hyperparameters of a Simple CNN Trained on Fashion MNIST
This article explores various hyperparameters of a convolutional neural network (CNN) trained on Fashion MNIST to identify 10 types of clothing
Created on February 7|Last edited on November 3
Comment
In this project, we explore some hyperparameters of a simple convolutional neural network (CNN) to build intuitions for a manageable example.
The dataset is Fashion MNIST: 60,000 images of 10 classes of clothing (dress, shirt, sneaker, etc).
Table of Contents
Varying Basic Hyperparameters: Batch Size, Dropout, Learning RateResults of Varying HyperparametersVarying Layer SizeResults of Varying Layer SizeCombinations of Layer SizesResults of Different Layer SizesHyperparameters More GenerallyParallel Coordinates ChartEvaluating Specific ExamplesLogged Example Predictions from Different Runs
Findings so far
- varying most of the straightforward hyperparameters (like batch size, dropout, and learning rate) doesn't have a strong effect on validation accuracy.
- the most promising direction is increasing layer sizes (val acc increases by 1%) and exploring different ratios of consecutive layer sizes (perhaps building up to more complex architectures.)
- some of the classes are harder for a human to distinguish than others, so investigating class-specific accuracy may prove useful.

Varying Basic Hyperparameters: Batch Size, Dropout, Learning Rate
I train a small convolutional network (2 convolutional layers with max pooling followed by dropout and a fully-connected layer) on Fashion MNIST.
The baseline accuracy is already impressive: 0.9366 training / 0.9146 validation (suggesting slight overfitting). What happens as we increase dropout, vary batch size, and change the learning rate? You can see the effect of these hyperparameters on train/val accuracy by checking exactly one of the three tabs in the "Results of varying hyperparameters" section below.
- Batch size: No significant effect and the default of 32 performs well
- Dropout: Increasing the dropout has the predictable effect of decreasing training accuracy and no improvement on validation accuracy.
- Learning rate: Again the baseline of 0.01, and generally lower values, perform better. Setting the learning rate too high (0.1) leads to a sudden divergence.
Results of Varying Hyperparameters
Batch size
4
6
5
Varying Layer Size
What happens if we vary the sizes of the three layers (two convolutional, one fully-connected)? You can enable the tabs in the following section to see the results.
Hidden (fc) Layer Size
Increasing the size of the penultimate fully-connected layer leads to lower training loss and slightly faster learning but doesn't significantly affect the validation accuracy (although a size of 512 performs well and may be worth exploring).
Layers 1 & 2
Increasing both layers gives the model more predictive capacity (more parameters) and increases validation accuracy.
Results of Varying Layer Size
Hidden layer size
5
7
Combinations of Layer Sizes
By increasing all the layers while maintaining their relative sizes, the validation accuracy goes up by about 1% from baseline.
Next Steps
- consider class-specific accuracy: is the model better at identifying certain items of clothing? are other items particularly problematic?
- explore the learning optimizer space (settings for optimizer, learning rate, decay, and momentum)
- broader architecture search: number and kinds of layers, kernel size, etc.
Results of Different Layer Sizes
Layer size variations
6
Hyperparameters More Generally
Below you can see a parallel coordinates chart that shows correlations between hyperparameters and an output metric of choice. In this case, I'm using validation accuracy, which you can see in the colorful column on the right. Most of the experiments so far have a high validation accuracy of around 0.9. Some hyperparameters like dropout, batch size, and layer 2 size have been sampled more extensively and do not seem to have a strong effect on performance. Others, like momentum, have not been varied and could be promising candidates for further experimentation.
Parallel Coordinates Chart
All runs
33
Evaluating Specific Examples
In this view, we can browse through predictions on specific examples from different runs. One common misclassification you'll notice as you browse these examples is between bags and shirts, e.g. because a bag handle resembles a neckline. Even a human can have trouble with these:

Logged Example Predictions from Different Runs
All runs
33
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.