Layout: Sweeps Noise
TL;DR: Measure signal relative to noise
I trained 211 variants of a deep net:
- one version 210 times (by accident, without a fixed random seed), and
- another 210 different versions once each (as intended).
Here I compare my "noise" and "signal" results.
Bidirectional RNN on MNIST in Pytorch
The model is a bidirectional recurrent neural network (BDRNN) trained on standard MNIST (60K train, 10K test) in PyTorch, based on this wonderful tutorial by yunjey. My fork instruments the main.py training script with Weights & Biases logging and includes a sweep.yaml config file to easily run a W&B Sweep.
Experiment setup
I ran a grid search to exhaustively try combinations of the following hyperparameters:
- layer count (3 options: 2, 3, 4)
- batch size (7 options: 64, 100, 128, 200, 256, 400, 512)
- hidden layer size (10 options: 10, 20,...100)
The "noise", or fake version of the sweep, actually used the following values repeatedly:
- hidden_size = 128
- num_layers = 2
- batch_size = 100
All of the other hyperparameters stayed fixed, and only one was varied at a time, for straightforward comparison.
Conclusions
Context note: the observations below are based on a small number of experiments and on insights from past experience. This report is meant to showcase what is possible with W&B and to inspire further exploration.
RNNs: Balance Generalization with Overfitting
Check exactly one of the 3 tabs below to see the effect of three hyperparmeters on accuracy. In the parallel coordinates chart, you can click on and drag the hyperparameter columns to rearrange them and make the variable relationship more clear.
What hyperparameters matter in a small RNN?
After instrumenting the RNN example with W&B, I manually tried a few combinations for my independent variables: 1. number of layers: 2, 3, 4 2. batch size: 32, 64, 128 3. hidden layer size: 32, 64, 128
The full grid space is 3 3 3 = 27 possibilities, which are much easier to test using argparse and sweeps (see later sections).
1. Layer count: decrease
Grouping by layer count enables me to keep all the other independent variables the same (see the parallel coordinates chart) and show how accuracy correlates with the number of layers. In this case, the more layers, the less accurate the network. This is likely due to overfitting as we increase the number of learned parameters way beyond what is reasonable for this small dataset. This is especially true in a non-convolutional, flat layer setting, which doesn't generalize as well as convolutional filters on images with spatial redundancies.
2. Batch size: increase
Looks safe to increase batch size for better generalization in this case. The effects of batch size on training dynamics are subtle and highly dependent on the dataset and architecture.
3. Hidden layer size: keep tuning
An intermediate value seems best for the hidden layer, balancing enough parameters for rich representation with overfitting.