Layout: Sweeps Noise

Created on February 5|Last edited on February 5
Comment
﻿
TL;DR: Measure signal relative to noise
﻿
﻿
﻿
I trained 211 variants of a deep net: 
one version 210 times (by accident, without a fixed random seed), and
another 210 different versions once each (as intended). 
Here I compare my "noise" and "signal" results. 
Bidirectional RNN on MNIST in PytorchThe model is a bidirectional recurrent neural network (BDRNN) trained on standard MNIST (60K train, 10K test) in PyTorch, based on this wonderful tutorial by yunjey. My fork instruments the main.py training script with Weights & Biases logging and includes a sweep.yaml config file to easily run a W&B Sweep.
Experiment setupI ran a grid search to exhaustively try combinations of the following hyperparameters:
layer count (3 options: 2, 3, 4)
batch size (7 options: 64, 100, 128, 200, 256, 400, 512)
hidden layer size (10 options: 10, 20,...100)
The "noise", or fake version of the sweep, actually used the following values repeatedly:
hidden_size = 128
num_layers = 2
batch_size = 100
All of the other hyperparameters stayed fixed, and only one was varied at a time, for straightforward comparison.
Conclusions
Context note: the observations below are based on a small number of experiments and on insights from past experience. This report is meant to showcase what is possible with W&B and to inspire further exploration. 
 
All project runs854
﻿
RNNs: Balance Generalization with Overfitting
﻿
﻿
﻿
Check exactly one of the 3 tabs below to see the effect of three hyperparmeters on accuracy.
In the parallel coordinates chart, you can click on and drag the hyperparameter columns to rearrange them and make the variable relationship more clear.
What hyperparameters matter in a small RNN?After instrumenting the RNN example with W&B, I manually tried a few combinations for my independent variables:
1. number of layers: 2, 3, 4
2. batch size: 32, 64, 128
3. hidden layer size: 32, 64, 128
The full grid space is 3  3  3 = 27 possibilities, which are much easier to test using argparse and sweeps (see later sections). 
1. Layer count: decreaseGrouping by layer count enables me to keep all the other independent variables the same (see the parallel coordinates chart) and show how accuracy correlates with the number of layers. In this case, the more layers, the less accurate the network. This is likely due to overfitting as we increase the number of learned parameters way beyond what is reasonable for this small dataset. This is especially true in a non-convolutional, flat layer setting, which doesn't generalize as well as convolutional filters on images with spatial redundancies.
2. Batch size: increaseLooks safe to increase batch size for better generalization in this case. The effects of batch size on training dynamics are subtle and highly dependent on the dataset and architecture.
3. Hidden layer size: keep tuningAn intermediate value seems best for the hidden layer, balancing enough parameters for rich representation with overfitting.
 
1 Layer Count27
 
2 Batch Size27
3 Hidden Size27
﻿
RNNs: Balance generalization with overfitting
﻿
﻿
﻿
 
1 Layer Count27
 
2 Batch Size27
3 Hidden Size27
﻿
Noise: Layer Count
﻿
﻿
﻿
Layer count210
﻿
Signal: Layer Count
﻿
﻿
﻿
Layer count210
﻿
Noise: Batch Size
﻿
﻿
﻿
Batch size210
﻿
Visualizing batch size changes
﻿
﻿
﻿
Batch size210
﻿
Batch size: Strong, 9x positive correlation
﻿
﻿
﻿
Batch size210
﻿
Signal: Batch Size
﻿
﻿
﻿
Batch size210
﻿
Noise: Hidden Size
﻿
﻿
﻿
Hidden size210
﻿
Noise: Hidden Size
﻿
﻿
﻿
Hidden size210
﻿
Signal: Hidden Size
﻿
﻿
﻿
Hidden size210
﻿
Signal: Hidden Size
﻿
﻿
﻿
Hidden size210
﻿
Noise: All Runs
﻿
﻿
﻿
All Bi-RNN runs210
﻿
Signal: All Runs
﻿
﻿
﻿
All Bi-RNN runs210
﻿
﻿
Add a comment