When running a hyperparameter search on a model, I often wonder if the changes I see in my key metrics, e.g. validation accuracy, are significant. I might plot the accuracy over epochs for 10 different values of, say, weight decay and see that the final accuracy only varies by about 0.5%. Is that a meaningful correlation or not? Should I explore a bigger range of values or abandon tuning this hyperparameter? Are there some other settings masking any possible impact of weight decay? One way to be more confident that my observations are meaningful and not due to random chance is to compare my experiments to a reference point, or null hypothesis. How does the change in accuracy I see when I vary a single hyperparameter compare to the change I see just from the fundamental stochasticity/randomness of the training (e.g. not setting a random seed)?
** The Signal vs. Noise Scenario **
While learning Pytorch and exploring Sweeps, I happened to train 211 variants of a network with the same code:
In the first scenario, any variance in my results is due to randomness, because all the hyperparameters are fixed. Call this case Noise. In the second scenario, the random seed is fixed, which should minimize any variance due to randomness and reflect the actual signal: changes in network accuracy caused by different hyperparameter settings. Call this case Signal.
Here I compare my Noise and Signal results, mainly looking at the relative magnitude of the variance in the models' accuracy. The variance in accuracy due to noise can serve as a reference baseline to estimate the significance of the results. If the accuracy variance is comparable across the two cases, then the effect can be equally well attributed to noise and probably isn't significant. The greater the variance in the Signal case relative to the Noise case, the less likely it is that the observed change is due to noise, and the more interesting/meaningful is that hyperparameter's influence on accuracy.
Here's a summary of the hyperparameter influence on accuracy via parallel coordinate plots. Left: Noise, right: a distinctly more coherent Signal
** Model: Bidirectional RNN on MNIST in Pytorch **
The model is a bidirectional recurrent neural network (BDRNN) trained to identify the handwritten digits 0-9 on the standard MNIST task (60K train, 10K test images) in PyTorch, based on this tutorial by yunjey. It has an input size of 28, looking at the equivalent of one row of pixels in an MNIST image (28*28 pixels) at a time. This is not a typical approach to learning MNIST, but it serves well to illustrate the overall point: how do we know when the observed effect sizes from our model-tuning efforts are meaningful?
** Experiment setup **
I ran a grid search to exhaustively try combinations of the following hyperparameters:
The "noise", or fake version of the sweep, actually used the following values repeatedly:
All of the other hyperparameters stayed fixed, and only one was varied at a time, for straightforward comparison. Each experiment run trains a BDRNN for 10 epochs with a new configuration of hyperparameters, changing one at a time. After every epoch, I log the model's accuracy on the 10K test images (the percentage of handwritten digits identified correctly via the model's highest confidence prediction). To calculate the accuracy variance, I look at the visualizations of test accuracy over time and estimate the vertical range of the plots. Specifically, I look at
** Overall framework **
** BDRNN-specific **
Context note: the observations below are based on a small number of experiments and on insights from past experience. This report is meant to showcase what is possible with W&B and to inspire further exploration.
The group mean accuracy (70 runs per group for each of 3 layer count values) varies by about 0.05% throughout. The maximum variance across all 210 runs and the whole training period is around 4% at the very beginning (down to about 1% by the end).
The parallel coordinates show that more layers correspond with worse performance, although the variance in accuracy is tiny (about 0.04%). This suggest layer count has basically no effect.
The group mean accuracy (70 runs per group for each of 3 layer count values) varies by about 1.2% at the start and 0.2% by the end. The maximum variance across all 210 runs and the whole training period is around 47% at the very beginning (down to about 2.5% by the end).
The parallel coordinates show that more layers correspond with better performance, which is the opposite conclusion from the Noise scenario. It aligns with the general intuition that more free parameters allows us to learn a more rich representation. Still, the effect size is tiny: 0.18% improvement on average when we increase the number of layers from 2 to 4.
The group mean accuracy (30 runs per group for each of 7 possible batch sizes) varies by only 0.1% throughout. The maximum variance across all 210 runs and the whole training period is around 4% at the very beginning (down to about 1% by the end).
The parallel coordinates show a jumble instead of a pattern: the best batch size is the smallest, the second best is the second largest batch size, and the others are worse in random order. The overall variance in final mean accuracy is tiny: about 0.12% improvement in accuracy when using the smallest batch size compared to the largest), suggesting batch size has almost no effect.
The group mean accuracy (30 runs per group for each of 7 possible batch sizes) varies by 9% at the start and about 0.4% at the end. The maximum variance is around 45%, again at the very start of training (down to about 2% by the end). This is 10X the maximum variance of the Noise scenario. In both scenarios, metrics start out noisy and gradually converge to very similar values. The effect of randomization and hyperparameter settings may be stark at the beginning and less visible with every epoch as all the other learning dynamics come into play (almost reassuring that the fundamental learning process works across a range of values and is not brittle). Also, the overall variance is much greater throughout the training period in the Signal case as compared to Noise.
The parallel coordinates show a much more reasonable inverse correlation: smaller batches correspond to slightly higher accuracy. The discrepancy in final mean accuracy is about 0.42% (between best and worst group by batch size), suggesting batch size has a very small effect.
The group mean accuracy (21 runs per group for each of 10 hidden layer sizes) varies by 0.2% at the start and about 0.1% at the end. The maximum variance across all runs is around 4%, right at the start (down to about 1% by the end).
The parallel coordinates show a strange pattern, with the highest accuracies corresponding to intermediate values for hidden size, while bigger and smaller hidden sizes all weaken performance. The overall effect size (difference in mean final accuracy) is 0.13%.
The group mean accuracy (21 runs per group for each of 10 hidden layer sizes) varies by 15% at the start and about 1.5% at the end. The maximum variance over the training period is around 45%, right at the start (again down to about 2% by the end of training).
The parallel coordinates show a much more reasonable relationship—increasing the hidden layer size generally increases the accuracy, with diminishing returns (less change in accuracy per fixed increase in hidden layer size). The overall effect size is 1.5%: this is how much tuning the hidden layer size can improve the model. While this is still relatively small, it could make a real difference in a benchmark or in a large-scale application.
These plots side-by-side are very similar. The main difference is the amount of variance at the start of training: 7% for Signal and 3% for Noise. The variance is also slightly greater throughout training for Signal compared to Noise. This suggests that in real, meaningful experiments, we might expect to see more variance.
Seeing these two dense parallel coordinates plots next to each other—Noise above, Signal in this section—is very helpful.