**Motivation**

When running a hyperparameter search on a model, I often wonder if the changes I see in my key metrics, e.g. validation accuracy, are significant. I might plot the accuracy over epochs for 10 different values of, say, weight decay and see that the final accuracy only varies by about 0.5%. Is that a meaningful correlation or not? Should I explore a bigger range of values or abandon tuning this hyperparameter? Are there some other settings masking any possible impact of weight decay? One way to be more confident that my observations are meaningful and not due to random chance is to compare my experiments to a reference point, or null hypothesis. How does the change in accuracy I see when I vary a single hyperparameter compare to the change I see just from the fundamental stochasticity/randomness of the training (e.g. not setting a random seed)?

** The Signal vs. Noise Scenario **

While learning Pytorch and exploring Sweeps, I happened to train 211 variants of a network with the same code:

- one version 210 times (all hyperparameters fixed, random seed varying), and
- another 210 different versions once (hyperparameters varying, random seed fixed).

In the first scenario, any variance in my results is due to randomness, because all the hyperparameters are fixed. Call this case Noise. In the second scenario, the random seed is fixed, which should minimize any variance due to randomness and reflect the actual signal: changes in network accuracy caused by different hyperparameter settings. Call this case Signal.

Here I compare my Noise and Signal results, mainly looking at **the relative magnitude of the variance in the models' accuracy**. The variance in accuracy due to noise can serve as a reference baseline to estimate the significance of the results. If the accuracy variance is comparable across the two cases, then the effect can be equally well attributed to noise and probably isn't significant. The greater the variance in the Signal case relative to the Noise case, the less likely it is that the observed change is due to noise, and the more interesting/meaningful is that hyperparameter's influence on accuracy.

Here's a summary of the hyperparameter influence on accuracy via parallel coordinate plots. Left: Noise, right: a distinctly more coherent Signal

** Model: Bidirectional RNN on MNIST in Pytorch **

The model is a bidirectional recurrent neural network (BDRNN) trained to identify the handwritten digits 0-9 on the standard MNIST task (60K train, 10K test images) in PyTorch, based on this tutorial by yunjey. It has an input size of 28, looking at the equivalent of one row of pixels in an MNIST image (28*28 pixels) at a time. This is not a typical approach to learning MNIST, but it serves well to illustrate the overall point: **how do we know when the observed effect sizes from our model-tuning efforts are meaningful?**

** Experiment setup **

My fork instruments the *main.py* training script with Weights & Biases logging and includes a *sweep.yaml* config file to easily run a W&B Sweep.

I ran a grid search to exhaustively try combinations of the following hyperparameters:

- layer count (3 options: 2, 3, 4)
- batch size (7 options: 64, 100, 128, 200, 256, 400, 512)
- hidden layer size (10 options: 10, 20, 30,...100)

The "noise", or fake version of the sweep, actually used the following values repeatedly:

- hidden_size = 128
- num_layers = 2
- batch_size = 100

All of the other hyperparameters stayed fixed, and only one was varied at a time, for straightforward comparison. Each experiment run trains a BDRNN for 10 epochs with a new configuration of hyperparameters, changing one at a time. After every epoch, I log the model's accuracy on the 10K test images (the percentage of handwritten digits identified correctly via the model's highest confidence prediction). To calculate the accuracy variance, I look at the visualizations of test accuracy over time and estimate the vertical range of the plots. Specifically, I look at

- variance in
*group mean accuracy*: the vertical difference between two lines, where each plots the mean accuracy across a group of models testing different values for the same hyperparameter (with all other hyperparameters held constant) *overall accuracy variance*across individual models: the height of the less-saturated band around each mean plot line, which shows the range of values logged by individual runs within in a group at a particular epoch

** Overall framework **

- comparing the variance of mean final accuracy per run group, as well as the overall accuracy variance across runs during training, helps set the context (relative magnitude) for effect size to differentiate signal from noise. Comparing the two explicitly can help us decide if an effect size is meaningful enough for us to care about, or too close to randomness to be worth exploring.
- the difference between signal and random noise is most obvious at the start of training: across all the hyperparameters tested, accuracy variance is significantly (~10X) higher at epoch 1 in the Signal case compared to the Noise case
- this framework would be interesting to explore for more complex training scenarios, especially given how easy it is to apply and scale with Weights & Biases sweeps. A promising next direction would be to quantify the number of runs/samples needed to ensure statistically significant results relative to a noise baseline.

** BDRNN-specific **

- For Bidirectional RNNs trained on hand-written digits, hidden layer size appears to be the most impactful hyperparameter. Batch size has a very weak effect (lower batch sizes are slightly better), and layer count (RNN depth) is of negligible importance.
- using this framework, hidden layer size has a meaningful effect of 1.5% on BDRNN accuracy, 10X the Noise effect (0.13%)
- batch size has a weak effect (0.42%), 4X the Noise effect on accuracy (0.1%)
- layer count has basically no effect: 0.18%, closer to the magnitude of other Noise than Signal effects. This is still 4-5 times greater than the Noise effect of layer count (0.04%).

Context note: the observations below are based on a small number of experiments and on insights from past experience. This report is meant to showcase what is possible with W&B and to inspire further exploration.

The group mean accuracy (70 runs per group for each of 3 layer count values) varies by **about 0.05% throughout**. The maximum variance across all 210 runs and the whole training period is **around 4%** at the very beginning (down to about 1% by the end).

The parallel coordinates show that more layers correspond with worse performance, although the variance in accuracy is tiny (about **0.04%**). This suggest layer count has basically no effect.

The group mean accuracy (70 runs per group for each of 3 layer count values) varies by **about 1.2% at the start** and **0.2% by the end**. The maximum variance across all 210 runs and the whole training period is **around 47%** at the very beginning (down to about 2.5% by the end).

The parallel coordinates show that more layers correspond with **better** performance, which is the opposite conclusion from the Noise scenario. It aligns with the general intuition that more free parameters allows us to learn a more rich representation. Still, the effect size is tiny: **0.18%** improvement on average when we increase the number of layers from 2 to 4.

The group mean accuracy (30 runs per group for each of 7 possible batch sizes) varies by **only 0.1% throughout**. The maximum variance across all 210 runs and the whole training period is **around 4%** at the very beginning (down to about 1% by the end).

The parallel coordinates show a jumble instead of a pattern: the best batch size is the smallest, the second best is the second largest batch size, and the others are worse in random order. The overall variance in final mean accuracy is tiny: about **0.12%** improvement in accuracy when using the smallest batch size compared to the largest), suggesting batch size has almost no effect.

The group mean accuracy (30 runs per group for each of 7 possible batch sizes) varies by **9% at the start** and about **0.4% at the end**. The **maximum variance is around 45%**, again at the very start of training (down to about **2%** by the end). This is 10X the maximum variance of the Noise scenario. In both scenarios, metrics start out noisy and gradually converge to very similar values. The effect of randomization and hyperparameter settings may be stark at the beginning and less visible with every epoch as all the other learning dynamics come into play (almost reassuring that the fundamental learning process works across a range of values and is not brittle). Also, the overall variance is much greater throughout the training period in the Signal case as compared to Noise.

The parallel coordinates show a much more reasonable inverse correlation: smaller batches correspond to slightly higher accuracy. The discrepancy in final mean accuracy is about **0.42%** (between best and worst group by batch size), suggesting batch size has a very small effect.

The group mean accuracy (21 runs per group for each of 10 hidden layer sizes) varies by **0.2% at the start** and about **0.1% at the end**. The maximum variance across all runs is around **4%**, right at the start (down to about **1%** by the end).

The parallel coordinates show a strange pattern, with the highest accuracies corresponding to intermediate values for hidden size, while bigger and smaller hidden sizes all weaken performance. The overall effect size (difference in mean final accuracy) is **0.13%**.

The group mean accuracy (21 runs per group for each of 10 hidden layer sizes) varies by **15% at the start** and about **1.5% at the end**. The maximum variance over the training period is **around 45%**, right at the start (again down to **about 2%** by the end of training).

The parallel coordinates show a much more reasonable relationship—increasing the hidden layer size generally increases the accuracy, with diminishing returns (less change in accuracy per fixed increase in hidden layer size). The overall effect size is **1.5%**: this is how much tuning the hidden layer size can improve the model. While this is still relatively small, it could make a real difference in a benchmark or in a large-scale application.

These plots side-by-side are very similar. The main difference is the amount of variance at the start of training: **7% for Signal and 3% for Noise**. The variance is also slightly greater throughout training for Signal compared to Noise. This suggests that in real, meaningful experiments, we might expect to see more variance.

Seeing these two dense parallel coordinates plots next to each other—Noise above, Signal in this section—is very helpful.

**increased hidden layer size is a predictor of accuracy**: you can see how the lower-accuracy purple is distributed evenly across the hidden size nodes in the Noise plot, but concentrated in the bottom nodes in the Signal plot. As hidden size increases in the Signal plot, the overall color gradient at each node becomes less purple more orange (higher accuracy).**batch size has a weak effect**: you can see slightly more purple across nodes in Noise than in Signal. In Signal, the lower nodes (smaller batch size) have more orange lines and the higher nodes (larger batch size) have more purple.**layer count has essentially no effect**in this setting: the layer count nodes are very similar across the two plots. Layer count 4 is slightly more purple in Noise than in Signal, supporting the counterintuitive (and likely incorrect) observation in the Noise scenario that increased layer count decreases accuracy.**variance due to meaningful changes is greater than variance due to random noise**: the final accuracy range is about**3% for Signal and 0.9% for Noise**. This gives us a baseline of how much variance in this metric we can attribute to mere chance and how much greater an effect size must be before we consider it meaningful and interesting, relative to random noise.