Skip to main content

Hyperparameter search testing

Making sure I have wandb.ai set up in a way such that I can compare hyperparameter settings fairly easily.
Created on December 1|Last edited on December 4
Piloting a search of beta_mu in [-10, -5, -1] and whether to use the mean or mu in calculating probabilities (all combinations have 3 seeds each). I explore train because it only learns from positive data, and label entropy (entropy_pred) because it involves using probabilities in its heuristic, but isn't as compute-heavy as expected information gain.
Results show that using mu instead of mean is better. Initially (looking at group by use_mean) it might appear that mean is better, but I believe this is an artifact of using mu being especially bad for a specific combination with -1 beta. Looking at the ungrouped plot (averaged by seed only), the best of the results use mu. I imagine this is because the probabilities can be closer to 1 (since mu can get arbitrarily close to 0 but the mean requires the sigma decrease as well). Beyond inflating all performance, it seems to especially help the label entropy policy, which might be because the mu is a better estimate of the actual probability/confidence (?).
Results show setting beta too high (-1) leads to poor performance in both train and label entropy, but using the mean can alleviate this (since it effectively lowers beta). The best results generally come from -5 for label entropy (faster optimization) and -10 for train (likely because it can't lower for bad features).
Interestingly, label entropy beats out train and solves the ATR dataset! Like solves solves, meaning F1=1. Results might look a lot different this time. Perhaps this relates to the average prob true being so low in the current model, meaning the entropy/eig/kl methods are not properly calibrated.
Summary:
Label entropy: [false, -5] consistently solves dataset within 85 steps.
Train: [false, -10] nearly solves the dataset with auc but lags behind with accuracy/f1. [false, -5] also works very well for auc (but noticeably worse than -10 for accuracy/f1).

A ridiculous number of plots

General description of plot sets shown:
  1. shows final performance by combination of hyperparameters in a heatmap
  2. shows entropy and train (left to right) grouping by combination, then only beta's mu, then only use_mean. top to bottom are different measures of performance
  3. shows entropy and train separately, grouping by combination
  4. shows the parameters of entropy and train, grouped by combination
  5. shows the best combinations of hyperparameters for train and entropy looking at auc and f1
  6. shows the auc of train and entropy by their typical alpha and beta values. we see that they do scale together
  7. shows the same as (1) except averaged over train and entropy to only reflect hyperparameters
I also have these heatmaps for all of the evaluation metrics but I don't know how to format imported images nicely in wandb so I've only included F1 and AUC. Let me know if you'd like to see any others!
F1 Train
F1 Label entropy
AUC Train
AUC Label entropy

I styled the top row to make reading it slightly easier:
  • train = blue
  • entropy = pink
  • use mu = dark/thick
  • use mean = light/thin
  • beta -1 = solid
  • beta -5 = dashed
  • beta -10 = dotted
looking at the same style of line between blue and pink compares the same configuration of hyperparams for entropy and train.

Run set
18



Run set
18