Regularization parameters correlation analyses

Created on December 31|Last edited on February 5
Comment
This sweep teaches us some preliminary observations about the hyperparameters included. 
This sweep was trained with a Bayesian HPO strategy, with ranges:
"steps" : {
  "min": 1000, 
  "max": 3000,
},
"alpha" :{
  "min": 0.6,
  "max": 1.75
},
"emb_dim" :{
  "min": 3,
  "max": 80
},
"prior_std" :{
  "min": .5,
  "max": 2.0
},
"ell_2" :{
  "min": .05,
  "max": 1.0
},
"gram" :{
  "min": .5,
  "max": 2.0
}
First observation: training lengthFrom the following charts, we see that learning happens extremely quickly at first, and then drops off considerably. However, test OMSE continues to decrease even in the tail of training. Contrarily, the feature importance plot suggests we might be over-fitting, so either we should:
start fixing training length at somewhere around 2k steps
consider implementing weight-decay.
Loss trends﻿
Sweep: Regularization-parameters sweep 1 119
Sweep: Regularization-parameters sweep 1 20
﻿
﻿
Second observation: regularization coefficientsThe ℓ2\ell_2ℓ2​﻿ regularization coefficient might need a broader range of values; the BHPO favored values at the maximum of its range, and it showed up positive in it's feature correlation despite the best runs all lying near the center of the range.
The gram regularization coefficient, however, showed minor correlation with and feature importance for performance. The concentration of the best models were on the lower end of the range, so we may tighten up that range.
However, more important than each of these is that the observed values of gram-loss and ℓ2\ell_2ℓ2​﻿-loss are different orders of magnitude. This means that we should probably rethink the ranges of their coefficients to ensure that we're giving adequate power to each regularization term during training. Note: the plotted loss are not including the coefficients as is usual.
﻿
﻿
Sweep: Regularization-parameters sweep 119
﻿
﻿
Add a comment