Skip to main content

[Tuning] Decoding

Created on February 25|Last edited on February 25

High variance in tuning results


1101001kStep00.10.20.30.40.5
Run set
7

After some investigation, we realized that the new purple run was getting a ridiculous LR schedule that likely destabilized training. To see what happens after resetting (expecting similar if not slightly better from scratch results)








Supervised pretraining

Before we perform our final tuning on context of interest, we have several options for preparing a base model.
  1. Unsupervised PT -> supervised PT
  2. Supervised PT
  3. Unsupervised/supervised joint PT
See them all here. 1 is best, 2 basically matches, 3 seems to converge first (oddly)
  • Compute/batch size is approximately matched (~40-48G, effective bsz 256).
  • Eval suggests some lagging, but these guys have the same val set and results are more charitably equivalent there, things pull in about as expected. Curiously, joint seems to overfit slightly earlier, but this seems to be an unlucky local minima since the overfit is sudden.

Run set
5


  1. This is the hypothesized most stable flow. It's not really getting its full glory because we're doing unsupervised PT on the exact Indy dataset as supervised PT; unsupervised should technically be able to leverage much more data (but we don't have scaling working). Anyway, it's one option.
The move to normalized targets was not helpful at convergence, but seems to slightly improve the compute-efficient regime.

Run set
2


2.
Because it's important that we actually tune fairly, we take a closer look and document things. indy_causal_scratch is supervised pretraining scenario; on the Indy target the ODoherty baseline is ~0.56 R^2. We do beat that here (again with 50% the training data); more importantly the move to m/s targets (and subsequent 10x increase in loss) is helpful. Since we are hitting a reasonably good performance and this is not the end measure, I stop tuning.
These are untuned, so there may be HPs that allow Sup PT to match or even surpass Unsup Pt -> Sup PT; however we did not iterate either flow much -- the normalization was a move in favor of Sup PT; I think it's fair to say both got a reasonable pass.


Run set
7

3.
Similarly, the joint training was improved with normalization (note losses are still not same order of magnitude, unsupervised is dominant.
Low pri pingback - check what happens when we make weights more even (do we do better on decode, maybe?)

Run set
9


Currently unsupervised tuning seems on track to match supervised tuning, which seems a little too good to be true.
If it is true, though, be careful, this might be sensitive to the small kinematic task weight we put during the original joint pretraining.


Run set
6