[Scale] 100K Proof of concept
Does scaling help us improve 5K Pitt trials?
Created on February 11|Last edited on February 14
Comment
- This initial scaling proof uses base_20 model; model size was scaled ~ proportional to RTT experiment data e.g.
- Pitt ~5K --> 2.5M (not swept or anything)
- RTT ~30K -> 2.5M
- Base_20 ~120K -> 12.5M
- Status: not working, in that scaled models are not per-epoch dominant (slope is ~ the same visually) unlike in smaller-scales (see Multisession Pilot); this is already a small ask as Kaplan-style scaling predicts per-token dominance in larger models
- This suggests either
- Execution bug (tuning is poor, code bug)
- Disparate distributions are really harmful (and we need mitigation strategies).
- More or less still believe that there is some transfer we should expect for Pitt data, where we only have a few thousand trial
-
- Saturation on Pitt data
- we should separately evaluate whether scaling with in-distribution Pitt CO tasks is helpful, or perhaps we are in a "high data regime" for this Pitt context.
- If this is the case, we should evaluate whether large models transfer more easily with novel subject/context robustness (because it has to cope with many).
- We have to be more systematic, let's revisit heterogeneity in just MC_RTT settings.
Section 1
Run set
6
Does extra token capacity help? (1 -> 4 for session embed)
- Perhaps, but not trend changing in the converged regime.
Run set
8
Add a comment