[Scale] 100K Proof of concept

Does scaling help us improve 5K Pitt trials?
Created on February 11|Last edited on February 14
Comment
﻿
This initial scaling proof uses base_20 model; model size was scaled ~ proportional to RTT experiment data e.g.
Pitt ~5K --> 2.5M (not swept or anything)
RTT ~30K -> 2.5M 
Base_20 ~120K -> 12.5M
Status: not working, in that scaled models are not per-epoch dominant (slope is ~ the same visually) unlike in smaller-scales (see Multisession Pilot); this is already a small ask as Kaplan-style scaling predicts per-token dominance in larger models
This suggests either 
Execution bug (tuning is poor, code bug)
Disparate distributions are really harmful (and we need mitigation strategies).
More or less still believe that there is some transfer we should expect for Pitt data, where we only have a few thousand trial
﻿
Saturation on Pitt data
we should separately evaluate whether scaling with in-distribution Pitt CO tasks is helpful, or perhaps we are in a "high data regime" for this Pitt context.
If this is the case, we should evaluate whether large models transfer more easily with novel subject/context robustness (because it has to cope with many).
We have to be more systematic, let's revisit heterogeneity in just MC_RTT settings.
﻿
Section 1﻿
Run set6
﻿
﻿
Does extra token capacity help? (1 -> 4 for session embed)
Perhaps, but not trend changing in the converged regime.
﻿
﻿
Run set8
﻿
﻿
Add a comment