150K Scaling Design
Created on March 25|Last edited on March 25
Comment
We run relatively short scaling procedures and vary the key design elements
Mask ratio
- Early on we ran some experiments on this within Indy that suggested mask ratio didn't really matter over a wide range (0.1-0.5), we revisit.
- Batch size was not fixed; emphasis was runtime efficiency
- Since the expectation was that performance would be constant.
- m6_150k was from a different experiment and is 3x larger than the others with greater bsz
Run set
5
Based on eval loss, everything seems comparable. However, val separates more clearly (and since these have the same training set, this is meaningful)
- M1 (prenorm_150k) ~ M3 > M5 > M7 on val. Re-examining, eval, this appears to hold as well.
- Note M1, 3, 5 have equiv bsz.
- The diff on eval and val is either a noise scale thing (eval is noisier) or val implies there exists data that is more challenging than eval (highly plausible)
- We should quickly confirm this ranking holds in transfer.
Add a comment