Skip to main content

150K Scaling Design

Created on March 25|Last edited on March 25
We run relatively short scaling procedures and vary the key design elements

Mask ratio

  • Early on we ran some experiments on this within Indy that suggested mask ratio didn't really matter over a wide range (0.1-0.5), we revisit.
  • Batch size was not fixed; emphasis was runtime efficiency
    • Since the expectation was that performance would be constant.
    • m6_150k was from a different experiment and is 3x larger than the others with greater bsz

2003004005006007008009001k2k3k4k5k6k7k8k9k10k20k30k40k50k60k70k80k90ktrainer/global_step
123456789102030405060708090100200epoch
Run set
5

Based on eval loss, everything seems comparable. However, val separates more clearly (and since these have the same training set, this is meaningful)
  • M1 (prenorm_150k) ~ M3 > M5 > M7 on val. Re-examining, eval, this appears to hold as well.
    • Note M1, 3, 5 have equiv bsz.
  • The diff on eval and val is either a noise scale thing (eval is noisier) or val implies there exists data that is more challenging than eval (highly plausible)
  • We should quickly confirm this ranking holds in transfer.