150K Scaling Design

Created on March 25|Last edited on March 25
Comment
We run relatively short scaling procedures and vary the key design elements
Mask ratioEarly on we ran some experiments on this within Indy that suggested mask ratio didn't really matter over a wide range (0.1-0.5), we revisit. 
Batch size was not fixed; emphasis was runtime efficiency
Since the expectation was that performance would be constant.
m6_150k was from a different experiment and is 3x larger than the others with greater bsz
﻿
eval_loss
eval_loss
2003004005006007008009001k2k3k4k5k6k7k8k9k10k20k30k40k50k60k70k80k90ktrainer/global_step
eval_loss
eval_loss
123456789102030405060708090100200epoch
m7_150k-ct3si8ef
m5_150k-y7isifso
m3_150k-hodo53b1
m6_150k-22329bto
prenorm_150k-vk4znnpc
Run set5
﻿
Based on eval loss, everything seems comparable. However, val separates more clearly (and since these have the same training set, this is meaningful)
M1 (prenorm_150k) ~ M3 > M5 > M7 on val. Re-examining, eval, this appears to hold as well.
Note M1, 3, 5 have equiv bsz.
The diff on eval and val is either a noise scale thing (eval is noisier) or val implies there exists data that is more challenging than eval (highly plausible)
We should quickly confirm this ranking holds in transfer.
﻿
Add a comment