Skip to main content

Staged Seq Length Training Test

Shortformer (https://arxiv.org/abs/2012.15832) style. 2 models trained on equal # tokens - one at seqlen = 512 for 90% of training, and seqlen = 2048 for the remaining 10%, and one at seqlen = 2048 for all of training. n.b - for some reason there's a bug in how wandb displays the runs, full screen and then minimize loss by tokens/wallclock time and it should display correctly. Also loss by tokens isn't 100% accurate since it scales by the initial seq length - but final losses should be equal # tokens
Created on May 18|Last edited on May 18

Validation


20M40M60M80M100MStep44.555.566.57
group: small_staged_seqlen_owt_sample_balanced_2t8xdqp5
group: small_non_staged_seqlen__owt_sample_balanced_2eenn15d
Run set
2



Run set
2



Run set
2




Run set
2





Run set
2