Staged Seq Length Training
Shortformer (https://arxiv.org/abs/2012.15832) style.
2 models trained on equal number of tokens - one at seqlen = 512 for 90% of training, and seqlen = 2048 for the remaining 10%, and one at seqlen = 2048 for the whole training run.
(the loss by tokens logging is somewhat inaccurate because it weights everything according to the initial seq length - but final number of tokens should be equal & hence final losses are comparable)
Created on May 18|Last edited on May 18
Comment
Validation
Add a comment