Staged Seq Length Training Test
Shortformer (https://arxiv.org/abs/2012.15832) style.
2 models trained on equal # tokens - one at seqlen = 512 for 90% of training, and seqlen = 2048 for the remaining 10%, and one at seqlen = 2048 for all of training.
n.b - for some reason there's a bug in how wandb displays the runs, full screen and then minimize loss by tokens/wallclock time and it should display correctly.
Also loss by tokens isn't 100% accurate since it scales by the initial seq length - but final losses should be equal # tokens
Created on May 18|Last edited on May 18
Comment
Validation
Run set
2
Run set
2
Run set
2
Run set
2
Run set
2
Add a comment