Staged Seq Length Training Test

Shortformer (https://arxiv.org/abs/2012.15832) style. 2 models trained on equal # tokens - one at seqlen = 512 for 90% of training, and seqlen = 2048 for the remaining 10%, and one at seqlen = 2048 for all of training. n.b - for some reason there's a bug in how wandb displays the runs, full screen and then minimize loss by tokens/wallclock time and it should display correctly. Also loss by tokens isn't 100% accurate since it scales by the initial seq length - but final losses should be equal # tokens

s black

Created on May 18|Last edited on May 18

Comment

﻿
Validation﻿
validation/lm_loss_by_tokens
validation/lm_loss_by_tokens
20M40M60M80M100MStep44.555.566.57
group: small_staged_seqlen_owt_sample_balanced_2t8xdqp5
group: small_non_staged_seqlen__owt_sample_balanced_2eenn15d
Run set2
﻿
﻿
﻿
Run set2
﻿
﻿
﻿
Run set2
﻿
﻿
﻿
﻿
Run set2
﻿
﻿
﻿
﻿
﻿
Run set2
﻿
﻿

Add a comment