Staged Seq Length Training

Shortformer (https://arxiv.org/abs/2012.15832) style. 2 models trained on equal number of tokens - one at seqlen = 512 for 90% of training, and seqlen = 2048 for the remaining 10%, and one at seqlen = 2048 for the whole training run. (the loss by tokens logging is somewhat inaccurate because it weights everything according to the initial seq length - but final number of tokens should be equal & hence final losses are comparable)

s black

Created on May 18|Last edited on May 18

Comment

﻿
Validation﻿
validation/lm_loss_by_tokens
validation/lm_loss_by_tokens
20M40M60M80M100MStep44.555.566.57
group: small_staged_seqlen_owt_sample_balanced_1sn9s0lt
group: small_non_staged_seqlen__owt_sample_balanced_2eenn15d
Run set2
﻿
﻿
﻿
Run set2
﻿
﻿
﻿
Run set2
﻿
﻿
﻿
Run set2
﻿
﻿
﻿
Run set2
﻿
﻿
﻿
Run set2
﻿
﻿

Add a comment