Controlled RTT Benefit comparison

Created on March 29|Last edited on March 29

Comment

﻿
﻿
Nearly 1:1 comparison of RTT experiments with NLB RTT dataset; replicated small benefit in own setting, no benefit in NLB.
In distribution scaling shows RTT is definitely not saturated
Playing with model parameter count (and other tuning parameters) suggests relatively robust pretraining scores.
Can still nail down the preprocessing of NLB, perhaps worth understanding where the improvement is coming from.
OTOH, I'm not sure about return for experiment. Could this simply be a scale issue? Is there something else at play here? With the modest effect size all conclusions are not that believable.
﻿
eval_loss
eval_loss
10010ktrainer/global_step0.30.4
f32_nlb-b4rz44ou
mc_rtt-i0n8o24x
single_time_nlb_r300_exact-9hohu8bv
single_time_nlb_r300_full-2iwg032a
f32_10x-824eizh4
single_time_nlb-nnnow3uw
f32-wi0xe1mn
Run set7
﻿
﻿

Add a comment