GMLP vs. Baseline

GMLP (blue) has slightly more params (200M vs. ~150M) than baseline (pink) and larger batch size - loss is plotted by tokens rather than steps to compensate. GMLP: 9x GMLP blocks + 3x local attention blocks Baseline: 12x dense attention blocks

s black

Created on May 19|Last edited on May 19

Comment

﻿
﻿
﻿
train/lm_loss_by_tokens
train/lm_loss_by_tokens
50G100G150G200G250GStep2345678910
group: gmlp_small_1_1h5cpjky
group: dense_rotary_adam_final_2o1xs6nu
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
﻿
﻿
Run set8
﻿
﻿

Add a comment

GMLP vs. Baseline

﻿