Skip to main content

GMLP vs. Baseline

GMLP (blue) has slightly more params (200M vs. ~150M) than baseline (pink) and larger batch size - loss is plotted by tokens rather than steps to compensate. GMLP: 9x GMLP blocks + 3x local attention blocks Baseline: 12x dense attention blocks
Created on May 19|Last edited on May 19




50G100G150G200G250GStep2345678910
group: gmlp_small_1_1h5cpjky
group: dense_rotary_adam_final_2o1xs6nu
Run set
8



Run set
8



Run set
8





Run set
8




Run set
8



Run set
8





Run set
8