GMLP vs. Baseline
GMLP (blue) has slightly more params (200M vs. ~150M) than baseline (pink) and larger batch size - loss is plotted by tokens rather than steps to compensate.
GMLP: 9x GMLP blocks + 3x local attention blocks
Baseline: 12x dense attention blocks
Created on May 19|Last edited on May 19
Comment
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Add a comment