Skip to main content

6M Parameter models

A selection of 6M parameter, GPT-J, models with varied architectures, trained on architectural design data. These are a part of a larger scaling laws experiment with models ranging from 2M to 2B parameters.
Created on April 5|Last edited on April 5

100M200M300M400M500M600M700M800M900M1GNumber of Tokens Processed0.20.30.40.50.60.70.80.91Train loss
d_model: 512