data efficiency hyperparams
Created on June 22|Last edited on September 8
Comment
grouped runsoverparametrized wdoverparametrized wd (ensembles)1.4b ensembles (based on rough optimal hps)1.4b, 200m tokens, 16x epoched, ensembles1.4b, 200m tokens, 32x and 16x, 6.4 wd
grouped runs
overparametrized wd
overparametrized wd (ensembles)

1.4b ensembles (based on rough optimal hps)

1.4b, 200m tokens, 16x epoched, ensembles

1.4b, 200m tokens, 32x and 16x, 6.4 wd
Add a comment