data efficiency hyperparams
Created on June 22|Last edited on September 8
Comment
grouped runsoverparametrized wdoverparametrized wd (ensembles)1.4b ensembles (based on rough optimal hps)1.4b, 200m tokens, 16x epoched, ensembles1.4b, 200m tokens, 32x and 16x, 6.4 wd
grouped runs
0
0
4
7
3
1.4b, 209M
3
Run set 7
0
overparametrized wd
3
1.4b, 209M, lr0.001
10
10
overparametrized wd (ensembles)
wd6.40
532
wd3.20
532
wd1.6
532
wd0.8
532

1.4b ensembles (based on rough optimal hps)
1.4b, 1.7B seed tokens
1779
1.4b, 838M seed tokens
1779
1.4b, 419M seed tokens
1779
1.4b, 209M seed tokens
1779
1.4b, 1.7B seed tokens
532
1.4b, 838m
532
1.4b, 419m
532
1.4b, 209m
532

1.4b, 200m tokens, 16x epoched, ensembles
Run set
1779
Run set
532

1.4b, 200m tokens, 32x and 16x, 6.4 wd
x32
1779
x16-wd6.4
1779
Add a comment