Skip to main content

Data efficiency scaling laws

Created on June 11|Last edited on June 22

Showing first 10 runs
00.20.40.60.8run_progress00.00020.00040.00060.00080.001
Showing first 50 runs
0.050.060.070.080.090.10.20.30.40.50.60.70.80.9run_progress34
Showing first 10 runs
0.050.060.070.080.090.10.20.30.40.50.60.70.80.9run_progress3
initial debugging
0
epoch scaling bs 64
10
weight decay tuning
32
full scaling laws
744
ensemble members
214
600m hp
0

Main lessons so far
  • Batch size is super important
  • lr 3e-3 is good for 64, smaller lr for larger batch size maybe?