Skip to main content

[Saturation]

Created on February 21|Last edited on February 24

Maybe scaling laws won't go far for single-monkeys

We specifically are looking for a linear fit for the best test set loss as data doubles + compute scales.
Model size doesn't appear to be an issue (8 layers vs 6 layers -- no difference)
Important caveat: It is possible data isn't scaling fast enough relative to compute.
There doesn't seem to be a world where scale1 and scale2 can hit a current rough line.

  • how does scale 1 compare to scale 2?


Here we study for example the per-session capacity. The base models allocate 1 token for per-session context; increasing to 8 (`8s`) reliably increases early sample efficiency (such that it seems always worthwhile to use), ablating reduces. Minor positive effect of more per-session capacity on best achieved performance.

Run set
9


Note lack of scaling result is still consistent with larger factor sizes, but our understanding of model behavior with larger factors is poor. See [Factor].


We are not model-size bottlenecked

(else the larger models should be pulling ahead of base models below)

Run set
6



If there is saturation in Churchland_Maze, it's not obvious

  • We can't study scaling once we hit <1K.
  • (We only have ~8K trials per Monkey released for Maze, either need Erinn's, Gallego's, or Pitt data to reach scales of 40K that seemed to saturate RTT).
  • Anticipating that CO, strictly easier than RTT, should saturate faster (i.e. < 40K trials)

Run set
3


64 context token series still good... (when does it break?) (low pri)

Run set
8