[Saturation]
Created on February 21|Last edited on February 24
Comment
Maybe scaling laws won't go far for single-monkeys
We specifically are looking for a linear fit for the best test set loss as data doubles + compute scales.
Model size doesn't appear to be an issue (8 layers vs 6 layers -- no difference)
Important caveat: It is possible data isn't scaling fast enough relative to compute.
There doesn't seem to be a world where scale1 and scale2 can hit a current rough line.

- how does scale 1 compare to scale 2?
Scaling trends will differ depending on architecture
Here we study for example the per-session capacity. The base models allocate 1 token for per-session context; increasing to 8 (`8s`) reliably increases early sample efficiency (such that it seems always worthwhile to use), ablating reduces. Minor positive effect of more per-session capacity on best achieved performance.
Run set
9
Note lack of scaling result is still consistent with larger factor sizes, but our understanding of model behavior with larger factors is poor. See [Factor].
We are not model-size bottlenecked
(else the larger models should be pulling ahead of base models below)
Run set
6
If there is saturation in Churchland_Maze, it's not obvious
- We can't study scaling once we hit <1K.
- (We only have ~8K trials per Monkey released for Maze, either need Erinn's, Gallego's, or Pitt data to reach scales of 40K that seemed to saturate RTT).
- Anticipating that CO, strictly easier than RTT, should saturate faster (i.e. < 40K trials)
Run set
3
64 context token series still good... (when does it break?) (low pri)
Run set
8
Add a comment