[Saturation]

Created on February 21|Last edited on February 24
Comment
﻿
Maybe scaling laws won't go far for single-monkeysWe specifically are looking for a linear fit for the best test set loss as data doubles + compute scales. 
Model size doesn't appear to be an issue (8 layers vs 6 layers -- no difference)
Important caveat: It is possible data isn't scaling fast enough relative to compute. 
There doesn't seem to be a world where scale1 and scale2 can hit a current rough line. 
﻿
how does scale 1 compare to scale 2?
﻿
﻿
Scaling trends will differ depending on architecture Here we study for example the per-session capacity. The base models allocate 1 token for per-session context; increasing to 8 (`8s`) reliably increases early sample efficiency (such that it seems always worthwhile to use), ablating reduces. Minor positive effect of more per-session capacity on best achieved performance.
﻿
Run set9
﻿
﻿
Note lack of scaling result is still consistent with larger factor sizes, but our understanding of model behavior with larger factors is poor. See [Factor].
﻿
We are not model-size bottlenecked(else the larger models should be pulling ahead of base models below)
﻿
Run set6
﻿
﻿
If there is saturation in Churchland_Maze, it's not obviousWe can't study scaling once we hit <1K. 
(We only have ~8K trials per Monkey released for Maze, either need Erinn's, Gallego's, or Pitt data to reach scales of 40K that seemed to saturate RTT).
Anticipating that CO, strictly easier than RTT, should saturate faster (i.e. < 40K trials)
﻿
Run set3
﻿
﻿
64 context token series still good... (when does it break?) (low pri)
﻿
Run set8
﻿
﻿
Add a comment