[Scale] In-session vs multi-session
Controlled understanding of our ability to scale.
Created on March 3|Last edited on March 7
Comment
We studied aggregation with sorted units included (inconsistent units per day) and there, the spacetime-factor 16 model was the best. So we continued with that here. We note separate thoughts on that truth in [Scale] Inductive Bias.
Conclusions:
- Revisiting scale demonstrates nontrivial but somewhat weak transfer mxsms for spacetime models. We should proceed with exploiting this method.
- This is IFF we’re confident it is the best mxsm we will likely get, we demonstrate that in [Scale] Inductive Bias.
(First on unsorted units). We study scaling when we operate only with relatively stable hash units, for the best case "stable" transfer scenario. We say this because models can zero-shot on novel days with hash input, but not sorted units.
Within session data scaling slows
- We expected in-session scaling as optimal in-distribution scaling. Indeed, even here, scaling doesn't seem to be reliably decreasing.
- Supervised loss (MSE) still seems improving at a healthy rate but must literally stop by at most ~8x more doublings.
- The comparison with MSE reminds us that Poisson NLL doesn't really have any reason to behave like classification NLL.
- Toggle sweep controls to see our variants are reasonably trained and scaling differences are nontrivial (> sweep variance); save for overly early stopping.
Unsupervised
5
5
8
8
Multisession scaling is clearly worse than in-session scaling.
- If we choose to believe that "scaling laws" are still holding, the exponent must be a much worse one for multisession scaling.
- For example, ~4x as much data appears needed multisession to achieve the intra result for 200, 400 trials.
- Does this hold more broadly? What if we scale up # data in these other sessions?
- The main reason this is disappointing/surprising is that existing multisession works don't really appear to study this phenomena; Brianna's work seems to suggest we should expect reasonable closeness of data across ~2 months.
- However, scaling still goes a bit further. However, we can't quite hit 1.6K in session worth even with 18K out-session worth (i.e. with some quite close by days in there)
- This is quite powerful - even though we're sampling literally from adjacent days, peak performance isn't quite reached.
- Permute-channels-per-session ablation still transfers but saturates much earlier. This indicates model is vulnerable to relatively superficial changes in channel structure after all.
- For unknown reason, multi-perm saturates doesn't seem to improve at all for 1600. Perhaps capacity? It seems that our model is just failing to transfer.
- Similarly, Loco improves data to what is worth 100 -> 200, but saturates even sooner than permute.
- Supervised multi-story isn't as clear, oddly good 200, oddly bad 400, Loco doesn't transfer at all.
5
Intra
5
4
4
5
5
0
5
5
Intra
5
4
4
5
0
5
Intra
5
permute multi
4
4
5
5
0
Given these observations, it seems likely that scaling behavior (regardless of saturation or not) is in good part mostly single session saturation.
- However, we have much more "multisession" data than single-session data - e.g. well more than 4x (it's more like 40x). So shouldn't we actually be attributing saturation to multisession scaling saturation?
- see how much we improve on 100 with the full Indy dataset.
Conclusions
- Scaling is visible in most settings - but the degree of scaling varies. Intra best, inter next, permute next, and inter-subject last. Each step reduces effectiveness of increased data by maybe an order of 2x or more.
- JY: Inter-subject commentary should really be limited until we can "scale across subjects." Else there's an imbalance in distributions.
- Permuting shows us the model can leverage highly unstructured data, though it has a harder time of doing it. So if we saturation, that is what it is.
- Unsupervised pretraining has more stable scaling behavior, and does work across subjects - kind of pitifully (worse than 4x), but still.
- Supervised side-by-side intra scaling indicates that a matter of 0.01 NLL corresponds to the full gambit of trivial to SoTA decoding R2. Thus it would make sense that we don't see power law extrapolating far beyond current floor (R2 is capped, and it's coming up).
- However current saturation behavior is a flaw of multi-session transfer, i.e. it's not just inevitability.
Thus, we definitely won't be setting in-session performance records.
The question now becomes, can we measure context generalization as a desirable endpoint? Or what becomes the nail we can strike with this hammer?
Other loose ends
- It's unclear how perm scaling fits into successful scaling narrative, but probably it's better than expected once we increase data a few orders.
What about sorted units, unstable APIs?
- We switch back because we were observing poor zero-shot on sorted units, does that imply different transfer properties as well?
Intra-scaling is stable as expected
5
Run set 2
5
Multi-session scaling is reduced by about a 2-4x factor. That is, multi_1600 now lands between single_200 and single_400, whereas previously it landed b/n 4 and 8.
multi_all lands between single_400 and single_800, whereas previously it was between 8 and 16.
Permuting is still harmful, indicating model still utilizes the relative spatial positioning on arrays.
Multianimal: loco_6400, acausal, ends up just shy of ~ single_400, this matches what happened in unsorted case. (See inductive_bias). Great! This indicates that sorting produces a hard to transfer to scenario rather than hard to transfer from (since Loco was always mismatched, and now it's more mismatched since individual Loco sessions are sorted, but this didn't affect the relative rate of transfer).
Multi
5
Single
5
4
4
Finally, in sorted case, other models are much worse at reconciling (sanity check)
Arch comparisons
6
Baselines
2
Conclusions about sorted, and overall
- Sorting will hurt cross-session transfer (here, by about 2-4x), but not apparently cross-subject (and presumably further irrelevant cross-task) scenarios.
- However, sorting can provide somewhat substantial decode advantage. So trade off at your own risk.
- If you don't have much cross-session, just go for sort.
- If you have medium amounts of cross-session, low in-session, go for unsort.
- With infinite cross session, low in-session, go for sort.
- Recommendation would be to not sort so that you can leverage similar context data available for the subject better. Unless sorting provides a drastic boost in performance.
- Multisession transfer with sorted neurons is qualitatively similar to trying to transfer with permuted channels (in unsorted experiments), in that cross-session transfer of 16x trials only amounts to around 2x-ing in-session data.
- But note, even with sorted neurons, permuting is harmful, relative spatial consistency across sessions is better than nothing.
- If you have a lot of in-context data, do whatever you want. If not, you need to promote the transfer of whatever data you have, which means trying to standardize where convenient
Add a comment