Decoding Pilot

Need to figure out how to get fine-tuning work. Testing on RTT single session to kick off.
Created on February 17|Last edited on February 18
Comment
Status: Scaling supervised decoding improves in accordance, coarsely, with unsupervised losses.
Pingbacks on fine-grained correlation (RTT_Indy, RTT_Indy_scratch)
Now need to drive towards SoTA on O'Doherty.
﻿
Loose threads:
pretrained rtt_indy (non-flat base arch) doesn't learn well, unclear why
﻿https://wandb.ai/joelye9/context_general_bci/runs/g0dyrucl/overview?workspace=user-joelye9﻿
	- This is low-pri because we don't actually care about the unsupervised -> supervised narrative when our unsupervised eval loss diffs are so negligible. 
Lag effects exist as expected, i.e. lags ~120-140 are better than no lag for probes.But actual finetuning performs better. Heavily regularized fine-tuning is not necessary.
﻿
Run set9
﻿
Lag unfortunately still "matters" even when fine-tuning (140 is still better-sample wise than 0/20). We should just use a lag of 120ms.
﻿
Run set6
﻿
Lower dropout is better (for basic decoding tuning)
﻿
Run set8
﻿
﻿
Decoding scaling﻿
Flat decoding strategy more or less never works across several different finetuning attempts
Freezing
Differing dropouts + weight decay
Joint training with 0.8 masking
1 layer decoder (still 100K parameters)
The flat decoding strategy more or less naturally suggests the use of transformer decoding (aesthetically better than concat, though that also works). Rather than toil to get it working in low-trial settings, we're simply interested if we can train good decoders with scaling. This is the case.
  
Based on reference results reported on Zenodo (https://zenodo.org/record/3854034﻿
Zenodo scores interpolate to approximately 0.57 R^2 for our bin size. Diffs in setup (besides architecture/scaling)
They train on 320s (we have 187 eval training trials right now)
They take as input sort + unsorted hash (expected improvement)
﻿
The best flat and non-flat pretrained models were taken from pilot runs and evaluated here.
Non-flat still in progress
Flat is clearly better than single session models. Pretraining also matters for conditioning the model to learn nontrivially.
This doesn't tell us that pretraining performance causes eval performance; also we're doing downstream-dataset pretraining, not big dataset pretraining. 
RTT Indy runs were slightly worse on pretrain, and their performance should tell us something about whether pretrain predicts decode downstream.
However, this does tell us (n=2) that coarsely, pretrain performance correlates with eval performance. (So long as the downstream task is all supervised) -- this is an uninteresting setting, but provides the sanity check we needed.
Flat essentially performs ~ on par with expectations, which is slightly below O'Doherty results. Achieves ~0.5 R^2.
A while back we trained a causal, non-flat model which achieved 0.53 R^2 (`indy_base_decode`). It was based off rtt_causal https://wandb.ai/joelye9/context_general_bci/runs/ankuy3ai/overview?workspace=user-joelye9﻿
We should aim for O'Doherty parity, as a test of our pipeline polish. 
﻿
﻿
﻿
Run set7
﻿
- Joeyo says - consistent improvement from at least including sorted units
    - I suspect remaining room for improvement will come from
        1. Using all units (hopefully this gets parity 0.53 to 0.57)
        2. Slight more sophisticated decoding (allow lookahead) — but acausal doesn’t seem to diff)
        3. More trials (we can’t use this, but it’d be good to know)
        4. Loco or broader scaling
    - All this suggests that our EOD improvements may be on the order of 0.0.5R^2 at most. Which is still respectable, but damn
﻿
﻿
Addendum:  Tuning yields ~2 points of improvement, going from scratch does work with appropriate LR but doesn't lead to great conditioning. Actually going from scratch is decent. 
﻿
Run set3
﻿
Causal doesn't seem to cause detriment.Interesting...
Won't translate to exact decode parity because of additional time; currently trying to add additional neurons to improve modeling.
﻿
﻿
Run set2
﻿
Though decoding from scratch seems more poorly conditioned in causal scenario.
﻿
Run set3
﻿
﻿
Add a comment