Multisession, Multisubject Pilot

Goal: Understand scaling when it probably works
Created on February 13|Last edited on February 17
Comment
When scaling up, we noticed that loss curves were pretty uninspiring as to whether anything was happening. Unsure if I just didn't know what to look for, so we wanted to check back to a scenario where we were pretty sure scaling was happening.
﻿
In old runs, with multi-animal and multisession scaling, models appear to need fewer steps and epochs to achieve a given loss. Single session model simply overfits.
RTT_all was never run with factor, so it takes less memory and batch size is much larger.
Old runs﻿
eval_loss
eval_loss
1101001kStep0.20.30.40.50.60.70.80.91
eval_loss
eval_loss
1101001kepoch0.20.30.40.50.60.70.80.91
eval_bps
eval_bps
89102030405060708090100200300400epoch0.0010.010.1
Run set8
﻿
﻿
Re-run, with sweepsIn lieu of sweeping capacity, we sweep dropout, learning rate.
Results are consistent with both multi-session and multi-subject transfer, without any particular conditioning other than providing context. No context ablation done at this point.
Note, masking rate is now set to 0.8 for throughput, scores may be higher than before
Also, other than rtt_loco_indy_flat, all other flat references are using standard infill, not asymmetric infill (config accident)
rtt_indy should be competitive with single (if not slightly better), but needed more patience (used default of 25 instead of 50)
These results show scaling in session to have expected leftward shifts in efficiency., still at 20ms. 
Improvements from scaling are relatively small on the y-axis, on the order of 0.01 on the y-axis.
Duo (and loco single) is probably a bit worse than single since it didn't get much patience, looks like it's still learning.
Comparing rtt_indy_flat_8l and rtt_indy_flat_shuffle suggests flatness improves on regular flat model primarily due to space masking, not capacity. (No comparison to joint model here)
﻿
﻿
 
Single75
Indy all38
 
Loco16
 
Duo8
﻿
﻿
Add a comment