[Factor] When does factor size matter?

Created on February 21|Last edited on February 21
Comment
Motivation: We believed amenable architectures supports improved performances across scaling; both non-spatial, or high factor size (the continuous analog of non-spatial) should perform worse at any given data amount. 
Early on, RTT showed little dependence on spatial processing, but we since moved to sorted RTT inputs (to achieve SoTA decoding) and early Maze experiments indicated drastic failure with spatial processing in sorted data. 
So we ran a comparison with larger factor sizes (4 -> 32) expecting degradation; to our surprise this was not so clear.
﻿
eval_loss
eval_loss
0.1110Time (hours)0.3
scale16_t32-nnkme5pm
scale4_t32-b59y8n38
scale1_t32-jfh1qqir
scale4-blx190jr
scale16-g6tj775t
scale1-tfj3vecf
Run set6
﻿
Per pass through data, smaller factors are more efficient. They are perhaps learning more due to the more nuanced spatial masking.
However, smaller factors require O(n^2) more memory; much larger batches can be afforded on larger factors; each step processes more data (e.g. analogous to the tradeoff of higher masking ratios; learn less, process more)
So comparing steps, larger factors are advantaged. Comparing wall-clock, (the fairest); larger factors appear an order of magnitude more efficient.
What about max performance attained? Mixed bag. 
It clearly seems there's no fundamental superiority of small factors here.
We expect that no spatial tokens should fail per results in [Pilot] as well as initial maze results. So there are some conditions in which spatial masking is important, what are they? Can we reproduce in RTT, sorted, at all?
To be safe, we went even more extreme next (pending)
﻿
﻿
Add a comment