Flat vs factorized
Factorized is objectively much more efficient but a flat model has more eventual potential/is more agnostic. Per Kaiming's spatiotemporal paper, this might be better at scaled throughputs.
Created on February 10|Last edited on February 11
Comment
We saw early on that (flat) spacetime models were objectively terrible on maze, and factor models were great -- for BPS. Revisiting now:
- Results show near equivalence in loss log-trends, but a non-correlation in BPS.
flat marginally worse on Maze 5ms (mc_maze_med) eval.
Run set
3
flat ~ factor in RTT runs.
They're similarly comparable in relatively stable datasets (RTT, 30K)
- Honestly flat looks poised to catch up to factor in val, should probably train to convergence
- But eval looks saturated, unless there's some incoming insights...
Run set
4
I don't refute that BPS scores are definitely different, but that's almost separate from the loss question -- which indicates that scaling is clear and consistent across factor sizes and styles.
Run set
17
Add a comment