Skip to main content

Flat vs factorized

Factorized is objectively much more efficient but a flat model has more eventual potential/is more agnostic. Per Kaiming's spatiotemporal paper, this might be better at scaled throughputs.
Created on February 10|Last edited on February 11
We saw early on that (flat) spacetime models were objectively terrible on maze, and factor models were great -- for BPS. Revisiting now:
  • Results show near equivalence in loss log-trends, but a non-correlation in BPS.


flat marginally worse on Maze 5ms (mc_maze_med) eval.


23456789102030405060708090100200300400500Time (minutes)0.10.20.30.4
0100200300400500600epoch0.10.20.30.40.50.6
Run set
3


flat ~ factor in RTT runs.

They're similarly comparable in relatively stable datasets (RTT, 30K)
  • Honestly flat looks poised to catch up to factor in val, should probably train to convergence
  • But eval looks saturated, unless there's some incoming insights...




Run set
4



I don't refute that BPS scores are definitely different, but that's almost separate from the loss question -- which indicates that scaling is clear and consistent across factor sizes and styles.

Run set
17