Flat vs factorized

Factorized is objectively much more efficient but a flat model has more eventual potential/is more agnostic. Per Kaiming's spatiotemporal paper, this might be better at scaled throughputs.

Joel Ye

Created on February 10|Last edited on February 11

Comment

We saw early on that (flat) spacetime models were objectively terrible on maze, and factor models were great -- for BPS. Revisiting now:
Results show near equivalence in loss log-trends, but a non-correlation in BPS.
﻿
flat marginally worse on Maze 5ms (mc_maze_med) eval.﻿
val_loss
val_loss
23456789102030405060708090100200300400500Time (minutes)0.10.20.30.4
flat_maze_4-m0y4gxbe
st_maze_4-xbdpvtu7
factor_maze_4-cixrizpp
train_loss
train_loss
0100200300400500600epoch0.10.20.30.40.50.6
Run set3
﻿
flat ~ factor in RTT runs.They're similarly comparable in relatively stable datasets (RTT, 30K)
Honestly flat looks poised to catch up to factor in val, should probably train to convergence
But eval looks saturated, unless there's some incoming insights...
﻿
﻿
﻿
﻿
Run set4
﻿
﻿
﻿
I don't refute that BPS scores are definitely different, but that's almost separate from the loss question -- which indicates that scaling is clear and consistent across factor sizes and styles.
﻿
Run set17
﻿
﻿
﻿

Add a comment