[Atomicity] Compute-normalized factor size comps

Hm, BPS clouds picture, let's just look at loss.

Created on February 10|Last edited on February 14

Comment

﻿
There may be no regime-change difference in achieved performance, but smaller factors may be more sample-efficient in a way that justifies extra-compute.rtt_f2_m75 converges first.
﻿
val_loss
val_loss
0.40.50.60.70.80.912345678910203040Time (hours)0.10.110.120.130.140.150.16
rtt_f2_m75-11ljra7g
rtt_f2_m25-chki2fa7
rtt_f2_m5-da3wgabk
rtt_m75-t6ofj5ae
rtt_m25-1fp382ze
rtt_m5-1xzlgb83
eval_loss
eval_loss
10203040Time (hours)0.090.1
Run set6
﻿
﻿
Another data point for RTT: unannotated uses 1 GPU, flat_<x> series uses 4 GPUs
The best atom depends both on budget.
Convergence in a day, no obvious win-loss after convergence.
worth a followup to see if variant gaps are correspondingly scaled in larger data.
﻿
Run set6
﻿
On the other hand, who cares about the factor model if we're competitive with the best with flat models and we can scale flat more?
﻿

Add a comment