[Atomicity] Compute-normalized factor size comps
Hm, BPS clouds picture, let's just look at loss.
Created on February 10|Last edited on February 14
Comment
There may be no regime-change difference in achieved performance, but smaller factors may be more sample-efficient in a way that justifies extra-compute.
rtt_f2_m75 converges first.
Run set
6
Another data point for RTT: unannotated uses 1 GPU, flat_<x> series uses 4 GPUs
- The best atom depends both on budget.
- Convergence in a day, no obvious win-loss after convergence.
- worth a followup to see if variant gaps are correspondingly scaled in larger data.
Run set
6
- On the other hand, who cares about the factor model if we're competitive with the best with flat models and we can scale flat more?
Add a comment