[Throughput] Mask Ratio effects
Masking more increases throughput (and perf/flop), but at what cost?
Created on February 10|Last edited on February 11
Comment
Overall the optimal masking is quite indeterminate and could use more experimentation with well tuned models.
The following plots are suggest potential costs to high masking for Pitt data, but not for RTT. Consequence of long trial structure in Pitt?
Nonetheless high masking can be used for iteration, most likely (0.8) and can easily be transferred to lower masking.
Pitt scaling 100K (Base)
- Unclear conclusions. Based on the x-axis scaling, val looks either converged or not converged; and higher masking doesn't necessarily seem worse. Eval seems more clearly worse with higher masking, though, but perhaps suboptimal learning regime?
- Note hard to draw conclusions on this set of runs -- models + batch sizes were quite undersized -- but a more stable set of runs below produce consistent conclusions.
Run set
4
RTT (Joint) series
- These models all have token budget adjusted so each model is receiving full 1s context of RTT segments.
- Relative to pitt scaling this data may be less structured; less time to take advantage?
- Doesn't seem like there's a difference at convergence (confirm for rtt_f2_m25)
Run set
10855
Pitt Scaling (150K Base)
- Masking can have appreciable differences at 40%. Note - Kaiming's approach basically says lower masking may be bad due to overfit; this is not what we see here.
Run set
8
Add a comment