Deciding the optimal lr schedule (which is cosine)
Created on June 30|Last edited on June 30
Comment
TL;DR
We perform an ablation on learning rate schedule and how it impacts the final performance on a 0.5B model x 200B token.
The result matches our conjecture:
1. All the high learning rate runs (with peak lr 8e-3) are significantly better than the low learning rate baseline (with peak lr 8e-4)
2. For high learning rate runs, longer decay will indicate better performance for WSD with constant & inverse square root lr stable phase. Therefore, Linear/Cosine will be better than WSD with a constant lr stable phase.
3. WSD with inverse square root stable phase will be better with shorter decay. In fact, inverse square root + 20% decay is better than constant + 40% decay.
We additionally observe that cosine learning rate is better than linear learning rate schedule
Result
Run set
4636
Add a comment