Skip to main content

Deciding the optimal lr schedule (which is cosine)

Created on June 30|Last edited on June 30

TL;DR


We perform an ablation on learning rate schedule and how it impacts the final performance on a 0.5B model x 200B token.
The result matches our conjecture:
1. All the high learning rate runs (with peak lr 8e-3) are significantly better than the low learning rate baseline (with peak lr 8e-4)
2. For high learning rate runs, longer decay will indicate better performance for WSD with constant & inverse square root lr stable phase. Therefore, Linear/Cosine will be better than WSD with a constant lr stable phase.
3. WSD with inverse square root stable phase will be better with shorter decay. In fact, inverse square root + 20% decay is better than constant + 40% decay.

We additionally observe that cosine learning rate is better than linear learning rate schedule



Result


Run set
4636