Deciding the optimal lr schedule (which is cosine)

Created on June 30|Last edited on June 30

Comment

﻿
TL;DR﻿
We perform an ablation on learning rate schedule and how it impacts the final performance on a 0.5B model x 200B token.
The result matches our conjecture:
1. All the high learning rate runs (with peak lr 8e-3) are significantly better than the low learning rate baseline (with peak lr 8e-4)
2. For high learning rate runs, longer decay will indicate better performance for WSD with constant & inverse square root lr stable phase. Therefore, Linear/Cosine will be better than WSD with a constant lr stable phase.  
3. WSD with inverse square root stable phase will be better with shorter decay. In fact, inverse square root + 20% decay is better than constant + 40% decay.
﻿
We additionally observe that cosine learning rate is better than linear learning rate schedule
﻿
﻿
Result﻿
Run set4636
﻿
﻿

Add a comment