Fantastic Optimizers and Where to Find Them

We present a rigorous study across 10 optimizers on optimization speed in language modeling speedup and reveal that the realistic speedup is significantly lower than claimed. We analyze why this is the case and show new observations about optimization.

Kaiyue Wen

Created on May 19|Last edited on May 19

Comment

﻿
AbstractMethodologyGeneral SetupPhase I: Full Hyperparameter SweepPhase II: Sensitive-Hyperparameters IdentificationPhase III: Case Study 1.2B Models16x Chinchilla More Phenomenon That We FoundWeight decay is helpful for final performance but leads to higher loss initiallyHigh weight decay (0.5-0.7) is beneficial for Lion and KronThe speedup of optimizers may diminish even if only one hyperparamaeter is slightly missetParameter norms typically track learning rate decay when there is weight decay, regardless of optimizersGradient norm increases during learning rate decay, regardless of optimizers
﻿
﻿
AbstractAdamW has long been the dominant optimizer in open-weight language model pretraining, despite numerous claims of 1.4 to 2x speedups from alternatives. We posit that two methodological shortcomings — unequal hyperparameter tuning and limited or misleading evaluation setup — have obscured fair comparisons and hindered practical adoption. To rectify this, we study ten deep learning optimizers, tuning hyperparameters across four model scales (0.1 B–1.2 B parameters) and data‐to‐model ratios (1–8x the Chinchilla optimum) in a principled manner. We show that rigorous hyperparameter tuning for each optimizer is necessary because (i) optimal hyperparameters for one optimizer may be far from optimal for another, and thus blind transfer of hyperparameters leads to unfair comparisons, and (ii) the speed-up over heavily-tuned baselines is lower than claimed speedup. Moreover, comparisons only in a short‐horizon setting can be misleading because the order of two optimizers can flip when training for longer---their loss curves can cross even multiple times with learning rate decay. Our results show that all the fastest optimizers use matrices as pre-conditioners --- multiplying the gradients with matrices as opposed to entry-wise scalars. Matrix-based methods such as Muon, Soap, and Kron can deliver a 30–40% stepwise speedup over well-tuned scalar-based baselines such as AdamW. Muon achieves the largest speedup in the 1x Chinchilla regime, but Soap surpasses it at data-to-model ratios of 8x or higher.
Methodology
General Setup﻿
Models: 130 M, 300 M, 520 M, and 1.2 B-parameter Transformers (sequence length = 4096). Detailed hyperparameters:

ModelParamsSeq LenHidden DimInter Dim# Layers# Heads
LLaMA-130M130M40965122048328
LLaMA-300M300M409676830723212
LLaMA-520M520M4096102440963216
LLaMA-1.2B1.2B4096153661443224
﻿
2. Data:  Mixture of DCLM-baseline (3.8 T tokens), StarCoder V2 (0.25 T), ProofPile 2 (0.055 T)
 Tokenized with the LLaMA-3 tokenizer.
  Chinchilla scaling target: ≈ 20× non-embedding parameter count.
3. Optimizers: Ten methods [implementation]:
﻿
AdamW: Decoupled Weight Decay Regularization (Loshchilov & Hutter, 2017)﻿
NAdamW: Incorporating Nesterov Momentum into Adam (Dozat, 2016) 
Mars: MARS: Unleashing the Power of Variance Reduction for Training Large Models (Yuan et al., 2024)﻿
Cautious: Cautious Optimizers: Improving Training with One Line of Code (Liang et al., 2024) 
Lion: Symbolic Discovery of Optimization Algorithms (Raffel et al., 2023) 
Adam-mini: Adam-mini: Use Fewer Learning Rates To Gain More (Zhang et al., 2024) 
Muon: Muon: An optimizer for the hidden layers of neural networks (GitHub) 
Scion: LIONS-EPFL/scion (GitHub) 
Kron (PSGD): kron_torch: PSGD with Kronecker-factored preconditioner (GitHub) 
Soap: SOAP: Improving and Stabilizing Shampoo using Adam (Vyas et al., 2024)﻿
Phase I: Full Hyperparameter Sweep﻿
Protocol: Coordinate-descent over grids for *all* optimizer hyperparameters (learning rate, weight decay, warmup, β₁, β₂, ε, max-grad-norm, batch size).
Regimes:
130 M / 300 M / 500 M at 1× Chinchilla
 130 M at 2×, 4×, 8× Chinchilla
The following is an example: where we perform the coordinate descent for AdamW on 130M 1x Chinchilla and the coordinate descent procedure is shown in the following tables

StageLRWDmin_lr_ratioWarmupMax Grad NormBatchVal. Loss
Init0.0080.10100012563.298
Round 10.0080.10200012563.282
Round 20.0080.10200011283.263
Best0.0080.10200021283.263
﻿
﻿
﻿
Run set61
﻿
﻿
﻿
Phase II: Sensitive-Hyperparameters Identification﻿Identify From Phase I, flag hyperparameters whose optima shift with scale (e.g. LR, warmup length).
2. Grid-search those parameters on 300 M / 500 M at 2×, 4×, 8× Chinchilla.
Example: Muon 520M x 8x
﻿
Run set4
﻿
3. Speedup Estimation: Combining first and second phases' result, we estimate the speedup of different optimizers over AdamW as the equivalent amount of data AdamW need to achieve the same loss.
﻿
Our result is shown in the following figures:
﻿
﻿
﻿
﻿
Phase III: Case Study To study even larger-scale experiments, we first examine how-well can we fit the hyperparameters
Fit smooth laws of the form
  h(N,D)=α N−AD−B+βh(N, D) = \alpha\,N^{-A}D^{-B} + \betah(N,D)=αN−AD−B+β 
﻿
  over (model size N, data size D, hyperparameter h) triples.
2.  Predict optimum settings at two out-of-distribution settings
﻿
1.2B Models﻿Link to code﻿
In order to verify our scaling law, we run a full sweep on the 1.2B models, our predicted configs perform within 6e-3 of the optimal configuration 

learning_rateweight_decaymin_lr_ratiowarmupbeta1beta2epsilonmax_grad_normtrain_batch_sizenesterov
0.0020.20.010000.90.981e-101256False
﻿
﻿
﻿
Run set11
﻿
We then run the 1.2B experiments for 1, 2, 4, 8 Chinchilla for AdamW, Nesterov AdamW and Muon.
﻿
Run set79
﻿
﻿
﻿
16x Chinchilla ﻿Link to code﻿
We experiment with 16x Chinchilla to see if Muon and Soap can continue to outperform AdamW and NAdamW. We observe that Soap overtakes when heavily overtrained. In 130M x 16x Chinchila, we observe Muon is outperformed by Nesterov AdamW.
﻿
Run set4
﻿
﻿
﻿
Run set5
﻿
﻿
﻿
More Phenomenon That We Found﻿
Weight decay is helpful for final performance but leads to higher loss initiallyAcross optimizers, non-zero weight decay is preferred for final performance but it leads to higher loss before learning rate decay.
﻿
Run set3
﻿
﻿
High weight decay (0.5-0.7) is beneficial for Lion and Kron﻿
Run set4
﻿
﻿
The speedup of optimizers may diminish even if only one hyperparamaeter is slightly misset﻿
In the 520M x 8x experiment, when only the learning rate is 2x from otpimal for Soap, where the training is still overall stable, its speedup over Mars is gone.
﻿
Run set3
﻿
﻿
Parameter norms typically track learning rate decay when there is weight decay, regardless of optimizers﻿
﻿
Run set5
﻿
﻿
Gradient norm increases during learning rate decay, regardless of optimizers﻿
Run set4
﻿
﻿
﻿

Model	Params	Seq Len	Hidden Dim	Inter Dim	# Layers	# Heads
LLaMA-130M	130M	4096	512	2048	32	8
LLaMA-300M	300M	4096	768	3072	32	12
LLaMA-520M	520M	4096	1024	4096	32	16
LLaMA-1.2B	1.2B	4096	1536	6144	32	24

Stage	LR	WD	Warmup	Max Grad Norm	Batch	Val. Loss
Init	0.008	0.1	1000	1	256	3.298
Round 1	0.008	0.1	2000	1	256	3.282
Round 2	0.008	0.1	2000	1	128	3.263
Best	0.008	0.1	2000	2	128	3.263

learning_rate	weight_decay	min_lr_ratio	warmup	beta1	beta2	epsilon	max_grad_norm	train_batch_size	nesterov
0.002	0.2	0.0	1000	0.9	0.98	1e-10	1	256	False

Add a comment