Skip to main content

Fantastic Optimizers and Where to Find Them

We present a rigorous study across 10 optimizers on optimization speed in language modeling speedup and reveal that the realistic speedup is significantly lower than claimed. We analyze why this is the case and show new observations about optimization.
Created on May 19|Last edited on May 19



Abstract

AdamW has long been the dominant optimizer in open-weight language model pretraining, despite numerous claims of 1.4 to 2x speedups from alternatives. We posit that two methodological shortcomings — unequal hyperparameter tuning and limited or misleading evaluation setup — have obscured fair comparisons and hindered practical adoption. To rectify this, we study ten deep learning optimizers, tuning hyperparameters across four model scales (0.1 B–1.2 B parameters) and data‐to‐model ratios (1–8x the Chinchilla optimum) in a principled manner. We show that rigorous hyperparameter tuning for each optimizer is necessary because (i) optimal hyperparameters for one optimizer may be far from optimal for another, and thus blind transfer of hyperparameters leads to unfair comparisons, and (ii) the speed-up over heavily-tuned baselines is lower than claimed speedup. Moreover, comparisons only in a short‐horizon setting can be misleading because the order of two optimizers can flip when training for longer---their loss curves can cross even multiple times with learning rate decay. Our results show that all the fastest optimizers use matrices as pre-conditioners --- multiplying the gradients with matrices as opposed to entry-wise scalars. Matrix-based methods such as Muon, Soap, and Kron can deliver a 30–40% stepwise speedup over well-tuned scalar-based baselines such as AdamW. Muon achieves the largest speedup in the 1x Chinchilla regime, but Soap surpasses it at data-to-model ratios of 8x or higher.

Methodology

General Setup


  1. Models: 130 M, 300 M, 520 M, and 1.2 B-parameter Transformers (sequence length = 4096). Detailed hyperparameters:
ModelParamsSeq LenHidden DimInter Dim# Layers# Heads
LLaMA-130M130M40965122048328
LLaMA-300M300M409676830723212
LLaMA-520M520M4096102440963216
LLaMA-1.2B1.2B4096153661443224

2. Data: Mixture of DCLM-baseline (3.8 T tokens), StarCoder V2 (0.25 T), ProofPile 2 (0.055 T)
Tokenized with the LLaMA-3 tokenizer.
Chinchilla scaling target: ≈ 20× non-embedding parameter count.
3. Optimizers: Ten methods [implementation]:


Phase I: Full Hyperparameter Sweep


  • Protocol: Coordinate-descent over grids for *all* optimizer hyperparameters (learning rate, weight decay, warmup, β₁, β₂, ε, max-grad-norm, batch size).
  • Regimes:
    • 130 M / 300 M / 500 M at 1× Chinchilla
    • 130 M at 2×, 4×, 8× Chinchilla
The following is an example: where we perform the coordinate descent for AdamW on 130M 1x Chinchilla and the coordinate descent procedure is shown in the following tables
StageLRWDmin_lr_ratioWarmupMax Grad NormBatchVal. Loss
Init0.0080.10100012563.298
Round 10.0080.10200012563.282
Round 20.0080.10200011283.263
Best0.0080.10200021283.263



Run set
61




Phase II: Sensitive-Hyperparameters Identification

  1. Identify From Phase I, flag hyperparameters whose optima shift with scale (e.g. LR, warmup length).
2. Grid-search those parameters on 300 M / 500 M at 2×, 4×, 8× Chinchilla.
Example: Muon 520M x 8x

Run set
4

3. Speedup Estimation: Combining first and second phases' result, we estimate the speedup of different optimizers over AdamW as the equivalent amount of data AdamW need to achieve the same loss.

Our result is shown in the following figures:





Phase III: Case Study

To study even larger-scale experiments, we first examine how-well can we fit the hyperparameters
  1. Fit smooth laws of the form

h(N,D)=α N−AD−B+βh(N, D) = \alpha\,N^{-A}D^{-B} + \beta


over (model size N, data size D, hyperparameter h) triples.
2. Predict optimum settings at two out-of-distribution settings


1.2B Models

In order to verify our scaling law, we run a full sweep on the 1.2B models, our predicted configs perform within 6e-3 of the optimal configuration
learning_rateweight_decaymin_lr_ratiowarmupbeta1beta2epsilonmax_grad_normtrain_batch_sizenesterov
0.0020.20.010000.90.981e-101256False



Run set
11

We then run the 1.2B experiments for 1, 2, 4, 8 Chinchilla for AdamW, Nesterov AdamW and Muon.

Run set
79




16x Chinchilla

We experiment with 16x Chinchilla to see if Muon and Soap can continue to outperform AdamW and NAdamW. We observe that Soap overtakes when heavily overtrained. In 130M x 16x Chinchila, we observe Muon is outperformed by Nesterov AdamW.

Run set
4



Run set
5




More Phenomenon That We Found



Weight decay is helpful for final performance but leads to higher loss initially

Across optimizers, non-zero weight decay is preferred for final performance but it leads to higher loss before learning rate decay.

Run set
3



High weight decay (0.5-0.7) is beneficial for Lion and Kron


Run set
4



The speedup of optimizers may diminish even if only one hyperparamaeter is slightly misset


In the 520M x 8x experiment, when only the learning rate is 2x from otpimal for Soap, where the training is still overall stable, its speedup over Mars is gone.

Run set
3



Parameter norms typically track learning rate decay when there is weight decay, regardless of optimizers



Run set
5



Gradient norm increases during learning rate decay, regardless of optimizers


Run set
4