Benchmarking Optimizers for Large Language Model Pretraining

Abstract: The recent development of Large Language Models (LLMs) has been accompanied by an effervescence of novel ideas and methods to better optimize the loss of deep learning models. Claims from those methods are myriad: from faster convergence to removing reliance on certain hyperparameters. However, the diverse experimental protocols used to validate these claims make direct comparisons between methods challenging. This study presents a comprehensive evaluation of recent optimization techniques across
Andrei Semenov
Created on October 17|Last edited on November 30
Comment
Codebase: https://github.com/epfml/llm-optimizer-benchmark﻿
Correspondence to: Andrei Semenov, Matteo Pagliardini, Martin JaggiDate: September 1, 2025
SetupBenchmarking & Ablations at Small ScaleWeight Decay ablationLearning rate sensitivityAblation on WSD, cosine, and linear learning rate schedulersWarmup ablationLearning rate decayingOptimizer-related ablations: re-tuning betas for longer training, sensitivity of optimizers to β2, the fail of Sophia, Schedule-Free & clipping, Muon & Newton-Schulz, MARS family, Signum configurations, on learning rates of ProdigyOn weight initialization and warmupBenchmarking results for 124M modelsScaling batch size vs. scaling the number of iterationsBenchmarking results for 210M models: a smooth hyperparameter transferScaling Up: Benchmarking & Ablations at 583M and 720M scaleOn z-loss regularizationScaling Up to 720M models & 1M tokens batch sizeTraining MoE modelsTakeaway of our ablations, pretraining best-practices with recent optimizersImplications of this research: Apertus 70B pretrain with AdEMAMixExtra Plots
﻿
Setup﻿
﻿
﻿
Benchmarking & Ablations at Small Scale﻿
Weight Decay ablation
Larger weight decay achieves significantly better results when training on fewer tokens. We observe that the majority of runs with the large weight decay of 0.5 consistently outperform those with weight decay of 0.1 for all training durations except for the long training on 16.8B tokens. Notably, Signum\texttt{Signum}Signum﻿ and Lion\texttt{Lion}Lion﻿ with large weight decay perform even better than AdamW\texttt{AdamW}AdamW﻿ with the same learning rate. We also consider a setting without weight decay. We observe that this is suboptimal for most of other optimizers, while the typical weight decay of 0.1 remains the best for long training durations. An interesting pattern emerges for optimizers that treat one-dimensional and two-dimensional parameters differently, such as Muon\texttt{Muon}Muon﻿ and MARS\texttt{MARS}MARS﻿. For these, runs with large weight decay (0.5) consistently underperform those with 0.1 and, in some cases, even those without weight decay. For Muon\texttt{Muon}Muon﻿, we attribute this effect to its algorithmic design, in which weight decay is not employed to optimize matrix parameters, in contrast to D-Muon\texttt{D-Muon}D-Muon﻿, where the observed patterns are reliably similar to those seen with AdamW\texttt{AdamW}AdamW﻿. For MARS\texttt{MARS}MARS﻿, we only vary the weight decay corresponding to matrix parameters while keeping 0.1 for all scalar, one-dimensional and final layer parameters. In this case, we conclude that the gap between large and small weight decay values narrows significantly faster. 
In (a) we observe that runs of AdamW\texttt{AdamW}AdamW﻿, Signum\texttt{Signum}Signum﻿, and Lion\texttt{Lion}Lion﻿ with the large weight decay of 0.5 consistently outperform the baseline AdamW\texttt{AdamW}AdamW﻿ with weight decay of 0.1 for all training durations except for the last one. Notably, Signum\texttt{Signum}Signum﻿ and Lion\texttt{Lion}Lion﻿ with large weight decay perform even better than AdamW\texttt{AdamW}AdamW﻿ with the same learning rate. In (b), we also consider a setting without weight decay. We observe that this is suboptimal not only for AdamW\texttt{AdamW}AdamW﻿, but also for the majority of other optimizers, while the typical weight decay of 0.1 remains the best for large training durations. Importantly, in (c), we ablate the impact of weight decay on the model’s ℓ2 norm.
Logs for the Weight Decay ablation.
﻿
Run set1478
﻿
 Gradient Norm patterns for weight decay.
﻿
Run set1478
﻿
Learning rate sensitivity��
Optimal learning rate stability across optimizers. The optimal learning rate determined during tuning on 2.1B tokens remains consistent after a learning rate sweep on 16.8B tokens for most optimizers. In (a), we observe that sign-based methods and similar to them Sophia\texttt{Sophia}Sophia﻿ diverge with increasing learning rate. Interestingly, in (b), SF-AdamW\texttt{SF-AdamW}SF-AdamW﻿, SOAP\texttt{SOAP}SOAP﻿, and D-Muon\texttt{D-Muon}D-Muon﻿ demonstrate their best performance with a large learning rate of 0.002, while MARS\texttt{MARS}MARS﻿ maintains remarkably consistent performance across the entire learning rate sweep.
Learning rate sensitivity. In the current setting, only SOAP\texttt{SOAP}SOAP﻿, SF-AdamW\texttt{SF-AdamW}SF-AdamW﻿, and D-Muon\texttt{D-Muon}D-Muon﻿ reach the better performance with the large learning rate of 0.002. Conversely, Sophia\texttt{Sophia}Sophia﻿ and all sign-based methods (Signum\texttt{Signum}Signum﻿ and Lion\texttt{Lion}Lion﻿) diverge with this learning rate value. MARS\texttt{MARS}MARS﻿ and MARS\texttt{MARS}MARS﻿ show a remarkably consistent performance across the learning rate sweep. And, Prodigy\texttt{Prodigy}Prodigy﻿ diverges for sufficiently large value of γmax⁡\gamma_{\max}γmax​﻿.
Logs.
﻿
Run set1478
﻿
 Gradient Norm patterns for learning rates.
﻿
Run set1478
﻿
Ablation on WSD, cosine, and linear learning rate schedulers
Warmup ablation
Learning rate decaying
Optimizer-related ablations: re-tuning betas for longer training, sensitivity of optimizers to β2, the fail of Sophia, Schedule-Free & clipping, Muon & Newton-Schulz, MARS family, Signum configurations, on learning rates of ProdigyRe-tuning betas for longer training.
Re-tuning beta parameters is significant for longer training. Our results reveal that increasing β3\beta_3β3​﻿ for AdEMAMix\texttt{AdEMAMix}AdEMAMix﻿ is crucial for long runs. Without these changes, SOAP\texttt{SOAP}SOAP﻿ with β2\beta_2β2​﻿ = 0.999 found via tuning on 16k steps runs ends up outperforming AdEMAMix\texttt{AdEMAMix}AdEMAMix﻿ with β3\beta_3β3​﻿ = 0.999. Re-tuning β2\beta_2β2​﻿ of SOAP\texttt{SOAP}SOAP﻿ does not change the results much, and give almost identical loss curves, however, the training dynamics if AdEMAMix\texttt{AdEMAMix}AdEMAMix﻿ changes dramatically with β3\beta_3β3​﻿, and gives best results with larger β3\beta_3β3​﻿ = 0.9999.
Logs.
﻿
Run set1478
﻿
﻿AdamW\texttt{AdamW}AdamW﻿ & β2\beta_2β2​﻿.
﻿
Run set1478
﻿
 Sensitivity of optimizers to β2\beta_2β2​﻿.
Impact of beta parameters on Schedule-Free\texttt{Schedule-Free}Schedule-Free﻿. We elaborate further on the question of the sensitivity of SF-AdamW\texttt{SF-AdamW}SF-AdamW﻿ to β2\beta_2β2​﻿. For language modeling, Defazio et al. initially suggested using default (β1\beta_1β1​﻿ = 0.9, β2\beta_2β2​﻿ = 0.95). Then, Hägele et al. revisited hyperparameter tuning of the schedule-free optimizer, proposing to apply (β1\beta_1β1​﻿ = 0.95, β2\beta_2β2​﻿ = 0.99), which improved the performance a lot. Based on our tuning, we claim that (β1\beta_1β1​﻿ = 0.9, β2\beta_2β2​﻿ = 0.9999) achieves the best performance at this scale—see (b). In addition, we fix β1\beta_1β1​﻿ = 0.9 and report the result with sweep of β2\beta_2β2​﻿ ∈ {0.999, 0.9999, 0.99999}, showing that the large and unconventional value of β2\beta_2β2​﻿ = 0.9999 is indeed the best in schedule-free runs. We also notice that SF-AdamW\texttt{SF-AdamW}SF-AdamW﻿ requires a slightly larger optimal β2\beta_2β2​﻿, compared to all other optimizers.
Logs.
﻿
Run set1478
﻿
Prodigy is sensitive to beta parameters in the small-batch setting. In this experiment, we follow our setup with a small batch size of 32 × 512 tokens, training 124M models with the best hyperparameters while sweeping β2\beta_2β2​﻿. Although β2\beta_2β2​﻿ = 0.999 yields the best results in this setting, even a slight change to β2\beta_2β2​﻿ = 0.9999 causes divergence. This occurs because (β1\beta_1β1​﻿, β2\beta_2β2​﻿) directly affect the internal statistics rtr_trt​﻿, st\boldsymbol{s}_tst​﻿, which determine the optimizer’s effective learning rate. As shown in (b), enabling bias correction effectively resolves this instability.
Logs.
﻿
Run set1478
﻿
ADOPT still needs β2\beta_2β2​﻿. One of the main theoretical claims of Taniguchi et al.—that ADOPT\texttt{ADOPT}ADOPT﻿ converges with any β2\beta_2β2​﻿. The authors verify those on a toy problem motivated by Reddi et al. However, in LLM training, the choice of β2\beta_2β2​﻿ still matters significantly. Our results demonstrate that, despite theoretical guarantees, performance strongly depends on tuning β2\beta_2β2​﻿ in practice.
Logs.
﻿
Run set1478
﻿
The fail of Sophia.\texttt{Sophia}.Sophia.﻿﻿
﻿Sophia\texttt{Sophia}Sophia﻿ diverges in the small-batch setting even with sufficiently small learning rate. We train 124M Llama models with batch size 32 × 512 tokens for T ∈ {64, 128, 256, 384, 512, 1024}k iterations. Sophia diverges with the typical learning rate γmax⁡\gamma_{\max}γmax​﻿ = 0.001, and even at smaller values (e.g., 3e−4, 5e−4) it still fails shortly after 2.1B tokens (≡ 128ksteps). Figures (a–c) show loss, next-token prediction accuracy, and gradient norms, respectively. For both reported γmax⁡\gamma_{\max}γmax​﻿ values, divergence occurs at nearly the same iteration (within 10k steps, ∼ 164M tokens). We do not attribute this instability to implementation bugs, since Sophia\texttt{Sophia}Sophia﻿ converges on larger MoE models for longer horizons. Whether this instability is related to the Chinchilla optimal horizon remains unclear; however, with a larger batch size (256 × 512), Sophia\texttt{Sophia}Sophia﻿ again fails once training exceeds 16.8B tokens.
﻿Sophia\texttt{Sophia}Sophia﻿ diverges in the large-batch setup, when training for many iterations. In the small-batch setup, we observed that Sophia\texttt{Sophia}Sophia﻿ exhibited convergence issues. With batch size 256 × 512, Sophia\texttt{Sophia}Sophia﻿ initially converges reliably across all training durations for 124M models used in our benchmarking. However, when extending training beyond 16.8B tokens, divergence reappears. To clearly visualize so, we present the best stable run (T = 128k steps, 16.8B tokens) with the unstable one (T = 256k steps, 33.6B tokens), using identical hyperparameters. The dashed line marks the iterationt = 129720 where divergence begins. This instability raises serious concerns about the practicality of Sophia\texttt{Sophia}Sophia﻿ for long training runs at scale.
Logs.
﻿
Run set1478
﻿
 Schedule-Free\texttt{Schedule-Free}Schedule-Free﻿ & clipping.
Clipping is significant for Schedule-Free\texttt{Schedule-Free}Schedule-Free﻿. We find that gradient clipping remains a critical hyperparameter for SF-AdamW\texttt{SF-AdamW}SF-AdamW﻿. As shown in (a), disabling clipping causes severe training instabilities. To mitigate these undesired loss dynamics, we reduced the learning rate from 0.001 to 0.0005, which stabilized training (b). However, even under this adjustment, the clipped runs still outperform those without clipping.
Logs.
﻿
Run set1478
﻿
﻿
﻿Muon\texttt{Muon}Muon﻿ & Newton-Schulz.
﻿Muon\texttt{Muon}Muon﻿’s dependence on the number of Newton-Schulz iterations. We perform a short ablation targeting the final loss of Muon\texttt{Muon}Muon﻿ by varying the number of Newton-Schulz iterations. Training is done for 16k steps with a batch size of 256 × 512 tokens, sweeping TNST_\text{NS}TNS​﻿ ∈ {1, 5, 10, 20}. We find that increasing TNST_\text{NS}TNS​﻿ beyond 5 does not improve performance, while unnecessarily increasing wall-clock time.
Logs.
﻿
Run set1478
﻿
 MARS\texttt{MARS}MARS﻿ family. 
﻿MARS\texttt{MARS}MARS﻿ family of optimizers. We study three MARS\texttt{MARS}MARS﻿-based algorithms: MARS-AdamW\texttt{MARS-AdamW}MARS-AdamW﻿ (just MARS\texttt{MARS}MARS﻿ in our work), MARS-Lion\texttt{MARS-Lion}MARS-Lion﻿, and MARS-Shampoo\texttt{MARS-Shampoo}MARS-Shampoo﻿. In this ablation, our goal is to complement our MARS\texttt{MARS}MARS﻿ runs with experiments for other similar methods, and support findings for these optimizer with our previous experience in tuning Lion\texttt{Lion}Lion﻿. We train with the batch size of 256 x 512 for the same training durations as in the benchmarking of 124M models. In (a), we show that, indeed, MARS-AdamW\texttt{MARS-AdamW}MARS-AdamW﻿ outperforms other alike methods, as reported in the original paper regarding the MARS-Lion\texttt{MARS-Lion}MARS-Lion﻿ optimizer. Interestingly, in (b), we show that the choice of γ\gammaγ﻿-scheduler for MARS\texttt{MARS}MARS﻿-based methods also depends on optimizer, as such, WSD runs of MARS-Lion\texttt{MARS-Lion}MARS-Lion﻿ outperform itself with cosine. Dashed blue and dark blue lines correspond to the AdamW\texttt{AdamW}AdamW﻿ baseline with cosine and WSD schedulers, respectively. Furthermore, in the same way as Lion\texttt{Lion}Lion﻿ benefits from larger warmup, MARS-Lion\texttt{MARS-Lion}MARS-Lion﻿ also improves with 8k steps (≡ 1B tokens) warmup, however, this improvement is not as dramatic (c).
Logs.
﻿
Run set1478
﻿
﻿Signum\texttt{Signum}Signum﻿ configurations.
Comparison of different update rules for Signum\texttt{Signum}Signum﻿. We evaluate three variants of the Signum\texttt{Signum}Signum﻿ update: Nesterov (our default), dampening—which resembles an EMA of mtwhen the dampening parameter τ\tauτ﻿ equals the momentum β\betaβ﻿—and the “plain” Signum\texttt{Signum}Signum﻿ without Nesterov momentum or dampening. Validation perplexity is reported for two training horizons in (256×512) batch size setting. The Nesterov variant corresponds to the runs included in our main benchmarking results. While Nesterov style momentum consistently achieves the best performance, the relative perplexity gap compared to the other variants decreases as the training horizon increases.
Logs.
﻿
Run set1478
﻿
﻿ADOPT\texttt{ADOPT}ADOPT﻿'s sensitivity to ε\varepsilonε﻿.
﻿ADOPT\texttt{ADOPT}ADOPT﻿'s sensitivity to ε\varepsilonε﻿. Interestingly, the suggested by the authors ε = 1e-6 is the best hyperparameter for this method. There is not a noticeable difference in convergence forε = {1e−6, 1e−7, 1e−8, 1e−9, 1e−10}, but the values of 1e−5 and above give a much morse results.
Logs.
﻿
Run set1478
﻿
On learning rates of Prodigy\texttt{Prodigy}Prodigy﻿.
EMA sequences of Prodigy\texttt{Prodigy}Prodigy﻿ result in the effective learning rate that emulates the dynamics of learning rate that we used to observe for AdamW\texttt{AdamW}AdamW﻿. Fixing the peak learning rate at γ\gammaγ﻿ = 1(following Mishchenko et al.), the EMA sequences rtr_trt​﻿ and st\boldsymbol{s}_tst​﻿ result in the effective learning rate shown in (a). The dashed line indicates the warmup duration. Across all schedulers and the run without a γ\gammaγ﻿-scheduler, the warmup of γteff\gamma^\text{eff}_tγteff​﻿ (a) is consistently longer than that of γt\gamma_tγt​﻿ (b), providing an implicit warmup. With cosine and WSD schedulers, the peak γteff\gamma^\text{eff}_tγteff​﻿ exceeds that of the run without a scheduler. Notably, the peak effective learning rates, especially for the cosine scheduler, are very close to the default value 0.001 used for AdamW\texttt{AdamW}AdamW﻿ at this model scale. This demonstrates that Prodigy\texttt{Prodigy}Prodigy﻿ may guide practitioners in tuning learning rates for Adam\texttt{Adam}Adam﻿-like optimizers.
Logs.
﻿
Run set1478
﻿
On weight initialization and warmup
Weight initialization with smaller std prefers longer warmup. We compare final loss of models trained with AdamW\texttt{AdamW}AdamW﻿ using two weight initializations: the conventional std = 0.02 and a smaller std = 0.006 as in DeepSeek. We vary the training horizon, warmup duration, and batch size (without changing the number of iterations). Our results indicate that smaller initialization benefits from longer warmup, leading to better performance compared to std = 0.02. However, with very short warmup, the conventional initialization outperforms the smaller one. Interestingly, increasing the batch size reduces the performance gap between the two initializations for longer training runs.
Logs.
﻿
Run set1478
﻿
Benchmarking results for 124M models Small batch size: 32x512.
﻿
Logs.
﻿
Run set1478
﻿
Larger batch size: 256x512.
﻿
Logs.
﻿
Run set1478
﻿
﻿
Scaling batch size vs. scaling the number of iterations
Scaling batch size vs. scaling the number of iterations. Our results demonstrate that: (left) scaling the batch size significantly improves MARS\texttt{MARS}MARS﻿, Signum\texttt{Signum}Signum﻿, Lion\texttt{Lion}Lion﻿ and Prodigy\texttt{Prodigy}Prodigy﻿ making them as good as AdamW\texttt{AdamW}AdamW﻿ even for a long training for 16.8B tokens. Which was not the case in Figure 5 (b), where we still observed a significant gap in performance; and (right): indeed, with scaling of the number of iterations, the gap between SOAP\texttt{SOAP}SOAP﻿ and AdEMAMix\texttt{AdEMAMix}AdEMAMix﻿ narrow and, finally, increases. But, on the other hand, with increase of the AdEMAMix\texttt{AdEMAMix}AdEMAMix﻿ β3\beta_3β3​﻿ parameter, the performance gap with SOAP\texttt{SOAP}SOAP﻿ reappears.
Logs.
﻿
Run set1478
﻿
Benchmarking results for 210M models: a smooth hyperparameter transfer
Ranking of optimizers for 210M models with the batch size of 256 × 512 tokens. Increasing a model size from 124M to 210M results in almost identical ranking of optimizers compared to results for 124M models with 256 × 512 tokens batch size. At this scale, we observe a smooth transition in our benchmarking.
Logs.
﻿
Run set1478
﻿
Scaling Up: Benchmarking & Ablations at 583M and 720M scale
On z-loss regularization
Ablation of z-loss regularization. Incorporating the z-loss regularizer does not improve the final loss or reduce the spikiness of the loss curves. Moreover, combining z-loss with small weight decay and decaying γ\gammaγ﻿ down to 10%, further degrades overall performance. Notably, these changes can reverse the relative ranking of optimizers compared to the results reported by Vyas et al.
Logs.
﻿
Run set1478
﻿
Scaling Up to 720M models & 1M tokens batch size
Comparing optimizers for training a 720M parameter LLM. We conduct runs with the batch size of 1M tokens. While previous ablations reveal that sign-based methods can outperform AdamW\texttt{AdamW}AdamW﻿ at sufficiently large batches, this advantage does not persist when scaling model size. On another hand, MARS\texttt{MARS}MARS﻿, that also benefits from the increased batch size, along with AdEMAMix\texttt{AdEMAMix}AdEMAMix﻿ dominates over other optimizers with a huge gap.
Ranking of optimizers for 720M Llama-based models. We plot the final validation loss obtained by the best-tuned optimizers on the FineWeb dataset. We use a batch size of 1M tokens and train multiple methods beyond and below the Chinchilla optimal duration, which is 14.4B for model of this size. AdEMAMix\texttt{AdEMAMix}AdEMAMix﻿ and MARS\texttt{MARS}MARS﻿ are the best optimizers in this setup, with a noticable gap in performance compared to other methods. We also plot the AdamW\texttt{AdamW}AdamW﻿ baseline in both figures to distinguish the group of methods that consistently perform worse than AdamW\texttt{AdamW}AdamW﻿ from the group of optimizers that outperform it for some training durations.
Logs.
﻿
Run set1478
﻿
Training MoE models
Ranking optimizers for 520M MoE models with 256 × 512 batch size. We report results for models trained for both 42k iterations (left), and 336k (right). MoE configuration correspond to one of the 124M dense model. Optimizer rankings closely mirror results for 124M, indicating that our benchmarking results transfer smoothly from dense models to MoEs. We also see that SOAP\texttt{SOAP}SOAP﻿ outperforms AdEMAMix\texttt{AdEMAMix}AdEMAMix﻿ in 336k steps run, however, with re-tuned beta parameters we might expect the opposite results in longer training.
Comparing optimizers for training a 520M parameter MoE. Results closely remind those for dense 124M models. The AdamW\texttt{AdamW}AdamW﻿ baseline by far outperforms Sophia\texttt{Sophia}Sophia﻿, SF-AdamW\texttt{SF-AdamW}SF-AdamW﻿, MARS\texttt{MARS}MARS﻿, and sign-based methods for 44B training horizon. Remarkably, in the same way as Prodigy\texttt{Prodigy}Prodigy﻿ followed AdamW\texttt{AdamW}AdamW﻿ for dense 124M models, we observe a similar situation for the MoE model.
Logs.
﻿
Run set1478
﻿
Takeaway of our ablations, pretraining best-practices with recent optimizers﻿
Implications of this research: Apertus 70B pretrain with AdEMAMixOur benchmarking experiments at relatively small models (compared to production-ready scale) has motivated the choice of optimizers for SwissAI Apertus 70B and 8B models---a fully open and compliant LLM trained on nice GPUs from Alps on open data including more than 1800 languages.
﻿
﻿
﻿
﻿
﻿
﻿
﻿
Extra Plots﻿
Add a comment