Benchmarking Optimizers for Large Language Model Pretraining
Abstract:
The recent development of Large Language Models (LLMs) has been accompanied by an effervescence of novel ideas and methods
to better optimize the loss of deep learning models. Claims from those methods are myriad: from faster convergence to removing
reliance on certain hyperparameters. However, the diverse experimental protocols used to validate these claims make direct
comparisons between methods challenging. This study presents a comprehensive evaluation of recent optimization techniques across
Created on October 17|Last edited on November 30
Comment
Correspondence to: Andrei Semenov, Matteo Pagliardini, Martin Jaggi
Date: September 1, 2025
SetupBenchmarking & Ablations at Small ScaleWeight Decay ablationLearning rate sensitivityAblation on WSD, cosine, and linear learning rate schedulersWarmup ablationLearning rate decayingOptimizer-related ablations: re-tuning betas for longer training, sensitivity of optimizers to β2, the fail of Sophia, Schedule-Free & clipping, Muon & Newton-Schulz, MARS family, Signum configurations, on learning rates of ProdigyOn weight initialization and warmupBenchmarking results for 124M modelsScaling batch size vs. scaling the number of iterationsBenchmarking results for 210M models: a smooth hyperparameter transferScaling Up: Benchmarking & Ablations at 583M and 720M scaleOn z-loss regularizationScaling Up to 720M models & 1M tokens batch sizeTraining MoE modelsTakeaway of our ablations, pretraining best-practices with recent optimizersImplications of this research: Apertus 70B pretrain with AdEMAMixExtra Plots
Setup
Benchmarking & Ablations at Small Scale
Weight Decay ablation

Larger weight decay achieves significantly better results when training on fewer tokens. We observe that the majority of runs with the large weight decay of 0.5 consistently outperform those with weight decay of 0.1 for all training durations except for the long training on 16.8B tokens. Notably, and with large weight decay perform even better than with the same learning rate. We also consider a setting without weight decay. We observe that this is suboptimal for most of other optimizers, while the typical weight decay of 0.1 remains the best for long training durations. An interesting pattern emerges for optimizers that treat one-dimensional and two-dimensional parameters differently, such as and . For these, runs with large weight decay (0.5) consistently underperform those with 0.1 and, in some cases, even those without weight decay. For , we attribute this effect to its algorithmic design, in which weight decay is not employed to optimize matrix parameters, in contrast to , where the observed patterns are reliably similar to those seen with . For , we only vary the weight decay corresponding to matrix parameters while keeping 0.1 for all scalar, one-dimensional and final layer parameters. In this case, we conclude that the gap between large and small weight decay values narrows significantly faster.

In (a) we observe that runs of , , and with the large weight decay of 0.5 consistently outperform the baseline with weight decay of 0.1 for all training durations except for the last one. Notably, and with large weight decay perform even better than with the same learning rate. In (b), we also consider a setting without weight decay. We observe that this is suboptimal not only for , but also for the majority of other optimizers, while the typical weight decay of 0.1 remains the best for large training durations. Importantly, in (c), we ablate the impact of weight decay on the model’s ℓ2 norm.
Logs for the Weight Decay ablation.
Gradient Norm patterns for weight decay.
Learning rate sensitivity
��

Optimal learning rate stability across optimizers. The optimal learning rate determined during tuning on 2.1B tokens remains consistent after a learning rate sweep on 16.8B tokens for most optimizers. In (a), we observe that sign-based methods and similar to them diverge with increasing learning rate. Interestingly, in (b), , , and demonstrate their best performance with a large learning rate of 0.002, while maintains remarkably consistent performance across the entire learning rate sweep.

Learning rate sensitivity. In the current setting, only , , and reach the better performance with the large learning rate of 0.002. Conversely, and all sign-based methods ( and ) diverge with this learning rate value. and show a remarkably consistent performance across the learning rate sweep. And, diverges for sufficiently large value of .
Logs.
Gradient Norm patterns for learning rates.
Ablation on WSD, cosine, and linear learning rate schedulers
Warmup ablation
Learning rate decaying
Optimizer-related ablations: re-tuning betas for longer training, sensitivity of optimizers to β2, the fail of Sophia, Schedule-Free & clipping, Muon & Newton-Schulz, MARS family, Signum configurations, on learning rates of Prodigy
Re-tuning betas for longer training.

Re-tuning beta parameters is significant for longer training. Our results reveal that increasing for is crucial for long runs. Without these changes, with = 0.999 found via tuning on 16k steps runs ends up outperforming with = 0.999. Re-tuning of does not change the results much, and give almost identical loss curves, however, the training dynamics if changes dramatically with , and gives best results with larger = 0.9999.
Logs.
& .
Sensitivity of optimizers to .

Impact of beta parameters on . We elaborate further on the question of the sensitivity of to . For language modeling, Defazio et al. initially suggested using default ( = 0.9, = 0.95). Then, Hägele et al. revisited hyperparameter tuning of the schedule-free optimizer, proposing to apply ( = 0.95, = 0.99), which improved the performance a lot. Based on our tuning, we claim that ( = 0.9, = 0.9999) achieves the best performance at this scale—see (b). In addition, we fix = 0.9 and report the result with sweep of ∈ {0.999, 0.9999, 0.99999}, showing that the large and unconventional value of = 0.9999 is indeed the best in schedule-free runs. We also notice that requires a slightly larger optimal , compared to all other optimizers.
Logs.

Prodigy is sensitive to beta parameters in the small-batch setting. In this experiment, we follow our setup with a small batch size of 32 × 512 tokens, training 124M models with the best hyperparameters while sweeping . Although = 0.999 yields the best results in this setting, even a slight change to = 0.9999 causes divergence. This occurs because (, ) directly affect the internal statistics , , which determine the optimizer’s effective learning rate. As shown in (b), enabling bias correction effectively resolves this instability.
Logs.

ADOPT still needs . One of the main theoretical claims of Taniguchi et al.—that converges with any . The authors verify those on a toy problem motivated by Reddi et al. However, in LLM training, the choice of still matters significantly. Our results demonstrate that, despite theoretical guarantees, performance strongly depends on tuning in practice.
Logs.
The fail of

diverges in the small-batch setting even with sufficiently small learning rate. We train 124M Llama models with batch size 32 × 512 tokens for T ∈ {64, 128, 256, 384, 512, 1024}k iterations. Sophia diverges with the typical learning rate = 0.001, and even at smaller values (e.g., 3e−4, 5e−4) it still fails shortly after 2.1B tokens (≡ 128ksteps). Figures (a–c) show loss, next-token prediction accuracy, and gradient norms, respectively. For both reported values, divergence occurs at nearly the same iteration (within 10k steps, ∼ 164M tokens). We do not attribute this instability to implementation bugs, since converges on larger MoE models for longer horizons. Whether this instability is related to the Chinchilla optimal horizon remains unclear; however, with a larger batch size (256 × 512), again fails once training exceeds 16.8B tokens.

diverges in the large-batch setup, when training for many iterations. In the small-batch setup, we observed that exhibited convergence issues. With batch size 256 × 512, initially converges reliably across all training durations for 124M models used in our benchmarking. However, when extending training beyond 16.8B tokens, divergence reappears. To clearly visualize so, we present the best stable run (T = 128k steps, 16.8B tokens) with the unstable one (T = 256k steps, 33.6B tokens), using identical hyperparameters. The dashed line marks the iterationt = 129720 where divergence begins. This instability raises serious concerns about the practicality of for long training runs at scale.
Logs.
& clipping.

Clipping is significant for . We find that gradient clipping remains a critical hyperparameter for . As shown in (a), disabling clipping causes severe training instabilities. To mitigate these undesired loss dynamics, we reduced the learning rate from 0.001 to 0.0005, which stabilized training (b). However, even under this adjustment, the clipped runs still outperform those without clipping.
Logs.
& Newton-Schulz.

’s dependence on the number of Newton-Schulz iterations. We perform a short ablation targeting the final loss of by varying the number of Newton-Schulz iterations. Training is done for 16k steps with a batch size of 256 × 512 tokens, sweeping ∈ {1, 5, 10, 20}. We find that increasing beyond 5 does not improve performance, while unnecessarily increasing wall-clock time.
Logs.
family.

family of optimizers. We study three -based algorithms: (just in our work), , and . In this ablation, our goal is to complement our runs with experiments for other similar methods, and support findings for these optimizer with our previous experience in tuning . We train with the batch size of 256 x 512 for the same training durations as in the benchmarking of 124M models. In (a), we show that, indeed, outperforms other alike methods, as reported in the original paper regarding the optimizer. Interestingly, in (b), we show that the choice of -scheduler for -based methods also depends on optimizer, as such, WSD runs of outperform itself with cosine. Dashed blue and dark blue lines correspond to the baseline with cosine and WSD schedulers, respectively. Furthermore, in the same way as benefits from larger warmup, also improves with 8k steps (≡ 1B tokens) warmup, however, this improvement is not as dramatic (c).
Logs.
configurations.

Comparison of different update rules for . We evaluate three variants of the update: Nesterov (our default), dampening—which resembles an EMA of mtwhen the dampening parameter equals the momentum —and the “plain” without Nesterov momentum or dampening. Validation perplexity is reported for two training horizons in (256×512) batch size setting. The Nesterov variant corresponds to the runs included in our main benchmarking results. While Nesterov style momentum consistently achieves the best performance, the relative perplexity gap compared to the other variants decreases as the training horizon increases.
Logs.
's sensitivity to .

's sensitivity to . Interestingly, the suggested by the authors ε = 1e-6 is the best hyperparameter for this method. There is not a noticeable difference in convergence forε = {1e−6, 1e−7, 1e−8, 1e−9, 1e−10}, but the values of 1e−5 and above give a much morse results.
Logs.
On learning rates of .

EMA sequences of result in the effective learning rate that emulates the dynamics of learning rate that we used to observe for . Fixing the peak learning rate at = 1(following Mishchenko et al.), the EMA sequences and result in the effective learning rate shown in (a). The dashed line indicates the warmup duration. Across all schedulers and the run without a -scheduler, the warmup of (a) is consistently longer than that of (b), providing an implicit warmup. With cosine and WSD schedulers, the peak exceeds that of the run without a scheduler. Notably, the peak effective learning rates, especially for the cosine scheduler, are very close to the default value 0.001 used for at this model scale. This demonstrates that may guide practitioners in tuning learning rates for -like optimizers.
Logs.
On weight initialization and warmup

Weight initialization with smaller std prefers longer warmup. We compare final loss of models trained with using two weight initializations: the conventional std = 0.02 and a smaller std = 0.006 as in DeepSeek. We vary the training horizon, warmup duration, and batch size (without changing the number of iterations). Our results indicate that smaller initialization benefits from longer warmup, leading to better performance compared to std = 0.02. However, with very short warmup, the conventional initialization outperforms the smaller one. Interestingly, increasing the batch size reduces the performance gap between the two initializations for longer training runs.
Logs.
Benchmarking results for 124M models
Small batch size: 32x512.

Logs.
Larger batch size: 256x512.

Logs.

Scaling batch size vs. scaling the number of iterations

Scaling batch size vs. scaling the number of iterations. Our results demonstrate that: (left) scaling the batch size significantly improves , , and making them as good as even for a long training for 16.8B tokens. Which was not the case in Figure 5 (b), where we still observed a significant gap in performance; and (right): indeed, with scaling of the number of iterations, the gap between and narrow and, finally, increases. But, on the other hand, with increase of the parameter, the performance gap with reappears.
Logs.
Benchmarking results for 210M models: a smooth hyperparameter transfer

Ranking of optimizers for 210M models with the batch size of 256 × 512 tokens. Increasing a model size from 124M to 210M results in almost identical ranking of optimizers compared to results for 124M models with 256 × 512 tokens batch size. At this scale, we observe a smooth transition in our benchmarking.
Logs.
Scaling Up: Benchmarking & Ablations at 583M and 720M scale
On z-loss regularization

Ablation of z-loss regularization. Incorporating the z-loss regularizer does not improve the final loss or reduce the spikiness of the loss curves. Moreover, combining z-loss with small weight decay and decaying down to 10%, further degrades overall performance. Notably, these changes can reverse the relative ranking of optimizers compared to the results reported by Vyas et al.
Logs.
Scaling Up to 720M models & 1M tokens batch size

Comparing optimizers for training a 720M parameter LLM. We conduct runs with the batch size of 1M tokens. While previous ablations reveal that sign-based methods can outperform at sufficiently large batches, this advantage does not persist when scaling model size. On another hand, , that also benefits from the increased batch size, along with dominates over other optimizers with a huge gap.

Ranking of optimizers for 720M Llama-based models. We plot the final validation loss obtained by the best-tuned optimizers on the FineWeb dataset. We use a batch size of 1M tokens and train multiple methods beyond and below the Chinchilla optimal duration, which is 14.4B for model of this size. and are the best optimizers in this setup, with a noticable gap in performance compared to other methods. We also plot the baseline in both figures to distinguish the group of methods that consistently perform worse than from the group of optimizers that outperform it for some training durations.
Logs.
Training MoE models

Ranking optimizers for 520M MoE models with 256 × 512 batch size. We report results for models trained for both 42k iterations (left), and 336k (right). MoE configuration correspond to one of the 124M dense model. Optimizer rankings closely mirror results for 124M, indicating that our benchmarking results transfer smoothly from dense models to MoEs. We also see that outperforms in 336k steps run, however, with re-tuned beta parameters we might expect the opposite results in longer training.

Comparing optimizers for training a 520M parameter MoE. Results closely remind those for dense 124M models. The baseline by far outperforms , , , and sign-based methods for 44B training horizon. Remarkably, in the same way as followed for dense 124M models, we observe a similar situation for the MoE model.
Logs.
Takeaway of our ablations, pretraining best-practices with recent optimizers
Implications of this research: Apertus 70B pretrain with AdEMAMix
Our benchmarking experiments at relatively small models (compared to production-ready scale) has motivated the choice of optimizers for SwissAI Apertus 70B and 8B models---a fully open and compliant LLM trained on nice GPUs from Alps on open data including more than 1800 languages.




Extra Plots
Add a comment