Adan: A New Optimizer That Challenges Adam
Taking Adan optimizer for a spin on image classification
Created on August 30|Last edited on December 30
Comment
Almost every week, we hear of a new optimizer that is better than everything else. This week's we have Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models.
This new optimizer is a slight modification of Adam. Specifically, it updates the momentum terms using the current and previous iteration gradients. In theory, these changes make the algorithm a bit heavier on the memory side, as we have to keep track of 2 iterations' gradients at any given time. Where you can find Adan:
💡
- PyTorch implementation of Adan by lucidrains (Phil Wang) 🔥 can be found here: https://github.com/lucidrains/Adan-pytorch
Running Adan vs Adam in PyTorch
The code to run the training is available here.
I trained my new favorite image backbone, convnext_tiny, on the Imagenette dataset (only 10 ImageNet easy classes) for a fixed 20 epoch run with default optimizer parameters. The results shown are the average of 5 runs. I am also using one cycle scheduler from fastai.
Everything is default, but for Adan, I used 10x the learning rate of Adam, as suggested on lucidrains' implementation of Adan in PyTorch.
💡
As you can see, they both converge very similarly.
- Adam is faster, and the memory footprint is lower.
- Adan sometimes converges to lower valid loss but starts bumpier. It is around 20% slower than Adam.
Convnext_tiny
Run set
11
Added SGD with mom=0.9 for reference, as suggested by Jean Kaddour.
💡
Resnet26d
Run set
6
Adan hyperparameters (betas)

Okay, but what about the default values of Adan hyperparameters? Are they optimal? The optimizer has 3 main parameters and they control de momentums. They are used in equations 3, 4 and 5 respectively. They are also used explicitly to update the parameters on equation 7.
The default values in the implementation I am using are close to zero
So let's run a quick Hyperparameter Sweep and check if we can improve the defaults. We keep lr=0.01 fixed and used the same one_cycle_scheduler as before.
Optimal values
As you can see, the default betas are very hard to beat!
Conclusions
In my minimal testing setup, Adan is as good as Adam. That said, it is quite a bit slower, so I don't think I am ready to replace Adam with it. We also have other optimizers that are good like LookAhead or Ranger that perform very well, so a detailed study comparing all the contenders may be needed to convince me.
Add a comment
Have you tried the MadGrad optimizer that I worked on? I would be interested to see how it does on this test problem, it should work well.
4 replies
Thanks a lot for publishing this nice report! I'd be curious to see SGD with momentum 0.9. Are you thinking of sharing the code of your pipeline? I'd be happy to run it myself then.
6 replies
Iterate on AI agents and models faster. Try Weights & Biases today.