Adan: A New Optimizer That Challenges Adam

Taking Adan optimizer for a spin on image classification

Created on August 30|Last edited on December 30

Comment

Almost every week, we hear of a new optimizer that is better than everything else. This week's we have Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models.﻿
This new optimizer is a slight modification of Adam. Specifically, it updates the momentum terms using the current and previous iteration gradients. In theory, these changes make the algorithm a bit heavier on the memory side, as we have to keep track of 2 iterations' gradients at any given time. Where you can find Adan:
The official PyTorch implementation is now available here﻿
💡
PyTorch implementation of Adan by lucidrains (Phil Wang) 🔥 can be found here: https://github.com/lucidrains/Adan-pytorch﻿
My friend Benjamin Warner implementation's for fastai﻿
There is also a Jax implementation available in Optax-contrib﻿
Running Adan vs Adam in PyTorchThe code to run the training is available here.﻿I trained my new favorite image backbone, convnext_tiny, on the Imagenette dataset (only 10 ImageNet easy classes) for a fixed 20 epoch run with default optimizer parameters. The results shown are the average of 5 runs. I am also using one cycle scheduler from fastai.﻿﻿
Everything is default, but for Adan, I used 10x the learning rate of Adam, as suggested on lucidrains' implementation of Adan in PyTorch.
💡
As you can see, they both converge very similarly.
Adam is faster, and the memory footprint is lower.
Adan sometimes converges to lower valid loss but starts bumpier. It is around 20% slower than Adam.
Convnext_tiny﻿
Run set11
﻿
Added SGD with mom=0.9 for reference, as suggested by Jean Kaddour.
💡
Resnet26d﻿
Run set6
﻿
Adan hyperparameters (betas)
﻿
Okay, but what about the default values of Adan hyperparameters? Are they optimal?  The optimizer has 3 main parameters (β1,β2,β3)(\beta_1, \beta_2, \beta_3)(β1​,β2​,β3​)﻿ and they control de momentums. They are used in equations 3, 4 and 5 respectively. They are also used explicitly to update the parameters on equation 7. 
The default values in the implementation I am using are close to zero β1=0.02,β2=0.08,β3=0.01\beta_1 = 0.02, \beta_2 =0.08, \beta_3 = 0.01β1​=0.02,β2​=0.08,β3​=0.01﻿﻿So let's run a quick Hyperparameter Sweep and check if we can improve the defaults. We keep lr=0.01 fixed and used the same one_cycle_scheduler as before.
﻿
﻿
﻿
Optimal values 
﻿
﻿
As you can see, the default betas are very hard to beat!
﻿
ConclusionsIn my minimal testing setup, Adan is as good as Adam. That said, it is quite a bit slower, so I don't think I am ready to replace Adam with it. We also have other optimizers that are good like LookAhead or Ranger that perform very well, so a detailed study comparing all the contenders may be needed to convince me.
﻿

Add a comment

Aaron Defazio • 3 years ago

Have you tried the MadGrad optimizer that I worked on? I would be interested to see how it does on this test problem, it should work well.

4 replies

Jean Kaddour • 3 years ago

Thanks a lot for publishing this nice report! I'd be curious to see SGD with momentum 0.9. Are you thinking of sharing the code of your pipeline? I'd be happy to run it myself then.

6 replies

Tags: Articles, Intermediate, Experiment, ImageNet

Iterate on AI agents and models faster. Try Weights & Biases today.