Distributed Shampoo parameters for ViT-VQGAN
Study of configuration for Distributed Shampoo on VQGAN
Created on August 28|Last edited on September 3
Comment
TLDR
- Nesterov brings more stability and faster convergence (see below graphs, Nesterov are the dotted lines).
- Optimal Distributed Shampoo settings are problem specific (Nesterov was not useful for dalle-mini). The possible settings are available in the official implementation.
Experiments
All experiments use same model, batch size and the RMSProp Normalized graft type on Distributed Shampoo.
We search over:
- beta1: 0.9, 0.95
- beta2: 0.9, 0.95
- nesterov: True/False
Run set
8
Results
Run set
8
Resources
Acknowledgements
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.