Skip to main content

Distributed Shampoo parameters for ViT-VQGAN

Study of configuration for Distributed Shampoo on VQGAN
Created on August 28|Last edited on September 3
We evaluate different settings on Distributed Shampoo while training a ViT-VQGAN.

TLDR

  • Nesterov brings more stability and faster convergence (see below graphs, Nesterov are the dotted lines).
  • Optimal Distributed Shampoo settings are problem specific (Nesterov was not useful for dalle-mini). The possible settings are available in the official implementation.

Experiments

All experiments use same model, batch size and the RMSProp Normalized graft type on Distributed Shampoo.
We search over:
  • beta1: 0.9, 0.95
  • beta2: 0.9, 0.95
  • nesterov: True/False

Run set
8


Results



Run set
8


Resources

Acknowledgements

Iterate on AI agents and models faster. Try Weights & Biases today.