Skip to main content

Evaluation of Distributed Shampoo

Comparison of optimizers: Distributed Shampoo, Adam & Adafactor
Created on January 10|Last edited on October 23
We evaluate & compare a few optimizers for DALL·E Mini training:
Distributed Shampoo: A Scalable Second Order Optimization Method for Deep Learning

Comparison of Optimizers

  • DalleBart model with 200M parameters
    • DalleBart is based on Bart (sequence to sequence model using an encoder/decoder transformer architecture)
    • Inputs (image captions) and outputs (encoded image tokens) use different embedding spaces
  • Implemented in JAX, optimizers use Optax
  • Batch size of 152 * 8 TPU's = 1216 (no gradient accumulation)
  • Weights in float32 / Computations in bfloat16
  • Learning rate scheduler
    • Warmup for 4k steps
    • Linear decay down to 0 at 50k steps
  • No weight decay (50k steps represents less than an epoch)
  • Memory limitations
    • On 1 single instance, Distributed Shampoo takes the most memory so max batch size is based on this optimizer. Batch size could be increased for the other optimizers but we decided to keep it constant.
    • On multiple instances the overhead vs Adam is decreased (see "Distributed Shampoo & Memory/Compute limitations")

Here below are the most interesting runs (see more runs in "How was the search performed").



How was the search performed?

For each optimizer, we used the same learning rate scheduler and searched over a grid where the ratio between 2 consecutive learning rates was ≈ 3 by using values such as 0.001 ; 0.003 ; 0.01 ; 0.03 ; 0.1









Distributed Shampoo & Memory/Compute limitations

Distributed Shampoo requires more memory & compute than Adam and Adafactor for the following reasons:
  • It is a second order optimizer.
  • Diagonal statistics are saved for Learning Rate Grafting (Agarwal et al) to define step sizes per layer.
  • Second-moment statistics and preconditioner take quadratic memory in the tensor dimension.
  • Distributed Shampoo (and many other second-order methods) requires higher precision matrix multiplies to compute the inverse-pth root.
  • Some overheads are due to other tradeoffs due to software and hardware limitations. For example, requirement of static shapes, compilation time of inverse-pth root with floating point32 emulation, and inverse-pth root computation (refer to the paper for more details and the presentation at ML Collective).
Overheads are addressed in Distributed Shampoo implementation:
  • Compute overhead per step can be reduced by increasing batch size, using gradient accumulation, decreasing frequency of inverse, or changing the block size:
    • The preconditioning matrix does not need to be computed at every step (up to every hundreds of steps with minimal impact on results).
    • Most of the compute overhead is batch-size independent, thus increasing batch size (or using gradient accumulation) proportionally reduces the overhead of Distributed Shampoo.
    • Block size directly affects the computational complexity of inverse-pth roots and lowering it has minimal impact on convergence.
  • Memory overhead is addressed through sharding, quantization and custom block sizes:
    • Optimizer parameters can be sharded, resulting in a minimal impact on a large number of TPU instances.
    • Quantization is supported and greatly reduce optimizer memory requirements.
    • Block size directly affects the memory requirements of gradient statistics and is useful for larger layers.

How does quantization affect the results?

Quantization can be useful when having limited memory, especially when only a single TPU/GPU instance is available.
We take our best Distributed Shampoo experiment and compare the quantized version of the optimizer with the non-quantized one, without changing any other parameter (including batch size). When quantized, we convert states as follows:
  • diagonal statistics → bfloat16
  • momentum buffers (2x) → int8
  • statistics, preconditioners → int16 + diagonals
The fully quantized version performs worse than the non-quantized version and shows some instability.
When we don't quantize the diagonal statistics, we observe only a minor degradation vs the non-quantized, the difference being even smaller when using elapsed time as x-axis (due to the quantized version being faster).
By default, Distributed Shampoo will quantize all parameters (when requested) except for the diagonal statistics. You can experiment with different quantization strategies as results may be problem dependent.
The memory savings from optimizer quantization let us handle models twice as large on a single TPU v3 instance (going from about 200M to 450M+ parameters). On multiple instances, quantization of the optimizer is not necessary as it can efficiently be sharded over instances.



Why did we perform this search?

DALL·E Mini was initially trained using Adafactor.
As the model scaled up, we observed instabilities leading to the search of other optimizers.



Resources

Acknowledgements

  • Rohan Anil for setting up Distributed Shampoo optimizer
  • Google TPU Research Cloud (TRC) program for providing computing resources
  • Weights & Biases for providing the infrastructure for experiment tracking and model management
  • 🤗 Hugging Face for the JAX implementation of Bart
Afroz Mohiuddin
Afroz Mohiuddin •  
Thanks, very well written, excited to try this!
Reply