Skip to main content

AMP vs Custom Quantization

Accelerating multi-node Large Language Model training with per-layer selective quantization (e.g. FP32 -> FP16) of the transformer architecture.
Created on August 2|Last edited on March 8

Full-model (all-layer) quantization on GPT2-Small

In terms of training loss:
  • 🥇 FP 32 is identical to AMP-FP16.
  • 🥈 BF16 just slightly worse loss, but is the fastest to train.
  • FP16 diverges instantly, causing the loss figure below to show a double-line along 0.
In these experiments every layer was quantized to the same datatype, i.e. no mixed precision (except in AMP). I need Pengrui's help on casting.
These figures compare: Loss (top), Throughput (right) and Memory usage (bottom).
💡

Run set
614





TODOs below here;



Replications

❯ colossalai check -i
CUDA Version: 11.7
PyTorch Version: 1.12.0+cu116
CUDA Version in PyTorch Build: 11.6
PyTorch CUDA Version Match:(minor version mismatch)
CUDA Extension:
  • embedding-layer-only FP16 mixed precision, but must minimize GPU-time spent casting datatypes
  • Time to quantize and de-quantize between FP16 to FP32 is critical... others have written specific architectures.
  • record accuracy against standard benchmark test set.
  • repeat same analysis for various modes & datasets.

More sizes and model architectures to come

TODO:
  • GTP -- more sizes -- TP & PP & DP over 100B+ params.
  • Bert
  • VIT
  • Pengrui is doing CNNs.