AMP vs Custom Quantization
Accelerating multi-node Large Language Model training with per-layer selective quantization (e.g. FP32 -> FP16) of the transformer architecture.
Created on August 2|Last edited on March 8
Comment
Full-model (all-layer) quantization on GPT2-Small
In terms of training loss:
- 🥇 FP 32 is identical to AMP-FP16.
- 🥈 BF16 just slightly worse loss, but is the fastest to train.
- ❌ FP16 diverges instantly, causing the loss figure below to show a double-line along 0.
In these experiments every layer was quantized to the same datatype, i.e. no mixed precision (except in AMP). I need Pengrui's help on casting.
These figures compare: Loss (top), Throughput (right) and Memory usage (bottom).
💡
Run set
614
TODOs below here;
Replications
❯ colossalai check -iCUDA Version: 11.7PyTorch Version: 1.12.0+cu116CUDA Version in PyTorch Build: 11.6PyTorch CUDA Version Match: ✓ (minor version mismatch)CUDA Extension: ✓
- embedding-layer-only FP16 mixed precision, but must minimize GPU-time spent casting datatypes
- Time to quantize and de-quantize between FP16 to FP32 is critical... others have written specific architectures.
- record accuracy against standard benchmark test set.
- repeat same analysis for various modes & datasets.
More sizes and model architectures to come
TODO:
- GTP -- more sizes -- TP & PP & DP over 100B+ params.
- Bert
- VIT
- Pengrui is doing CNNs.
Add a comment