AMP vs Custom Quantization

Accelerating multi-node Large Language Model training with per-layer selective quantization (e.g. FP32 -> FP16) of the transformer architecture.

Kastan Day

Created on August 2|Last edited on March 8

Comment

﻿
Full-model (all-layer) quantization on GPT2-SmallIn terms of training loss:
🥇 FP 32 is identical to AMP-FP16.
🥈 BF16 just slightly worse loss, but is the fastest to train.
❌ FP16 diverges instantly, causing the loss figure below to show a double-line along 0.
In these experiments every layer was quantized to the same datatype, i.e. no mixed precision (except in AMP). I need Pengrui's help on casting.
These figures compare: Loss (top), Throughput (right) and Memory usage (bottom).
💡
﻿
Run set614
﻿
﻿
﻿
﻿
TODOs below here; ﻿
Replications❯ colossalai check -i
CUDA Version: 11.7
PyTorch Version: 1.12.0+cu116
CUDA Version in PyTorch Build: 11.6
PyTorch CUDA Version Match: ✓ (minor version mismatch)
CUDA Extension: ✓
embedding-layer-only FP16 mixed precision, but must minimize GPU-time spent casting datatypes
Time to quantize and de-quantize between FP16 to FP32 is critical... others have written specific architectures.
record accuracy against standard benchmark test set.
repeat same analysis for various modes & datasets.
More sizes and model architectures to comeTODO: 
GTP -- more sizes -- TP & PP & DP over 100B+ params.
Bert
VIT
Pengrui is doing CNNs. 
﻿
﻿

Add a comment