Performance Study
Detailed study investigating performance of various Megatron-DeepSpeed optimizations for training GPT.
Date: 03/24/2023
Created on March 22|Last edited on March 25
Comment
Environment @ ALCF (Polaris)
- Python 3.10.9
- torch.__version__ = 1.13.0a0+git49444c3
- DeepSpeed general environment info:
- torch install path ............... ['/soft/datascience/conda/2023-01-10/mconda3/lib/python3.10/site-packages/torch']
- torch version .................... 1.13.0a0+git49444c3
- deepspeed install path ........... ['/lus/grand/projects/datascience/foremans/locations/polaris/projects/saforem2/Megatron-DeepSpeed/venvs/polaris/2023-01-10/lib/python3.10/site-packages/deepspeed']
- deepspeed info ................... 0.8.3+6379defa, 6379defa, master
- torch cuda version ............... 11.8
- torch hip version ................ None
- nvcc version ..................... 11.8
- deepspeed wheel compiled w. ...... torch 1.13, cuda 11.8
The source code is available online at:
Performance Study
We are interested in measuring the performance impact from different combinations of the following configuration options:
- world_size
- pipeline-model-parallel-size
- tensor-model-parallel-size
- zero_optimization.stage
- flash_attention
In particular, we use throughput/samples_per_sec as our performance metric.
Model: 2.7B Params
World Size: 16
Here we can see that (in decreasing order of performance):
- PPSIZE=1 ZERO_STAGE=1
- FLASH_ATTN=1
- FLASH_ATTN=0
- PPSIZE=1 ZERO_STAGE=2
- PPSIZE=1 ZERO_STAGE=3
- PPSIZE > 1
Our primary metric of interest is .
- Note on statistics:
- We measure samples_per_sec over the first 5 training steps and aggregate these results for a given experiment
- This is repeated for multiple experiment-s to calculate statistics (average and error bars)
- Error bars below represent the standard deviation σ in the mean μ across experiments.
- The relevant configuration options for a given experiment are written in text directly above their corresponding bar.
2.7B Model
Since the 2.7B parameter model is (relatively) small, we're able to get away with training (i.e. no OOM errors) using only ZeRO.stage = 1 offloading, and NO pipeline or tensor parallelism.
💡
As expected, this is indeed the most performant configuration and spans the first four bars in the first chart below:
- world_size: 16, pipeline_model_parallel_size: 1, use_flash_attn: true, zero_stage: 1, checkpoint_activations: false
- world_size: 16, pipeline_model_parallel_size: 1, use_flash_attn: false, zero_stage: 1, checkpoint_activations: false
- world_size: 16, pipeline_model_parallel_size: 1, use_flash_attn: true, zero_stage: 1, checkpoint_activations: true
- world_size: 16, pipeline_model_parallel_size: 1, use_flash_attn: false, zero_stage: 1, checkpoint_activations: true
All configs
Results
Run set
33
System
Pipeline Parallelism / ZeRO > 1
Zooming in, ignoring the first two entries in the plots above (i.e. PPSIZE=1 , ZERO_STAGE=1 , FLASH_ATTN=0|1 )
Run set
33
Extras
Add a comment