Performance Study

Detailed study investigating performance of various Megatron-DeepSpeed optimizations for training GPT. Date: 03/24/2023
Created on March 22|Last edited on March 25
Comment
﻿
﻿
Environment @ ALCF (Polaris)NOTE: The following experiments were done on the Polaris machine at ALCF using 
Python 3.10.9 
torch.__version__ = 1.13.0a0+git49444c3
DeepSpeed general environment info:
torch install path ............... ['/soft/datascience/conda/2023-01-10/mconda3/lib/python3.10/site-packages/torch']
torch version .................... 1.13.0a0+git49444c3
deepspeed install path ........... ['/lus/grand/projects/datascience/foremans/locations/polaris/projects/saforem2/Megatron-DeepSpeed/venvs/polaris/2023-01-10/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.8.3+6379defa, 6379defa, master
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.8
The source code is available online at:
﻿https://github.com/saforem2/Megatron-DeepSpeed﻿﻿﻿
﻿
Performance StudyWe are interested in measuring the performance impact from different combinations of the following configuration options: 
world_size
pipeline-model-parallel-size
tensor-model-parallel-size
zero_optimization.stage
flash_attention
for both the 2.7B and 20B GPT model architectures﻿﻿﻿﻿
In particular, we use throughput/samples_per_sec as our performance metric.
Model: 2.7B Params
World Size: 16Here we can see that (in decreasing order of performance):
PPSIZE=1 ZERO_STAGE=1
FLASH_ATTN=1
FLASH_ATTN=0
PPSIZE=1 ZERO_STAGE=2
PPSIZE=1 ZERO_STAGE=3
PPSIZE > 1
Our primary metric of interest is Nsamples/secN_{\mathrm{samples}} / \mathrm{sec}Nsamples​/sec﻿ .
Note on statistics:
We measure samples_per_sec over the first 5 training steps and aggregate these results for a given experiment
This is repeated for multiple experiment-s to calculate statistics (average and error bars)
Error bars below represent the standard deviation σ in the mean μ across experiments.
The relevant configuration options for a given experiment are written in text directly above their corresponding bar.
2.7B Model
Since the 2.7B parameter model is (relatively) small, we're able to get away with training (i.e. no OOM errors) using only ZeRO.stage = 1 offloading, and NO pipeline or tensor parallelism.
﻿
💡
As expected, this is indeed the most performant configuration and spans the first four bars in the first chart below:
world_size: 16, pipeline_model_parallel_size: 1, use_flash_attn: true, zero_stage: 1, checkpoint_activations: false
world_size: 16, pipeline_model_parallel_size: 1, use_flash_attn: false, zero_stage: 1, checkpoint_activations: false
world_size: 16, pipeline_model_parallel_size: 1, use_flash_attn: true, zero_stage: 1, checkpoint_activations: true
world_size: 16, pipeline_model_parallel_size: 1, use_flash_attn: false, zero_stage: 1, checkpoint_activations: true
All configs
Results﻿
Run set33
﻿
System
Pipeline Parallelism / ZeRO > 1Zooming in, ignoring the first two entries in the plots above (i.e. PPSIZE=1 , ZERO_STAGE=1 , FLASH_ATTN=0|1 )
﻿
Run set33
﻿
﻿
﻿
﻿
﻿
Extras﻿
Add a comment