Skip to main content

Performance Study

Detailed study investigating performance of various Megatron-DeepSpeed optimizations for training GPT. Date: 03/24/2023
Created on March 22|Last edited on March 25



Environment @ ALCF (Polaris)

NOTE: The following experiments were done on the Polaris machine at ALCF using
  • Python 3.10.9
  • torch.__version__ = 1.13.0a0+git49444c3
  • DeepSpeed general environment info:
    • torch install path ............... ['/soft/datascience/conda/2023-01-10/mconda3/lib/python3.10/site-packages/torch']
    • torch version .................... 1.13.0a0+git49444c3
    • deepspeed install path ........... ['/lus/grand/projects/datascience/foremans/locations/polaris/projects/saforem2/Megatron-DeepSpeed/venvs/polaris/2023-01-10/lib/python3.10/site-packages/deepspeed']
    • deepspeed info ................... 0.8.3+6379defa, 6379defa, master
    • torch cuda version ............... 11.8
    • torch hip version ................ None
    • nvcc version ..................... 11.8
    • deepspeed wheel compiled w. ...... torch 1.13, cuda 11.8
The source code is available online at:



Performance Study

We are interested in measuring the performance impact from different combinations of the following configuration options:
  • world_size
  • pipeline-model-parallel-size
  • tensor-model-parallel-size
  • zero_optimization.stage
  • flash_attention
for both the 2.7B and 20B GPT model architectures
In particular, we use throughput/samples_per_sec as our performance metric.

Model: 2.7B Params

World Size: 16

Here we can see that (in decreasing order of performance):
  1. PPSIZE=1 ZERO_STAGE=1
    1. FLASH_ATTN=1
    2. FLASH_ATTN=0
  2. PPSIZE=1 ZERO_STAGE=2
  3. PPSIZE=1 ZERO_STAGE=3
  4. PPSIZE > 1
Our primary metric of interest is Nsamples/secN_{\mathrm{samples}} / \mathrm{sec} .
  • Note on statistics:
    • We measure samples_per_sec over the first 5 training steps and aggregate these results for a given experiment
    • This is repeated for multiple experiment-s to calculate statistics (average and error bars)
    • Error bars below represent the standard deviation σ in the mean μ across experiments.
  • The relevant configuration options for a given experiment are written in text directly above their corresponding bar.
2.7B Model
Since the 2.7B parameter model is (relatively) small, we're able to get away with training (i.e. no OOM errors) using only ZeRO.stage = 1 offloading, and NO pipeline or tensor parallelism.

💡
As expected, this is indeed the most performant configuration and spans the first four bars in the first chart below:
  1. world_size: 16, pipeline_model_parallel_size: 1, use_flash_attn: true, zero_stage: 1, checkpoint_activations: false
  2. world_size: 16, pipeline_model_parallel_size: 1, use_flash_attn: false, zero_stage: 1, checkpoint_activations: false
  3. world_size: 16, pipeline_model_parallel_size: 1, use_flash_attn: true, zero_stage: 1, checkpoint_activations: true
  4. world_size: 16, pipeline_model_parallel_size: 1, use_flash_attn: false, zero_stage: 1, checkpoint_activations: true

All configs

Results


Run set
33


System

Pipeline Parallelism / ZeRO > 1

Zooming in, ignoring the first two entries in the plots above (i.e. PPSIZE=1 , ZERO_STAGE=1 , FLASH_ATTN=0|1 )

Run set
33









Extras