Skip to main content

Speed-Testing Colossal AI's Parallelism

A need for speed...
Created on June 21|Last edited on June 28


Motivation

I want to train big models fast. How can I achieve maximum throughput using parallelism?
💡
Right now my goal is to train large PyTorch and PyTorch Lightning models as fast as possible. But in the context of CodeFlare, this work would enable auto-scaling and auto-parallelism for user-submitted PyTorch models. For many use-cases, there's a one-to-one mapping from PyTorch to Colossal-AI, here's the helpful docs for converting PyTorch to Colossal-AI. Ideally, one could auto-scale the training of any model to the available hardware.


Ideally, one could auto-scale the training of any model to the available hardware. To solve both problems, I want to evaluate how these listed factor determine the most efficient flavor of model parallelism:
  • 👉 Please comment your ideas! 👈
  • Model architecture (transformer vs CNN, etc.),
  • Model framework (PyTorch ⇨ Colossal-AI or FS-DDP, TensorFlow ⇨ TF Mesh),
  • model size,
  • data size,
  • num GPUs available,
  • networking bandwidth,
  • GPUs per node,
  • compute topology...?
Here's an overview of Colossal-AI's methods that I will be testing:


Methods

Methods that are examined in this report:
  • 1D, 2D, pipeline parallel (PP), and PP_1D.
  • Colossal AI's own optimizer: Zero3
  • Vanilla training in PyTorch (with data parallel)
Methods not examined here:
  • 2.5D & 3D. They require 8 GPUs, and I could not get those allocated in time. Also, it makes comparisons less equal, but still valid.
  • Multi-node scenarios. Everything here is single-node using NCCL.

Code & Reproducibility

You can view the full python file used to run these experiments here: https://wandb.ai/kastan/col_ai/runs/248q61tv/code?workspace=user-kastan
Here are the full metadata & node stats (A100-40GBx4, with 128CPUs, but only allocated roughly 8): https://wandb.ai/kastan/col_ai/groups/gpt2_pp_2gpu?workspace=user-kastan
Here's the main W&B project dashboard: https://wandb.ai/kastan/col_ai?workspace=user-kastan

Future work (your input, please!)

  1. Multi-node setting with 2.5 and 3D parallelism. This is Colossal-AI's key idea, it must be tested thoroughly.
  2. Use different models (GPT-2-large? RoBERTa? ViT? Your favorite model??)
  3. Compare against FS-DDP from PyTorch Ligntning in all above scenarios.

Results

✅ Colossal-AI's Pipeline Parallel is by far the fastest in a single-node, multi-gpu setting. It's almost 3x faster than vanilla PyTorch.
❌ However, Tensor parallelism (1D, 2D, 3D) are counter-productive when a model can fit on a single GPU, and are only necessary with huge models.

Ideally we'd use data parallelism for everything. But that's not possible to reach multi-trillion-param models.
Tensor-Parallelism is useful to spread a model across multiple GPUs when it cannot fit in one GPU's memory, but with large communication penalty.
Pipeline-Parallelism is useful to pack duplicates of a model into a single GPU (using micro-batch parallelism) to make full use of available GPU memory.

The real world will require all of the above. Careful tuning of DP, TP, PP, batch size and micro-batch size.
💡

Maximizing Throughput & Hardware Utilization

I want to train as fast as possible. Therefore, my north star for training GPT is samples per second of training thruput. For hardware, I want to maximize the utilization of my GPU cores and memory.


There are error bars on the GPU utilization because I'm typically using 4 GPUs, and the main line represents the average. You can see I have plenty of GPU memory headroom and I'd like to train bigger models.

Model Loss

The model, GPT-2-small from the Titans library, converges quickly in under a single epoc in all cases.

Run set
495


Supplementary hardware logs

Interesting metrics include GPU wattage, GPU time spent accessing memory (%).
CPU memory utilization and network traffic (bits), and disk utilization can also be helpful (but are not shown in this report).

Run set
495

Kastan Day
Kastan Day •  
Panel
This is the most important chart, my north star.
Reply