Speed-Testing Colossal AI's Parallelism
A need for speed...
Created on June 21|Last edited on June 28
Comment
MotivationMethodsCode & ReproducibilityFuture work (your input, please!)ResultsMaximizing Throughput & Hardware UtilizationModel LossSupplementary hardware logs
Motivation
I want to train big models fast. How can I achieve maximum throughput using parallelism?
💡
Right now my goal is to train large PyTorch and PyTorch Lightning models as fast as possible. But in the context of CodeFlare, this work would enable auto-scaling and auto-parallelism for user-submitted PyTorch models. For many use-cases, there's a one-to-one mapping from PyTorch to Colossal-AI, here's the helpful docs for converting PyTorch to Colossal-AI. Ideally, one could auto-scale the training of any model to the available hardware.

Ideally, one could auto-scale the training of any model to the available hardware. To solve both problems, I want to evaluate how these listed factor determine the most efficient flavor of model parallelism:
- 👉 Please comment your ideas! 👈
- Model architecture (transformer vs CNN, etc.),
- Model framework (PyTorch ⇨ Colossal-AI or FS-DDP, TensorFlow ⇨ TF Mesh),
- model size,
- data size,
- num GPUs available,
- networking bandwidth,
- GPUs per node,
- compute topology...?

Methods
Methods that are examined in this report:
- 1D, 2D, pipeline parallel (PP), and PP_1D.
- Colossal AI's own optimizer: Zero3
- Vanilla training in PyTorch (with data parallel)
Methods not examined here:
- 2.5D & 3D. They require 8 GPUs, and I could not get those allocated in time. Also, it makes comparisons less equal, but still valid.
- Multi-node scenarios. Everything here is single-node using NCCL.
Code & Reproducibility
You can view the full python file used to run these experiments here: https://wandb.ai/kastan/col_ai/runs/248q61tv/code?workspace=user-kastan
Here are the full metadata & node stats (A100-40GBx4, with 128CPUs, but only allocated roughly 8): https://wandb.ai/kastan/col_ai/groups/gpt2_pp_2gpu?workspace=user-kastan
Future work (your input, please!)
- Multi-node setting with 2.5 and 3D parallelism. This is Colossal-AI's key idea, it must be tested thoroughly.
- Use different models (GPT-2-large? RoBERTa? ViT? Your favorite model??)
- Compare against FS-DDP from PyTorch Ligntning in all above scenarios.
Results
✅ Colossal-AI's Pipeline Parallel is by far the fastest in a single-node, multi-gpu setting. It's almost 3x faster than vanilla PyTorch.
❌ However, Tensor parallelism (1D, 2D, 3D) are counter-productive when a model can fit on a single GPU, and are only necessary with huge models.
Ideally we'd use data parallelism for everything. But that's not possible to reach multi-trillion-param models.
Tensor-Parallelism is useful to spread a model across multiple GPUs when it cannot fit in one GPU's memory, but with large communication penalty.
Pipeline-Parallelism is useful to pack duplicates of a model into a single GPU (using micro-batch parallelism) to make full use of available GPU memory.
The real world will require all of the above. Careful tuning of DP, TP, PP, batch size and micro-batch size.
💡
Maximizing Throughput & Hardware Utilization
I want to train as fast as possible. Therefore, my north star for training GPT is samples per second of training thruput. For hardware, I want to maximize the utilization of my GPU cores and memory.
There are error bars on the GPU utilization because I'm typically using 4 GPUs, and the main line represents the average. You can see I have plenty of GPU memory headroom and I'd like to train bigger models.
Model Loss
The model, GPT-2-small from the Titans library, converges quickly in under a single epoc in all cases.
Run set
495
Supplementary hardware logs
Interesting metrics include GPU wattage, GPU time spent accessing memory (%).
CPU memory utilization and network traffic (bits), and disk utilization can also be helpful (but are not shown in this report).
Run set
495
Add a comment
Panel
This is the most important chart, my north star.Reply