Speed-Testing Colossal AI's Parallelism

A need for speed...
Created on June 21|Last edited on June 28
Comment
﻿
MotivationMethodsCode & ReproducibilityFuture work (your input, please!)ResultsMaximizing Throughput & Hardware UtilizationModel LossSupplementary hardware logs
﻿
MotivationI want to train big models fast. How can I achieve maximum throughput using parallelism?
💡
Right now my goal is to train large PyTorch and PyTorch Lightning models as fast as possible. But in the context of CodeFlare, this work would enable auto-scaling and auto-parallelism for user-submitted PyTorch models. For many use-cases, there's a one-to-one mapping from PyTorch to Colossal-AI, here's the helpful docs for converting PyTorch to Colossal-AI. Ideally, one could auto-scale the training of any model to the available hardware.
﻿
﻿
Ideally, one could auto-scale the training of any model to the available hardware. To solve both problems, I want to evaluate how these listed factor determine the most efficient flavor of model parallelism:
👉 Please comment your ideas! 👈
Model architecture (transformer vs CNN, etc.), 
Model framework (PyTorch ⇨ Colossal-AI or FS-DDP, TensorFlow ⇨ TF Mesh),
model size,
data size,
num GPUs available,
networking bandwidth,
GPUs per node,
compute topology...?
Here's an overview of Colossal-AI's methods that I will be testing:
﻿
MethodsMethods that are examined in this report:
1D, 2D, pipeline parallel (PP), and PP_1D.
Colossal AI's own optimizer: Zero3
Vanilla training in PyTorch (with data parallel)
Methods not examined here:
2.5D & 3D. They require 8 GPUs, and I could not get those allocated in time. Also, it makes comparisons less equal, but still valid.
Multi-node scenarios. Everything here is single-node using NCCL.
Code & ReproducibilityYou can view the full python file used to run these experiments here: https://wandb.ai/kastan/col_ai/runs/248q61tv/code?workspace=user-kastan﻿
Here are the full metadata &  node stats (A100-40GBx4, with 128CPUs, but only allocated roughly 8): https://wandb.ai/kastan/col_ai/groups/gpt2_pp_2gpu?workspace=user-kastan﻿
Here's the main W&B project dashboard: https://wandb.ai/kastan/col_ai?workspace=user-kastan﻿
Future work (your input, please!)Multi-node setting with 2.5 and 3D parallelism. This is Colossal-AI's key idea, it must be tested thoroughly. 
Use different models (GPT-2-large? RoBERTa? ViT? Your favorite model??)
Compare against FS-DDP from PyTorch Ligntning in all above scenarios.
Results✅ Colossal-AI's Pipeline Parallel is by far the fastest in a single-node, multi-gpu setting. It's almost 3x faster than vanilla PyTorch.
❌ However, Tensor parallelism (1D, 2D, 3D) are counter-productive when a model can fit on a single GPU, and are only necessary with huge models.
﻿
Ideally we'd use data parallelism for everything. But that's not possible to reach multi-trillion-param models.
Tensor-Parallelism is useful to spread a model across multiple GPUs when it cannot fit in one GPU's memory, but with large communication penalty. 
Pipeline-Parallelism is useful to pack duplicates of a model into a single GPU (using micro-batch parallelism) to make full use of available GPU memory. 
﻿
The real world will require all of the above. Careful tuning of DP, TP, PP, batch size and micro-batch size.
💡
Maximizing Throughput & Hardware UtilizationI want to train as fast as possible. Therefore, my north star for training GPT is samples per second of training thruput. For hardware, I want to maximize the utilization of my GPU cores and memory.
﻿
﻿
There are error bars on the GPU utilization because I'm typically using 4 GPUs, and the main line represents the average. You can see I have plenty of GPU memory headroom and I'd like to train bigger models. 
Model LossThe model, GPT-2-small from the Titans library, converges quickly in under a single epoc in all cases.
﻿
Run set495
﻿
Supplementary hardware logsInteresting metrics include GPU wattage, GPU time spent accessing memory (%).
CPU memory utilization and network traffic (bits), and disk utilization can also be helpful (but are not shown in this report).
﻿
Run set495
﻿
﻿
Add a comment
Kastan Day • 3 years ago
Panel
This is the most important chart, my north star.