[RWKV-infctx] DeepSpeed 2 / 3 comparisons
The following compares deepspeed 2 & 3 (with / without CPU offloading) running on a single 2 x A5000 (with NvLink, 24GB vram each, via vast.ai). With all other training params kept consistently the same (including dataset, 1 epoch, enwiki_10k)
Created on July 10|Last edited on July 10
Comment
(benchmark was done on 10th July 2023. With Torch 2.0.1, Cuda 11.8.)
Deepspeed Strat | Time | VRAM Usage | RAM Usage | Validation Loss |
---|---|---|---|---|
Stage 2 | 24 mins : 55 sec | ~22.3 + 23.8 GB | ~85 GB | 6.173 |
Stage 3 | 29 mins : 12 sec | ~23.0 + 23.2 GB^ | ~85 GB | 5.665 |
Stage 2 + CPU offload | 43 mins : 08 sec | ~9.7 + 10.3 GB | ~128 GB | 6.124 |
Stage 3 + CPU offload | 1hr : 42mins : 38 sec | ~7.0 + 7.3 GB | ~145 GB | 5.668 |
^ note that in theory stage 3 uses less VRAM than stage 2, however because it will also try to use up more VRAM than its needed for "cache" items if possible. It currently maxing out to the same level as stage 2 here
Git repository and notebook can be found here: https://github.com/PicoCreator/RWKV-LM-LoRA/blob/dev-infctx-torch-compile/notebook/trainer-validation/deepspeed-2-and-3.ipynb
Torch.JIT was enabled for deepspeed 2, But was disabled for deepspeed 3 (not compatible). Torch.compile was disabled
Impact on train/loss with deepspeed 2/3
Run set
4
Resource consumption figures, are used to compare the differences in impact it has on the system in the same process. And can be used to help estimate performance in other similar systems
System Resource Usage
Run set
4
What does Deepspeed 2 & 3 do (With/Without CPU offload) ???
Instead of simply splitting the dataset being trained, and having a full copy of everything in all GPU's (aka DDP / DeepSpeed 1).
Deepspeed 2, keeps a full copy of the model weights on each GPU, but splits the training gradient descent memory usage into multiple GPUs, or offload it into CPU memory (+ CPU offload option).
Deepspeed 3, takes it a step further, and distributes the model weights across all the GPUs, drastically lowering the vram requirement, while increasing the amount of GPU to GPU traffic drastically. Gradient descent memory is still split across multiple GPUs, with the option to offload into CPU memory (Same as deepspeed 2)
Finally, Deepspeed 3, also introduce options to further offload such model weights / gradient descent, more into CPU memory or NVMe. However this option was not enabled or explored in the following benchmarks.
Add a comment