Skip to main content

[RWKV-infctx] DeepSpeed 2 / 3 comparisons

The following compares deepspeed 2 & 3 (with / without CPU offloading) running on a single 2 x A5000 (with NvLink, 24GB vram each, via vast.ai). With all other training params kept consistently the same (including dataset, 1 epoch, enwiki_10k)
Created on July 10|Last edited on July 10
(benchmark was done on 10th July 2023. With Torch 2.0.1, Cuda 11.8.)
Deepspeed StratTimeVRAM UsageRAM UsageValidation Loss
Stage 224 mins : 55 sec~22.3 + 23.8 GB~85 GB6.173
Stage 329 mins : 12 sec~23.0 + 23.2 GB^~85 GB5.665
Stage 2 + CPU offload43 mins : 08 sec~9.7 + 10.3 GB~128 GB6.124
Stage 3 + CPU offload1hr : 42mins : 38 sec~7.0 + 7.3 GB~145 GB5.668



^ note that in theory stage 3 uses less VRAM than stage 2, however because it will also try to use up more VRAM than its needed for "cache" items if possible. It currently maxing out to the same level as stage 2 here
Git repository and notebook can be found here: https://github.com/PicoCreator/RWKV-LM-LoRA/blob/dev-infctx-torch-compile/notebook/trainer-validation/deepspeed-2-and-3.ipynb
Torch.JIT was enabled for deepspeed 2, But was disabled for deepspeed 3 (not compatible). Torch.compile was disabled


Impact on train/loss with deepspeed 2/3


Run set
4

Resource consumption figures, are used to compare the differences in impact it has on the system in the same process. And can be used to help estimate performance in other similar systems

System Resource Usage


Run set
4


What does Deepspeed 2 & 3 do (With/Without CPU offload) ???

Instead of simply splitting the dataset being trained, and having a full copy of everything in all GPU's (aka DDP / DeepSpeed 1).

Deepspeed 2, keeps a full copy of the model weights on each GPU, but splits the training gradient descent memory usage into multiple GPUs, or offload it into CPU memory (+ CPU offload option).

Deepspeed 3, takes it a step further, and distributes the model weights across all the GPUs, drastically lowering the vram requirement, while increasing the amount of GPU to GPU traffic drastically. Gradient descent memory is still split across multiple GPUs, with the option to offload into CPU memory (Same as deepspeed 2)

Finally, Deepspeed 3, also introduce options to further offload such model weights / gradient descent, more into CPU memory or NVMe. However this option was not enabled or explored in the following benchmarks.