[RWKV-infctx] DeepSpeed 2 / 3 comparisons

The following compares deepspeed 2 & 3 (with / without CPU offloading) running on a single 2 x A5000 (with NvLink, 24GB vram each, via vast.ai). With all other training params kept consistently the same (including dataset, 1 epoch, enwiki_10k)

Picocreator

Created on July 10|Last edited on July 10

Comment

(benchmark was done on 10th July 2023. With Torch 2.0.1, Cuda 11.8.)

Deepspeed StratTimeVRAM UsageRAM UsageValidation Loss
Stage 224 mins : 55 sec~22.3 + 23.8 GB~85 GB6.173
Stage 329 mins : 12 sec~23.0 + 23.2 GB^~85 GB5.665
Stage 2 + CPU offload43 mins : 08 sec~9.7 + 10.3 GB~128 GB6.124
Stage 3 + CPU offload1hr : 42mins : 38 sec~7.0 + 7.3 GB~145 GB5.668
﻿
﻿
^ note that in theory stage 3 uses less VRAM than stage 2, however because it will also try to use up more VRAM than its needed for "cache" items if possible. It currently maxing out to the same level as stage 2 hereGit repository and notebook can be found here: https://github.com/PicoCreator/RWKV-LM-LoRA/blob/dev-infctx-torch-compile/notebook/trainer-validation/deepspeed-2-and-3.ipynb﻿Torch.JIT was enabled for deepspeed 2, But was disabled for deepspeed 3 (not compatible). Torch.compile was disabled﻿
Impact on train/loss with deepspeed 2/3﻿
Run set4
﻿
Resource consumption figures, are used to compare the differences in impact it has on the system in the same process. And can be used to help estimate performance in other similar systems
System Resource Usage﻿
Run set4
﻿
What does Deepspeed 2 & 3 do (With/Without CPU offload) ???Instead of simply splitting the dataset being trained, and having a full copy of everything in all GPU's (aka DDP / DeepSpeed 1).
﻿
Deepspeed 2, keeps a full copy of the model weights on each GPU, but splits the training gradient descent memory usage into multiple GPUs, or offload it into CPU memory (+ CPU offload option).
﻿
Deepspeed 3, takes it a step further, and distributes the model weights across all the GPUs, drastically lowering the vram requirement, while increasing the amount of GPU to GPU traffic drastically. Gradient descent memory is still split across multiple GPUs, with the option to offload into CPU memory (Same as deepspeed 2)
﻿
Finally, Deepspeed 3, also introduce options to further offload such model weights / gradient descent, more into CPU memory or NVMe. However this option was not enabled or explored in the following benchmarks.
﻿

Deepspeed Strat	Time	VRAM Usage	RAM Usage	Validation Loss
Stage 2	24 mins : 55 sec	~22.3 + 23.8 GB	~85 GB	6.173
Stage 3	29 mins : 12 sec	~23.0 + 23.2 GB^	~85 GB	5.665
Stage 2 + CPU offload	43 mins : 08 sec	~9.7 + 10.3 GB	~128 GB	6.124
Stage 3 + CPU offload	1hr : 42mins : 38 sec	~7.0 + 7.3 GB	~145 GB	5.668

Add a comment