A Guide to DeepSpeed Zero With the HuggingFace Trainer
A guide for making the most out of your GPU's!
Created on February 21|Last edited on February 27
Comment
Introduction
Training large models efficiently and effectively has become a pivotal challenge for researchers and developers alike. As models grow in complexity and size, the demand for more sophisticated optimization techniques has led to the development of innovative solutions.
One such groundbreaking solution is DeepSpeed Zero, a component of the DeepSpeed library, which aims to democratize AI by enabling the training of large models on available hardware without compromising speed or requiring significant resources. This guide will delve into what DeepSpeed Zero is, its benefits, and how it works.
Keep in mind there are tons of different configurations you can creat with DeepSpeed, and covering them all would probably take a few tutorials, so we'll aim to cover the core configuration details that will give you the most benefit in terms of GPU memory reductions!

Table of Contents
Introduction Table of ContentsUnderstanding DeepSpeed ZeroSingle GPU Tactics CPU OffloadingGradient Checkpointing Low Precision Data Types Memory Efficient Optimizers Multi-GPU Tactics Stage 1: Sharding Optimizer StatesStage 2: Partitioning GradientsStage 3: Partitioning Model ParametersTraining with HuggingFace and DeepSpeed Full Speed Ahead Related Articles
Understanding DeepSpeed Zero
DeepSpeed Zero is part of the broader DeepSpeed library, an open-source deep learning optimization library designed to reduce computational demands and improve training efficiency. Developed by Microsoft, DeepSpeed offers a suite of features that tackle various challenges in training large-scale models. Among these, Zero Redundancy Optimizer (ZeRO) stands out for its unique approach to optimizing memory usage and scaling training to unprecedented model sizes.
As models grow, the memory required for their parameters, gradients, and optimizer states can exceed the capacity of available GPUs, hindering the ability to train larger models efficiently. ZeRO is designed to address the limitations of traditional data-parallelism and model-parallelism techniques in distributed training. ZeRO tackles this problem by partitioning model states across the available devices, significantly reducing the memory footprint per device and enabling the training of models that were previously unattainable.
Traditional data parallelism (DP), while efficient in compute and communication, suffers from poor memory efficiency due to the replication of the entire model across all processes, leading to redundant memory consumption. Model Parallelism (MP), on the other hand, partitions models for better memory efficiency but at the cost of compute and communication efficiency, often resulting in fine-grained computation and expensive communication that hampers scalability.
Both approaches statically maintain all model states throughout the training, which is inefficient since not all states are needed at all times. To address these issues, Microsoft has created ZeRO-DP (Zero Redundancy Optimizer for Data Parallelism). It combines the best of both worlds by partitioning model states across processes like MP for memory efficiency, while still retaining the computational granularity and communication efficiency of DP. This is achieved through a dynamic communication schedule, effectively removing redundant memory consumption without sacrificing the advantages of DP.
Single GPU Tactics
Although many of DeepSpeed's features were designed for use in a multi-GPU setting, there are also ways to utilize the library with only a single GPU. Below, I'll first go over a few tactics that can be employed in the single GPU setting.
CPU Offloading
One tactic of interest for the single GPU setting is CPU offloading. This approach allows for offloading network parameters and optimizer states to CPU RAM, thus reducing the VRAM memory requirements on the GPU. This is particularly useful for managing large models on systems with limited GPU memory, enabling more complex computations without upgrading hardware. However, it's important to note that you will need to have an ample amount of CPU ram, so keep this in mind when using CPU offloading.
Gradient Checkpointing
Separately, another effective strategy for single GPU setups is activation checkpointing, also known as gradient checkpointing. This technique reduces VRAM requirements by saving only a subset of intermediate activations during the forward pass and recomputing the rest during the backward pass.
Although this can increase the computational overhead, it significantly lowers the memory footprint, making it possible to train larger models or use larger batch sizes on a single GPU. I typically enable this outside of the DeepSpeed config file, and set 'gradient_checkpointing' to true in the TrainingArguments class for HuggingFace.
Low Precision Data Types
In DeepSpeed, FP16 and BF16 are precision formats used to enhance training speed and efficiency. FP16 reduces memory usage and accelerates computations by using 16-bit floating-point numbers, effectively doubling the processing speed on supported hardware. BF16, offering a wider range with near FP32 accuracy, optimizes training on newer CPUs and GPUs.
Both formats require hardware support for full efficiency gains, with DeepSpeed automating their implementation to balance performance with precision, enabling faster, larger-scale model training.
Memory Efficient Optimizers
DeepSpeed offers a variety of high-performance optimizers tailored for different hardware and model requirements:
Adam (CPU): A versatile optimizer that adapts learning rates for each parameter, suitable for a wide range of deep learning tasks on CPU environments.
AdamW (CPU): A variant of Adam with decoupled weight decay regularization, typically leading to better training stability and model performance, particularly in models prone to overfitting.
FusedAdam (GPU): Enhances Adam by fusing operations for improved performance on GPUs, requiring Apex installation for optimal utilization.
FusedLamb (GPU): Applies the LAMB algorithm for large batch training, optimizing training efficiency on GPU with support for adaptive learning rates.
OnebitAdam (GPU): Introduces 1-bit compression to reduce communication overhead for distributed training, ideal for scaling across multiple GPUs.
ZeroOneAdam (GPU): Combines advantages of zero and one-bit quantization to further optimize distributed training efficiency and communication.
OnebitLamb (GPU): Implements the LAMB algorithm with 1-bit compression, enhancing model training scalability and reducing GPU memory footprint.
Multi-GPU Tactics
For multiple GPU's, Deepspeed offers many more tools for reducing VRAM requirements. There are 3 main "stages" that are commonly used. I'll go over each stage below. Note that these stages can also be used with a single GPU, when also using CPU offloading.
Stage 1: Sharding Optimizer States
Stage 1 of DeepSpeed Zero addresses the memory usage of the optimizer states. Optimizer states, particularly for optimizers like Adam or RMSprop, can consume as much memory as the model parameters themselves. In traditional setups, each GPU maintains its own copy of the optimizer states, leading to a significant memory overhead.
Stage 1 solves this problem by sharding the optimizer states across GPUs, similar to the partitioning done for parameters and gradients in the earlier stages. Each GPU holds only a part of the optimizer states, and during the update step, only the relevant optimizer states are updated based on the distributed parameters and gradients. Here is an example below, maintaining the gradient
{"zero_optimization": {"stage": 1},"gradient_accumulation_steps": 1,"train_micro_batch_size_per_gpu": 1,"gradient_clipping": 1.0,"fp16": {"enabled": true}}
As I mentioned earlier, it's also possible to offload the Optimizer states to the CPU, and this can easily be accomplished with the following configuration by adding a simple offload_optimizer flag. It's also important to note that offloading the optimizer to the CPU is valid for Zero stages 1,2 and 3!
{"zero_optimization": {"stage": 1},"offload_optimizer": {"device": "cpu"}"gradient_accumulation_steps": 1,"train_micro_batch_size_per_gpu": 1,"gradient_clipping": 1.0,"fp16": {"enabled": true}}
Stage 2: Partitioning Gradients
The second stage of DeepSpeed Zero focuses on optimizing memory usage by partitioning gradients across all GPUs involved in the training process. In traditional distributed training, each GPU stores a complete set of gradients, which can quickly lead to memory bottlenecks as model sizes increase.
Stage 2 addresses this issue by ensuring that each GPU holds only a fraction of the gradients. This means that, during the backward pass, gradients are calculated and then distributed among GPUs so that no single GPU holds the entire set of gradients. This approach significantly reduces the memory required for training, allowing for larger batch sizes or model sizes.
Here is an example of a stage 1 config JSON file:
{"zero_optimization": {"stage": 2},"gradient_accumulation_steps": 1,"train_micro_batch_size_per_gpu": 1,"gradient_clipping": 1.0,"fp16": {"enabled": true}}
Stage 3: Partitioning Model Parameters
Building on the foundation laid in the first two stages, Stage 3 extends the partitioning strategy to the model's parameters. In this stage, the parameters of the model are divided across the GPUs, so each GPU is responsible for storing and updating only a portion of the model's parameters.
This partitioning dramatically reduces the memory requirements per GPU for storing the model, enabling the training of much larger models than would otherwise be possible with available hardware.
During training, parameters needed for computations on a given GPU are dynamically communicated across GPUs, ensuring that each device has access to the necessary parameters for its portion of the computation. Additionally, you can also offload the parameters of the model to the CPU, which will reduce GPU VRAM requirements. However, it's important to keep in mind that you will need ample amount of CPU RAM in order to store the parameters, so make sure you machine is suited for these new requirements.
{"zero_optimization": {"stage": 3,"offload_param": {"device": "cpu","pin_memory": true},"offload_optimizer": {"device": "cpu","pin_memory": true}},"gradient_accumulation_steps": 1,"train_micro_batch_size_per_gpu": 1,"gradient_clipping": 1.0,"fp16": {"enabled": true}}
In stage 3, you can also offload the parameters of the model to the CPU in order to reduce VRAM usage requirements. Here is an example below:
{"zero_optimization": {"stage": 3,"offload_param": {"device": "cpu"},"offload_optimizer": {"device": "cpu"}},"gradient_accumulation_steps": 1,"train_micro_batch_size_per_gpu": 1,"gradient_clipping": 1.0,"fp16": {"enabled": true}}
Training with HuggingFace and DeepSpeed
If you're interested in testing out some of these different configs, I will add a training script below, which can be used to test out the different deepspeed configs. Keep in mind that in order to run this script, you will need to replace the typical 'python' command with 'deepspeed.' So if your script is name train.py, you will run the command deepspeed train.py.
The key component of this script is passing our deepSpeed config file to the TrainingArguments, which allows the HuggingFace trainer to utilize the config.
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArgumentsfrom datasets import load_datasetfrom transformers import DataCollatorForLanguageModelingfrom transformers import TrainerCallbackimport torchimport transformersdef main():# Load the tokenizertokenizer = AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-step-50K-105b')tokenizer.add_special_tokens({'pad_token': '[PAD]'})# Load dataset from the Hugging Face datasets librarydataset = load_dataset("wikitext", "wikitext-2-raw-v1")# Tokenize the textsdef tokenize_function(examples):return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)tokenized_datasets = dataset.map(tokenize_function, batched=True)# Load the data collatordata_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,mlm=False, # TinyLlama uses a causal (not masked) language model, similar to GPT-2)# Load the modelmodel = AutoModelForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-step-50K-105b')# Define the training argumentstraining_args = TrainingArguments(output_dir='./results',overwrite_output_dir=True,num_train_epochs=3,per_device_train_batch_size=1,save_steps=10_000,save_total_limit=2,fp16=True,deepspeed="path_to_deepspeed_config.json", # Path to DeepSpeed config filegradient_checkpointing=True,report_to='wandb')# Initialize the Trainertrainer = Trainer(model=model,args=training_args,data_collator=data_collator,train_dataset=tokenized_datasets["train"],eval_dataset=tokenized_datasets["validation"])# Start the trainingtrainer.train()# Save the final model and tokenizermodel.save_pretrained('./final_model')tokenizer.save_pretrained('./final_model')if __name__ == "__main__":main()
Note that I set the gradient_checkpointing flag to true in the training arguments, which enables the gradient checkpointing we discussed earlier. Additionally, I set the report_to arg to 'wandb' in order to log my results to Weights and Biases.
Full Speed Ahead
Wrapping up, DeepSpeed Zero significantly changes the game in training large AI models. Its smart handling of memory lets researchers and developers push the boundaries of model sizes on regular hardware setups. With DeepSpeed Zero, the focus shifts from hardware limitations to what you can achieve with your models, simplifying the process of tackling complex AI challenges. This innovation is a big deal for anyone in the field looking to scale up their models efficiently. If you have any questions or comments, feel free to drop them below, and I hope you enjoyed this tutorial!
Related Articles
Building a RAG-Based Digital Restaurant Menu with LlamaIndex and W&B Weave
Powered by RAG, we will transform the traditional restaurant PDF menu into an AI powered interactive menu!
Fine-Tuning Mistral7B on Python Code With A Single GPU!
A tutorial for fine-tuning Mistral7B on Python Code using a single GPU!
How to Fine-Tune LLaVA on a Custom Dataset
A tutorial for fine-tuning LLaVA on your own data!
Skin Lesion Classification on HAM10000 with HuggingFace using PyTorch and W&B
Explore the use of HuggingFace, PyTorch, and W&B for classifying skin lesions with the HAM10000 dataset. We will build, train, and evaluate models for medical diagnostics!
Sources:
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.