How To Implement Gradient Accumulation in PyTorch
In this article, we learn how to implement gradient accumulation in PyTorch in a short tutorial complete with code and interactive visualizations so you can try for yourself.
Created on June 27|Last edited on August 22
Comment
I conducted some experiments, so now it is possible to win NLP competitions on old single GPUs such as Tesla P100. Training `deberta-v3-large` with 512 input sequences length and batch size of 4.
— Vadim (@vad13irt) June 19, 2022
In this article, we'll look at how you can implement gradient accumulation in PyTorch for writing compute-efficient training loops.
Unlike TensorFlow, PyTorch provides an easy way to easily write compute-efficient training loops into which we can easily add gradient accumulation with just a couple of lines of code.
For a closer look at the various Mixed Precision Methods available in PyTorch you can refer to the official documentation.
💡
Table of Content
Show Me the Code
Most PyTorch training loops are of the following form:
optimizer = ...for epoch in range(...):for i, sample in enumerate(dataloader):inputs, labels = sample# Forward Passoutputs = model(inputs)# Compute Loss and Perform Back-propagationloss = loss_fn(outputs, labels)loss.backward()# Update Optimizeroptimizer.step()optimizer.zero_grad()
Also, if you've been training deep learning models in PyTorch for a while, especially large language models (LLMs), you've probably run into the following CUDA error:
RuntimeError: CUDA out of memory. Tried to allocate .. MiB (.. GiB total capacity; ... GiB already allocated;... MiB free; ... GiB reserved in total by PyTorch)
For a much more detailed overview of how to prevent CUDA OOM errors, kindly refer to the following report:
While training large language models, it has been shown that larger batch sizes lead to better convergence possibilities. But more often than not, you cannot fit bigger batch sizes into your machine. So what's the solution? Well, we can use something called "Gradient Accumulation". As shown in the above code snippet, we take a batch of data, compute a forward pass and then run backpropagation. Well, what if instead of that, we don't update after every batch but rather store the gradients for some forward passes and then run backpropagation. Thus we "accumulate" gradients for some steps and then run backpropagation.
Well, now that we know what Gradient Accumulation let's see how we can make it work in a PyTorch training loop:
optimizer = ...NUM_ACCUMULATION_STEPS = ...for epoch in range(...):for idx, sample in enumerate(dataloader):inputs, labels = sample# Forward Passoutputs = model(inputs)# Compute Loss and Perform Back-propagationloss = loss_fn(outputs, labels)# Normalize the Gradientsloss = loss / NUM_ACCUMULATION_STEPSloss.backward()if ((idx + 1) % NUM_ACCUMULATION_STEPS == 0) or (idx + 1 == len(dataloader)):# Update Optimizeroptimizer.step()optimizer.zero_grad()
That's all it takes!
- We normalize the loss with regard to the number of gradient accumulation steps
- Only update the optimizer every chunk, the number of chunks being a number of steps/accumulation steps. Or at the end of the data loader.
Summary
In this article, you saw how you could implement gradient accumulation in PyTorch for writing compute-efficient training loops.
To see the full suite of W&B features, please check out this short 5 minutes guide. If you want more reports covering the math and from-scratch code implementations, let us know in the comments below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.
Recommended Reading
Preventing The CUDA Out Of Memory Error In PyTorch
A short tutorial on how you can avoid the "RuntimeError: CUDA out of memory" error while using the PyTorch framework.
How To Use GradScaler in PyTorch
In this article, we explore how to implement automatic gradient scaling (GradScaler) in a short tutorial complete with code and interactive visualizations.
How To Use Autocast in PyTorch
In this article, we learn how to implement Tensor Autocasting in a short tutorial, complete with code and interactive visualizations, so you can try it yourself.
How to Set Random Seeds in PyTorch and Tensorflow
Learn how to set the random seed for everything in PyTorch and Tensorflow in this short tutorial, which comes complete with code and interactive visualizations.
Setting Up TensorFlow And PyTorch Using GPU On Docker
A short tutorial on setting up TensorFlow and PyTorch deep learning models on GPUs using Docker.
How To Calculate Number of Model Parameters for PyTorch and TensorFlow Models
This article provides a short tutorial on calculating the number of parameters for TensorFlow and PyTorch deep learning models, with examples for you to follow.
Add a comment
There is a minor error here, you should zero out your gradients after the optimizer.step().
or else, the weights will never update
Reply
optimizer.zero_grad()
Don't perform zero_grad() on each step, only on steps where you call optimizer.step().
4 replies
Iterate on AI agents and models faster. Try Weights & Biases today.