How To Implement Gradient Accumulation in PyTorch

In this article, we learn how to implement gradient accumulation in PyTorch in a short tutorial complete with code and interactive visualizations so you can try for yourself.

Saurav Maheshkar

Created on June 27|Last edited on August 22

Comment

﻿
I conducted some experiments, so now it is possible to win NLP competitions on old single GPUs such as Tesla P100. Training `deberta-v3-large` with 512 input sequences length and batch size of 4.
— Vadim (@vad13irt) June 19, 2022
﻿
In this article, we'll look at how you can implement gradient accumulation in PyTorch for writing compute-efficient training loops.
Unlike TensorFlow, PyTorch provides an easy way to easily write compute-efficient training loops into which we can easily add gradient accumulation with just a couple of lines of code.
For a closer look at the various Mixed Precision Methods available in PyTorch you can refer to the official documentation.
💡
Table of ContentShow Me the CodeSummaryRecommended Reading
﻿
﻿
Show Me the CodeMost PyTorch training loops are of the following form:
optimizer = ...
﻿
for epoch in range(...):
    for i, sample in enumerate(dataloader):
        inputs, labels = sample
﻿
	# Forward Pass
        outputs = model(inputs)
        # Compute Loss and Perform Back-propagation
	loss = loss_fn(outputs, labels)
        loss.backward()
	# Update Optimizer
        optimizer.step()
﻿
	optimizer.zero_grad()
Also, if you've been training deep learning models in PyTorch for a while, especially large language models (LLMs), you've probably run into the following CUDA error:
RuntimeError: CUDA out of memory. Tried to allocate .. MiB (.. GiB total capacity; ... GiB already allocated; 
... MiB free; ... GiB reserved in total by PyTorch)
For a much more detailed overview of how to prevent CUDA OOM errors, kindly refer to the following report:
Preventing The CUDA Out Of Memory Error In PyTorch
A short tutorial on how you can avoid the "RuntimeError: CUDA out of memory" error while using the PyTorch framework.
﻿
﻿
While training large language models, it has been shown that larger batch sizes lead to better convergence possibilities. But more often than not, you cannot fit bigger batch sizes into your machine. So what's the solution? Well, we can use something called "Gradient Accumulation". As shown in the above code snippet, we take a batch of data, compute a forward pass and then run backpropagation. Well, what if instead of that, we don't update after every batch but rather store the gradients for some forward passes and then run backpropagation. Thus we "accumulate" gradients for some steps and then run backpropagation. 
Well, now that we know what Gradient Accumulation let's see how we can make it work in a PyTorch training loop:
optimizer = ...
NUM_ACCUMULATION_STEPS = ...
﻿
for epoch in range(...):
    for idx, sample in enumerate(dataloader):
        inputs, labels = sample
﻿
	# Forward Pass
        outputs = model(inputs)
        # Compute Loss and Perform Back-propagation
	loss = loss_fn(outputs, labels)
﻿
	# Normalize the Gradients
	loss = loss / NUM_ACCUMULATION_STEPS
        loss.backward()
﻿
	if ((idx + 1) % NUM_ACCUMULATION_STEPS == 0) or (idx + 1 == len(dataloader)):
		# Update Optimizer
        	optimizer.step()
﻿
		optimizer.zero_grad()
That's all it takes!
We normalize the loss with regard to the number of gradient accumulation steps
Only update the optimizer every chunk, the number of chunks being a number of steps/accumulation steps. Or at the end of the data loader.
SummaryIn this article, you saw how you could implement gradient accumulation in PyTorch for writing compute-efficient training loops.
To see the full suite of W&B features, please check out this short 5 minutes guide. If you want more reports covering the math and from-scratch code implementations, let us know in the comments below or on our forum ✨!﻿
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.
Recommended Reading
Preventing The CUDA Out Of Memory Error In PyTorch
A short tutorial on how you can avoid the "RuntimeError: CUDA out of memory" error while using the PyTorch framework.
How To Use GradScaler in PyTorch
In this article, we explore how to implement automatic gradient scaling (GradScaler) in a short tutorial complete with code and interactive visualizations.
How To Use Autocast in PyTorch
In this article, we learn how to implement Tensor Autocasting in a short tutorial, complete with code and interactive visualizations, so you can try it yourself. 
How to Set Random Seeds in PyTorch and Tensorflow
Learn how to set the random seed for everything in PyTorch and Tensorflow in this short tutorial, which comes complete with code and interactive visualizations.
Setting Up TensorFlow And PyTorch Using GPU On Docker
A short tutorial on setting up TensorFlow and PyTorch deep learning models on GPUs using Docker.
How To Calculate Number of Model Parameters for PyTorch and TensorFlow Models
This article provides a short tutorial on calculating the number of parameters for TensorFlow and PyTorch deep learning models, with examples for you to follow.
﻿
﻿