Skip to main content

How To Implement Gradient Accumulation in PyTorch

In this article, we learn how to implement gradient accumulation in PyTorch in a short tutorial complete with code and interactive visualizations so you can try for yourself.
Created on June 27|Last edited on August 22

In this article, we'll look at how you can implement gradient accumulation in PyTorch for writing compute-efficient training loops.
Unlike TensorFlow, PyTorch provides an easy way to easily write compute-efficient training loops into which we can easily add gradient accumulation with just a couple of lines of code.
For a closer look at the various Mixed Precision Methods available in PyTorch you can refer to the official documentation.
💡

Table of Content





Show Me the Code

Most PyTorch training loops are of the following form:
optimizer = ...

for epoch in range(...):
for i, sample in enumerate(dataloader):
inputs, labels = sample

# Forward Pass
outputs = model(inputs)
# Compute Loss and Perform Back-propagation
loss = loss_fn(outputs, labels)
loss.backward()
# Update Optimizer
optimizer.step()

optimizer.zero_grad()
Also, if you've been training deep learning models in PyTorch for a while, especially large language models (LLMs), you've probably run into the following CUDA error:
RuntimeError: CUDA out of memory. Tried to allocate .. MiB (.. GiB total capacity; ... GiB already allocated;
... MiB free; ... GiB reserved in total by PyTorch)
For a much more detailed overview of how to prevent CUDA OOM errors, kindly refer to the following report:



While training large language models, it has been shown that larger batch sizes lead to better convergence possibilities. But more often than not, you cannot fit bigger batch sizes into your machine. So what's the solution? Well, we can use something called "Gradient Accumulation". As shown in the above code snippet, we take a batch of data, compute a forward pass and then run backpropagation. Well, what if instead of that, we don't update after every batch but rather store the gradients for some forward passes and then run backpropagation. Thus we "accumulate" gradients for some steps and then run backpropagation.
Well, now that we know what Gradient Accumulation let's see how we can make it work in a PyTorch training loop:
optimizer = ...
NUM_ACCUMULATION_STEPS = ...

for epoch in range(...):
for idx, sample in enumerate(dataloader):
inputs, labels = sample

# Forward Pass
outputs = model(inputs)
# Compute Loss and Perform Back-propagation
loss = loss_fn(outputs, labels)

# Normalize the Gradients
loss = loss / NUM_ACCUMULATION_STEPS
loss.backward()

if ((idx + 1) % NUM_ACCUMULATION_STEPS == 0) or (idx + 1 == len(dataloader)):
# Update Optimizer
optimizer.step()

optimizer.zero_grad()
That's all it takes!
  1. We normalize the loss with regard to the number of gradient accumulation steps
  2. Only update the optimizer every chunk, the number of chunks being a number of steps/accumulation steps. Or at the end of the data loader.

Summary

In this article, you saw how you could implement gradient accumulation in PyTorch for writing compute-efficient training loops.
To see the full suite of W&B features, please check out this short 5 minutes guide. If you want more reports covering the math and from-scratch code implementations, let us know in the comments below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.

aurko
aurko •  
There is a minor error here, you should zero out your gradients after the optimizer.step(). or else, the weights will never update
Reply
Alexey Zaytsev
Alexey Zaytsev •  
optimizer.zero_grad() Don't perform zero_grad() on each step, only on steps where you call optimizer.step().
4 replies
Iterate on AI agents and models faster. Try Weights & Biases today.