Skip to main content

How To Use GradScaler in PyTorch

In this article, we explore how to implement automatic gradient scaling (GradScaler) in a short tutorial complete with code and interactive visualizations.
Created on June 14|Last edited on January 27

In this article, we'll look at how you can use the torch.cuda.amp.GradScaler in PyTorch to implement automatic Gradient Scaling for writing compute efficient training loops.
Unlike Tensorflow, PyTorch provides an easy interface to easily use compute efficient methods, which we can easily add into the training loop with just a couple of lines of code.
For a closer look at the various Mixed Precision Methods available in PyTorch you can refer to the official documentation.
💡

Table of Contents





Show Me the Code

Most PyTorch training loops are of the following form:
optimizer = ...

for epoch in range(...):
for i, sample in enumerate(dataloader):
inputs, labels = sample
optimizer.zero_grad()

# Forward Pass
outputs = model(inputs)
# Compute Loss and Perform Back-propagation
loss = loss_fn(outputs, labels)
loss.backward()
# Update Optimizer
optimizer.step()
Also, if you've been training deep learning models in PyTorch for a while, especially large language models (LLMs), you've probably run into the following CUDA error:
RuntimeError: CUDA out of memory. Tried to allocate .. MiB (.. GiB total capacity; ... GiB already allocated;
... MiB free; ... GiB reserved in total by PyTorch)
For a much more detailed overview of how to prevent CUDA OOM errors, kindly refer to the following reports:



One common error in any large deep learning model is the problem of underflowing gradients (i.e., your gradients are too small to take into account). float16 tensors often don't take into account extremely small variations.
To prevent this, we can scale our gradients by some factor, so they aren't flushed to zero. Not to be confused with vanishing gradients, these gradients might contribute to the learning process but are skipped because of computational limits.
Let's see how you can use Grad Scaler in your training loops:
scaler = torch.cuda.amp.GradScaler()
optimizer = ...

for epoch in range(...):
for i, sample in enumerate(dataloader):
inputs, labels = sample
optimizer.zero_grad()

# Forward Pass
outputs = model(inputs)
# Compute Loss and Perform Back-propagation
loss = loss_fn(outputs, labels)
# ⭐️ ⭐️ Scale Gradients
scaler.scale(loss).backward()
# ⭐️ ⭐️ Update Optimizer
scaler.step(optimizer)
scaler.update()
For a better understanding, you can refer to this video:


Summary

In this article, you saw how you can use the torch.cuda.amp.GradScaler in PyTorch to implement automatic Gradient Scaling for writing compute efficient training loops and how using Weights & Biases to monitor your metrics can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5 minutes guide. If you want more reports covering the math and from-scratch code implementations, let us know in the comments below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.

Daniel Turner
Daniel Turner •  
If using Gradscalar, how does this affect the gradients that are logged using wandb.watch()? Is there a way to log the unscaled, rather than scaled, gradients?
Reply
Bence Krénusz
Bence Krénusz •  
Hey there, Thanks for the article. Referring to this tutorial: https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html GradScaler usage is bound to use the torch.autocast().
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.