How To Use GradScaler in PyTorch

In this article, we explore how to implement automatic gradient scaling (GradScaler) in a short tutorial complete with code and interactive visualizations.

Saurav Maheshkar

Created on June 14|Last edited on January 27

Comment

﻿
﻿
In this article, we'll look at how you can use the torch.cuda.amp.GradScaler in PyTorch to implement automatic Gradient Scaling for writing compute efficient training loops.
Unlike Tensorflow, PyTorch provides an easy interface to easily use compute efficient methods, which we can easily add into the training loop with just a couple of lines of code.
For a closer look at the various Mixed Precision Methods available in PyTorch you can refer to the official documentation.
💡
Table of Contents Show Me the CodeSummaryRecommended Reading
﻿
﻿
Show Me the CodeMost PyTorch training loops are of the following form:
optimizer = .﻿.﻿.
﻿
for epoch in range﻿(﻿.﻿.﻿.﻿)﻿:
    for i, sample in enumerate﻿(dataloader)﻿:
        inputs, labels = sample
        optimizer.zero_grad(﻿)
﻿
	# Forward Pass
        outputs = model(inputs)
        # Compute Loss and Perform Back-propagation
	loss = loss_fn(outputs, labels)
        loss.backward(﻿)
	# Update Optimizer
        optimizer.step(﻿)
Also, if you've been training deep learning models in PyTorch for a while, especially large language models (LLMs), you've probably run into the following CUDA error:
RuntimeError: CUDA out of memory. Tried to allocate .﻿. MiB (﻿.﻿. GiB total capacity; .﻿.﻿. GiB already allocated; 
.﻿.﻿. MiB free; .﻿.﻿. GiB reserved in total by PyTorch)
For a much more detailed overview of how to prevent CUDA OOM errors, kindly refer to the following reports:
Preventing The CUDA Out Of Memory Error In PyTorch
A short tutorial on how you can avoid the "RuntimeError: CUDA out of memory" error while using the PyTorch framework.
﻿
﻿
One common error in any large deep learning model is the problem of underflowing gradients (i.e., your gradients are too small to take into account). float16 tensors often don't take into account extremely small variations. 
To prevent this, we can scale our gradients by some factor, so they aren't flushed to zero. Not to be confused with vanishing gradients, these gradients might contribute to the learning process but are skipped because of computational limits.
Let's see how you can use Grad Scaler in your training loops:
scaler = torch.cuda.amp.GradScaler()
optimizer = .﻿.﻿.
﻿
for epoch in range﻿(﻿.﻿.﻿.﻿)﻿:
    for i, sample in enumerate﻿(dataloader)﻿:
        inputs, labels = sample
        optimizer.zero_grad(﻿)
﻿
	# Forward Pass
        outputs = model(inputs)
        # Compute Loss and Perform Back-propagation
	loss = loss_fn(outputs, labels)
	
	# ⭐️ ⭐️ Scale Gradients
        scaler.scale(loss).backward()
	# ⭐️ ⭐️ Update Optimizer
	scaler.step(optimizer)
        scaler.update()
For a better understanding, you can refer to this video:
﻿
SummaryIn this article, you saw how you can use the torch.cuda.amp.GradScaler in PyTorch to implement automatic Gradient Scaling for writing compute efficient training loops and how using Weights & Biases to monitor your metrics can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5 minutes guide. If you want more reports covering the math and from-scratch code implementations, let us know in the comments below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.
Recommended Reading
Setting Up TensorFlow And PyTorch Using GPU On Docker
A short tutorial on setting up TensorFlow and PyTorch deep learning models on GPUs using Docker.
Preventing The CUDA Out Of Memory Error In PyTorch
A short tutorial on how you can avoid the "RuntimeError: CUDA out of memory" error while using the PyTorch framework.
How to Compare Keras Optimizers in Tensorflow for Deep Learning
A short tutorial outlining how to compare Keras optimizers for your deep learning pipelines in Tensorflow, with a Colab to help you follow along.
How to Initialize Weights in PyTorch
A short tutorial on how you can initialize weights in PyTorch with code and interactive visualizations.
Recurrent Neural Network Regularization With Keras
A short tutorial teaching how you can use regularization methods for Recurrent Neural Networks (RNNs) in Keras, with a Colab to help you follow along.
How To Calculate Number of Model Parameters for PyTorch and TensorFlow Models
This article provides a short tutorial on calculating the number of parameters for TensorFlow and PyTorch deep learning models, with examples for you to follow.
﻿
﻿

Add a comment

Daniel Turner • 3 years ago

If using Gradscalar, how does this affect the gradients that are logged using wandb.watch()? Is there a way to log the unscaled, rather than scaled, gradients?

Bence Krénusz • 3 years ago

Hey there, Thanks for the article. Referring to this tutorial: https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html GradScaler usage is bound to use the torch.autocast().

Tags: Articles, Domain Agnostic, Beginner, Tutorial, PyTorch

Iterate on AI agents and models faster. Try Weights & Biases today.