How To Use GradScaler in PyTorch
In this article, we explore how to implement automatic gradient scaling (GradScaler) in a short tutorial complete with code and interactive visualizations.
Created on June 14|Last edited on January 27
Comment
In this article, we'll look at how you can use the torch.cuda.amp.GradScaler in PyTorch to implement automatic Gradient Scaling for writing compute efficient training loops.
Unlike Tensorflow, PyTorch provides an easy interface to easily use compute efficient methods, which we can easily add into the training loop with just a couple of lines of code.
For a closer look at the various Mixed Precision Methods available in PyTorch you can refer to the official documentation.
💡
Table of Contents
Show Me the Code
Most PyTorch training loops are of the following form:
optimizer = ...for epoch in range(...):for i, sample in enumerate(dataloader):inputs, labels = sampleoptimizer.zero_grad()# Forward Passoutputs = model(inputs)# Compute Loss and Perform Back-propagationloss = loss_fn(outputs, labels)loss.backward()# Update Optimizeroptimizer.step()
Also, if you've been training deep learning models in PyTorch for a while, especially large language models (LLMs), you've probably run into the following CUDA error:
RuntimeError: CUDA out of memory. Tried to allocate .. MiB (.. GiB total capacity; ... GiB already allocated;... MiB free; ... GiB reserved in total by PyTorch)
For a much more detailed overview of how to prevent CUDA OOM errors, kindly refer to the following reports:
One common error in any large deep learning model is the problem of underflowing gradients (i.e., your gradients are too small to take into account). float16 tensors often don't take into account extremely small variations.
To prevent this, we can scale our gradients by some factor, so they aren't flushed to zero. Not to be confused with vanishing gradients, these gradients might contribute to the learning process but are skipped because of computational limits.
Let's see how you can use Grad Scaler in your training loops:
scaler = torch.cuda.amp.GradScaler()optimizer = ...for epoch in range(...):for i, sample in enumerate(dataloader):inputs, labels = sampleoptimizer.zero_grad()# Forward Passoutputs = model(inputs)# Compute Loss and Perform Back-propagationloss = loss_fn(outputs, labels)# ⭐️ ⭐️ Scale Gradientsscaler.scale(loss).backward()# ⭐️ ⭐️ Update Optimizerscaler.step(optimizer)scaler.update()
For a better understanding, you can refer to this video:
Summary
In this article, you saw how you can use the torch.cuda.amp.GradScaler in PyTorch to implement automatic Gradient Scaling for writing compute efficient training loops and how using Weights & Biases to monitor your metrics can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5 minutes guide. If you want more reports covering the math and from-scratch code implementations, let us know in the comments below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.
Recommended Reading
Setting Up TensorFlow And PyTorch Using GPU On Docker
A short tutorial on setting up TensorFlow and PyTorch deep learning models on GPUs using Docker.
Preventing The CUDA Out Of Memory Error In PyTorch
A short tutorial on how you can avoid the "RuntimeError: CUDA out of memory" error while using the PyTorch framework.
How to Compare Keras Optimizers in Tensorflow for Deep Learning
A short tutorial outlining how to compare Keras optimizers for your deep learning pipelines in Tensorflow, with a Colab to help you follow along.
How to Initialize Weights in PyTorch
A short tutorial on how you can initialize weights in PyTorch with code and interactive visualizations.
Recurrent Neural Network Regularization With Keras
A short tutorial teaching how you can use regularization methods for Recurrent Neural Networks (RNNs) in Keras, with a Colab to help you follow along.
How To Calculate Number of Model Parameters for PyTorch and TensorFlow Models
This article provides a short tutorial on calculating the number of parameters for TensorFlow and PyTorch deep learning models, with examples for you to follow.
Add a comment
If using Gradscalar, how does this affect the gradients that are logged using wandb.watch()? Is there a way to log the unscaled, rather than scaled, gradients?
Reply
Hey there, Thanks for the article. Referring to this tutorial: https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html GradScaler usage is bound to use the torch.autocast().
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.