How To Use Autocast in PyTorch
In this article, we learn how to implement Tensor Autocasting in a short tutorial, complete with code and interactive visualizations, so you can try it yourself.
Created on June 20|Last edited on January 19
Comment
In this article, we'll look at how you can use the torch.cuda.amp.autocast() in PyTorch to implement automatic Tensor Casting for writing compute-efficient training loops.
Unlike Tensorflow, PyTorch provides an easy interface to easily use compute efficient methods, which we can easily add into the training loop with just a couple of lines of code.
For a closer look at the various Mixed Precision Methods available in PyTorch you can refer to the official documentation.
💡
Table of Contents
Show Me the Code
Most PyTorch training loops are of the following form:
optimizer = ...for epoch in range(...):for i, sample in enumerate(dataloader):inputs, labels = sampleoptimizer.zero_grad()# Forward Passoutputs = model(inputs)# Compute Loss and Perform Back-propagationloss = loss_fn(outputs, labels)loss.backward()# Update Optimizeroptimizer.step()
Also, if you've been training deep learning models in PyTorch for a while, especially large language models (LLMs), you've probably run into the following CUDA error:
RuntimeError: CUDA out of memory. Tried to allocate .. MiB (.. GiB total capacity; ... GiB already allocated;... MiB free; ... GiB reserved in total by PyTorch)
For a much more detailed overview of how to prevent CUDA OOM errors, kindly refer to the following reports:
Preventing The CUDA Out Of Memory Error In PyTorch
A short tutorial on how you can avoid the "RuntimeError: CUDA out of memory" error while using the PyTorch framework.
How To Use GradScaler in PyTorch
In this article, we explore how to implement automatic gradient scaling (GradScaler) in a short tutorial complete with code and interactive visualizations.
One of the most common reasons for the OOM error encountered in training large deep learning models is the problem of memory utilization. By default, PyTorch uses float32 to represent model parameters. For any decently sized model, that amounts to a lot of memory. If you have a decent accelerator with, say, 16GB of RAM, you probably won't be able to train bigger models. You probably won't even be able to compute a single forward pass through a batch of data. One possible solution to this problem is to automatically cast your tensors to a smaller memory footprint. Say float16 or even integer values! Seems simple enough, right?
Let's see how you can autocast using a single line of code!
optimizer = ...for epoch in range(...):for i, sample in enumerate(dataloader):inputs, labels = sampleoptimizer.zero_grad()# ⭐️ ⭐️ Autocastingwith torch.cuda.amp.autocast():# Forward Passoutputs = model(inputs)# Compute Loss and Perform Back-propagationloss = loss_fn(outputs, labels)loss.backward()# Update Optimizeroptimizer.step()
Whenever you use W&B to log metrics, it tracks tons of system metrics every couple of seconds. This enables us to gain valuable insight into what's going on in our machine. You'll find the complete list of the metrics being logged here.
For our specific context here, we're trying to solve that out-of-memory issue. Let's have a look at some charts to see how Weights & Biases helps us track useful metrics:
This set of panels contains runs from a private project, which cannot be shown in this report
As we can see, for most of the runs, we are consuming up to 92%- 95% of Memory Allocation and GPU Utilization. We can try to reduce this using autocasting.
Summary
In this article, you saw how you can use the torch.cuda.amp.autocast() in PyTorch to implement automatic Tensor Casting for writing compute efficient training loops and how using Weights & Biases to monitor your metrics can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5 minutes guide. If you want more reports covering the math and from-scratch code implementations, let us know in the comments below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.
Recommended Reading
Preventing The CUDA Out Of Memory Error In PyTorch
A short tutorial on how you can avoid the "RuntimeError: CUDA out of memory" error while using the PyTorch framework.
How To Use GradScaler in PyTorch
In this article, we explore how to implement automatic gradient scaling (GradScaler) in a short tutorial complete with code and interactive visualizations.
Setting Up TensorFlow And PyTorch Using GPU On Docker
A short tutorial on setting up TensorFlow and PyTorch deep learning models on GPUs using Docker.
PyTorch Dropout for regularization - tutorial
Learn how to regularize your PyTorch model with Dropout, complete with a code tutorial and interactive visualizations
How to Initialize Weights in PyTorch
A short tutorial on how you can initialize weights in PyTorch with code and interactive visualizations.
How To Use GPU with PyTorch
A short tutorial on using GPUs for your deep learning models with PyTorch, from checking availability to visualizing usable.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.