Skip to main content

How To Use Autocast in PyTorch

In this article, we learn how to implement Tensor Autocasting in a short tutorial, complete with code and interactive visualizations, so you can try it yourself.
Created on June 20|Last edited on January 19

In this article, we'll look at how you can use the torch.cuda.amp.autocast() in PyTorch to implement automatic Tensor Casting for writing compute-efficient training loops.
Unlike Tensorflow, PyTorch provides an easy interface to easily use compute efficient methods, which we can easily add into the training loop with just a couple of lines of code.
For a closer look at the various Mixed Precision Methods available in PyTorch you can refer to the official documentation.
💡

Table of Contents



Show Me the Code

Most PyTorch training loops are of the following form:
optimizer = ...

for epoch in range(...):
for i, sample in enumerate(dataloader):
inputs, labels = sample
optimizer.zero_grad()

# Forward Pass
outputs = model(inputs)
# Compute Loss and Perform Back-propagation
loss = loss_fn(outputs, labels)
loss.backward()
# Update Optimizer
optimizer.step()
Also, if you've been training deep learning models in PyTorch for a while, especially large language models (LLMs), you've probably run into the following CUDA error:
RuntimeError: CUDA out of memory. Tried to allocate .. MiB (.. GiB total capacity; ... GiB already allocated;
... MiB free; ... GiB reserved in total by PyTorch)
For a much more detailed overview of how to prevent CUDA OOM errors, kindly refer to the following reports:



One of the most common reasons for the OOM error encountered in training large deep learning models is the problem of memory utilization. By default, PyTorch uses float32 to represent model parameters. For any decently sized model, that amounts to a lot of memory. If you have a decent accelerator with, say, 16GB of RAM, you probably won't be able to train bigger models. You probably won't even be able to compute a single forward pass through a batch of data. One possible solution to this problem is to automatically cast your tensors to a smaller memory footprint. Say float16 or even integer values! Seems simple enough, right?
Let's see how you can autocast using a single line of code!
optimizer = ...

for epoch in range(...):
for i, sample in enumerate(dataloader):
inputs, labels = sample
optimizer.zero_grad()

# ⭐️ ⭐️ Autocasting
with torch.cuda.amp.autocast():
# Forward Pass
outputs = model(inputs)
# Compute Loss and Perform Back-propagation
loss = loss_fn(outputs, labels)

loss.backward()
# Update Optimizer
optimizer.step()
Whenever you use W&B to log metrics, it tracks tons of system metrics every couple of seconds. This enables us to gain valuable insight into what's going on in our machine. You'll find the complete list of the metrics being logged here.
For our specific context here, we're trying to solve that out-of-memory issue. Let's have a look at some charts to see how Weights & Biases helps us track useful metrics:

This set of panels contains runs from a private project, which cannot be shown in this report

As we can see, for most of the runs, we are consuming up to 92%- 95% of Memory Allocation and GPU Utilization. We can try to reduce this using autocasting.

Summary

In this article, you saw how you can use the torch.cuda.amp.autocast() in PyTorch to implement automatic Tensor Casting for writing compute efficient training loops and how using Weights & Biases to monitor your metrics can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5 minutes guide. If you want more reports covering the math and from-scratch code implementations, let us know in the comments below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.


Iterate on AI agents and models faster. Try Weights & Biases today.