Skip to main content

Why You Should Upgrade Your Code to PyTorch 2.0

Taking PyTorch 2.0 for a spin
Created on March 21|Last edited on April 3

Introduction

TLDR: This post is meant to inspire you to upgrade to PyTorch 2.0 today and enjoy this supercharged release! 🚀🥳
This major release presents many tools that make your workflows more powerful. It has no API changes, and it's completely backward compatible with your existing PyTorch code. I am not going to go through everything that's new, for that I encourage you to read the blog post article about 2.0.

torch.compile = Falsetorch.compile = True05001,0001,5002,0002,500

Two new features are very interesting for existing code bases:
  • torch.set_default_device: This enables to setup your device globally! no more to("cuda") calls all over your code. Just set it up once and you're done 😎. This new feature enables you to remove some boiler plate on your code––note that this is optional and you can keep your code as it is if you'd prefer.
  • And the torch.compile function that promises extreme throughput on NVIDIA GPUs. Simply put, you wrap your model with compile_model = torch.compile(model) and you are good to go! Normally, you get a nice performance boost from 10% to 100%, depending on the model! 🚀
Ah, but not so fast: Is your DataLoader fast enough to handle the increased throughput?
Let's run some benchmarks and see what happens 🤓

Benchmarking PyTorch 2.0

As the most exciting new feature is torch.compile let's instrument our existing code with this feature and see our GPUs melt.
Multiple users were reporting huge performance boosts by using this feature on the dev branch of PyTorch. The post below is from Karpathy on his nanoGPT repo; he has been using torch 2.0 for a while now.

Let's run two benchmarks, one on image classification and one on a transformer model.
We are going to measure three important quantities:
  • Model throughput (sample per second)
  • Model + DataLoader throughput (samples per sec)
  • Total runtime (secs)
The pseudo-code of the instrumentation looks like this:
from time import perf_counter

for epoch in range(epochs):
tf = perf_counter()
for x,y in dataloader:
t0 = perf_counter()
# pytorch train steps...
x,y = x.to("cuda"), y.to("cuda")
out = model(x)
loss = loss_func(out, y)
loss.backward()
# other stuff like optimizer, schedulers, grad scalers...
tf_with_dataloader = perf_counter() - tf # time with data gathering
tf = perf_counter()
wandb.log({"samples_per_sec":len(x)/(tf-t0),
"sampler_per_sec_dl":len(x)/tf_with_dataloader})
The code to run this benchmark can be found here
💡

ResNet50

Here we will train a ResNet50 from torchvision. We'll use the same training script to benchmark the Apple M processors.
We present two results:
  • The samples per second for the forward + backward pass (without the time spent gathering the data from the DataLoader)
  • The samples per second, including the data gathering.
We test two NVIDIA GPUS, an A100 with 40GB of VRAM and a V100 with 16GB. We also test on a single batch of data (one_batch) so we can get the maximum theoretical throughput of the pipeline.
💡

A100
5
A100 - one_batch
3
V100
7
V100 - one_batch
7


☝️ Toggle the different runsets and see the difference in performance accross devices.
💡
As you can see from the graphs above, PyTorch's new torch.compile gives you a nice performance boost on your forward and backward passes! Almost 33% extra samples/sec by just adding one line of code...but the DataLoader is slower for some reason. 😳 Maybe my pipeline is not current with the new tricks available on torchvision.
🤦‍♂️ After switching to the PyTorch docker image the DataLoader problem went away. And I am reminded again that we should use docker whenever possible.

BERT Training

We'll reuse the same training script from the Apple benchmarks. You can find the code here. For BERT, we have a huge performance boost! The overhead of the compile would be negligible on a long training run. Another thing we notice is that for BERT the dataloader does not slow down the forward and backward pass. The HuggingFace Dataset and tokenizer are very fast:

A100
5
V100
3

As we can see, we pay an upfront cost for compiling the model. This should not be an issue on longer training runs, and I encourage you to try it! We did after all see a 50% performance boost.
I wager they put a lot of effort into accelerating the transformer blocks. I am looking forward to trying the H100 GPUs that are supposedly optimized to compute the Multi-Head Attention (the building block of the transformer).

CPU utilization of the DataLoader

One thing I noticed is that PyTorch's 2.0 DataLoader (not the DataLoader V2.0, 🤣) is not utilizing the multithreading as before. The A100's machine has 12 available cores, and even when passing num_workers=12, the CPUs are not going brrrrr...
This is for the ResNet50 DataLoader, which is a simple image pipeline that only has a Resize transform on top of the standard PIL.Image.open
This is on a fresh conda env. The PyTorch docker image works better and has more throughput than conda, don't know why 🤷‍♂️. We check CPU utilization using htop
💡




Conclusions

The performance boost from torch.compile is not negligible, and this is done by just adding one line to your code. This new trick that makes the model faster may force you to re-work your DataLoader's capabilities as it may not be able to keep up!
  • The performance gain for the ResNet50 is around 33% on an A100 and around 25% on a V100
  • For the BERT model, the gain is around 45% on an A100 and 50% on a V100.
We didn't cover here, but there are plenty of other cool new tricks in PyTorch 2.0, notably the global device setup. The problem with this is that it's not backward compatible, so if you plan to keep running your code on the pre-2.0 versions, you may need to wait a bit.
I am looking forward to seeing the new GPUs from NVIDIA that are supposed to be even faster on transformer models.
Iterate on AI agents and models faster. Try Weights & Biases today.