Skip to main content

Multi-GPU Training Using PyTorch Lightning

In this article, we take a look at how to execute multi-GPU training using PyTorch Lightning and visualize GPU performance in Weights & Biases.
Created on November 13|Last edited on November 28
A GPU is the workhorse for most deep learning workflows. If you have used TensorFlow Keras you must have known that the same training script can be used to train a model using multi GPUs and even with TPU with minimal to no change.
In this article, we will see how we can make our PyTorch script accelerator agnostic, i.e, we can use the same PyTorch code organized using PyTorch Lightning and train using multiple GPUs across multiple devices.

Introduction to the Common Workflow With PyTorch Lightning

This article is a part of my PyTorch Lightning series. Before you train your model on multiple GPUs make sure to check out the two reports to get you started with PyTorch Lightning:
PyTorch Lightning lets you decouple research from engineering. Making your PyTorch code train on multiple GPUs can be daunting if you are not experienced and a waste of time if you want to scale your research. PyTorch Lightning is more of a "style guide" that helps you organize your PyTorch code such that you do not have to write boilerplate code which also involves multi-GPU training.

The Common Workflow with PyTorch Lightning

  • Start with your PyTorch code and focus on the neural network aspect. It involves your data pipeline, model architecture, training loop, validation loop, testing loop, loss function, optimizer, etc.
  • Organize your data pipeline using PyTorch Lightning. The DataModule organizes the data pipeline into one shareable and reusable class. More on it here.
  • Organize your model architecture, training loop, validation loop, testing loop, optimizer(s), loss function, etc using PyTorch Lightning. The LightningModule defines a system to group all the research code into a single class to make it self-contained.
  • Define the Trainer which abstracts away all the engineering code for us. You can specify the number of GPUs, the number of epochs, etc. It also lets you use callbacks such as Early Stopping, Model Checkpoint, etc. More on callbacks here.
In this article, we will see how easy it is to train our model on multiple GPUs.

Multi-GPU Training

I have structured the PyTorch code for image classification on the Caltech-101 dataset using PyTorch Lightning. I have used my GCP account to train the classifier with two K80 GPUs with just one minor change in the Trainer.

How To Train on Multiple GPUs

Keeping everything the same just pass gpus and accelerator argument to the PyTorch Lightning Trainer. I had access to two K80 GPUs thus gpus=2. I was using Jupyter Notebook for training thus accelerator='dp. Here dp stands for Data Parallel. We will soon go into the specifics but before that let's visualize the system metrics using Weights & Biases.
# Initialize a trainer
trainer = pl.Trainer(max_epochs=50,
progress_bar_refresh_rate=20,
gpus=2,
accelerator='dp',
logger=wandb_logger,
callbacks=[early_stop_callback],
checkpoint_callback=checkpoint_callback)
Weights & Biases automatically capture metrics related to GPU. The media panels shown below are some of the most important metrics that we care about. You can also see the training and test metrics.


Run set
1

Ryo Matsuzaka
Ryo Matsuzaka •  
I had many issues(no modules, no function, version issues etc...) to run codes on the blogs, those of which can be overcome. But I got a issue which cannot be resolved. I got `RuntimeError: Encountered different devices in metric calculation (see stacktrace for details). This could be due to the metric class not being on the same device as input. Instead of `metric=Accuracy(...)` try to do `metric=Accuracy(...).to(device)` where device corresponds to the device of the input.` I use 2 GPUs on local machine. Do you know how to fix it?
Reply
Luis Gonzales
Luis Gonzales •  
what about the wandb.init()? how do i specify rank_zero with lightning?
Reply
YUANFAN GUO
YUANFAN GUO •  
Great work! Thank you very much.
Reply
Esteban Gonzalez
Esteban Gonzalez •  
system/gpu.0.temp, system/gpu.1.temp
HI!
1 reply
Iterate on AI agents and models faster. Try Weights & Biases today.