Multi-GPU Training Using PyTorch Lightning

In this article, we take a look at how to execute multi-GPU training using PyTorch Lightning and visualize GPU performance in Weights & Biases.

Ayush Thakur

Created on November 13|Last edited on November 28

Comment

A GPU is the workhorse for most deep learning workflows. If you have used TensorFlow Keras you must have known that the same training script can be used to train a model using multi GPUs and even with TPU with minimal to no change. 
In this article, we will see how we can make our PyTorch script accelerator agnostic, i.e, we can use the same PyTorch code organized using PyTorch Lightning and train using multiple GPUs across multiple devices. 
Introduction to the Common Workflow With PyTorch LightningThis article is a part of my PyTorch Lightning series. Before you train your model on multiple GPUs make sure to check out the two reports to get you started with PyTorch Lightning:
﻿Image Classification using PyTorch Lightning﻿
﻿Transfer Learning using PyTorch Lightning﻿
PyTorch Lightning lets you decouple research from engineering. Making your PyTorch code train on multiple GPUs can be daunting if you are not experienced and a waste of time if you want to scale your research. PyTorch Lightning is more of a "style guide" that helps you organize your PyTorch code such that you do not have to write boilerplate code which also involves multi-GPU training. 
The Common Workflow with PyTorch LightningStart with your PyTorch code and focus on the neural network aspect. It involves your data pipeline, model architecture, training loop, validation loop, testing loop, loss function, optimizer, etc.
Organize your data pipeline using PyTorch Lightning. The DataModule organizes the data pipeline into one shareable and reusable class. More on it here.
Organize your model architecture, training loop, validation loop, testing loop, optimizer(s), loss function, etc using PyTorch Lightning. The LightningModule defines a system to group all the research code into a single class to make it self-contained.
Define the Trainer which abstracts away all the engineering code for us. You can specify the number of GPUs, the number of epochs, etc. It also lets you use callbacks such as Early Stopping, Model Checkpoint, etc. More on callbacks here.
In this article, we will see how easy it is to train our model on multiple GPUs. 
Multi-GPU TrainingI have structured the PyTorch code for image classification on the Caltech-101 dataset using PyTorch Lightning. I have used my GCP account to train the classifier with two K80 GPUs with just one minor change in the Trainer. 
How To Train on Multiple GPUsKeeping everything the same just pass gpus and accelerator argument to the PyTorch Lightning Trainer. I had access to two K80 GPUs thus gpus=2. I was using Jupyter Notebook for training thus accelerator='dp. Here dp stands for Data Parallel. We will soon go into the specifics but before that let's visualize the system metrics using Weights & Biases.
# Initialize a trainer
trainer = pl.Trainer(max_epochs=50,
                     progress_bar_refresh_rate=20, 
                     gpus=2, 
                     accelerator='dp',
                     logger=wandb_logger,
                     callbacks=[early_stop_callback],
                     checkpoint_callback=checkpoint_callback)
Weights & Biases automatically capture metrics related to GPU. The media panels shown below are some of the most important metrics that we care about. You can also see the training and test metrics. 
﻿
﻿
Run set1
﻿
﻿

Add a comment

Ryo Matsuzaka • 3 years ago

I had many issues(no modules, no function, version issues etc...) to run codes on the blogs, those of which can be overcome. But I got a issue which cannot be resolved. I got `RuntimeError: Encountered different devices in metric calculation (see stacktrace for details). This could be due to the metric class not being on the same device as input. Instead of `metric=Accuracy(...)` try to do `metric=Accuracy(...).to(device)` where device corresponds to the device of the input.` I use 2 GPUs on local machine. Do you know how to fix it?