Setting Up TensorFlow And PyTorch Using GPU On Docker
A short tutorial on setting up TensorFlow and PyTorch deep learning models on GPUs using Docker.
Created on March 8|Last edited on July 26
Comment
Table of Contents
This post assumes a basic understanding of virtual environments
💡
Why Use Docker?
In this Report, we'll walk through how you can use Docker to make distributed training of deep learning TensorFlow and PyTorch models easier all while helping you avoid those nasty CUDA errors.
Current state-of-the-art models are famously huge and over-parameterized––in fact, they contain way more parameters than the number of data points in the dataset. These models require giganormous amounts of compute to train and therefore depend on multiprocessing and distribution modules such as torch.distributed or tf.distribute.
Even if you somehow manage to write working parallel code, it's another pain point to ensure that your CUDA version matches with what your primary library supports (dependency hell ☠️) and that all of your accelerators are "visible."
Docker helps make this process infinitely better by providing preconfigured images with optimal CUDA setup for every version. You can even build open these existing images and add your own custom libraries and frameworks to make the process even simpler.
Weights and Biases logging works well with distributed systems even allowing you to track system metrics to ensure that all your GPUs are being utilized.
Let's walk through a couple of examples on how you can use Docker to train Tensorflow and PyTorch pipelines.
Setting Up TensorFlow With GPU Support Using Docker
TensorFlow provides an easy API to write distributed code with the tf.distribute API, but making sure that you have the correct versions of NVIDIA GPU Toolkit, CUDA Toolkit, CUPTI, cuDNN and TensorRT can be a hassle. We can use the default TensorFlow images to use a preconfigured environment with all the essential packages preinstalled.
TensorFlow provides a number of images depending on your use case such as latest, nightly and devel. Assuming you have Docker installed on your computer we can download these images using commands such as
$ docker pull tensorflow/tensorflow$ docker pull tensorflow/tensorflow:latest-gpu
These commands will install the latest stable release and the latest GPU compatible release respectively. For more detailed instructions please refer to the official documentation.
Now, assuming you have some train.py script with a appropriate distribution strategy, such as:
import wandbfrom wandb.keras import WandbCallback# Initialize the runwandb.init(project="GPU-Docker")if __name__ == "__main__":# Other Steps like Dataset Caching and PreProcessingstrategy = tf.distribute.MirroredStrategy()with strategy.scope():# Create Model# model = ...# model.compile(...)model.fit(...,callbacks = [WandbCallback()])
We can simply run this script inside the container to ensure that TensorFlow uses all the GPUs available on the system using the following command.
$ docker run --gpus all -v $PWD:/tmp -w /tmp -it tensorflow/tensorflow:latest-gpu python train.py
Let's try to understand this command:
- docker run - this tells the docker daemon to run the following commands within some container
- --gpus all - this tells the daemon to use all available GPUs
- -v $PWD:/tmp -w /tmp - tells the daemon where to execute the commands within the container using the current directory ($PWD)
- -it tensorflow/tensorflow:latest-gpu - tells the daemon which container to use in the "interactive" mode. This mode allows you to access the terminal within the container.
- python train.py - these commands are executed within the container and therefore use the python interpreter to execute our script.
If you are using other packages which are not available in the default TensorFlow image then you can create your own Docker image using the TensorFlow image as base and then install other necessary packages within the image. Then we can just build that image and substitute it for the TensorFlow image. For an example Dockerfile please refer to this repository.
Setting Up PyTorch With GPU Support Using Docker
Now, admittedly torch.distributed is hard to get started with but we can make the distributed execution easier with PyTorch just like we did with TensorFlow.
Similar to TensorFlow, PyTorch also provides a bunch of great images with various versions of CUDA and cuDNN preconfigured. The simplest way to get started would be to use the latest image, although other tags are also available on their official Docker page.
Similar to TensorFlow, the procedure to download official images are the same viz.
$ docker pull pytorch/pytorch:latest$ docker pull pytorch/pytorch:1.9.1-cuda11.1-cudnn8-runtime$ docker pull pytorch/pytorch:1.9.1-cuda11.1-cudnn8-devel
The aforementioned 3 images are representative of most other tags.
The latest image comes with the latest stable versions of PyTorch, CUDA and cuDNN. There are also other tags of the form X-cuda-Y-cudnn-Z-runtime/devel, where X is the pytorch version, Y is the CUDA version and Z is the cuDNN version. The images tagged with devel come preinstalled with various compiler configurations.
We can use similar commands to train the models using Docker and building on top of the images.
In a framework agnostic manner we can also use the official Nvidia CUDA versions such as :
$ docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Summary
In this article, you saw how you can set up both TensorFlow and PyTorch to train deep learning models on all of your GPUs using Docker to make distributed training easier. To see the full suite of W&B features please check out this short 5 minutes guide. If you want more reports covering the math and "from-scratch" code implementations let us know in the comments down below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.
Recommended Reading
How To Use GPU with PyTorch
A short tutorial on using GPUs for your deep learning models with PyTorch, from checking availability to visualizing usable.
PyTorch Dropout for regularization - tutorial
Learn how to regularize your PyTorch model with Dropout, complete with a code tutorial and interactive visualizations
How to save and load models in PyTorch
This article is a machine learning tutorial on how to save and load your models in PyTorch using Weights & Biases for version control.
Image Classification Using PyTorch Lightning and Weights & Biases
This article provides a practical introduction on how to use PyTorch Lightning to improve the readability and reproducibility of your PyTorch code.
A Gentle Introduction To Weight Initialization for Neural Networks
An explainer and comprehensive overview of various strategies for neural network weight initialization
How to Compare Keras Optimizers in Tensorflow for Deep Learning
A short tutorial outlining how to compare Keras optimizers for your deep learning pipelines in Tensorflow, with a Colab to help you follow along.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.