Skip to main content

Setting Up TensorFlow And PyTorch Using GPU On Docker

A short tutorial on setting up TensorFlow and PyTorch deep learning models on GPUs using Docker.
Created on March 8|Last edited on July 26

Table of Contents

This post assumes a basic understanding of virtual environments
💡

Why Use Docker?

In this Report, we'll walk through how you can use Docker to make distributed training of deep learning TensorFlow and PyTorch models easier all while helping you avoid those nasty CUDA errors.
Current state-of-the-art models are famously huge and over-parameterized––in fact, they contain way more parameters than the number of data points in the dataset. These models require giganormous amounts of compute to train and therefore depend on multiprocessing and distribution modules such as torch.distributed or tf.distribute.
Even if you somehow manage to write working parallel code, it's another pain point to ensure that your CUDA version matches with what your primary library supports (dependency hell ☠️) and that all of your accelerators are "visible."
Docker helps make this process infinitely better by providing preconfigured images with optimal CUDA setup for every version. You can even build open these existing images and add your own custom libraries and frameworks to make the process even simpler.
Weights and Biases logging works well with distributed systems even allowing you to track system metrics to ensure that all your GPUs are being utilized.
Let's walk through a couple of examples on how you can use Docker to train Tensorflow and PyTorch pipelines.

Setting Up TensorFlow With GPU Support Using Docker

TensorFlow provides an easy API to write distributed code with the tf.distribute API, but making sure that you have the correct versions of NVIDIA GPU Toolkit, CUDA Toolkit, CUPTI, cuDNN and TensorRT can be a hassle. We can use the default TensorFlow images to use a preconfigured environment with all the essential packages preinstalled.
TensorFlow provides a number of images depending on your use case such as latest, nightly and devel. Assuming you have Docker installed on your computer we can download these images using commands such as
$ docker pull tensorflow/tensorflow
$ docker pull tensorflow/tensorflow:latest-gpu
These commands will install the latest stable release and the latest GPU compatible release respectively. For more detailed instructions please refer to the official documentation.
Now, assuming you have some train.py script with a appropriate distribution strategy, such as:
import wandb
from wandb.keras import WandbCallback

# Initialize the run
wandb.init(project="GPU-Docker")

if __name__ == "__main__":
# Other Steps like Dataset Caching and PreProcessing

strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
# Create Model
# model = ...
# model.compile(...)

model.fit(...,callbacks = [WandbCallback()])
We can simply run this script inside the container to ensure that TensorFlow uses all the GPUs available on the system using the following command.
$ docker run --gpus all -v $PWD:/tmp -w /tmp -it tensorflow/tensorflow:latest-gpu python train.py
Let's try to understand this command:
  • docker run - this tells the docker daemon to run the following commands within some container
  • --gpus all - this tells the daemon to use all available GPUs
  • -v $PWD:/tmp -w /tmp - tells the daemon where to execute the commands within the container using the current directory ($PWD)
  • -it tensorflow/tensorflow:latest-gpu - tells the daemon which container to use in the "interactive" mode. This mode allows you to access the terminal within the container.
  • python train.py - these commands are executed within the container and therefore use the python interpreter to execute our script.
If you are using other packages which are not available in the default TensorFlow image then you can create your own Docker image using the TensorFlow image as base and then install other necessary packages within the image. Then we can just build that image and substitute it for the TensorFlow image. For an example Dockerfile please refer to this repository.

Setting Up PyTorch With GPU Support Using Docker

Now, admittedly torch.distributed is hard to get started with but we can make the distributed execution easier with PyTorch just like we did with TensorFlow.
Similar to TensorFlow, PyTorch also provides a bunch of great images with various versions of CUDA and cuDNN preconfigured. The simplest way to get started would be to use the latest image, although other tags are also available on their official Docker page.
Similar to TensorFlow, the procedure to download official images are the same viz.
$ docker pull pytorch/pytorch:latest
$ docker pull pytorch/pytorch:1.9.1-cuda11.1-cudnn8-runtime
$ docker pull pytorch/pytorch:1.9.1-cuda11.1-cudnn8-devel
The aforementioned 3 images are representative of most other tags.
The latest image comes with the latest stable versions of PyTorch, CUDA and cuDNN. There are also other tags of the form X-cuda-Y-cudnn-Z-runtime/devel, where X is the pytorch version, Y is the CUDA version and Z is the cuDNN version. The images tagged with devel come preinstalled with various compiler configurations.
We can use similar commands to train the models using Docker and building on top of the images.


In a framework agnostic manner we can also use the official Nvidia CUDA versions such as :
$ docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

Summary

In this article, you saw how you can set up both TensorFlow and PyTorch to train deep learning models on all of your GPUs using Docker to make distributed training easier. To see the full suite of W&B features please check out this short 5 minutes guide. If you want more reports covering the math and "from-scratch" code implementations let us know in the comments down below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.

Iterate on AI agents and models faster. Try Weights & Biases today.