Distributed training in tf.keras with W&B

Explore the ways to distribute your training workloads with minimal code changes and analyze system metrics with Weights and Biases.
Sayak Paul


In this report, I will show you how to seamlessly integrate tf.distribute.MirroredStrategy for distributing your training workloads across multiple GPUs for tf.keras models. Distributed training can be particularly very useful when you have very large datasets and the need to scale the training costs becomes very prominent with that. It becomes unrealistic to perform the training on only a single hardware accelerator (a GPU in this case), hence the need for performing distributed training.

GitHub repository

Towards the end of the report, we will be seeing two methods that can make distributed training very effective - a. pre-fetching data so that it's ready for the model to consume as soon as it finishes an epoch, b. tuning the batch size.

A huge shoutout to Martin Gorner of Google (ML Product Manager on the Kaggle team) for providing me guidance to prepare this report.


System set-up, costs, etc

We primarily have two options to perform distributed training with GCP -

Compute Engine allows you to create Virtual Machines with a number of different software and hardware configurations that might be suitable for a variety of tasks not just for training deep learning models. On the other hand, AI Platform Notebooks give us pre-configured Jupyter Lab Notebook instances with the flexibility of customization.

In my experience, I have found the process of setting up a Compute Engine instance to be more involved than spinning up an AI Platform Notebook instance. Let's see how they differ from the perspective of costs.

Here are my system configurations:

Compute Engine, for the above, would cost me -


And, AI Platform Notebooks would cost me the following -


As you can see there's a difference between the costs of the two yet the latter (AI Platform Notebooks) one is just click-and-go kind of a thing. As a practitioner, I would want my time to be spent on the things that matter with respect to my expertise, I would not want to reinvent the wheel when it's not needed. Hence, I chose to go with AI Platform Notebooks. For a more thorough coverage on setting up AI Platform Notebooks and using them, refer to this guide.

To be able to use multiple GPUs in one AI Platform Notebook instance, you would first need to apply for a quota increase. You can check out this thread to know more about it.

Show me the code

tf.distribute.MirroredStrategy does parameter updates in synchronous mode using the all-reduce algorithm by default. However, TensorFlow 2.x supports parameter updates in asynchronous modes as well. Explaining their details is out of the scope for this report. In case you are interested in learning more about them, here are some very good resources:

Okay, back to code!

As a starting point, let's first train an image classifier to distinguish between cats and dogs on a single K80 GPU. We will be using a MobileNetV2 network (pre-trained on ImageNet) as our based architecture and on its top, we will append the classification head. So, in code, it would look like so -

# Load the MobileNetV2 model but exclude the classification layers
EXTRACTOR = MobileNetV2(weights='imagenet', include_top=False,
                 input_shape=(224, 224, 3))

# We are fine-tuning
EXTRACTOR.trainable = True

# Construct the head of the model that will be placed on top of the
# the base model
class_head = EXTRACTOR.output
class_head = GlobalAveragePooling2D()(class_head)
class_head = Dense(512, activation="relu")(class_head)
class_head = Dropout(0.5)(class_head)
class_head = Dense(1)(class_head)

# Create the new model
pet_classifier = Model(inputs=EXTRACTOR.input, outputs=class_head)

Training this fella for 10 epochs gives us a good result -

Show me the code

Fine-tuning with LR schedules (single GPU)

Let's now quickly see if the learning schedule had any effect on the training.

Fine-tuning with LR schedules (single GPU)

Porting the model to multiple GPUs

Now, in order to distribute this training across the four GPUs, we first need to define the MirroredStrategy scope -

strategy = tf.distribute.MirroredStrategy()

After that, we can compile our model withing the scope context -

with strategy.scope():
    model = get_training_model()

get_training_model includes the model definition as shown above along with the compilation step. So, what we are doing is creating and compiling the model within the scope of the MirroredStrategy. After this, it's absolutely the same - you call model.fit. Our learning rate schedule will also change a bit as we will now be distributing the model parameters across four GPUs (note the Y-values).


The performance did not change much (at least in terms of accuracy and loss) as you can see in the below figure -

Porting the model to multiple GPUs

Analyzing GPU metrics

GPUs are comparatively not as cheap as other commodity hardware. So, it's important to ensure the GPU utilization is as high as possible. Let's quickly see how we are doing there. Here are the graphs for GPU utilization and the time spent by the GPU to access the memory for fetching data (single GPU). As we can see the GPU utilization high most of the times which is good.

Analyzing GPU metrics

When we use multiple GPUs, we get -

When we use multiple GPUs, we get -

Two methods to further improve performance

Method #1

TensorFlow's data API offers a number of things to further improve model training where input data streaming is a bottleneck. For example, ideally when a model is training the data for the next epoch should be ready so that the model does not need to wait for it. If it needs to wait then it introduces some bottleneck in terms of the overall training time.

prefetch allows us to instruct TensorFlow to prepare the next batch of data ready as soon as the model finishes the current epoch. It even allows us to specify the number of samples the system should fetch beforehand. But what if we would want the system to decide that for us depending on the bandwidth of the system processes and hardware. We can specify that as well with tf.data.experimental.AUTOTUNE -

# Prepare batches and randomly shuffle the training images (this time with prefetch)
train_batches = train.shuffle(1024).repeat().batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
valid_batches = valid.repeat().batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)

Method #2

The second thing we can do is play with the batch sizes. Since we are using multiple GPUs with the synchronous parameter update scheme, each GPU will receive a slice of data and will be trained on that. So, if we use too high of a batch size it might be difficult for the GPUs to properly distribute that among each other. On the other hand, if we use too small of a batch size, the GPUs might go under-utilized. So, we would need to find the sweet spot there. Here are some general suggestions (these come from Martin Gorner's notebook)

# Calculate batch size
batch_size_per_replica = 32
batch_size = batch_size_per_replica * strategy.num_replicas_in_sync

Note that we are using a larger batch size now, in the previous experiment we used a batch size of 16.

The above-mentioned notebook by Martin contains a number of tips and tricks to optimize the model performance when using distributed training and these include:

There are a number of suggestions available in this guide as well.

Okay, enough talking! It's now time to tie the above-discussed methods together and see the results -

Two methods to further improve performance

LR Schedule + BS of 16 + Pre-fetch

LR Schedule + BS of 16 + Pre-fetch

Conclusion and next steps

It's important to keep in mind that when using distributed training, the bigger the dataset the better the utilization of the accelerators. Be it TPUs or multiple GPUs, this logic would hold true.

When using multiple GPUs (whether in a single or in multiple machines/clusters) it's very important to note the cost associated with the synchronization time needed for the multiple accelerators to coordinate between themselves. This video explains some trade-offs related to this.

I hope you got a sense of how easy it is to distribute training workloads for your tf.keras models. As the next steps, you might want to experiment with the different tips and tricks shared in this report and also in the resources I mentioned. If you are more into customizing training loops, you might want to try mixed-precision training and distributed training in there as well.

If you have any feedback to share with me, you can do so via a Tweet here. I would really appreciate it.