Distributed training in tf.keras with W&B

Explore the ways to distribute your training workloads with minimal code changes and analyze system metrics with Weights and Biases.
Sayak Paul
Created on April 1|Last edited on April 9
Comment
﻿
IntroductionIn this report, I will show you how to seamlessly integrate tf.distribute.MirroredStrategy for distributing your training workloads across multiple GPUs for tf.keras models. Distributed training can be particularly very useful when you have very large datasets and the need to scale the training costs becomes very prominent with that. It becomes unrealistic to perform the training on only a single hardware accelerator (a GPU in this case), hence the need for performing distributed training. 
GitHub repository →Towards the end of the report, we will be seeing two little hacks that can make distributed training very effective - a. pre-fetching data so that it's ready for the model to consume as soon as it finishes an epoch, b. tuning the batch size. 
﻿
﻿
﻿
GPU Time Spent Accessing Memory
GPU Time Spent Accessing Memory
102030Time (minutes)01020304050
multi-gpu-no-ft   GPU 0 Time Spent Accessing Memory (%)
multi-gpu-no-ft   GPU 1 Time Spent Accessing Memory (%)
multi-gpu-no-ft   GPU 2 Time Spent Accessing Memory (%)
multi-gpu-no-ft   GPU 3 Time Spent Accessing Memory (%)
1-gpu   GPU 0 Time Spent Accessing Memory (%)
1-gpu   GPU 1 Time Spent Accessing Memory (%)
1-gpu   GPU 2 Time Spent Accessing Memory (%)
1-gpu   GPU 3 Time Spent Accessing Memory (%)
multi-gpu-prefetch   GPU 0 Time Spent Accessing Memory (%)
multi-gpu-prefetch   GPU 1 Time Spent Accessing Memory (%)
multi-gpu-prefetch   GPU 2 Time Spent Accessing Memory (%)
multi-gpu-prefetch   GPU 3 Time Spent Accessing Memory (%)
multi-gpu   GPU 0 Time Spent Accessing Memory (%)
multi-gpu   GPU 1 Time Spent Accessing Memory (%)
multi-gpu   GPU 2 Time Spent Accessing Memory (%)
multi-gpu   GPU 3 Time Spent Accessing Memory (%)
GPU Utilization
GPU Utilization
102030Time (minutes)020406080
multi-gpu-no-ft   GPU 0 Utilization (%)
multi-gpu-no-ft   GPU 1 Utilization (%)
multi-gpu-no-ft   GPU 2 Utilization (%)
multi-gpu-no-ft   GPU 3 Utilization (%)
1-gpu   GPU 0 Utilization (%)
1-gpu   GPU 1 Utilization (%)
1-gpu   GPU 2 Utilization (%)
1-gpu   GPU 3 Utilization (%)
multi-gpu-prefetch   GPU 0 Utilization (%)
multi-gpu-prefetch   GPU 1 Utilization (%)
multi-gpu-prefetch   GPU 2 Utilization (%)
multi-gpu-prefetch   GPU 3 Utilization (%)
multi-gpu   GPU 0 Utilization (%)
multi-gpu   GPU 1 Utilization (%)
multi-gpu   GPU 2 Utilization (%)
multi-gpu   GPU 3 Utilization (%)
Run set4
﻿
System set-up, costs, etcWe primarily have two options to perform distributed training with GCP - 
Compute Engine
AI Platform Notebooks
Compute Engine allows you to create Virtual Machines with a number of different software and hardware configurations that might be suitable for a variety of tasks not just for training deep learning models. On the other hand, AI Platform Notebooks  give us pre-configured Jupyter Lab Notebook instances with the flexibility of customization. 
In my experience, I have found the process of setting up a Compute Engine instance to be more involved than spinning up an AI Platform Notebook instance. Let's see how they differ from the perspective of costs. 
Here are my system configurations:
n1-standard-4 vCPUs-15 GB
4 Tesla K80s
100 GB standard persistent disk
Deep Learning Image: Base m44
Compute Engine, for the above, would cost me - 
  
And, AI Platform Notebooks would cost me the following - 
  
As you can see there's practically no to very negligible difference between the costs of the two yet the latter (AI Platform Notebooks) one is just click-and-go kind of a thing. As a practitioner, I would want my time to be spent on the things that matter with respect to my expertise, I would not want to reinvent the wheel when it's not needed. Hence, I chose to go with AI Platform Notebooks. For a more thorough coverage on setting up AI Platform Notebooks and using them, refer to this guide. 
To be able to use multiple GPUs in one AI Platform Notebook instance, you would first need to apply for a quota increase. You can check out this thread to know more about it. 
Show me the codetf.distribute.MirroredStrategy does parameter updates in synchronous mode using the all-reduce algorithm by default. However, TensorFlow 2.x supports parameter updates in asynchronous modes as well. Explaining their details is out of the scope for this report. In case you are interested in learning more about them, here are some very good resources:
Inside TensorFlow: tf.distribute.Strategy
Distributed training with TensorFlow
Scaling TensorFlow 2 models to multi-worker GPUs (TF Dev Summit '20)
Okay, back to code! 
As a starting point, let's first train an image classifier to distinguish between cats and dogs on a single K80 GPU. We will be using a MobileNetV2 network (pre-trained on ImageNet) as our based architecture and on its top, we will append the classification head. So, in code, it would look like so - 
# Load the MobileNetV2 model but exclude the classification layers
EXTRACTOR = MobileNetV2(weights='imagenet', include_top=False,
                 input_shape=(224, 224, 3))

# We are not training the extractor model
EXTRACTOR.trainable = False

# Construct the head of the model that will be placed on top of the
# the base model
class_head = EXTRACTOR.output
class_head = GlobalAveragePooling2D()(class_head)
class_head = Dense(512, activation="relu")(class_head)
class_head = Dropout(0.5)(class_head)
class_head = Dense(1)(class_head)

# Create the new model
pet_classifier = Model(inputs=EXTRACTOR.input, outputs=class_head)
Training this fella for 10 epochs gives us a good result - 
﻿
﻿
﻿
Run set1
﻿
Porting the model to multiple GPUsNow,  in order to distribute this training across the four GPUs, we first need to define the MirroredStrategy scope - 
strategy = tf.distribute.MirroredStrategy()
After that, we can compile our model withing the scope context - 
with strategy.scope():
    model = get_training_model()
get_training_model includes the model definition as shown above along with the compilation step. After this, it's absolutely the same - you call model.fit.  The performance did not change much (at least in terms of accuracy and loss) as you can see in the below figure - 
﻿
﻿
﻿
Run set1
﻿
Analyzing GPU metricsGPUs are comparatively not as cheap as other commodity hardware. So, it's important to ensure the GPU utilization is as high as possible. Let's quickly see how we are doing there. Here are the graphs for GPU utilization (~75%) and the time spent by the GPU to access the memory for fetching data (single GPU) - 
﻿
﻿
﻿
Run set1
﻿
When we use multiple GPUs, we get -
﻿
﻿
﻿
Run set1
﻿
Two hacks to further improve performanceHack #1TensorFlow's data API offers a number of things to further improve model training where input data streaming is a bottleneck. For example, ideally when a model is training the data for the next epoch should be ready so that the model does not need to wait for it. If it needs to wait then it introduces some bottleneck in terms of the overall training time. 
prefetch allows us to instruct TensorFlow to prepare the next batch of data ready as soon as the model finishes the current epoch. It even allows us to specify the number of samples the system should fetch beforehand. But what if we would want the system to decide that for us depending on the bandwidth of the system processes and hardware. We can specify that as well with tf.data.experimental.AUTOTUNE - 
# Prepare batches and randomly shuffle the training images (this time with prefetch)
train_batches = train.shuffle(1024).repeat().batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
valid_batches = valid.repeat().batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
Hack #2The second thing we can do is play with the batch sizes. Since we are using multiple GPUs with the synchronous parameter update scheme, each GPU will receive a slice of data and will be trained on that. So, if we use too high of a batch size it might be difficult for the GPUs to properly distribute that among each other. On the other hand, if we use too small of a batch size, the GPUs might go under-utilized. So, we would need to find the sweet spot there. Here are some general suggestions (these come from Martin Gorner's notebook)
Start with 16 as the local batch size for each of the GPUs. Try keeping it to a multiple of 4/8 depending on the number of GPUs you are using. 
The global batch size then becomes - local_batch_size * number_of_GPUs. 
# Calculate batch size
batch_size_per_replica = 16
batch_size = batch_size_per_replica * strategy.num_replicas_in_sync
The above-mentioned notebook contains a number of tips and tricks to optimize the model performance when using distributed training and these include:
Using mixed- training and XLA-enabled compilation
Delegating the parallelization threads to tf.data.experimental.AUTOTUNE
Learning rate schedules 
There are a number of suggestions available in this guide as well. 
Okay, enough talking! It's now time to tie the above-discussed hacks together and see the results - 
﻿
﻿
﻿
Run set1
﻿
Conclusion and next stepsIt's important to keep in mind that when using distributed training, the bigger the dataset the better the utilization of the accelerators. Be it TPUs or multiple GPUs, this logic would hold true. 
When using multiple GPUs (whether in a single or in multiple machines/clusters) it's very important to note the cost associated with the synchronization time needed for the multiple accelerators to coordinate between themselves. This video explains some trade-offs related to this. 
I hope you got a sense of how easy it is to distribute training workloads for your tf.keras models. As the next steps, you might want to experiment with the different tips and tricks shared in this report and also in the resources I mentioned. If you are more into customizing training loops, you might want to try mixed-precision training and distributed training in there as well. 
If you have any feedback to share with me, you can do so via a Tweet here. I would highly appreciate it.
﻿
﻿
Add a comment