Distributed training in tf.keras with W&B
Introduction
In this report, I will show you how to seamlessly integrate tf.distribute.MirroredStrategy
for distributing your training workloads across multiple GPUs for tf.keras
models. Distributed training can be particularly very useful when you have very large datasets and the need to scale the training costs becomes very prominent with that. It becomes unrealistic to perform the training on only a single hardware accelerator (a GPU in this case), hence the need for performing distributed training.
GitHub repository →
Towards the end of the report, we will be seeing two little hacks that can make distributed training very effective - a. pre-fetching data so that it's ready for the model to consume as soon as it finishes an epoch, b. tuning the batch size.
System set-up, costs, etc
We primarily have two options to perform distributed training with GCP -
- Compute Engine
- AI Platform Notebooks
Compute Engine allows you to create Virtual Machines with a number of different software and hardware configurations that might be suitable for a variety of tasks not just for training deep learning models. On the other hand, AI Platform Notebooks give us pre-configured Jupyter Lab Notebook instances with the flexibility of customization.
In my experience, I have found the process of setting up a Compute Engine instance to be more involved than spinning up an AI Platform Notebook instance. Let's see how they differ from the perspective of costs.
Here are my system configurations:
- n1-standard-4 vCPUs-15 GB
- 4 Tesla K80s
- 100 GB standard persistent disk
- Deep Learning Image: Base m44
Compute Engine, for the above, would cost me -
And, AI Platform Notebooks would cost me the following -
As you can see there's practically no to very negligible difference between the costs of the two yet the latter (AI Platform Notebooks) one is just click-and-go kind of a thing. As a practitioner, I would want my time to be spent on the things that matter with respect to my expertise, I would not want to reinvent the wheel when it's not needed. Hence, I chose to go with AI Platform Notebooks. For a more thorough coverage on setting up AI Platform Notebooks and using them, refer to this guide.
To be able to use multiple GPUs in one AI Platform Notebook instance, you would first need to apply for a quota increase. You can check out this thread to know more about it.
Show me the code
tf.distribute.MirroredStrategy
does parameter updates in synchronous mode using the all-reduce algorithm by default. However, TensorFlow 2.x supports parameter updates in asynchronous modes as well. Explaining their details is out of the scope for this report. In case you are interested in learning more about them, here are some very good resources:
- Inside TensorFlow: tf.distribute.Strategy
- Distributed training with TensorFlow
- Scaling TensorFlow 2 models to multi-worker GPUs (TF Dev Summit '20)
Okay, back to code!
As a starting point, let's first train an image classifier to distinguish between cats and dogs on a single K80 GPU. We will be using a MobileNetV2 network (pre-trained on ImageNet) as our based architecture and on its top, we will append the classification head. So, in code, it would look like so -
# Load the MobileNetV2 model but exclude the classification layers
EXTRACTOR = MobileNetV2(weights='imagenet', include_top=False,
input_shape=(224, 224, 3))
# We are not training the extractor model
EXTRACTOR.trainable = False
# Construct the head of the model that will be placed on top of the
# the base model
class_head = EXTRACTOR.output
class_head = GlobalAveragePooling2D()(class_head)
class_head = Dense(512, activation="relu")(class_head)
class_head = Dropout(0.5)(class_head)
class_head = Dense(1)(class_head)
# Create the new model
pet_classifier = Model(inputs=EXTRACTOR.input, outputs=class_head)
Training this fella for 10 epochs gives us a good result -
Porting the model to multiple GPUs
Now, in order to distribute this training across the four GPUs, we first need to define the MirroredStrategy
scope -
strategy = tf.distribute.MirroredStrategy()
After that, we can compile our model withing the scope context -
with strategy.scope():
model = get_training_model()
get_training_model
includes the model definition as shown above along with the compilation step. After this, it's absolutely the same - you call model.fit
. The performance did not change much (at least in terms of accuracy and loss) as you can see in the below figure -
Analyzing GPU metrics
GPUs are comparatively not as cheap as other commodity hardware. So, it's important to ensure the GPU utilization is as high as possible. Let's quickly see how we are doing there. Here are the graphs for GPU utilization (~75%) and the time spent by the GPU to access the memory for fetching data (single GPU) -
When we use multiple GPUs, we get -
Two hacks to further improve performance
Hack #1
TensorFlow's data API offers a number of things to further improve model training where input data streaming is a bottleneck. For example, ideally when a model is training the data for the next epoch should be ready so that the model does not need to wait for it. If it needs to wait then it introduces some bottleneck in terms of the overall training time.
prefetch
allows us to instruct TensorFlow to prepare the next batch of data ready as soon as the model finishes the current epoch. It even allows us to specify the number of samples the system should fetch beforehand. But what if we would want the system to decide that for us depending on the bandwidth of the system processes and hardware. We can specify that as well with tf.data.experimental.AUTOTUNE
-
# Prepare batches and randomly shuffle the training images (this time with prefetch)
train_batches = train.shuffle(1024).repeat().batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
valid_batches = valid.repeat().batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
Hack #2
The second thing we can do is play with the batch sizes. Since we are using multiple GPUs with the synchronous parameter update scheme, each GPU will receive a slice of data and will be trained on that. So, if we use too high of a batch size it might be difficult for the GPUs to properly distribute that among each other. On the other hand, if we use too small of a batch size, the GPUs might go under-utilized. So, we would need to find the sweet spot there. Here are some general suggestions (these come from Martin Gorner's notebook)
- Start with 16 as the local batch size for each of the GPUs. Try keeping it to a multiple of 4/8 depending on the number of GPUs you are using.
- The global batch size then becomes -
local_batch_size * number_of_GPUs
.
# Calculate batch size
batch_size_per_replica = 16
batch_size = batch_size_per_replica * strategy.num_replicas_in_sync
The above-mentioned notebook contains a number of tips and tricks to optimize the model performance when using distributed training and these include:
- Using mixed- training and XLA-enabled compilation
- Delegating the parallelization threads to
tf.data.experimental.AUTOTUNE
- Learning rate schedules
There are a number of suggestions available in this guide as well.
Okay, enough talking! It's now time to tie the above-discussed hacks together and see the results -
Conclusion and next steps
It's important to keep in mind that when using distributed training, the bigger the dataset the better the utilization of the accelerators. Be it TPUs or multiple GPUs, this logic would hold true.
When using multiple GPUs (whether in a single or in multiple machines/clusters) it's very important to note the cost associated with the synchronization time needed for the multiple accelerators to coordinate between themselves. This video explains some trade-offs related to this.
I hope you got a sense of how easy it is to distribute training workloads for your tf.keras
models. As the next steps, you might want to experiment with the different tips and tricks shared in this report and also in the resources I mentioned. If you are more into customizing training loops, you might want to try mixed-precision training and distributed training in there as well.
If you have any feedback to share with me, you can do so via a Tweet here. I would highly appreciate it.