Distributed Training with Weights & Biases

In this article, we will explore data-parallel distributed training in Keras trying different configurations of GPU count.
Stacey Svetlichnaya
Created on March 29|Last edited on November 3
Comment
﻿
In this work log, I explore data-parallel distributed training in Keras. I try different configurations of GPU count (1, 2, 4 or 8 GPUs) and total (original)/effective (per GPU) batch size, increase the dataset size, and compare evaluation methods.
I include my notes and ideas for the next steps from different experiments to show a realistic research process.
Table Of ContentsBasic Multi-GPU in KerasScaling to 4 GPUsBatch size matters more than GPU countTraining on 2 GPUs with 10X the dataIf time-constrained, train with larger batches on less dataTrain 2.5X faster on 4 GPUs, 3X on 8 GPUs?Optimal batch size & GPU count? It depends.Tentative: Not much impact from fixed sub-batch size & scaled batch sizeTentative: Vary sub-batch size on 8GPUs
﻿
Initial observationstraining acceleration is linear-ish: compared to 1 GPU, training runs 1.6 times faster on 2 GPUs and 2.5 times faster on 4 GPUs
this is very easy to accomplish in Keras with minimal code effort and no fine-tuning: see multi_gpu_model() in Keras utils﻿
tradeoff between batch size, stability, and accuracy: larger batches generally increase training speed and stabilize the training (fewer lags/stalls observed when training in the cloud). Smaller batches lead to slightly higher validation accuracy on average.
batch size affects validation accuracy more than GPU count: 4-5% versus 1% — this is reassuring for trusting this data-parallel training paradigm to not adversely affect the final model
if time-constrained, train on less data with bigger batches:  training on less data (10% of full size) with larger batch sizes yields comparable/slightly higher validation accuracy than training on the full dataset with lower batch size
the optimal configuration, especially the effective batch size per GPU, is more subtle to tune and likely depends on the particular model
Data and modelThe core model trains a basic 7-layer convolutional network predicting one of 10 animal classes (bird, mammal, reptile, etc) on class-balanced subsets of iNaturalist 2017, typically 5000 train / 800 test for fast iteration. You can read more about this task with more powerful models or in the context of curriculum learning. 
CodeThe code to train a data-parallel model is in this example gist.
Basic Multi-GPU in Keras﻿
﻿
2 GPUs with Keras7
﻿
Smaller batches for accuracy, larger for speed on 2 GPUsKeras has a built-in function for model parallelization: mutli_gpu_model in utils. This is trivial to enable.
Here are examples on a local machine and GCP with 2 GPUs.
NotesThe "original" batch size shown is for the original, non-parallel version of the model. The parallel version splits each of these batches evenly across the GPUs. The effective sub-batch size seen by each parallel copy of the model is the original batch size / 2, in this case.
smaller batch sizes (original 32/64, sub-batch 16/32) reach higher accuracy for a fixed epoch count than larger batch sizes (original 128/256, sub-batch 64/128)
local mode is slightly better than cloud (GCP) due to more data (6400 vs 5000), possibly due to physical co-location of GPUs
larger batches generally lead to faster training, with a few confounding factors as seen in the bottom graph
local runs (purples) train the fastest
not much difference between 32, 64, and 128
long stuck period for "batch 64 (V2, 5K train)" (orange)—generally observing high variability in runtimes on GCP
I initially attempted to use a clever early version of the multi_gpu_model here which requires substantially more complicated batch size adjustments in the training/validation generator
Next stepsincrease the number of GPUs, amount of training data
explore the effect of batch size
increase the complexity of the model (especially ResNet)
Scaling to 4 GPUs
 Speed up: 4 GPUs: 2.5X, 2 GPUs: 1.6X
2.5X faster training on 4 GPUs vs 1 GPUmodel reaches a slightly higher validation accuracy 2.5X as fast when using 4 versus 1 GPU - this is the main advantage
train/val acc/loss are not significantly affected by parallelizing the job across 1, 2, or 4 GPUs-this is expected and reassuring
could continue tuning to improve relative speed-up-improvement is less than linear
need better metrics (data throughput, batches per unit time, time to convergence) to quantify the added value of distributed training
ExperimentTrain a 7-layer convnet on the main iNaturalist dataset (5000 train / 800 val) as a proof of concept for the Keras multi_gpu_model function.
Notesinitially no noticeable difference between 1 GPU and 2 GPUs—masked by one extremely slow run, batch 64_2, which stalled for 2 hours during training for unknown reasons. Leaving it out of the average shows that 2 GPUs yield a 1.6x acceleration
accuracy vs batch size: consider effect of both batch size and number of GPUs
for 2 GPUs, batch 64  >  batch 32 / 128 > batch 256, but the effect is not super clear. 64 seems to be the optimal choice for batch size
some combinations might be slower—resource sharing on the CPU?
﻿
﻿
1 GPU5
2 GPUs 4
4 GPUs4
 
All runs for bar chart12
﻿
Next stepsrun with the same settings on 1, 2, and 3? GPUs
try with various batch sizes on 4 GPUs (64, 128)
how to log some more details of training time, e.g. time per epoch? so we can compare throughput/speed to a fixed accuracy level.
try running with more/fixed amounts of data? Other experiments used 6400/1280
consider ensembling models
Batch size matters more than GPU count
Use smaller batches when acc mattersbatch size 64 is still best** across 1/2/4 GPUs, closely followed by 128/32, 256 worst. Note that the differences are very small, and averaging by training step versus time shuffles the ordering.
batch size has a bigger impact than GPU count**:   4-5% difference in train/val acc between 64 and 256 versus  ~1% difference between 1 and 4 GPUs (see the previous section) when training a 7-layer CNN on 5000 images to predict 1/10 labels. Note that the learning rate stays constant. Adjusting it may equalize the disparity across the batch size.
﻿
﻿
Batch size 323
Batch size 643
Batch size 1283
Batch size 2563
﻿
Notes/next stepsbatch size 32 is smaller but performs worse than 64: perhaps sub-batches of only 8 items are inefficient when split across 4 GPUs.
test effects of 1) more data, 2) bigger model (simply larger layers/deeper net, optionally Inception-ResNet V2/resnet)
sudden jump in training loss for batch size 256 and 64, around 125 minutes in  — a side effect of how run averaging works? need to run more trials to average over shifting training dynamics or different clusters?
this is hitting CPU limits
﻿
Training on 2 GPUs with 10X the data
If time-constrained, train with larger batches on less dataCompare performance when training with 5K (blues) vs 50K (red) images on 2 GPUs. For a fixed 50 epochs, the increase in training time is linear with the increase in dataset size and doesn't significantly improve this particular model.
The 50K version plateaus in the same amount of time it takes the 5K version to finish training (and barely start to plateau). The 50K version does reach a slightly higher max validation accuracy (49% vs 45% for the 5K case), though this decays with further training. The effect of parallelization and of using 10X more data is much more obvious when looking at the time taken to train than at epochs seen. 
Note that for a given amount of training time (up to 3.5 hours), training on 5K examples with larger batch sizes outperforms training on 50K examples on validation accuracy. The 50K case eventually surpasses the 5K cases, but the difference in max validation accuracy reached is only about 4-5%.
Next stepsrepeat with 4 GPUs
consider further exploration of training dataset size: tradeoff between more data, accuracy, and more epochs
the boldest blue run has an unexplained lag (straight segment)—resource competition?
﻿
﻿
﻿
5OK vs 5K on 2 GPUs5
﻿
Train 2.5X faster on 4 GPUs, 3X on 8 GPUs?﻿
﻿
1 GPU4
2 GPUs4
4 GPUs4
8 GPUs4
﻿
﻿
Linear-ish speedup with distributed trainingDistributing over 4 GPUs, even for such a small network and dataset, gives a 2.5X speed-up relative to 1 GPU. 2 GPUs give a 1.6X speed-up relative to 1 GPU. Overall, the improvement is not linear with GPU count but still substantial. Increasing the compute to 8 GPUs doesn't improve the runtime with the batch sizes tried so far and goes against the overall trend of reducing compute time. 
Note that the 8GPU runs are not directly comparable, as they use double the training data and more than double the validation data.
Next stepsThe overall accuracy is relatively low in this proof-of-concept. What happens if we train a deeper net?
Run distributed training for longer / more iterations to get more reliable estimates of acceleration (though this may be task-specific)
Get more comfortable with distributed training approaches in Tensorflow as opposed to Keras—specific strategies may also show a greater speedup
GPU usage is uneven with basic Keras distribution: e.g. 11,000 MiB on GPU 0 and 60 MiB each on 1, 2, and 3. 
how to quantify throughput more meaningfully?
Optimal batch size & GPU count? It depends.﻿
﻿
Vary batch size & GPU count8
﻿
Find an optimal batch size for the problemAn original batch size of 64 still does best on average, compared to scaling for 8 GPUs. E.g., it may be better to run on 4 GPUs with original batch size 64, sub-batch size 16, than on 8 GPUs with batch size 64, sub-batch size 8, or on 4 GPUs with batch size 256, sub-batch size 64 (which is what Keras would recommend). 
8 GPUs with the max batch size 512, sub-batch size 64 is still best overall (assuming one has access to this extra compute and is willing to explore the best configuration for a particular problem).
Tentative: Not much impact from fixed sub-batch size & scaled batch size﻿
﻿
﻿
Fixed subbatch size4
﻿
Subbatch size matters at 8 GPUs, not before?Keras recommends increasing the original batch size proportionally with GPU count. In this scenario, the sub-batch size is fixed at 64 and the original model's batch size is scaled up accordingly from 1 to 8 GPUs. 
On 8GPUs, the training is ~3x faster. There is no noticeable speedup between 1, 2, and 4 GPUs.
This is surprising—perhaps the model needs to be more complex or the data load heavier.
Tentative: Vary sub-batch size on 8GPUs﻿
﻿
﻿
Vary batch size, 8 GPU6
﻿
Maximize batch size to be safe?A larger batch size appears more reliable. Note the long relatively flat stretches of the maroon and light blue lines, as if computation temporarily slowed down (issues with the cloud, perhaps?).
More experiments need to be run before we can draw any solid conclusions
﻿