Skip to main content

Distributed Training with Weights & Biases

In this article, we will explore data-parallel distributed training in Keras trying different configurations of GPU count.
Created on March 29|Last edited on November 3
In this work log, I explore data-parallel distributed training in Keras. I try different configurations of GPU count (1, 2, 4 or 8 GPUs) and total (original)/effective (per GPU) batch size, increase the dataset size, and compare evaluation methods.
I include my notes and ideas for the next steps from different experiments to show a realistic research process.

Table Of Contents



Initial observations

  • training acceleration is linear-ish: compared to 1 GPU, training runs 1.6 times faster on 2 GPUs and 2.5 times faster on 4 GPUs
  • this is very easy to accomplish in Keras with minimal code effort and no fine-tuning: see multi_gpu_model() in Keras utils
  • tradeoff between batch size, stability, and accuracy: larger batches generally increase training speed and stabilize the training (fewer lags/stalls observed when training in the cloud). Smaller batches lead to slightly higher validation accuracy on average.
  • batch size affects validation accuracy more than GPU count: 4-5% versus 1% — this is reassuring for trusting this data-parallel training paradigm to not adversely affect the final model
  • if time-constrained, train on less data with bigger batches: training on less data (10% of full size) with larger batch sizes yields comparable/slightly higher validation accuracy than training on the full dataset with lower batch size
  • the optimal configuration, especially the effective batch size per GPU, is more subtle to tune and likely depends on the particular model

Data and model

The core model trains a basic 7-layer convolutional network predicting one of 10 animal classes (bird, mammal, reptile, etc) on class-balanced subsets of iNaturalist 2017, typically 5000 train / 800 test for fast iteration. You can read more about this task with more powerful models or in the context of curriculum learning.

Code

The code to train a data-parallel model is in this example gist.

Basic Multi-GPU in Keras



2 GPUs with Keras
7


Smaller batches for accuracy, larger for speed on 2 GPUs

Keras has a built-in function for model parallelization: mutli_gpu_model in utils. This is trivial to enable.
Here are examples on a local machine and GCP with 2 GPUs.

Notes

  • The "original" batch size shown is for the original, non-parallel version of the model. The parallel version splits each of these batches evenly across the GPUs. The effective sub-batch size seen by each parallel copy of the model is the original batch size / 2, in this case.
  • smaller batch sizes (original 32/64, sub-batch 16/32) reach higher accuracy for a fixed epoch count than larger batch sizes (original 128/256, sub-batch 64/128)
  • local mode is slightly better than cloud (GCP) due to more data (6400 vs 5000), possibly due to physical co-location of GPUs
  • larger batches generally lead to faster training, with a few confounding factors as seen in the bottom graph
    • local runs (purples) train the fastest
    • not much difference between 32, 64, and 128
    • long stuck period for "batch 64 (V2, 5K train)" (orange)—generally observing high variability in runtimes on GCP
    • I initially attempted to use a clever early version of the multi_gpu_model here which requires substantially more complicated batch size adjustments in the training/validation generator

Next steps

  • increase the number of GPUs, amount of training data
  • explore the effect of batch size
  • increase the complexity of the model (especially ResNet)

Scaling to 4 GPUs

Speed up: 4 GPUs: 2.5X, 2 GPUs: 1.6X

2.5X faster training on 4 GPUs vs 1 GPU

  • model reaches a slightly higher validation accuracy 2.5X as fast when using 4 versus 1 GPU - this is the main advantage
  • train/val acc/loss are not significantly affected by parallelizing the job across 1, 2, or 4 GPUs-this is expected and reassuring
  • could continue tuning to improve relative speed-up-improvement is less than linear
  • need better metrics (data throughput, batches per unit time, time to convergence) to quantify the added value of distributed training

Experiment

Train a 7-layer convnet on the main iNaturalist dataset (5000 train / 800 val) as a proof of concept for the Keras multi_gpu_model function.

Notes

  • initially no noticeable difference between 1 GPU and 2 GPUs—masked by one extremely slow run, batch 64_2, which stalled for 2 hours during training for unknown reasons. Leaving it out of the average shows that 2 GPUs yield a 1.6x acceleration
  • accuracy vs batch size: consider effect of both batch size and number of GPUs
  • for 2 GPUs, batch 64 > batch 32 / 128 > batch 256, but the effect is not super clear. 64 seems to be the optimal choice for batch size
  • some combinations might be slower—resource sharing on the CPU?


1 GPU
5
2 GPUs
4
4 GPUs
4
All runs for bar chart
12


Next steps

  • run with the same settings on 1, 2, and 3? GPUs
  • try with various batch sizes on 4 GPUs (64, 128)
  • how to log some more details of training time, e.g. time per epoch? so we can compare throughput/speed to a fixed accuracy level.
  • try running with more/fixed amounts of data? Other experiments used 6400/1280
  • consider ensembling models

Batch size matters more than GPU count

Use smaller batches when acc matters

  • batch size 64 is still best** across 1/2/4 GPUs, closely followed by 128/32, 256 worst. Note that the differences are very small, and averaging by training step versus time shuffles the ordering.
  • batch size has a bigger impact than GPU count**: 4-5% difference in train/val acc between 64 and 256 versus ~1% difference between 1 and 4 GPUs (see the previous section) when training a 7-layer CNN on 5000 images to predict 1/10 labels. Note that the learning rate stays constant. Adjusting it may equalize the disparity across the batch size.


Batch size 32
3
Batch size 64
3
Batch size 128
3
Batch size 256
3


Notes/next steps

  • batch size 32 is smaller but performs worse than 64: perhaps sub-batches of only 8 items are inefficient when split across 4 GPUs.
  • test effects of 1) more data, 2) bigger model (simply larger layers/deeper net, optionally Inception-ResNet V2/resnet)
  • sudden jump in training loss for batch size 256 and 64, around 125 minutes in — a side effect of how run averaging works? need to run more trials to average over shifting training dynamics or different clusters?
  • this is hitting CPU limits


Training on 2 GPUs with 10X the data

If time-constrained, train with larger batches on less data

Compare performance when training with 5K (blues) vs 50K (red) images on 2 GPUs. For a fixed 50 epochs, the increase in training time is linear with the increase in dataset size and doesn't significantly improve this particular model.
The 50K version plateaus in the same amount of time it takes the 5K version to finish training (and barely start to plateau). The 50K version does reach a slightly higher max validation accuracy (49% vs 45% for the 5K case), though this decays with further training. The effect of parallelization and of using 10X more data is much more obvious when looking at the time taken to train than at epochs seen.
Note that for a given amount of training time (up to 3.5 hours), training on 5K examples with larger batch sizes outperforms training on 50K examples on validation accuracy. The 50K case eventually surpasses the 5K cases, but the difference in max validation accuracy reached is only about 4-5%.

Next steps

  • repeat with 4 GPUs
  • consider further exploration of training dataset size: tradeoff between more data, accuracy, and more epochs
  • the boldest blue run has an unexplained lag (straight segment)—resource competition?



5OK vs 5K on 2 GPUs
5


Train 2.5X faster on 4 GPUs, 3X on 8 GPUs?



1 GPU
4
2 GPUs
4
4 GPUs
4
8 GPUs
4



Linear-ish speedup with distributed training

Distributing over 4 GPUs, even for such a small network and dataset, gives a 2.5X speed-up relative to 1 GPU. 2 GPUs give a 1.6X speed-up relative to 1 GPU. Overall, the improvement is not linear with GPU count but still substantial. Increasing the compute to 8 GPUs doesn't improve the runtime with the batch sizes tried so far and goes against the overall trend of reducing compute time.
Note that the 8GPU runs are not directly comparable, as they use double the training data and more than double the validation data.

Next steps

  • The overall accuracy is relatively low in this proof-of-concept. What happens if we train a deeper net?
  • Run distributed training for longer / more iterations to get more reliable estimates of acceleration (though this may be task-specific)
  • Get more comfortable with distributed training approaches in Tensorflow as opposed to Keras—specific strategies may also show a greater speedup
  • GPU usage is uneven with basic Keras distribution: e.g. 11,000 MiB on GPU 0 and 60 MiB each on 1, 2, and 3.
  • how to quantify throughput more meaningfully?

Optimal batch size & GPU count? It depends.



Vary batch size & GPU count
8


Find an optimal batch size for the problem

An original batch size of 64 still does best on average, compared to scaling for 8 GPUs. E.g., it may be better to run on 4 GPUs with original batch size 64, sub-batch size 16, than on 8 GPUs with batch size 64, sub-batch size 8, or on 4 GPUs with batch size 256, sub-batch size 64 (which is what Keras would recommend).
8 GPUs with the max batch size 512, sub-batch size 64 is still best overall (assuming one has access to this extra compute and is willing to explore the best configuration for a particular problem).

Tentative: Not much impact from fixed sub-batch size & scaled batch size




Fixed subbatch size
4


Subbatch size matters at 8 GPUs, not before?

Keras recommends increasing the original batch size proportionally with GPU count. In this scenario, the sub-batch size is fixed at 64 and the original model's batch size is scaled up accordingly from 1 to 8 GPUs.
On 8GPUs, the training is ~3x faster. There is no noticeable speedup between 1, 2, and 4 GPUs.
This is surprising—perhaps the model needs to be more complex or the data load heavier.

Tentative: Vary sub-batch size on 8GPUs




Vary batch size, 8 GPU
6


Maximize batch size to be safe?

A larger batch size appears more reliable. Note the long relatively flat stretches of the maroon and light blue lines, as if computation temporarily slowed down (issues with the cloud, perhaps?).
More experiments need to be run before we can draw any solid conclusions
Chuck Gruber
Chuck Gruber •  
1
Reply
André Goulart Nogueira
André Goulart Nogueira •  
Will this comment be available for everyone or just for me?
Reply
Kotni Krishna
Kotni Krishna •  
So, with most GPU counts batch size of 64 is the best.
1 reply
Iterate on AI agents and models faster. Try Weights & Biases today.