Do Batch Sizes Actually Need To Be Powers of 2?

Is the fixation on powers of 2 for efficient GPU utilization an urban myth? In this article, we explore whether this argument is true when using today's GPUs.
Thomas Bierhance
Created on May 30|Last edited on February 2
Comment
﻿
﻿
Fixating on powers of 2 for parameters is a software engineering habit we see a lot for integer parameters. 
For example, you may have heard the advice to choose powers of 2 for your batch sizes: Andrew Ng reports typical batch sizes of 64, 128, 256, and 512 in one of his deeplearning.ai courses, NVIDIA recommends multiples of 8 for Tensor Cores and even Goodfellow, Bengio & Courville state in their book Deep Learning that:
Some kinds of hardware achieve better runtime with specific sizes of arrays. Especially when using GPUs, it is common for power of 2 batch sizes to offer better runtime.Now, there is nothing wrong with these batch sizes! It can be a good exploratory tactic to change a parameter by one order of magnitude. 
However, in my experience, the better runtime argument is generally not true when using today's GPUs and deep learning libraries.
If it was, we should see different training times and GPU utilization for a batch size of 2x2^x2x﻿ compared to a batch size of 2x+12^x+12x+1﻿. Let's see if that holds up.
﻿
Table of ContentsCasual Image Classification TestTesting Batch Sizes One by OneWhere Have We Gone Wrong?What About the Tensor Cores and Multiples of 8?How to Choose a Batch SizeConclusionP.S.
﻿
Casual Image Classification TestWe'll start with conducting ordinary image classification training on a data set with approximately 3,000 images. The training was done using a Kaggle Notebook with an NVIDIA P100 (that does not contain Tensor Cores).
The fine-tuning included a modern ConvNeXt, a common ResNet, and a DeiT transformer model with differing batch sizes. 
ResNet (80 epochs, batch size 128 & 129): 26 minutes
DeiT (40 epochs, batch size 64 & 65): 39 minutes
ConvNeXt (40 epochs, batch size 64 & 65): 37 minutes
In the plot below, you'll see the batch sizes in similar colors. You'll also see the training time was almost identical for the different batch sizes:
﻿
﻿
What's more? The GPU utilization was also about the same for the different batch sizes: 72% for ResNet, 89% for ConvNeXt, and 92% for DeiT. The high GPU utilization is a strong indicator that the GPU was a limiting factor. If different odd batch sizes would affect the ability to parallelize the computations on the GPU, we should see that here. 
We did not:
﻿
﻿
Testing Batch Sizes One by OneThe picture remains the same when testing the different batch sizes one by one. The chart below shows the runtime of training a ConvNeXt for 5 epochs with different batch sizes. 
Now, as expected, the runtime decreases when increasing the batch size until it hits a limit, but there are no sudden breaks at specific batch sizes. This suggests all the more that the restriction to powers of 2 is not necessary.
﻿
﻿
Where Have We Gone Wrong?How come the rule of thumb to choose a power of 2 could not be confirmed by our experiments? 
I think it is mainly because parallelization is not limited at all to the single samples in a batch. Libraries like PyTorch, TensorFlow, and cuBLAS have been heavily optimized to use GPUs efficiently. Therefore an odd batch size does not throw them off track.
Still, many papers I've read that discuss training procedures, choice of learning rates, etc., have only tested certain batch sizes. And guess which sizes? Exactly the ones that keep the myth alive.
Finally, all of us (yesm, even computer scientists) like simple explanations and rules. Rationales like "Stream Multiprocessors receive warps of 32 threads" are quickly made, sound reasonable, and spread widely.
If you would like to dive deeper into the underlying implementation details (including CUDA programming), I highly recommend MiniTorch by Cornell Tech. You can also reproduce my measurements in this Kaggle notebook and adapt them to your own ideas.
What About the Tensor Cores and Multiples of 8?I got a hint from a reader (thanks, Mohamed Yousef!) that the P100 doesn't have Tensor Cores, and a GPU with Tensor Cores might show a different behavior. I repeated the runs on an RTX A4000 (192 Tensor Cores). While training was overall faster, the conclusion did not change for this set-up. Training ResNet50 for 80 epochs took 16 minutes for both batch sizes (more details here).
I did not nvprof the runs, so it might be that the Tensor Cores could not kick in for some other reason than the batch size. I think it's still a fair comparison because when you need to nvprof your training to squeeze out the last ounce of performance, you're not relying on any rules of thumb anymore anyway.
You can get more Tips for Optimizing GPU Performance Using Tensor Cores in NVIDIA's technical blog. Note that the numbers shown are measurements for single layers, not end-to-end training, like in this blog.
How to Choose a Batch SizeSimply put: there is no single "best" batch size, even for a given data set and model architecture. You need to trade off training time, memory usage, regularization, and accuracy.
Larger batch sizes will train faster and consume more memory but might show lower accuracy.
You can find more details and links to great discussions in ﻿the blog post What's the Optimal Batch Size to Train a Neural Network? by Ayush Thakur.
ConclusionAs we have seen, using powers of 2 for the batch size is not readily advantageous in everyday training situations, which leads to the conclusion:
Measuring the actual effect on training speed, accuracy and memory consumption when choosing a batch size should be preferred instead of focusing on powers of 2.
💡
What is your experience with batch sizes? I would love to hear your feedback in the comments below!
P.S.Of course, if you still want to stick to powers of 2 for your batch sizes, that's okay! Here are a few arguments you can use instead of better runtime:
🤔 limited search space for hyperparameters
❤️ all the others do it like this
🧱 it just feels more robust 
🍒 it feels less cherry-picked
📖 NVIDIA recommends multiples of 8 for Tensor Cores, and almost all powers of 2 are multiples of 8, and I want to be on the safe side.
⚖️ 32 can be a good compromise when you are desperate and can't decide between 27 and 37.
﻿