What Makes Depthwise Separable Convolutions Faster
Exploring how Depthwise Separable Convolutions can make CNNs faster
Created on March 31|Last edited on March 31
Comment
Introduction
The Convolution operation is a widely used function in Functional Analysis, Image Processing Deep Learning. The convolution operation when applied to two functions f and g, produces a third function expressing how the shape of one is modified by the other. While it is immensely popular, especially in the domain of Deep Learning, the vanilla convolution operation is quite expensive computationally. Several Neural Network architectures such as Xception and MobileNet use a special type of convolution called Depthwise Separable Convolution to speed up training and inference, especially on Mobile and Embedded Devices. In this article, we would investigate exactly why Depthwise Separable Convolutions are so much faster than Vanilla Convolutions by deriving their respective computational complexity.
The Vanilla Convolution Operation
The vanilla convolution function can be mathematically defined as the following:
For all non-negative values of t (i.e, for all values of t such that t ∈ [0, ∞) ), we could truncate the limits of integration resulting in,
It can also be defined as the overlap of two functions f and g as one slides over the other, performing a sum of products.

Computing the Computational Complexity
In order to compute the computational complexity of the vanilla convolutional operation and the Depthwise Separable Convolution operation, we would need to count the number of multiplication operations for a convolution. This is because the Binary Addition of two numbers may be performed in a single clock cycle using two registers with the inputs latched and a bunch of combinatorial logic gates. Binary multiplication, however, requires successive shift and addition operations which must be performed as many times as there are bits in the multiplier and is thus a more expensive operation.
Computational Complexity of Vanilla Convolution
Let us consider an input volume of the dimension where is the height and width of the volume and is the number of channels. In the case of a standard RGB image and for a gray-scale image .

Let us convolve with a tensor of shape or tensors with the shape which results in a volume of shape .

Let us now count the number of multiplication operations for this operation.
The number of multiplication operations for a single stride across a single channel is . For channels in the initial volume, the number of multiplication operations is .
Sliding the kernel over a volume of (, , ), we get a tensor of shape (, , ). Hence the total number of multiplication operations for a single channel of the convolution kernel will be .
Since there are channels in the convolutional kernel, this operation is repeated times. Hence, the total number of multiplication operations for the above convolution operation becomes .
Now, let us see how using an alternate form of the vanilla convolution operation, we can reduce time complexity.
Computational Complexity of Depthwise Separable Convolutions
In the vanilla convolution operation all, the kernel is applied to all the channels of the input volume. However, Depthwise Separable Convolutions breaks down the whole operation into 2 steps:
- Depthwise Convolution or the Filtering Stage
- Pointwise Convolution or the Combination Stage
Depthwise Convolution
Let us consider the same input volume convolving with number of kernels. A single convolution with a single kernel gives a volume of . Repeating this times, we get such tensors and stacking them up channel-wise, we get a single tensor of shape .


Let us now find the computational complexity for Depthwise Convolution.
The number of multiplication operations for the convolution of a single kernel over a single stride over the input volume is .
Since the output shape is , the number of multiplication operations for convolving over a single channel of the input image is .
Now there are number of kernels for convolving with number of channels, the number of multiplication operations for Depthwise Convolution operation becomes .
Pointwise Convolution
For Pointwise Convolution, we convolve the volume with kernels of producing the desired output of shape .

Let us now find the computational complexity of the Pointwise Convolution operation.
For convolving a single kernel over a single stride of the input image, the number of multiplication operations is .
For convolving a single kernel over a single channel of the input tensor producing a shape of , the number of multiplication operations = .
For convolving number of kernels over the whole of input tensor, the number of multiplication operations is .
Comparing Vanilla Convolution with Depthwise Separable Convolution
Let us take the ratios between the Complexity of the Vanilla Convolution operation and that of the Depthwise Separable Convolution operation. If we take the ratio of the number of multiplication operations for both cases, we get...
or,
or,
Now, let us consider and DK = [2 ** i for i in range(5, 11)] and visualize how the ratio varies.

Note that the ratio of computation complexity of Vanilla Convolution to that of Depthwise Separable Convolution is always much less than 1 and it decreases with increasing Kernel Dimension, making it much faster compared to Vanilla Convolution.
Further Reading
For more information on Depthwise Separable Convolutions, you may refer to:
Similar Resources on Fully-Connected
Rethinking Depthwise Separable Convolutions: How Intra-Kernel Correlations Lead to Improved MobileNets (CVPR 2020)
Submission for Reproducibility Challenge 2021 for the paper "Rethinking Depthwise Separable Convolutions: How Intra-Kernel Correlations Lead to Improved MobileNets" by Daniel Haase and Manuel Amthor.
Dynamic Group Convolution for Accelerating Convolutional Neural Networks (ECCV 2020)
A reproduction of the paper 'Dynamic Group Convolution for Accelerating Convolutional Neural Networks ' by Zhuo Su et. al, accepted to ECCV (2020).
Dynamic Convolutions: Exploiting Spatial Sparsity for Faster Inference (CVPR 2020)
A reproduction of the 2020 CVPR paper entitled 'Dynamic Convolutions: Exploiting Spatial Sparsity for Faster Inference ' by Thomas Verelst et. al
ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks
Submission for Reproducibility Challenge 2020 for the paper "ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks" by Wang et al. (2020), accepted to CVPR 2020.
Add a comment