What Makes Depthwise Separable Convolutions Faster

Exploring how Depthwise Separable Convolutions can make CNNs faster
Created on March 31|Last edited on March 31
Comment
﻿
IntroductionThe Convolution operation is a widely used function in Functional Analysis, Image Processing Deep Learning. The convolution operation when applied to two functions f and g, produces a third function expressing how the shape of one is modified by the other. While it is immensely popular, especially in the domain of Deep Learning, the vanilla convolution operation is quite expensive computationally. Several Neural Network architectures such as Xception and MobileNet use a special type of convolution called Depthwise Separable Convolution to speed up training and inference, especially on Mobile and Embedded Devices. In this article, we would investigate exactly why Depthwise Separable Convolutions are so much faster than Vanilla Convolutions by deriving their respective computational complexity.
﻿
The Vanilla Convolution OperationThe vanilla convolution function can be mathematically defined as the following:
(f⊛g)(t)=∫−∞∞f(τ)g(t−τ)dτ\large{(f \circledast g)(t) = \int_{- \infty}^{\infty} f(\tau) g(t - \tau) d\tau}(f⊛g)(t)=∫−∞∞​f(τ)g(t−τ)dτ﻿
﻿
For all non-negative values of t (i.e, for all values of t such that t ∈ [0, ∞) ), we could truncate the limits of integration resulting in,
(f⊛g)(t)=∫0tf(τ)g(t−τ)dτ\large{(f \circledast g)(t) = \int_{0}^{t} f(\tau) g(t - \tau) d\tau}(f⊛g)(t)=∫0t​f(τ)g(t−τ)dτ﻿
﻿
It can also be defined as the overlap of two functions f and g as one slides over the other, performing a sum of products.
Source: https://en.wikipedia.org/wiki/Convolution﻿
﻿
Convolution as Sum of Products. Source: Intuitively Understanding Convolutions for Deep Learning﻿
﻿
Computing the Computational ComplexityIn order to compute the computational complexity of the vanilla convolutional operation and the Depthwise Separable Convolution operation, we would need to count the number of multiplication operations for a convolution. This is because the Binary Addition of two numbers may be performed in a single clock cycle using two registers with the inputs latched and a bunch of combinatorial logic gates. Binary multiplication, however, requires successive shift and addition operations which must be performed as many times as there are bits in the multiplier and is thus a more expensive operation.
Computational Complexity of Vanilla ConvolutionLet us consider an input volume VVV﻿ of the dimension (DV,DV,N)(D_{V}, D_{V}, N)(DV​,DV​,N)﻿ where DVD_{V}DV​﻿ is the height and width of the volume and NNN﻿ is the number of channels. In the case of a standard RGB image N=3N = 3N=3﻿ and for a gray-scale image N=1N = 1N=1﻿.
﻿
﻿
﻿
Let us convolve VVV﻿ with a tensor of shape (Dk,Dk,N)(D_{k}, D_{k}, N)(Dk​,Dk​,N)﻿ or NNN﻿ tensors with the shape (Dk,Dk)(D_{k}, D_{k})(Dk​,Dk​)﻿ which results in a volume GGG﻿ of shape (DG,DG,N)(D_{G}, D_{G}, N)(DG​,DG​,N)﻿.
﻿
﻿
﻿
Let us now count the number of multiplication operations for this operation.
The number of multiplication operations for a single stride across a single channel is Dk∗DkD_{k} * D_{k}Dk​∗Dk​﻿. For MMM﻿ channels in the initial volume, the number of multiplication operations is (Dk)2∗M(D_{k})^{2} * M(Dk​)2∗M﻿.
Sliding the kernel over a volume of (DVD_{V}DV​﻿, DVD_{V}DV​﻿, MMM﻿), we get a tensor of shape (DGD_{G}DG​﻿, DGD_{G}DG​﻿, NNN﻿). Hence the total number of multiplication operations for a single channel of the convolution kernel will be (DG)2∗(Dk)2∗M(D_{G})^{2} * (D_{k})^{2} * M(DG​)2∗(Dk​)2∗M﻿.
Since there are NNN﻿ channels in the convolutional kernel, this operation is repeated NNN﻿ times. Hence, the total number of multiplication operations for the above convolution operation becomes N∗(DG)2∗(Dk)2∗MN * (D_{G})^{2} * (D_{k})^{2} * MN∗(DG​)2∗(Dk​)2∗M﻿.
Now, let us see how using an alternate form of the vanilla convolution operation, we can reduce time complexity.
Computational Complexity of Depthwise Separable ConvolutionsIn the vanilla convolution operation all, the kernel is applied to all the channels of the input volume. However, Depthwise Separable Convolutions breaks down the whole operation into 2 steps:
Depthwise Convolution or the Filtering Stage
Pointwise Convolution or the Combination Stage
Depthwise ConvolutionLet us consider the same input volume (DV,DV,M)(D_{V}, D_{V}, M)(DV​,DV​,M)﻿ convolving with MMM﻿ number of(DK,DK)(D_{K}, D_{K})(DK​,DK​)﻿ kernels. A single convolution with a single kernel gives a volume of (DG,DG,1)(D_{G}, D_{G}, 1)(DG​,DG​,1)﻿. Repeating this NNN﻿ times, we get NNN﻿ such tensors and stacking them up channel-wise, we get a single tensor of shape (DG,DG,M)(D_{G}, D_{G}, M)(DG​,DG​,M)﻿.
﻿
﻿
﻿
﻿
﻿
Let us now find the computational complexity for Depthwise Convolution.
The number of multiplication operations for the convolution of a single (DK,DK)(D_{K}, D_{K})(DK​,DK​)﻿ kernel over a single stride over the input volume is (DK)2(D_{K})^{2}(DK​)2﻿.
Since the output shape is (DG,DG)(D_{G}, D_{G})(DG​,DG​)﻿, the number of multiplication operations for convolving over a single channel of the input image is (DG)2∗(DK)2(D_{G})^{2} * (D_{K})^{2}(DG​)2∗(DK​)2﻿.
Now there are MMM﻿ number of kernels for convolving with MMM﻿ number of channels, the number of multiplication operations for Depthwise Convolution operation becomes M∗(DG)2∗(DK)2M * (D_{G})^{2} * (D_{K})^{2}M∗(DG​)2∗(DK​)2﻿.
Pointwise ConvolutionFor Pointwise Convolution, we convolve the (DG,DG,M)(D_{G}, D_{G}, M)(DG​,DG​,M)﻿ volume with NNN﻿ kernels of (1,1,M)(1, 1, M)(1,1,M)﻿ producing the desired output of shape (DV,DV,N)(D_{V}, D_{V}, N)(DV​,DV​,N)﻿.
﻿
﻿
﻿
Let us now find the computational complexity of the Pointwise Convolution operation.
For convolving a single kernel over a single stride of the input image, the number of multiplication operations is 1∗1∗M=M1 * 1 * M = M1∗1∗M=M﻿.
For convolving a single kernel over a single channel of the input tensor producing a shape of (DG,DG)(D_{G}, D_{G})(DG​,DG​)﻿, the number of multiplication operations = M∗(DG)2M * (D_{G})^{2}M∗(DG​)2﻿.
For convolving NNN﻿ number of kernels over the whole of input tensor, the number of multiplication operations is N∗M∗(DG)2N * M * (D_{G})^{2}N∗M∗(DG​)2﻿.
Comparing Vanilla Convolution with Depthwise Separable ConvolutionLet us take the ratios between the Complexity of the Vanilla Convolution operation and that of the Depthwise Separable Convolution operation. If we take the ratio of the number of multiplication operations for both cases, we get...
convvanillaconvdsc=N∗(DG)2∗(DK)2∗MM∗(DG)2∗[(DK)2+N]\LARGE{\frac{conv_{vanilla}}{conv_{dsc}} = \frac{N * (D_{G})^{2} * (D_{K})^{2} * M}{M * (D_{G})^{2} * [(D_{K})^{2} + N]}}convdsc​convvanilla​​=M∗(DG​)2∗[(DK​)2+N]N∗(DG​)2∗(DK​)2∗M​﻿
or,
convvanillaconvdsc=(DK)2∗M(DK)2+N\LARGE{\frac{conv_{vanilla}}{conv_{dsc}} = \frac{(D_{K})^{2} * M}{(D_{K})^{2} + N}}convdsc​convvanilla​​=(DK​)2+N(DK​)2∗M​﻿
or,
convvanillaconvdsc=1(DK)2+1N\LARGE{\frac{conv_{vanilla}}{conv_{dsc}} = \frac{1}{(D_{K})^{2}} + \frac{1}{N}}convdsc​convvanilla​​=(DK​)21​+N1​﻿
Now, let us consider N=3N = 3N=3﻿ and DK = [2 ** i for i in range(5, 11)] and visualize how the ratio varies.
﻿
Note that the ratio of computation complexity of Vanilla Convolution to that of Depthwise Separable Convolution is always much less than 1 and it decreases with increasing Kernel Dimension, making it much faster compared to Vanilla Convolution.
﻿
Further ReadingFor more information on Depthwise Separable Convolutions, you may refer to:
﻿Xception: Deep Learning with Depthwise Separable Convolutions﻿
﻿MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications﻿﻿﻿
﻿One Model To Learn Them All﻿
﻿
Similar Resources on Fully-Connected
Rethinking Depthwise Separable Convolutions: How Intra-Kernel Correlations Lead to Improved MobileNets (CVPR 2020)
Submission for Reproducibility Challenge 2021 for the paper "Rethinking Depthwise Separable Convolutions: How Intra-Kernel Correlations Lead to Improved MobileNets" by Daniel Haase and Manuel Amthor.
Dynamic Group Convolution for Accelerating Convolutional Neural Networks (ECCV 2020) 
A reproduction of the paper 'Dynamic Group Convolution for Accelerating Convolutional Neural Networks ' by Zhuo Su et. al, accepted to ECCV (2020).
Dynamic Convolutions: Exploiting Spatial Sparsity for Faster Inference (CVPR 2020)
A reproduction of the 2020 CVPR paper entitled 'Dynamic Convolutions: Exploiting Spatial Sparsity for Faster Inference ' by Thomas Verelst et. al
ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks
Submission for Reproducibility Challenge 2020 for the paper "ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks" by Wang et al. (2020), accepted to CVPR 2020.
﻿
﻿
Add a comment