Skip to main content

What Makes Depthwise Separable Convolutions Faster

Exploring how Depthwise Separable Convolutions can make CNNs faster
Created on March 31|Last edited on March 31

Introduction

The Convolution operation is a widely used function in Functional Analysis, Image Processing Deep Learning. The convolution operation when applied to two functions f and g, produces a third function expressing how the shape of one is modified by the other. While it is immensely popular, especially in the domain of Deep Learning, the vanilla convolution operation is quite expensive computationally. Several Neural Network architectures such as Xception and MobileNet use a special type of convolution called Depthwise Separable Convolution to speed up training and inference, especially on Mobile and Embedded Devices. In this article, we would investigate exactly why Depthwise Separable Convolutions are so much faster than Vanilla Convolutions by deriving their respective computational complexity.



The Vanilla Convolution Operation

The vanilla convolution function can be mathematically defined as the following:
(fg)(t)=f(τ)g(tτ)dτ\large{(f \circledast g)(t) = \int_{- \infty}^{\infty} f(\tau) g(t - \tau) d\tau}


For all non-negative values of t (i.e, for all values of t such that t ∈ [0, ∞) ), we could truncate the limits of integration resulting in,
(fg)(t)=0tf(τ)g(tτ)dτ\large{(f \circledast g)(t) = \int_{0}^{t} f(\tau) g(t - \tau) d\tau}


It can also be defined as the overlap of two functions f and g as one slides over the other, performing a sum of products.

Convolution as Sum of Products. Source: Intuitively Understanding Convolutions for Deep Learning



Computing the Computational Complexity

In order to compute the computational complexity of the vanilla convolutional operation and the Depthwise Separable Convolution operation, we would need to count the number of multiplication operations for a convolution. This is because the Binary Addition of two numbers may be performed in a single clock cycle using two registers with the inputs latched and a bunch of combinatorial logic gates. Binary multiplication, however, requires successive shift and addition operations which must be performed as many times as there are bits in the multiplier and is thus a more expensive operation.

Computational Complexity of Vanilla Convolution

Let us consider an input volume VV of the dimension (DV,DV,N)(D_{V}, D_{V}, N) where DVD_{V} is the height and width of the volume and NN is the number of channels. In the case of a standard RGB image N=3N = 3 and for a gray-scale image N=1N = 1.



Let us convolve VV with a tensor of shape (Dk,Dk,N)(D_{k}, D_{k}, N) or NN tensors with the shape (Dk,Dk)(D_{k}, D_{k}) which results in a volume GG of shape (DG,DG,N)(D_{G}, D_{G}, N).



Let us now count the number of multiplication operations for this operation.
The number of multiplication operations for a single stride across a single channel is DkDkD_{k} * D_{k}. For MM channels in the initial volume, the number of multiplication operations is (Dk)2M(D_{k})^{2} * M.
Sliding the kernel over a volume of (DVD_{V}, DVD_{V}, MM), we get a tensor of shape (DGD_{G}, DGD_{G}, NN). Hence the total number of multiplication operations for a single channel of the convolution kernel will be (DG)2(Dk)2M(D_{G})^{2} * (D_{k})^{2} * M.
Since there are NN channels in the convolutional kernel, this operation is repeated NN times. Hence, the total number of multiplication operations for the above convolution operation becomes N(DG)2(Dk)2MN * (D_{G})^{2} * (D_{k})^{2} * M.
Now, let us see how using an alternate form of the vanilla convolution operation, we can reduce time complexity.

Computational Complexity of Depthwise Separable Convolutions

In the vanilla convolution operation all, the kernel is applied to all the channels of the input volume. However, Depthwise Separable Convolutions breaks down the whole operation into 2 steps:
  • Depthwise Convolution or the Filtering Stage
  • Pointwise Convolution or the Combination Stage

Depthwise Convolution

Let us consider the same input volume (DV,DV,M)(D_{V}, D_{V}, M) convolving with MM number of(DK,DK)(D_{K}, D_{K}) kernels. A single convolution with a single kernel gives a volume of (DG,DG,1)(D_{G}, D_{G}, 1). Repeating this NN times, we get NN such tensors and stacking them up channel-wise, we get a single tensor of shape (DG,DG,M)(D_{G}, D_{G}, M).





Let us now find the computational complexity for Depthwise Convolution.
The number of multiplication operations for the convolution of a single (DK,DK)(D_{K}, D_{K}) kernel over a single stride over the input volume is (DK)2(D_{K})^{2}.
Since the output shape is (DG,DG)(D_{G}, D_{G}), the number of multiplication operations for convolving over a single channel of the input image is (DG)2(DK)2(D_{G})^{2} * (D_{K})^{2}.
Now there are MM number of kernels for convolving with MM number of channels, the number of multiplication operations for Depthwise Convolution operation becomes M(DG)2(DK)2M * (D_{G})^{2} * (D_{K})^{2}.

Pointwise Convolution

For Pointwise Convolution, we convolve the (DG,DG,M)(D_{G}, D_{G}, M) volume with NN kernels of (1,1,M)(1, 1, M) producing the desired output of shape (DV,DV,N)(D_{V}, D_{V}, N).



Let us now find the computational complexity of the Pointwise Convolution operation.
For convolving a single kernel over a single stride of the input image, the number of multiplication operations is 11M=M1 * 1 * M = M.
For convolving a single kernel over a single channel of the input tensor producing a shape of (DG,DG)(D_{G}, D_{G}), the number of multiplication operations = M(DG)2M * (D_{G})^{2}.
For convolving NN number of kernels over the whole of input tensor, the number of multiplication operations is NM(DG)2N * M * (D_{G})^{2}.

Comparing Vanilla Convolution with Depthwise Separable Convolution

Let us take the ratios between the Complexity of the Vanilla Convolution operation and that of the Depthwise Separable Convolution operation. If we take the ratio of the number of multiplication operations for both cases, we get...
convvanillaconvdsc=N(DG)2(DK)2MM(DG)2[(DK)2+N]\LARGE{\frac{conv_{vanilla}}{conv_{dsc}} = \frac{N * (D_{G})^{2} * (D_{K})^{2} * M}{M * (D_{G})^{2} * [(D_{K})^{2} + N]}}

or,
convvanillaconvdsc=(DK)2M(DK)2+N\LARGE{\frac{conv_{vanilla}}{conv_{dsc}} = \frac{(D_{K})^{2} * M}{(D_{K})^{2} + N}}

or,
convvanillaconvdsc=1(DK)2+1N\LARGE{\frac{conv_{vanilla}}{conv_{dsc}} = \frac{1}{(D_{K})^{2}} + \frac{1}{N}}

Now, let us consider N=3N = 3 and DK = [2 ** i for i in range(5, 11)] and visualize how the ratio varies.

Note that the ratio of computation complexity of Vanilla Convolution to that of Depthwise Separable Convolution is always much less than 1 and it decreases with increasing Kernel Dimension, making it much faster compared to Vanilla Convolution.



Further Reading

For more information on Depthwise Separable Convolutions, you may refer to:



Similar Resources on Fully-Connected