What's the Difference Between Strided Convolution and Pooling?
In this article, we'll do a quick comparison of the benefits and detriments of two different ways to downscale input tensor: pooling and strided convolutions.  This is a translated version of the article. Feel free to report any possible mis-translations in the comments section
Created on August 26|Last edited on August 26
Comment
Strided convolution and pooling serve the same purpose: Downsampling — or compressing — information. They each have their own benefits and detriments and in this article, we'll look into both techniques.
conv_layer = tf.keras.layers.Conv2D(filters,kernel_size,strides=(1, 1),padding='valid',activation=None)
Here,  strides specifies the number of pixels that kernel skips while moving along the width and height of the input tensor.
Here's what we'll be covering:
Table of Contents
Let's dive in!
Downsampling the Input Tensor
We can downsample our input tensor using two common methods:
- Changing the default parameter of strides from (1, 1) to say (2, 2). Thus the kernels will skip the 2 pixels while moving along the x- and y-axis.
- Using spatial pooling layers like max pooling(MaxPooling2D) or average pooling layers(AveragePooling2D).
So what's the difference? Let's examine that with a pair of simple models.
Using a Pooling Layer
def simple_cnn_with_pool():inputs = Input(shape=(32,32,3))conv1 = Conv2D(filters=32, kernel_size=(3,3), strides=(1,1), activation='relu')(inputs)pool1 = MaxPooling2D(pool_size=2)(conv1)return Model(inputs=inputs, outputs=pool1)
Let's see the  summary of this model.  

Fig 1: Model summary of a model with a pooling layer
Notice downsampling occurs with an increase in the number of channels.Now can we do the same thing without the pooling layer?
Using Strided Convolutions
def simple_cnn_without_pool():inputs = Input(shape=(32,32,3))conv1 = Conv2D(filters=32, kernel_size=(3,3), strides=(2,2), activation='relu')(inputs)return Model(inputs=inputs, outputs=conv1)
Notice the use of  strides=(2,2) and no pooling layer. Let's see the  summary of this model.

Fig 2: Model summary of a model without pooling layer and stride as (2,2)
- We have the same number of trainable parameters in both cases.
- We have the same output tensor shape in both cases.
Shouldn't pooling then be considered an additional computational operation to achieve the same result?
To Pool or Not To Pool?
The answer is:it depends.
It depends on the model architecture and the dataset used and the task performed. Here'a a case study:
- When convolutions with strides are better than pooling: The first layer in the ResNet uses convolution with strides. This significantly reduces the computation required by other subsequent layers. This layer reduces the requirement of 3, 3x3 kernel-sized convolutional layer with the use of one 7x7 kernel-sized convolutional layer having stride of 2x2, having the same receptive field. A similar technique was used by SqueezeNet.
- When pooling is better than convolutions: In ResNet, the residual blocks have 1x1 kernel-sized convolutional layers, which made gradient backpropagation harder. This was investigated in the FishNet paper and they solved this by replacing the 1x1 convolutional layer with pooling and simple upscales/identities/concatenations.
Add a comment