Skip to main content

Are fully connected and convolution layers equivalent? If so, how?

As part of this post, we look at the Convolution and Linear layers in MS Excel and compare results from Excel with PyTorch implementations.
Created on July 13|Last edited on July 13
Recently I asked the question on Twitter -

In a quest to find the answer, I ended up implementing the Convolution and Linear layer operation in Microsoft Excel. As part of this blog post, I'll explain the Conv1d and Linear layer operation in detail and also share how the two layers are equivalent.
Also, this post explains the convolution operation without using any funky mathematical formulas.

The key point to note about Convolution operation

For a gentle introduction to Convolutions, please refer to the following blogs:
I only want to highlight, a 2-D convolution does not mean that the kernel is 2-D too, it could very well be a 3-D kernel. But, a 2-D convolution means that the kernel moves along 2 dimensions/axis.
Let's assume we have a 3-channel input image of shape 224x224. Then below is an example of a 2-D convolution,
Note, that the convolution kernel itself is a cube, but the convolution is considered 2-D because it moves along two dimensions of the image (width & height).

Conv1d operation in MS Excel

Let's assume we have an input matrix of shape 3x3.
Figure-1: Input matrix
Now, a Conv1d operation as per the PyTorch docs, Conv1d expects the input to be of shape (N,Cin,Lin)(N,C_{in}, L_{in}) where NN represents the batch size, CinC_{in} represents input number of channels and LinL_{in} represents the length of the input sequence.
Let's assume batch size is 1, in which case we could represent the above 3x3 matrix to be of shape 1x3x3. Great! Let's try a few things in PyTorch now.
import torch
import torch.nn as nn

x = torch.tensor([[
[0.608502902,0.833821936,0.793526093],
[0.905739925,0.95717805,0.809504577],
[0.930305136,0.966781513,0.440928965]]])
x.shape

>> torch.size([1,3,3])
So, our input has the shape 1x3x3 which matches the expectation of Conv1d operation in PyTorch. Let's perform the convolution operation now and check the output shape.
# kernel size is 1, output channels are 5
conv = nn.Conv1d(3, 5, 1, bias=False)
conv(x).shape
>> torch.Size([1, 5, 3])
Were you able to guess the output shape yourself? If so great, you already have a great intuition about convolutions! If not, that's okay - we will develop an intuition for the convolution as part of this blog post.
❓: Why is the output shape= `1x5x3` when kernel size=1 and output number of channels=5?
When we set the input number of channels to be 3, and kernel size 1, we are telling PyTorch to create filters of shape 3x1. Also, because we set the number of output channels to be equal to 5, that means PyTorch will create 5 such kernels, each of shape 3x1. That is, the Convolution weights should be of shape 5x3x1 which represents total of 5 convolution filters, each of shape 3x1.
Let's confirm that in PyTorch.
conv = nn.Conv1d(3, 5, 1, bias=False)
conv.weight.shape

>> torch.Size([5, 3, 1])
That's great! This matches our expectations. Exactly how this operation takes place has been shown below in MS Excel.
Figure-2: Conv1d operation with input channels 3, output channels 5, and kernel size 1
Let's say all 5 filters have weight [0.484249785,0.419076606,0.108487291][0.484249785, 0.419076606, 0.108487291]. These filters have been shown with different backgrounds in Figure-2 above. Essentially, every row in our input leads to 5 outputs because we have a total of 5 filters!
This way, we get a 1x5x3 output where the calculation for each of the items has been shown above in Figure-2.
Let's check if the output matches that in PyTorch.
conv = nn.Conv1d(3, 5, 1, bias=False)
_conv_weight = torch.tensor([[0.484249785, 0.419076606, 0.108487291]])
conv_weight = torch.cat(5*[_conv_weight])
conv_weight = conv_weight.view(conv.weight.shape)

with torch.no_grad():
conv.weight.copy_(conv_weight)

conv(x)

>>
tensor([[[0.7752, 0.9098, 0.7713],
[0.7752, 0.9098, 0.7713],
[0.7752, 0.9098, 0.7713],
[0.7752, 0.9098, 0.7713],
[0.7752, 0.9098, 0.7713]]], grad_fn=<SqueezeBackward1>)
As can be seen, the PyTorch output matches ours from MS Excel! Thus, this is exactly what goes on in a Conv1d operation!

Fully-Connected layer in MS Excel

But what about the Linear/fully connected layer? At the end of the day, all we want to know is how are Conv1d and fully connected layers equivalent?
So first, let's perform the operation in PyTorch and check the output shape.
lin = nn.Linear(3, 5, bias=False)
lin(x.transpose(-1, -2)).transpose(-1, -2).shape

>> torch.Size([1, 5, 3])
The output shape is the same as the output from the Conv1d operation before! But what are these extra transpose operations?
Let's see this in MS Excel again! To have the equivalence to Conv1d, we need to transpose the input as can be seen from the code I first shared in my tweet.
Figure-3: PyTorch code to showcase that Conv1d and Linear layer operations are equivalent.
So after we transpose, our input looks like below:
Figure-4: Input after transpose
And since we have a Linear layer with input channels 3, and output channels 5, therefore we can have the weight matrix which looks like this:
Figure-5: Linear layer weight matrix
Therefore, when we do the row by column matrix multiplication, we can see that the results are equivalent to Conv1d output.
This has been shown in figure-5 below.
Figure-5: Linear layer operation in MS Excel
Let's check if the output matches that in PyTorch.
lin = nn.Linear(3, 5, bias=False)
with torch.no_grad():
lin.weight.copy_(conv_weight.view(lin.weight.shape))

out_1 = lin(x.transpose(1,2))
out_1

>>
tensor([[[0.7752, 0.7752, 0.7752, 0.7752, 0.7752],
[0.9098, 0.9098, 0.9098, 0.9098, 0.9098],
[0.7713, 0.7713, 0.7713, 0.7713, 0.7713]]],
grad_fn=<UnsafeViewBackward>)
This is the exact same output as MS Excel! Now to get the output to match that of Conv1d operation, all we do is transpose it once again.
out_2 = out_1.transpose(1,2)
out_2

>>
tensor([[[0.7752, 0.9098, 0.7713],
[0.7752, 0.9098, 0.7713],
[0.7752, 0.9098, 0.7713],
[0.7752, 0.9098, 0.7713],
[0.7752, 0.9098, 0.7713]]], grad_fn=<TransposeBackward0>)
Thus, we get the same output as our Conv1d operation.

Conclusion

Therefore, we can safely say that Conv1d operation and fully connected/linear layer operations are equivalent given that kernel size is 1.
Can you imagine how the output would like if kernel size was greater than 1? I was able to perform this operation in Excel. I think you should try to do the same!
All code above has been shared a GitHub Gist here.