MLP-Mixer: An all-MLP Architecture for Vision

In the past few months there have been various papers proposing MLP based architectures without Attention or Convolutions. This report analyses the paper 'MLP-Mixer: An all-MLP Architecture for Vision' by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer and others.
Saurav Maheshkar
Created on June 30|Last edited on February 1
Comment
﻿
﻿Link to the paper ⟶\longrightarrow⟶﻿﻿﻿﻿﻿﻿Github Repository with Flax Implementation﻿Recently a lot of research has been published around MLPs without attention or convolutions! With interesting new methods like RepMLP, ResMLP (CVPR 2021), Can Attention Enable MLPs To Catch Up With CNNs? (CVM 2021) and Are Pre-trained Convolutions Better than Pre-trained Transformers? (ACL 2021). This new awoken interest seems to be a widely accepted one with giants like Google, Facebook and universities like Tsinghua and Monash submitting multiple papers on this topic to top-tier conferences. (Collection of recent MLP based work)
Figure 1: A comparison of the recent MLP based Architectures
The strong performance of recent vision architectures is often attributed to Attention or Convolutions. But Multi Layer Perceptrons have always been better at capturing long-range dependencies and positional patterns, but admittedly fall behind when it comes to learning local features, which is where CNNs shine. An interesting new perspective of viewing convolutions as a "sparse FC with shared parameters" was proposed in Ding, et al. This perspective opens up a new way of looking at architectures. In this report we'll look at one such paper which explores the idea of using convolutions with an extremely small kernel size of (1,1) essentially turning convolutions into standard matrix multiplications applied independently to each spatial location.  This modification alone doesn't allow for the aggregation of spatial information. To compensate for this the authors proposed dense matrix multiplications that are applied to every feature across all spatial locations.
Model Architecture
Figure 2: An overview of the MLP-Mixer architecture
1️⃣ MLP BlockEach MLP Block consists of two fully-connected layers with a non-linearity (GELU in this case) applied independently to each row of the input data tensor. It's important to note that the hidden widths for the MLP Blocks are chosen independently of the number of input patches. Thus the computational complexity of the network is linear in the number of input patches, unlike Vision based transformers whose complexity is quadratic. Thus the overall complexity is linear in terms of the number of pixels, similar to CNNs. Below is an implementation of the MLP Block in Flax. (For the entire code see the associated github repository)﻿﻿﻿
class MLPBlock(nn.Module):
    """
    A Flax linen Module implementation of the MLPBlock
    References:
        - https://arxiv.org/abs/2105.01601v1
    Attributes:
        mlp_dim (int): no of dimensions for the MLP layers
    """
﻿
    mlp_dim: int
﻿
    @nn.compact
    def __call__(self, x: ArrayLike, *args, **kwargs) -> ArrayLike:
        """Forward pass for MLPBlock"""
﻿
        linear_output = nn.Dense(self.mlp_dim)(x)
        activation_output = nn.gelu(linear_output)
        return nn.Dense(x.shape[-1])(activation_output)
2️⃣ Mixer Block
Figure 3: A Pictographic View of the Mixer Block Architecture with some of its salient features.
A Mixer Block takes as input a sequence of SSS﻿ non-overlapping image patches, each projected to a hidden dimension CCC﻿. The authors refer to this as a real-valued input table X∈RS×CX \in \mathbb{R}^{S \times C}X∈RS×C﻿. If the input image has a resolution of (H,W)(H, W)(H,W)﻿ and if we use square patches of resolution (P,P)(P,P)(P,P)﻿ then, one can easily figure out the number of patches by conserving the area as 
S=HW/P2S = HW / P^2S=HW/P2﻿
A Mixer Block consists of two MLP Blocks :-
Token Mixing Block: This block acts along the columns of the input table XXX﻿ (after performing XTX^TXT﻿). It maps RS↦RS\mathbb{R}^S \mapsto \mathbb{R}^SRS↦RS﻿, where SSS﻿ is the input sequence length and is shared across all columns.
Channel Mixing Block: This block acts along the rows of the input table XXX﻿. It maps RC↦RC\mathbb{R}^C \mapsto \mathbb{R}^CRC↦RC﻿and is shared across all rows.
Having the same MLP Block (sharing the same kernel / parameters) for the rows and columns is a key design choice of this architecture. This instills the model with positional invariance (A key feature of convolutions). This "parameter-tying" prevents the model from growing too fast while increasing the value of CCC﻿ or SSS﻿ and as reported leads to memory savings.
Another thing to note about that the Mixer Block architecture is that it has a "isotropic" design, meaning all layers of the block take an input of the same size (width). This is a common choice for Transformers and RNNs but CNNs on the other hand have a "pyramidal" structure where the deeper layers have lower resolution but higher number of channels. 
Unlike Vision Transformers, MLPMixer doesn't use positional embeddings, because the Token Mixing Block is sensitive to the order of the tokens thus allowing it to learn "locations". MLPMixer also uses standard layers like Layer Normalization and Skip Connections along with a Classifier head. 
Below is an implementation of the Mixer Block in Flax. (For the entire code see the associated github repository)﻿﻿﻿
class MixerBlock(nn.Module):
    """
    A Flax linen Module consisting of two MLP Blocks
    for token mixing and channel mixing
    References:
        - https://arxiv.org/abs/2105.01601v1
    Attributes:
        tokens_mlp_dim (int): no of dimensions for the token mixing MLP layers
        channels_mlp_dim (int): no of dimensions for the channel mixing MLP layers
    """
﻿
    tokens_mlp_dim: int = 384
    channels_mlp_dim: int = 3072
﻿
    @nn.compact
    def __call__(self, x: ArrayLike, *args, **kwargs) -> ArrayLike:
        """Forward pass for MixerBlock"""
﻿
        # Token Mixing
        prenorm = nn.LayerNorm()(x)
        prenorm_transposed = jnp.swapaxes(prenorm, 1, 2)
        mlp_output = MLPBlock(mlp_dim=self.tokens_mlp_dim, name="token_mixing")(
            prenorm_transposed
        )
﻿
        # Channel Mixing
        mlp_output_transposed = jnp.swapaxes(mlp_output, 1, 2)
        skip_connection = x + mlp_output_transposed
        postnorm = nn.LayerNorm()(skip_connection)
        return x + MLPBlock(mlp_dim=self.channels_mlp_dim, name="channel_mixing")(
            postnorm
        )
🏠 Complete ModelThe Model also consists of a "Per-Patch Fully Connected layer" which converts the input image patches to fixed length vectors. This forms our input table (X∈RS×CX \in \mathbb{R}^{S \times C}X∈RS×C﻿) which is then passed through a number of Mixer Blocks. Finally a Classification head is added to get the desired output. 
Below is an implementation of the MLPMixer in Flax. (For the entire code see the associated github repository)﻿﻿﻿
class MLPMixer(nn.Module):
    """
    A Flax linen Module of the MLP-Mixer architecture
    References:
        - https://arxiv.org/abs/2105.01601v1
    Attributes:
        patch_size (Optional[int]): patch size, defaults to 16
        num_classes (Optional[int]): number of classes, defaults to 10
        num_blocks (Optional[int]): number of blocks, defaults to 12
        hidden_dim (Optional[int]): hidden dimension, defaults to 768
        tokens_mlp_dim (Optional[int]): dimensions for the token mixing layers,
            defaults to 384
        channels_mlp_dim (Optional[int]): dimensions for the channel mixing layers,
            defaults to 3072
        model_name (Optional[str]): name of the model, defaults to "Mixer-B_16"
    """
﻿
    patch_size: Optional[int] = 16
    num_classes: Optional[int] = 10
    num_blocks: Optional[int] = 12
    hidden_dim: Optional[int] = 768
    tokens_mlp_dim: Optional[int] = 384
    channels_mlp_dim: Optional[int] = 3072
    model_name: Optional[str] = "Mixer-B_16"
﻿
    @nn.compact
    def __call__(self, inputs: ArrayLike, *args, **kwargs) -> ArrayLike:
﻿
        # Get the Patch Embeddings
        patches = nn.Conv(
            features=self.hidden_dim,
            kernel_size=(self.patch_size, self.patch_size),
            strides=(self.patch_size, self.patch_size),
            name="stem",
        )(inputs)
        rearranged_patches = einops.rearrange(patches, "n h w c -> n (h w) c")
﻿
        # Feed into Mixer Blocks
        mixerblock_output = rearranged_patches
        for _ in range(typing.cast(int, self.num_blocks)):
            mixerblock_output = MixerBlock(
                tokens_mlp_dim=self.tokens_mlp_dim,
                channels_mlp_dim=self.channels_mlp_dim,
            )(mixerblock_output)
﻿
        # Layer Normalization
        layernorm_output = nn.LayerNorm(name="pre_head_layer_norm")(mixerblock_output)
﻿
        # Get the mean of the patches
        x = jnp.mean(layernorm_output, axis=1)  # pylint: disable=invalid-name
﻿
        # Feed into Classification Head
        if self.num_classes:
            x = nn.Dense(  # pylint: disable=invalid-name
                self.num_classes, kernel_init=nn.initializers.zeros, name="head"
            )(x)
        return x
ResultsThe following graph shows the fine tuning performance of a Mixer B16 trained with 224 as the patch size, w.r.t a ViT B32 trained on patch sizes of 224, 128, 64 and 32. As evident from the graph, Mixer performs just as well as ViT on fine-tuning performance !! I mean let's not forget Mixer lacks convolutions or Attention. Being able to provide reasonably similar performance is impressive to say the least. 
﻿
Run set5
﻿
Some Interesting Results from the Paper 🧐
Affect of Scale 🪜There are 2 ways to scale a MLP-Mixer model that are outlined in the paper :-
Increase the model size (viz. number of layers, hidden dimensions, MLP widths) when pre-training.
Increase the input image resolution when fine-tuning.
Figure 4: Role of Model Scale 
The authors report that when trained on ImageNet from scratch, Mixer lies around 3% behind ViT. But as the pre-training dataset grows in size, Mixer's performance steadily increases. When pre-trained on JFT-300M, Mixer lies around 0.3% behind ViT while being ~2x faster. 
PreTraining Dataset Size 📦The authors report that pre-training on larger datasets improves the mixer's performance. On pre-training on a smaller subset of JFT-300M, all mixer models overfit. BiT overfits less, possibly because of the inductive biases associated with convolutions. But upon increasing the size of the pre-training dataset, the performance of Mixer grows while BiT plateaus. 
Figure 5: PreTrained Dataset size effect on performance
On comparing with ViT, the relative improvement is more distinct. Mixer models improve more with dataset size than ViT. The explanation that authors give is the difference in inductive biases. 
" Self-Attention layers in ViT lead to certain properties of the learned functions that are less compatible with the true underlying distribution than those discovered with the Mixer architecture "
Visualizations 👀The first few layers of Convolutional Neural Networks are known to learn detectors that act on pixels in a particular local region of the image. In contrast, Mixer allows for global information exchange in the Token-Mixing blocks (MLP No 1). These Token-Mixing MLPs allow for communication between different spatial locations.
Figure 6: Some Weights of the MLP blocks. It's important to note that in contrast to convolutional kernels where each weight corresponds to a pixel, in the case of MLP, each weight corresponds to particular 16x16 patch.
Figure 6 shows the weights of the first few Token Mixing blocks of Mixer trained on the JFT-300M. Some of the learned features act on the entire image whereas others operate on smaller portions of the image. The first few blocks contain local interactions whereas the "deeper" blocks learn features across larger regions of the image.
ConclusionIn this paper the authors propose a very very simple architecture for vision. Although the model doesn't improve upon the current SOTA performance, but performs comparatively well especially when scaled. Continuing on recent work in the field, this paper is one of many which raises the question of whether " Attention is necessary ? " and hopefully encourages other to think of new interesting architectures beyond the infinite well of Convolutions and Attention. 
﻿