The Evolution Of Mobile CNN Architectures

Looking back at core breakthroughs in neural network-based mobile computer vision models. Made by Carlo Lepelaars using Weights & Biases
Carlo Lepelaars


Over the past years, convolutional neural networks (CNNs) have become more feasible for usage in embedded and mobile devices. Tools like Tensorflow Lite and ONNX have further accelerated opportunities for the application of deep learning in embedded devices. For computer vision, we often use specialized CNN architectures for these applications. This report gives an overview of the available mobile CNN architectures. We will also evaluate a subset of these models on a dataset from a recent Kaggle competition. The use case will be detecting diseases in apple trees using data from Kaggle's Plant Pathology 2020 competition.

**Check out the accompanying Kaggle Notebook **→


MobileNet (2017)

MobileNets were one of the first initiatives to build CNN architectures that can easily be deployed in mobile applications. One of the main innovations is depthwise separable convolutions, which is visualized below. A separable convolution separates a normal convolution kernel into two kernels. For example, instead of a 3x3 kernel, we get a 3x1 and a 1x3 kernel. This separation reduces the number of operations needed to perform the convolution and is therefore much more efficient. However, it is not always possible to separate on the spatial dimension, so it is more common to separate on the depth (channel) dimension. This depthwise separable convolution is used in MobileNet. The paper also introduces a width multiplier that lets you easily scale the CNN model depending on your use case. The model became an easy solution for mobile object detection, face detection, and image classification applications.

For a more in-depth look into depthwise separable convolutions, check out this blog post by Chi-Feng Wang.


ShuffleNet (2017) / ShuffleNetV2 (2018)

ShuffleNet advanced the state-of-the-art for mobile CNN architectures by introducing pointwise group convolutions and channel shuffle. Pointwise group convolutions are used to speed up the 1x1 convolutions common in mobile CNN architectures. However, these convolutions have the side effect that outputs from a particular channel are only derived from a small fraction of input channels. We can mitigate this side effect by dividing channels from each group into multiple subgroups, which is the channel shuffle operation. The ShuffleNetV2 lays out several practical guidelines for efficient CNN architectures. It further optimizes the architecture with changes in the bottleneck and by introducing channel split. In the image below, we can see the differences between ShuffleNetV1 (a., b.) and ShuffleNetV2 (c. d.).


MobileNetV2 (2018)

V2 of the MobileNet series introduced inverted residuals and linear bottlenecks to improve the performance of MobileNets. Inverted residuals allow the network to calculate (ReLU) activations more efficiently and preserve more information after activation. To preserve this information, it becomes important that the last activation in the bottleneck has a linear activation. The figure below from the original MobileNetV2 paper shows the bottleneck, including the inverted residuals. Thicker blocks in this figure have more channels.


Furthermore, the ReLU6 activation is introduced here to speed up the activations for low-precision calculations and to make them more suitable for quantization. The ReLU6 activation is defined as:

$ReLU6(x) = min(max(x, 0), 6)$

Along with object detection and image classification, the paper also demonstrated promising results for efficient semantic segmentation using techniques from the DeepLabV3 paper.

NASNetMobile (2018)

The NASNet research aimed towards searching for an optimal CNN architecture using reinforcement learning. NAS stands for Neural Architecture Search and is a technique developed at Google Brain for searching through a space of neural network configurations. NAS was used with standard datasets like CIFAR10 and ImageNet to optimize CNNs for different sizes. The reduced version is called NASNetMobile. Below you see the best reduced convolutional cell derived with NAS and CIFAR10.


FBNet (2018)

FBNet stands for Facebook-Berkeley-Nets and introduces a family of device-specific (e.g., iPhone X and Samsung Galaxy S8) mobile architectures derived with neural architecture search. The main innovation here is a differentiable version of neural architecture search (DNAS), which is visualized in the figure below.


EfficientNetB0 (2019)

The EfficientNet research searches to efficiently scale CNN architectures using calculation of compound scaling parameters. The smallest version of EfficientNet is EfficientNetB0. This architecture is similar to NASNetMobile but is mainly utilizes the bottlenecks we have seen in MobileNetV2. Furthermore, for this research they also add squeeze-and-excite optimization and the powerful Swish activation function. This is an activation that was optimized using reinforcement learning (NAS) and is defined as:

$Swish(x) = x \cdot sigmoid(\beta x)$

Where $\beta$ can be defined as a constant or as a trainable parameter.

MobileNetV3 (2019)

MobileNetV3 introduces several tricks to optimize MobileNetV2, both in speed and performance. The architecture was optimized for mobile phone CPUs using both hardware-aware NAS and the NetAdapt algorithm. For semantic segmentation, MobileNetV3 has an efficient decoder called Lite Reduced Atrous Spatial Pyramid Pooling" (LR-ASPP). Furthermore, the Hard Swish activation is used to improve over ReLU6 activations and as a more efficient alternative to the normal Swish activation. The Hard Swish activation is defined as:

$hswish(x) = x \frac{ReLU6(x+3)}{6}$

GhostNet (2020)

The GhostNet architecture by Huawei's Noah's Ark Lab is regarded as the state-of-the-art mobile architecture. The core feature of GhostNet's architecture is the "Ghost Module". The Ghost module first adopts ordinary convolutions to generate feature maps and then uses cheap linear operations to augment these feature maps called "Ghost features." Because of these linear operations, the architecture requires a lot less FLOPS and parameters to do convolutions, while it is still able to learn powerful representations of the data. Furthermore, the Ghost Module features an identity mapping to preserve the initial feature maps. See the figure below to see the differences between an ordinary convolution and a Ghost module.


As we have seen with EfficientNetB0, squeeze-and-excite optimization is also used in the residual layers of GhostNet's bottlenecks. Despite the advances in activation functions that we have discussed, GhostNet still uses the regular ReLU activation.



We use data from Kaggle's Plant Pathology 2020 competition. This dataset contains around 1800 images of apple tree leaves with 4 classes denoting diseases in apple trees. Below is a sample of images labeled with the "Multiple diseases" class. Note that this is a relatively small dataset and that if we leverage transfer learning methods, we are likely to get much better results. To learn more about the dataset itself, check out Tarun Paparaju's excellent exploratory notebook for this competition.


Experimental Setup

In this experiment we will evaluate MobileNet, MobileNetV2, NASNetMobile, EfficientNetB0 and GhostNet. Our main metrics of interest are the validation accuracy, the inference speed per image, and the number of parameters in a model.

The inference speed is measured by predicting all images in our test generator and calculating the time it takes to predict a single image using a Tensorflow Dataset generator. Note that this is quite a crude method because the time may vary based on CPU/GPU specifics and other running processes. However, it gives us a reasonable ballpark estimate of how fast the model is.

This experiment will use pre-trained ImageNet weights for the MobileNet and NASNetMobile architectures, Noisy Student weights for fine-tuning EfficientNetB0. GhostNet will be trained from Glorot uniform weight initialization, and we use sunnyyeah's implementation of GhostNet for Tensorflow 2. Our learning rate schedule is an exponential decay schedule with 5 warmup-epochs. For GhostNet, we use a higher learning rate because we have to train the network from random weight initializations. Our data augmentations are horizontal and vertical flips.

Check out this Kaggle Notebook to see the full experimental setup →


EfficientNetB0 comes out as the best in terms of validation accuracy and is fast and has relatively many parameters. MobileNetV2 gives good results and is the smallest of our selection. Unfortunately, we did not manage to reproduce state-of-the-art results for GhostNet on our image classification problem. One reason for this is probably because we could not leverage pre-trained ImageNet weights for this model.

Section 2


In our experiment, EfficientNetB0 achieves the highest validation accuracy of all models. The difference in accuracy with the second-best model, MobileNetV2, is 1%. As we expected, MobileNetV2 converges faster and is more accurate compared to the first version of MobileNet. However, MobileNetV2 is slightly slower than MobileNet when we benchmark on the GPU in Kaggle Notebooks. Note that MobileNetV2 could still be faster on mobile devices. This discrepancy is because depthwise separable convolutions are currently not supported in cuDNN.

If working memory is very limited, then MobileNetV2 would be the best choice out of the models in our experiment. If this is not a hard constraint, then EfficientNetB0 seems to be the best architecture for mobile applications currently. Unfortunately, it is not clear from our experiments if the clever GhostNet architecture also yields state-of-the-art results beyond academic datasets like CIFAR10 and ImageNet. We hope to see more experiments on GhostNet by the community to see if this architecture also shines on real-world mobile computer vision applications.

Final Thoughts

In this report, we have not evaluated the effect of techniques like model pruning and quantization on these CNN architectures. Note that there is a separate trade-off between speed and accuracy with these techniques, and some mobile architectures may be better suited for pruning or quantization.

We hope this introduction gives you a good overview of mobile CNN architectures! If so, be sure to check out the accompanying Kaggle notebook for this report.

If you have any questions or feedback, feel free to comment below. You can also contact me on Twitter @carlolepelaars.

Check out the accompanying Kaggle Notebook →