Skip to main content

移动端CNN模型架构的演变

回顾基于神经网络的移动计算机视觉模型的核心突破
Created on January 28|Last edited on July 12
本报告是作者Carlo Lepelaars所写的"The Evolution Of Mobile CNN Architectures"的翻译

导言

在过去的几年中,卷积神经网络(CNN)在嵌入式和移动设备中的使用变得更加可行。 Tensorflow LiteONNX等工具为深度学习在嵌入式设备中的应用提供了进一步的加速机会。对于计算机视觉,我们通常针对这些应用程序使用专用的CNN架构。这篇报告概述了可用的移动CNN架构。我们还将在最近的Kaggle 竞赛的数据集中评估部分模型。此用例将使用Kaggle的2020植物病理学竞赛的数据来检测苹果树中的疾病。

请查看相关的Kaggle Notebook




模型架构

MobileNet (2017)

MobileNets是建立可轻松部署在移动应用程序中的CNN模型的首批举措之一。主要创新之一是深度可分离卷积(depthwise separable convolutions),如下图所示。可分离的卷积将普通卷积kernel分为两个kernel。例如,我们得到的不是3x3kernel,而是3x1和1x3kernel。这种分离减少了执行卷积所需的操作数量,因此效率更高。但是,并非总是可以在空间维度(spatial dimension)上进行分离,因此在深度(通道)维度(depth dimension)上进行分离更为常见。这种深度可分离卷积在MobileNet中使用。论文还介绍了宽度倍增器(width multiplier),该宽度倍增器使您可以根据用例轻松缩放CNN模型。该模型成为移动对象检测、面部检测和图像分类应用程序的简便解决方案。

要更深入地研究深度可分离卷积,请查看Chi-Feng Wang的这篇博客文章。

depthwise.png

ShuffleNet (2017) / ShuffleNetV2 (2018)

通过引入逐点分组卷积(pointwise group convolutions) 和通道混合(channel shuffle),ShuffleNet提高了移动CNN架构的最新技术。逐点分组卷积用于加快移动CNN架构中常见的1x1卷积。但是,这些卷积会产生副作用,即特定通道的输出仅从一小部分输入通道中得出。我们可以通过将通道从每个组划分为多个子组来缓解这种副作用,这是通道混合操作。 ShuffleNetV2为高效的CNN架构列出了一些实用指南。它通过更改瓶颈和引入通道拆分(channel split)来进一步优化体系结构。在下图中,我们可以看到ShuffleNetV1(a., b.) 和ShuffleNetV2(c. d.) 之间的差异。

shufflenet_bottleneck.png

MobileNetV2 (2018)

MobileNet系列的V2引入了反向残差(inverted residuals)和线性瓶颈(linear bottlenecks),以提高MobileNets的性能。反向残差使网络可以更有效地计算(ReLU)激活,并在激活后保留更多信息。为了保留此信息,瓶颈中的最后一次激活具有线性激活非常重要。下图来自原始MobileNetV2论文,显示了瓶颈,包括了反向残差。此图中较厚的块具有更多通道。

invert_resid.png

Furthermore, the ReLU6 activation is introduced here to speed up the activations for low-precision calculations and to make them more suitable for quantization. The ReLU6 activation is defined as:

ReLU6(x)=min(max(x,0),6)ReLU6(x) = min(max(x, 0), 6)

Along with object detection and image classification, the paper also demonstrated promising results for efficient semantic segmentation using techniques from the DeepLabV3 paper.

NASNetMobile (2018)

The NASNet research aimed towards searching for an optimal CNN architecture using reinforcement learning. NAS stands for Neural Architecture Search and is a technique developed at Google Brain for searching through a space of neural network configurations. NAS was used with standard datasets like CIFAR10 and ImageNet to optimize CNNs for different sizes. The reduced version is called NASNetMobile. Below you see the best reduced convolutional cell derived with NAS and CIFAR10.

nasnetmobile.png

FBNet (2018)

FBNet stands for Facebook-Berkeley-Nets and introduces a family of device-specific (e.g., iPhone X and Samsung Galaxy S8) mobile architectures derived with neural architecture search. The main innovation here is a differentiable version of neural architecture search (DNAS), which is visualized in the figure below.

dnas.png

EfficientNetB0 (2019)

The EfficientNet research searches to efficiently scale CNN architectures using calculation of compound scaling parameters. The smallest version of EfficientNet is EfficientNetB0. This architecture is similar to NASNetMobile but is mainly utilizes the bottlenecks we have seen in MobileNetV2. Furthermore, for this research they also add squeeze-and-excite optimization and the powerful Swish activation function. This is an activation that was optimized using reinforcement learning (NAS) and is defined as:

Swish(x)=x⋅sigmoid(βx)Swish(x) = x \cdot sigmoid(\beta x)

Where β\beta can be defined as a constant or as a trainable parameter.

MobileNetV3 (2019)

MobileNetV3 introduces several tricks to optimize MobileNetV2, both in speed and performance. The architecture was optimized for mobile phone CPUs using both hardware-aware NAS and the NetAdapt algorithm. For semantic segmentation, MobileNetV3 has an efficient decoder called Lite Reduced Atrous Spatial Pyramid Pooling" (LR-ASPP). Furthermore, the Hard Swish activation is used to improve over ReLU6 activations and as a more efficient alternative to the normal Swish activation. The Hard Swish activation is defined as:

hswish(x)=xReLU6(x+3)6hswish(x) = x \frac{ReLU6(x+3)}{6}

GhostNet (2020)

The GhostNet architecture by Huawei's Noah's Ark Lab is regarded as the state-of-the-art mobile architecture. The core feature of GhostNet's architecture is the "Ghost Module". The Ghost module first adopts ordinary convolutions to generate feature maps and then uses cheap linear operations to augment these feature maps called "Ghost features." Because of these linear operations, the architecture requires a lot less FLOPS and parameters to do convolutions, while it is still able to learn powerful representations of the data. Furthermore, the Ghost Module features an identity mapping to preserve the initial feature maps. See the figure below to see the differences between an ordinary convolution and a Ghost module.

ghost_module.png

As we have seen with EfficientNetB0, squeeze-and-excite optimization is also used in the residual layers of GhostNet's bottlenecks. Despite the advances in activation functions that we have discussed, GhostNet still uses the regular ReLU activation.

Experiments

Dataset

We use data from Kaggle's Plant Pathology 2020 competition. This dataset contains around 1800 images of apple tree leaves with 4 classes denoting diseases in apple trees. Below is a sample of images labeled with the "Multiple diseases" class. Note that this is a relatively small dataset and that if we leverage transfer learning methods, we are likely to get much better results. To learn more about the dataset itself, check out Tarun Paparaju's excellent exploratory notebook for this competition.

apple_tree_leaves.png



Experimental Setup

In this experiment we will evaluate MobileNet, MobileNetV2, NASNetMobile, EfficientNetB0 and GhostNet. Our main metrics of interest are the validation accuracy, the inference speed per image, and the number of parameters in a model.

The inference speed is measured by predicting all images in our test generator and calculating the time it takes to predict a single image using a Tensorflow Dataset generator. Note that this is quite a crude method because the time may vary based on CPU/GPU specifics and other running processes. However, it gives us a reasonable ballpark estimate of how fast the model is.

This experiment will use pre-trained ImageNet weights for the MobileNet and NASNetMobile architectures, Noisy Student weights for fine-tuning EfficientNetB0. GhostNet will be trained from Glorot uniform weight initialization, and we use sunnyyeah's implementation of GhostNet for Tensorflow 2. Our learning rate schedule is an exponential decay schedule with 5 warmup-epochs. For GhostNet, we use a higher learning rate because we have to train the network from random weight initializations. Our data augmentations are horizontal and vertical flips.

Check out this Kaggle Notebook to see the full experimental setup →

Results

EfficientNetB0 comes out as the best in terms of validation accuracy and is fast and has relatively many parameters. MobileNetV2 gives good results and is the smallest of our selection. Unfortunately, we did not manage to reproduce state-of-the-art results for GhostNet on our image classification problem. One reason for this is probably because we could not leverage pre-trained ImageNet weights for this model.




Run set
5


Conclusion

In our experiment, EfficientNetB0 achieves the highest validation accuracy of all models. The difference in accuracy with the second-best model, MobileNetV2, is 1%. As we expected, MobileNetV2 converges faster and is more accurate compared to the first version of MobileNet. However, MobileNetV2 is slightly slower than MobileNet when we benchmark on the GPU in Kaggle Notebooks. Note that MobileNetV2 could still be faster on mobile devices. This discrepancy is because depthwise separable convolutions are currently not supported in cuDNN.

If working memory is very limited, then MobileNetV2 would be the best choice out of the models in our experiment. If this is not a hard constraint, then EfficientNetB0 seems to be the best architecture for mobile applications currently. Unfortunately, it is not clear from our experiments if the clever GhostNet architecture also yields state-of-the-art results beyond academic datasets like CIFAR10 and ImageNet. We hope to see more experiments on GhostNet by the community to see if this architecture also shines on real-world mobile computer vision applications.

Final Thoughts

In this report, we have not evaluated the effect of techniques like model pruning and quantization on these CNN architectures. Note that there is a separate trade-off between speed and accuracy with these techniques, and some mobile architectures may be better suited for pruning or quantization.


We hope this introduction gives you a good overview of mobile CNN architectures! If so, be sure to check out the accompanying Kaggle notebook for this report.

If you have any questions or feedback, feel free to comment below. You can also contact me on Twitter @carlolepelaars.

Check out the accompanying Kaggle Notebook →


Iterate on AI agents and models faster. Try Weights & Biases today.