Skip to main content

Apple Releases New Mobile Friendly Vision Transformers

You no longer need to pay for expensive GPUs in the cloud with Apple's new Vision Transformers, which are designed for quick and easy deployment on the edge!
Created on August 22|Last edited on August 22
Vision Transformers (ViTs) have become a prominent tool in the field of image analysis, driving state-of-the-art performance across various tasks. However, traditional ViTs are often associated with high computational and memory requirements. Recent hybrid architectures have attempted to balance the trade-off between accuracy and latency, but the search for an optimal solution continues. In this context, Apple's introduction of FastViT has set a new standard for mobile image networks.

Design and Architecture of FastViT

The architecture of FastViT revolves around three core principles: removing skip connections, employing train-time overparameterization, and utilizing large kernel convolutions.

Novel Token Mixing Operator: RepMixer

At the heart of FastViT lies RepMixer, a unique token mixing operator that reduces memory access costs by removing skip connections within the network. It uses structural reparameterization, providing a more efficient pathway for information flow.

Efficiency Techniques

Train-time Overparameterization: FastViT employs overparameterization during training, allowing the model to learn more complex representations. These additional parameters are reconfigured at inference time, enhancing efficiency without sacrificing performance.
Factorized Convolutions: Replacing dense convolutions with their factorized version (depthwise followed by pointwise) improves efficiency metrics without hurting performance.
Large Kernel Convolutions: The introduction of large kernel convolutions substitutes self-attention layers in the early stages, contributing to better latency without a significant impact on overall accuracy.



Performance and Evaluation

FastViT's performance is impressive, especially when considered against other recent models. On the ImageNet dataset, the architecture is 3.5x faster than CMT and 4.9x faster than EfficientNet, while achieving comparable accuracy. The mid-size models can achieve 82.6 percent top-1 accuracy on ImageNet classification while maintaining real-time performance on an iPhone 12. Different model sizes provide varying performance, allowing users to select the right model for their specific needs.

Results on Imagenet-1k classification

Robustness

One notable aspect of FastViT is its robustness to corruption and out-of-distribution samples. This quality enables more reliable real-world applications, strengthening its value as a go-to architecture.

Application and Real-World Deployment

The application of FastViT extends across various computer vision tasks, including image classification, detection, segmentation, and 3D mesh regression. Its architecture allows for considerable latency improvements on both mobile devices (e.g., iPhone 12 Pro) and desktop GPUs.

Conclusion

Apple's FastViT represents a significant step forward in the quest for the optimal latency-accuracy trade-off in vision transformers. Its novel design, efficiency techniques, and robust performance have set a new benchmark for hybrid transformer architectures. By releasing models in several different sizes with varying performance on the ImageNet dataset, FastViT provides flexibility to cater to diverse computational needs for deployment on the edge!
Code and models related to FastViT can be found at Apple's GitHub repository: https://github.com/apple/ml-fastvit
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.