Meta Releases Quantized Llama 3.2 Models for Mobile Devices

Llama just got faster!
Created on October 25|Last edited on October 25
Comment
Meta has introduced its first quantized versions of the Llama models, specifically targeting enhanced speed and reduced memory usage on mobile devices. This development reflects Meta's commitment to making advanced AI more accessible by optimizing models for edge computing, enabling developers to deploy high-quality AI solutions on smartphones and other resource-constrained hardware. 
Overview of Quantized Llama ModelsThe quantized models of Llama 3.2, available in both 1B and 3B configurations, achieve significant efficiency improvements. These models maintain the same level of quality as the original Llama models while offering 2-4x faster inference and reducing model size by 56%. Memory usage has also been cut by an average of 41%, enhancing their viability for mobile devices with limited runtime memory. The quantized models prioritize short-context applications, such as up to 8,000 tokens, without sacrificing safety or accuracy.
Quantization Techniques and PerformanceMeta employs two methods for quantizing Llama: Quantization-Aware Training (QAT) with LoRA adaptors and SpinQuant. QAT, combined with LoRA, offers the highest model quality in low-precision environments, while SpinQuant provides flexibility, allowing developers to quantize models post-training without needing access to proprietary datasets. Both methods are integrated into the Llama Stack and supported by PyTorch's ExecuTorch framework, ensuring portability across Qualcomm and MediaTek SoCs powered by Arm CPUs. The models have been optimized further using Kleidi AI kernels, with ongoing efforts to enhance performance through NPU utilization.
Deployment and ResultsQuantized Llama models were tested using the Android OnePlus 12 device, showcasing improvements such as a 2.5x faster decode latency and 4.2x faster prefill latency. Meta also verified similar performance on other smartphones, including Samsung S24+ and iOS devices, although formal performance metrics for the latter are still under review. These models provide more privacy by enabling interactions to remain on-device, aligning with the growing demand for edge computing.
Broader Implications and Future DirectionsMeta's quantization strategy aligns with the increasing need for efficient AI models suitable for real-world mobile applications. As community developers continue to adopt and customize Llama models, Meta envisions these quantized versions as a critical step toward democratizing AI by reducing barriers related to computational resources. The release also strengthens Meta’s collaborations with industry partners, including Qualcomm, Arm, and MediaTek, and further integrates Llama into platforms like Hugging Face. Looking ahead, Meta aims to extend the capabilities of these models through NPU optimizations and deepen community engagement in advancing AI-driven mobile experiences.
With quantized Llama models now available on llama.com and Hugging Face, Meta invites developers to explore new possibilities in AI while delivering seamless, privacy-conscious experiences on mobile devices.
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.