Meta Researchers Present MobileLLM
Lightweight LLM's for the Mobile Platform!
Created on July 9|Last edited on July 9
Comment
The growing need for efficient language models (LLMs) on mobile devices is driven by increasing cloud costs and latency concerns. Large language models, such as those powering ChatGPT, are typically deployed in cloud environments, which can be expensive and introduce latency issues. Deploying these models on-device can significantly reduce operational costs and latency, making them more practical for everyday use. The recent paper "MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases" introduces innovative approaches to make LLMs more efficient and suitable for mobile deployment.
Challenging Traditional Scaling Laws
Traditional scaling laws, as proposed by Kaplan et al. (2020), suggest that model performance scales predictably with the number of parameters, dataset size, and computational budget. This implies that increasing the model size proportionally will improve performance. However, MobileLLM challenges this notion, particularly for models with fewer than a billion parameters. The authors demonstrate that for small LLMs, increasing the model depth (number of layers) is more beneficial than increasing the width (number of units in each layer). This finding contradicts the established belief that more parameters uniformly distributed across the model would yield better results.
Model Specs
MobileLLM focuses on sub-billion parameter models, specifically MobileLLM-125M with approximately 125 million parameters and MobileLLM-350M with approximately 350 million parameters. These models are designed to be deep and thin, meaning they have many layers but fewer units per layer. This architecture helps capture more abstract and hierarchical features, improving performance in various tasks. The immediate block-wise weight-sharing approach is a key innovation in MobileLLM. This method involves sharing weights between adjacent layers in the model, significantly reducing the total number of unique parameters. This improves memory efficiency and reduces the computational burden without compromising performance.

Weight-Sharing Techniques
The immediate block-wise weight-sharing approach works by having adjacent layers share the same set of weights. Instead of each layer having its own unique set of weights, every two adjacent layers use the same weights. This reduces the total number of unique weights, making the model more memory efficient. The process involves computing the same block twice, which incurs a slight increase in computational latency. However, the overall efficiency gain from reduced memory usage compensates for this overhead.
Experimental Results and Performance
The MobileLLM models were extensively tested and compared against existing models, showing significant improvements in performance. For zero-shot reasoning tasks, MobileLLM outperformed existing sub-billion parameter models. For example, MobileLLM-125M demonstrated a notable improvement over models like OPT-125M and GPT-Neo-125M, showing a higher average accuracy. Specifically, MobileLLM-125M achieved 2.7% and 4.3% higher accuracy than the previous state-of-the-art 125M and 350M models, respectively. MobileLLM-350M outperformed larger models like BLOOM-560M and showed comparable performance to even larger models in certain tasks.
On-Device Testing
To validate the practical deployment of MobileLLM models, the authors tested the models on an iPhone 13. They used the Metal Performance Shaders (MPS) backend to measure the model's performance. The latency analysis included loading time, initialization time, and execution time for MobileLLM-125M, MobileLLM-LS-125M (with weight sharing), and a 60-layer non-shared model for comparison. The results indicated that the MobileLLM-LS model incurred only a slight increase in loading, initialization, and execution times compared to the 60-layer non-shared model. Specifically, the loading time for the 60-layer non-shared model was 68.6 ms compared to 43.6 ms for the MobileLLM-LS model, initialization time was 3347.7 ms compared to 1388.2 ms, and execution time was 29.0 ms compared to 16.0 ms. This demonstrates the efficiency of the weight-sharing approach in maintaining performance while significantly reducing the model size.

Conclusion
MobileLLM presents a significant advancement in optimizing language models for on-device use. By challenging traditional scaling laws and introducing innovative weight-sharing techniques, the authors have created models that are both efficient and high-performing. The practical testing on an iPhone 13 showcases the feasibility of deploying these models on mobile devices, making them a viable solution for reducing latency and operational costs in real-world applications. The results show that MobileLLM models, particularly the 125M and 350M parameter variants, achieve significant performance improvements in various tasks, making them suitable for on-device deployment. This work opens new avenues for deploying efficient LLMs on mobile devices, ensuring that high-quality language models can operate seamlessly without relying on cloud-based resources.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.