Apple Unveils MM1.5: A New iOS Virtual Assistant?

A new Multimodal LLM with special UI Skills!
Created on October 3|Last edited on October 3
Comment
The MM1.5 family introduces a series of multimodal large language models (MLLMs) designed to enhance capabilities in visual comprehension, text-rich image understanding, and multi-image reasoning. Building on the success of its predecessor, MM1, this new generation focuses on optimizing data mixtures, dynamic image processing, and continual fine-tuning to achieve high performance across a variety of multimodal tasks. With sizes ranging from 1 billion to 30 billion parameters, MM1.5 offers both dense and mixture-of-experts (MoE) variants, making it versatile for applications from mobile devices to cloud deployments.
Architecture OverviewMM1.5’s architecture leverages a vision encoder based on the CLIP framework, which divides images into sub-images, converting them into a sequence of tokens. This dynamic image-splitting technique enables the model to process high-resolution images by breaking them into smaller parts, allowing efficient encoding of both large and small-scale inputs. The encoded visual tokens are then integrated with the language model via a vision-language connector, which aligns image representations with text, allowing the model to produce grounded text outputs. This setup ensures that MM1.5 can handle complex visual prompts like object localization, text recognition, and spatial reasoning.
A key feature that enhances MM1.5’s visual understanding is its use of coordinate tokens. These tokens represent specific regions within an image using (x1, y1, x2, y2) tuples, where each tuple defines the top-left and bottom-right corners of a bounding box around an object or area of interest. By incorporating these tokens, MM1.5 can localize its text outputs to particular image regions, making it capable of tasks such as referring to a precise area (“the object in the top-left corner”) or interacting with UI elements in a graphical interface. The coordinate tokens empower the model to perform fine-grained visual referring and grounding, enabling it to generate responses that are not just contextually accurate, but spatially precise as well.
﻿
﻿
﻿
MM1.5-Video and UI SpecializationMM1.5 introduces two specialized variants: MM1.5-Video and MM1.5-UI. MM1.5-Video leverages the model’s inherent multi-image reasoning capabilities to extend its functionality to video understanding. The model processes multiple frames simultaneously, enabling it to perform temporal reasoning and answer questions about video content. Meanwhile, MM1.5-UI is optimized for user interface comprehension, handling dense graphical elements and text to provide functionalities like identifying clickable buttons or summarizing screen content. This makes it particularly valuable for applications in UI testing and digital assistant interfaces.
Pioneering User Interface UnderstandingOne standout application of the MM1.5 family is MM1.5-UI, a specialized variant that excels in user interface (UI) comprehension tasks. Fine-tuned with UI-specific data, MM1.5-UI demonstrates strong capabilities in understanding and interacting with the graphical user interfaces of devices such as smartphones and computers. By leveraging its advanced image processing and text comprehension features, MM1.5-UI is capable of identifying on-screen elements, interpreting their functionality, and responding to complex user queries about the interface layout.
MM1.5-UI’s performance sets a new standard in several key benchmarks. For example, it can recognize and refer to specific text elements within densely populated screens, such as locating a "Settings" button or identifying clickable options. It has also shown impressive results in distinguishing between different types of UI components like sliders, buttons, and checkboxes. In recent evaluations, MM1.5-UI achieved state-of-the-art performance across multiple tasks, outperforming other models by a significant margin in areas like widget captioning, screen summarization, and tap prediction on UI elements. Its ability to maintain context-aware, multi-turn conversations about the screen's layout and elements further demonstrates its proficiency in providing valuable assistance to users navigating complex digital environments.
Benchmarks The MM1.5 family has been evaluated against a broad range of benchmarks, demonstrating its versatility and strong performance across multimodal, video, and UI understanding tasks. Key highlights of MM1.5’s capabilities include state-of-the-art results in text-rich image understanding, multi-image reasoning, and user interface interaction.
For text-rich tasks, MM1.5 surpassed its predecessor MM1 and other contemporary models across several key benchmarks. It achieved top scores in the DocVQA benchmark for document understanding, with an increase from 75.8 to 91.4, and significantly improved its performance in InfoVQA for infographic comprehension, where its score rose from 47.3 to 67.3. On knowledge-intensive benchmarks such as MMMU for complex multimodal reasoning, MM1.5-30B showed a marked improvement over previous models, highlighting its strength in handling intricate textual and visual information.
﻿
In the domain of multi-image and video understanding, MM1.5-Video excelled in MVBench, a video reasoning benchmark, consistently scoring higher than other models of similar scale, such as MiniCPM-V and Phi-3-Vision. It also demonstrated superior in-context learning abilities on the VL-ICL benchmark, with its performance surpassing other models, particularly in zero-shot and few-shot settings.
The MM1.5-UI variant sets a new standard for UI understanding. On public UI benchmarks like screen2words (screen-level captioning) and widget captioning, MM1.5-UI achieved a notable performance boost over previous leading models. In the Ferret-UI elementary UI tasks, MM1.5-UI outperformed the Ferret-UI 13B model by a significant margin, achieving over 9 points higher on average in Android and iOS tasks. It particularly excelled in text-based UI tasks, leveraging its advanced OCR and spatial reasoning capabilities to understand and interact with complex UI layouts.
﻿
Overall, MM1.5’s benchmark performance underscores its robustness and applicability across a diverse set of multimodal tasks, establishing it as a leading model for both general-purpose and specialized applications in the multimodal AI space.
The paper: https://arxiv.org/abs/2409.20566﻿
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.