Apple's 'Secret' Multimodal LLM: Ferret
Apple quietly released a pretty impressive multimodal LLM!
Created on December 26|Last edited on December 26
Comment
In October, Apple introduced Ferret, a Multimodal Large Language Model (MLLM), designed to understand spatial referring in any form within an image and ground open-vocabulary descriptions. However, it has recently been gaining a lot of attention, and it is a very impressive model! The key feature of Ferret is its hybrid region representation, which integrates discrete coordinates with continuous visual features, enabling the representation of a wide range of region types, such as points, bounding boxes, and free-form shapes like scribbles, polygons, and masks.

The Architecture
Ferret's architecture comprises an image encoder to extract image embeddings, the spatial-aware visual sampler for regional continuous features, and an LLM to jointly model image, text, and region feature.

Image Encoder: This component is responsible for converting the raw image into a set of meaningful data points or embeddings. It works like an advanced scanner, capturing essential visual elements such as shapes, colors, and textures. This process is crucial for enabling the AI to 'see' and interpret the visual aspects of an image.
Spatial-Aware Visual Sampler: This part of Ferret deals with the spatial information within images. Unlike standard models that may only recognize simple shapes or areas, Ferret's visual sampler is designed to handle a wide range of shapes and formats, from points and rectangles to more complex, free-form areas. This capability is particularly important for interpreting images with irregular or non-standard regions.
Large Language Model: The LLM in Ferret bridges the gap between visual and textual information. It integrates the image embeddings from the image encoder and the data from the spatial-aware visual sampler with textual input. This integration allows Ferret to understand and respond to queries that involve both textual descriptions and specific regions within an image.
Training Details
Ferret was initialized with CLIP-ViT-L/14 as the image encoder and Vicuna as the LLM, with the projection layer weights from LLaVA. The visual sampler was randomly initialized. Training was conducted on the GRIT dataset for three epochs. The training process took approximately 2.5 to 5 days (depending on which version of Vicuna used) on 8 A100 GPUs for Ferret-13B/7B models.
Results and Comparison with LLaVA
Ferret exhibited superior performance across all evaluated tasks. It excelled particularly in tasks requiring referring and grounding capabilities, demonstrating strong spatial understanding and commonsense reasoning. This performance indicates a significant advancement over existing MLLMs, including LLaVA, in tasks involving multimodal interaction and spatial reasoning.

In summary, Apple's Ferret MLLM stands out with its innovative approach to multimodal interaction, blending spatial awareness with language understanding in a way that significantly enhances performance in tasks requiring detailed spatial referencing and grounding.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.