Apple Releases 4M-21: An Advanced Any-to-Any Vision Model
A new "Any-to-Any" model by Apple and Swiss Federal Institute of Technology Lausanne!
Created on July 3|Last edited on July 3
Comment
The paper "4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities" introduces a new approach to address challenges associated with training multimodal and multitask models, proposing a model that extends capabilities by training on a diverse set of modalities and tasks without compromising performance.
Problem Addressed
Multitask learning often faces hurdles such as negative transfer, where training on multiple tasks degrades performance, and requires meticulous balancing of losses or gradients. Additionally, existing models are usually trained on a limited number of modalities, restricting their out-of-the-box capabilities. The 4M-21 model aims to overcome these limitations by significantly increasing the number of tasks and modalities it can handle, thereby enhancing its applicability and efficiency in real-world scenarios.
Methodology
The core innovation of the 4M-21 model lies in its use of discrete tokenization to unify the representation of various modalities, making it possible to train a single model on diverse inputs. The discrete tokenization process involves several tailored approaches: the ViT Tokenizer, utilized for image-like modalities such as RGB images, edges, and feature maps, employs Vision Transformer (ViT)-based VQ-VAE models to map inputs into small grids of discrete tokens. The MLP Tokenizer is applied to non-spatial modalities like 3D human poses and global embeddings from models like DINOv2 and ImageBind, using Bottleneck MLP-based discrete VAEs with Memcodes quantization. The Text Tokenizer employs a WordPiece tokenizer to encode text and other modalities (such as bounding boxes, color palettes, and metadata) into sequences of discrete tokens.

This tokenization process improves training stability, enables full parameter sharing, and reduces computational complexity by compressing dense modalities into sparse token sequences. The 4M-21 model supports an extensive range of modalities, including RGB images, geometric data, semantic data, edges, feature maps, metadata, and text. RGB modalities include both tokenized and pixel versions of RGB images, along with extracted color palettes for conditional generation. Geometric modalities comprise surface normals, depth, and 3D human poses & shapes, providing essential information about scene geometry. Semantic modalities include semantic segmentation and bounding boxes, capturing scene semantics, with pseudo labels extracted from models like Mask2Former and SAM. Edges involve Canny edges and SAM edges, which contain important information about scene layout and semantics. Feature maps are embeddings from models like CLIP, DINOv2, and ImageBind, which have strong transfer learning and retrieval capabilities. Metadata encompasses semantic, geometric, and image processing metadata extracted from RGB images and other modalities. Text modalities include captions from datasets like CC12M and COYO700M, along with web text from the C4 corpus, encoded using WordPiece tokenizers and T5-XXL embeddings.

Datasets and Results
The model was trained and evaluated on a variety of datasets, demonstrating its versatility and performance across different tasks. Key datasets include CC12M, used for pre-training, providing a large-scale multimodal dataset with captions; COYO700M, another large-scale dataset used for training, offering diverse samples with pseudo labels; C4, a text corpus used for language modeling; DIODE, for surface normal and depth estimation tasks; COCO, for semantic and instance segmentation tasks; 3DPW, for 3D human keypoint estimation; and ImageNet-1K, used for k-NN retrieval tasks.
The evaluation covers a range of tasks, including surface normal and depth estimation, semantic and instance segmentation, k-NN retrieval, and 3D human keypoint estimation. The results show that 4M-21 performs competitively with specialized models across various tasks. For example, it achieved comparable performance to state-of-the-art models like Omnidata in surface normal estimation, matched the performance of models like MiDaS DPT in depth estimation, showed strong results in semantic segmentation often outperforming models like UnifiedIO, demonstrated effective performance using SAM instances in instance segmentation, approached the performance upper bound defined by the tokenizer reconstruction quality in k-NN retrieval, and performed well compared to models like 4D-Humans in 3D human keypoint estimation.
Multimodal Capabilities
The 4M-21 model also excels in multimodal retrieval, enabling the prediction of global embeddings from any input modality and facilitating retrieval tasks. This capability allows for retrieving data from different modalities more effectively than previous models, offering enhanced control over retrievals. The model's ability to handle diverse tasks and modalities without performance degradation highlights its potential for real-world applications where efficiency and versatility are crucial.
Conclusion
The 4M-21 model represents a significant advancement in the field of multimodal and multitask learning. By leveraging discrete tokenization and multimodal training, it addresses the limitations of existing models and offers robust performance across a wide array of tasks and modalities. The open-sourcing of the multimodal models and training code further contributes to the advancement of the field, enabling broader research and development. The 4M-21 model sets a new benchmark for any-to-any vision models, highlighting the potential for unified models to streamline and enhance performance in diverse applications. This research provides valuable insights into the future of multimodal machine learning, emphasizing the importance of scalable and versatile models capable of handling complex, real-world scenarios.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.