Google's PaLI-3 Sets New Benchmarks in Vision-Language Modeling with Fewer Parameters
Google presents some interesting findings for a new model and training procedure for visual language models!
Created on October 16|Last edited on October 16
Comment
In a compelling new paper, Google researchers have unveiled PaLI-3, a vision-language model (VLM) that outperforms its larger counterparts. While the current trend in VLM development focuses on increasing the number of parameters to improve performance, PaLI-3 disrupts this narrative. With just 5 billion parameters, this model performs exceptionally well across multiple benchmarks, particularly in tasks that require text understanding and object localization.
The Architecture

Architecture of PaLI-3
Visual Component (SigLIP: 2B Vision Model)
At its core, the vision component of PaLI-3 utilizes a Vision Transformer (ViT-G/142) with approximately 2 billion parameters. This ViT is initialized using the SigLIP training recipe, which is a contrastive pretraining method.
The SigLIP approach involves predicting two separate embeddings: one for images (using ViT-G/14) and one for texts. These embeddings are trained in a way that a binary classifier can correctly determine if a given image and text pair correspond with each other (contrastive learning).
This method of training is akin to the CLIP and ALIGN techniques but is deemed more efficient, scalable, and robust. Post this training, the text embedding transformer from the SigLIP is discarded when the ViT is integrated into PaLI.
Full PaLI Model
After processing the image through the Vision Transformer, the visual tokens are then linearly projected and combined with embedded text tokens from the input.
Combined, these visual and text tokens feed into a 3-billion parameter UL2 encoder-decoder language model. The language model then generates the desired textual output based on the visual and textual input.
Contrastive Learning
The vision encoder for PaLI-3 was pre-trained using contrastive objectives rather than classification objectives. This strategic pre-training has enabled the model to achieve state-of-the-art (SOTA) results in various tasks, including visually-situated text understanding and object localization. Moreover, the model is trained on a web-scale image-text dataset, which further boosts its performance.
The training procedure
The procedure for training the model involves a few stages:
Stage 0: Unimodal Pretraining - The image encoder is contrastively pre-trained using SigLIP on web image-text pairs. About 40% of pairs are retained after filtering. The image encoder is trained at 224x224 resolution. Concurrently, a 3B UL2 text encoder-decoder is trained using a mixture of denoisers approach.
Stage 1: Multimodal Training - The image encoder is merged with the text encoder-decoder and trained on a mixed multimodal task. The image encoder remains unaltered, and the data is kept at its native 224x224 resolution. The primary dataset for this stage is WebLI. Other datasets utilized include multilingual captioning on CC3M-35L and WebLI OCR, cross-lingual VQA, and object detection. PaLI-3's capability is further enhanced with PDF documents and web-images in over 100 languages.
Stage 2: Resolution Increase - The model's resolution is increased to better capture image details. The entire model, including the image encoder, is fine-tuned using a curriculum of rising resolutions, with checkpoints at 812x812 and 1064x1064. The training data primarily comprises visually-situated text and object detection.
Task Specialization - The model is specifically fine-tuned for each task. The image encoder remains fixed, and for most tasks, the 812x812 resolution checkpoint is used. However, for two tasks related to document understanding, the resolution is increased to 1064x1064.
Results
Visually-Situated Text Understanding: This area involves understanding text within the context of images, using benchmarks such as TextCaps, TextVQA, and InfographicVQA, which cover natural images, documents, and infographics. PaLI-3 achieves state-of-the-art (SOTA) results on most of these benchmarks, both with and without external OCR input. However, it lags in AI2D and ChartQA, benchmarks requiring complex reasoning.
Natural Image Understanding: This task centers on interpreting the visual elements in natural images and generating captions or answering questions about them. Benchmarks used include COCO captions, VQAv2, OKVQA, and TallyQA. Despite its smaller model size, PaLI-3 outperforms almost all competitors but slightly trails larger models like PaLI-X.
Video Captioning and Question Answering: This involves fine-tuning PaLI-3 on benchmarks like MSR-VTT and ActivityNet Captions for video captioning, and NExT-QA and ActivityNet-QA for video question answering. Unlike other models, PaLI-3 was not pre-trained with video data but still performs competitively, indicating its robust capabilities across different media types.
Overall
In closing, PaLI-3 sets a new standard for efficiency and effectiveness in vision-language modeling. Its strong performance across diverse benchmarks, achieved with fewer parameters, makes it an attractive option for real-world applications. It not only challenges the notion that bigger is always better but also opens new avenues for multilingual and multimedia capabilities.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.