EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Combining the strengths of multiple encoders to achieve optimal multimodal performance!
Created on September 3|Last edited on September 3
Comment
Eagle is a family of multimodal large language models (MLLMs) developed to enhance the visual perception capabilities of language models by integrating multiple vision encoders. This approach addresses a crucial challenge in multimodal models: accurately interpreting complex visual information, which is essential for tasks such as optical character recognition (OCR), document analysis, and visual question answering. While existing MLLMs have made strides in these areas, there is often a lack of systematic comparisons that deeply explore the integration of multiple vision experts. Eagle fills this gap by conducting a comprehensive exploration of the design space for MLLMs that use a mixture of vision encoders, ultimately leading to a more streamlined and effective design. The result is a model family that outperforms other leading open-source models across various benchmarks.
Vision-Centric DesignEagle’s design emphasizes a vision-centric approach, leveraging a mixture of complementary vision encoders combined in a simple yet effective fusion architecture. The study reveals that straightforwardly concatenating visual tokens from different encoders can be just as effective as more complex mixing strategies. To enhance this design, Eagle introduces a novel Pre-Alignment stage, which aligns each vision encoder individually with a frozen language model before the encoders are trained together. This pre-alignment bridges the representational gap between encoders trained on diverse tasks, leading to a more coherent and stable integration. The systematic exploration of various design choices and thorough ablation studies allow Eagle to achieve superior results compared to other state-of-the-art models.
High-Resolution Adaptation and Fusion Strategy  One of Eagle’s key innovations is its emphasis on high-resolution adaptation, allowing the model to process visual inputs at higher resolutions, which is critical for capturing fine-grained details in tasks like OCR. Unlike previous methods that often keep vision encoders frozen, Eagle’s approach involves "unlocking" these encoders during training, significantly improving performance, especially when adapting to higher input resolutions that differ from the encoders’ pre-training settings. The study compares various fusion strategies, including sequence-based and complex attention methods, finding that Channel Concatenation—a method that combines visual tokens along the channel dimension—offers the best balance of simplicity, efficiency, and performance. This approach maintains high throughput while achieving superior results, making it the preferred strategy for integrating multiple vision experts.
﻿
Pre-Alignment of Vision Experts  The Pre-Alignment stage is a critical step in Eagle’s design, specifically aimed at addressing the representational inconsistencies between vision encoders trained on different tasks. Vision encoders, or experts, specialize in distinct areas such as object detection, text recognition, or segmentation, each developing unique feature sets that do not inherently align with the language model’s text-based representations. In the Pre-Alignment stage, each vision expert is fine-tuned individually with a frozen language model using supervised tasks like predicting text tokens from visual inputs. This process aligns the vision encoder’s output with the text-based embeddings of the language model, creating a common representational space before the final model training. Freezing the language model during this step ensures that adjustments focus solely on the vision encoder, preserving the linguistic capabilities of the language model while tuning the visual input to fit seamlessly. This stage stabilizes the training process, especially when integrating multiple task-specific encoders, and enhances the overall synergy between visual and linguistic features, leading to better performance across a wide range of multimodal tasks.
Integration of Multiple Vision Encoders  Eagle systematically incorporates multiple vision encoders trained on various tasks, including object detection, segmentation, and text recognition, using a step-by-step greedy approach to identify the optimal combination of vision experts. This strategy consistently improves the model’s performance by leveraging the strengths of each encoder, making Eagle particularly effective for complex visual tasks that require a broad range of visual understanding. The integration of diverse vision experts allows Eagle to handle intricate multimodal challenges, such as document analysis and high-resolution OCR, with greater precision and effectiveness.
Performance and Benchmarks  Eagle’s design and training strategies translate into state-of-the-art performance across numerous benchmarks. The model excels in visual question answering (VQA) tasks, such as GQA and VQAv2, as well as in OCR and document understanding benchmarks like TextVQA and OCRBench. By supporting high-resolution visual inputs and integrating multiple task-specific encoders, Eagle demonstrates robust visual perception and reasoning skills, outperforming other leading MLLMs. The combination of high-resolution adaptation, Pre-Alignment, and optimized fusion strategies allows Eagle to maintain a straightforward yet highly effective design, making it capable of processing complex visual information without requiring elaborate image decomposition techniques.
Conclusion  Eagle sets a new standard in the design of multimodal large language models by systematically exploring the integration of multiple vision encoders. Unlike previous efforts that focused primarily on novel fusion architectures, Eagle emphasizes the importance of fundamental design choices, such as high-resolution adaptation and pre-alignment of vision experts. This approach not only improves performance but also ensures stability and coherence when combining diverse visual representations. By identifying efficient training recipes and fusion methods, Eagle offers a powerful, scalable solution for enhancing the visual perception capabilities of MLLMs, providing a robust foundation for future research and development in the field of multimodal AI.
The Paper: https://arxiv.org/pdf/2408.15998v1﻿
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.