OmniVision-968M: A Ultra-lightweight Multimodal Model built on Qwen 2.5
Created on November 15|Last edited on November 15
Comment
OmniVision-968M represents a breakthrough in multimodal AI, offering a compact, sub-billion parameter model designed for edge devices. Optimized for both visual and text inputs, it builds upon the LLaVA architecture to reduce token processing, enhance accuracy, and operate efficiently in environments with limited computational power.
Token Compression and Enhanced Efficiency
OmniVision achieves remarkable computational efficiency by reducing the number of image tokens from 729 to 81, which significantly lowers latency and resource use. This token compression improves the model's ability to process images without sacrificing performance. The model’s reduced computational demand allows it to operate smoothly on edge devices, making advanced AI more accessible.
Accurate Output and Reduced Hallucinations with DPO
Accuracy improvements are achieved through Direct Preference Optimization (DPO) training on reliable data sources, which minimizes hallucinations—common in generative models. DPO training aligns OmniVision’s responses with a teacher model that makes subtle but accuracy-critical adjustments, ensuring responses stay on point without deviating from the model’s core capabilities. This precision-focused approach allows OmniVision to handle diverse real-world applications effectively.
Architecture and Training Stages
OmniVision's architecture consists of three components: the Qwen2.5-0.5B-Instruct base language model for text processing, the SigLIP-400M vision encoder for generating image embeddings, and a projection layer that aligns these embeddings with the language model's token space. During training, the model undergoes three stages: pretraining with image-caption pairs to establish basic alignments, supervised fine-tuning on image-based question-answering datasets to improve contextual understanding, and the DPO phase, which fine-tunes outputs by training with corrected answer pairs.
Benchmark Performance and Real-World Application
In benchmarking against models like nanoLLAVA and Qwen2-VL-2B, OmniVision excels across tasks such as MM-VET, ChartQA, and ScienceQA. Its robust performance in visual-linguistic benchmarks confirms its superiority over nanoLLAVA, setting new standards in compact multimodal model performance. With applications in edge devices and portable AI, OmniVision is poised to support innovations in fields from interactive media to autonomous technology.
Looking Forward
As OmniVision continues to develop, Nexa AI plans to expand DPO training and enhance the model’s document and text processing abilities. The goal is to refine OmniVision into a fully optimized, production-ready solution for real-time, on-device multimodal processing. Through ongoing advancements, Nexa AI aims to bring powerful, compact AI capabilities to a wider array of applications, marking a step forward in the evolution of AI technology.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.