Skip to main content

Introducing VisionLLM: A New Method for Multi-Modal LLM’s

Humans are multimodal — so shouldn't AI be the same? Here, we discuss the future of vision models, and the recent progress that's been made in multi-modal LLMs.
Created on June 1|Last edited on June 2
Multi-modal AI models represent a significant leap forward in the ability of machines to understand and engage with the world. They bring together various forms of data, enabling a more holistic understanding of inputs and allowing for a richer set of outputs.
The implications of this advancement are vast, with potential applications stretching across numerous sectors and industries. Among the most compelling applications are in robotics and autonomous vehicles. In these areas, machines need to make sense of complex environments that include a multitude of different sensory inputs.
Robots must understand and navigate physical spaces, interpreting both the visual and auditory cues they encounter. Similarly, autonomous vehicles must process real-time data from multiple sources, including visual information from cameras, and location data from GPS systems. They must then make rapid, safe decisions based on this data.

The Future of Vision Models

As the landscape of computer vision models continue to evolve, a technique called visual prompt tuning has recently emerged and is rapidly gaining traction. This approach allows for flexible execution of pure vision tasks, which can be seen in a variety of applications. However, one drawback to these visual prompts is that they require more tedious interaction from the user, as a interface is required to make prompts, and overall, the experience of making these visual prompts is inferior to simply explaining prompts via text.

Speaking Pixelish?

VisionLLM represents a particularly innovative approach. It pairs traditional image processing techniques with text-focused language models. Images are interpreted as a "foreign language", transformed into token representations that more effectively align with language prompts. Differing from the use of fixed-size patch embeddings, VisionLLM employs a language-guided image tokenizer, providing a flexible encoding of visual information that is tailored to task-specific language instructions.

Technical Overview

Given an image, the system feeds it into an image backbone model like ResNet. This model extracts visual features across four different scales. Simultaneously, a text encoder, such as BERT, extracts language features Fl from provided prompts. These two strands of features, visual and language, are then intertwined through a process called cross-attention, creating multi-scale, language-aware visual features. This novel process is instrumental in aligning features across both modalities, thereby facilitating how the language instructions or prompts relate to the image.
Subsequently, VisionLLM employs a transformer-based network, like Deformable DETR, furnished with M randomly initialized queries. These queries are inquiries used by the model to gain a deeper understanding of the image. Constructed upon the multi-scale language-aware visual features, this network generates image tokens. Each token, a distinct unit of meaning, is represented by an embedding (semantic information) and a location (positional information).
Example Prompts for the Model
This unique design represents the images in a way that's independent of the input resolution and focuses on extracting visual representations that hold valuable information in relation to the language prompts. An example of such a language instruction could be "Segment all the objects of category set <class> within the <range> of the image and generate a list of the format (c (class), x1, y1, x2, y2, ..., x8, y8)", wherein the generated list format relates to the object's boundary points relative to its center.
System Architecture


Multi-Modal Progress

VisionLLM's innovative approach has a significant impact on the usability of vision models. It allows the system to understand and represent images in a manner that's not only free from the restrictions of input resolution, but also extracts visual representations that are meaningful in relation to the language prompts.
This level of specificity enhances the model's understanding and makes the application of vision models more versatile and robust. Overall, VisionLLM introduces a new paradigm for vision models, bridging the gap between visual and language features, and marking an exciting turning point in the future of vision models.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.