A Simple and Efficient Technique for Building Multi-Modal Models
Researchers unveil a clever way to reuse exiting pretrained models to achieve high performance on multi-modal tasks
Created on July 5|Last edited on July 5
Comment
In some exciting work produced by Stanford and Contextual AI, researchers have unveiled a unique framework known as LENS (Large Language Models ENhanced to See). This system is designed to augment the capabilities of frozen LLMs (Large Language Models), enabling them to handle not only text but also images and vision-and-language tasks, through building on top of the already impressive natural language understanding capabilities of LLM's.
The LENS Advantage
LENS stands out from traditional multi-modal methodologies by establishing an integrated framework that directs a LLM's "reasoning module" to function on textual data derived from a group of independent "vision modules". Crucially, LENS eliminates the computational weight of syncing the visual and text domains through additional joint pretraining on multimodal data, a necessity often seen in preceding works for resolving vision-and-language tasks.
Furthermore, the advantage of the LENS approach is that it doesn't require expensive and time-consuming pretraining to align visual and language modalities, a significant hurdle in previous models. The ability to leverage pre-existing vision foundation models without additional fine-tuning represents a noteworthy advantage, offering flexibility and reducing computational costs.
The framework works by taking an image and utilizing the vision modules to extract all possible textual information that can illustrate the image. This spans from objects and attributes to captions, without confining the information to specific task directives. The frozen LLM then interprets the generic prompts combined with task-specific prompts, allowing it to carry out object recognition or visual reasoning tasks.

Visual Vocabularies
One of the integral facets of LENS is the employment of visual vocabularies that translate an image into textual information, subsequently handled by a pre-existing LLM. Researchers created vocabularies for common objects and attributes for this very purpose.
Tagging is achieved through a comprehensive tag vocabulary amassed from various image classification, object detection, and semantic segmentation datasets. For the derivation of attributes, GPT-3, a large language model, was utilized to generate descriptions of the visual attributes that distinguish each object category within the object vocabulary.
Technical Overview
The LENS system hinges on four critical components: three distinct vision modules and a reasoning module, each performing a specific function based on the task at hand.
Tag Module: Given an image, this module assigns tags to the image. A vision encoder (CLIP) is used to select the most appropriate tags for each image.
Attributes Module: This module identifies and assigns pertinent attributes to the objects within the image. A contrastively pretrained vision encoder, CLIP, is used here, complemented by task-specific prompts.
Intensive Captioner: An image captioning model, BLIP, is engaged to generate a plethora of captions per image. This method captures the diverse aspects of the visual content within an image, delivering these diverse captions directly to the "reasoning module" without any alterations.
Reasoning Module: The reasoning module is a frozen LLM that generates responses based on the textual descriptions supplied by the vision modules, in conjunction with task-specific instructions.

An Clever Technique
Performance analysis has shown that LENS excels in object recognition tasks, often demonstrating equal or superior performance when contrasted with existing models like CLIP. While LENS may not always surpass all other methods, it stands out with its balance between efficiency and effective performance. By avoiding costly pretraining and efficiently using existing vision models, LENS optimizes resource use. Its modular design suggests it could be combined with other techniques for potentially improved outcomes.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.