Skip to main content

Predicting the Next Image?

Researchers create a new type of vision model, similar to that of modern LLM's! Outlined in the paper, "Sequential Modeling Enables Scalable Learning for Large Vision Models."
Created on December 15|Last edited on December 15
A recent paper by UC Berkeley and Johns Hopkins University called "Sequential Modeling Enables Scalable Learning for Large Vision Models" introduces a new approach to learning Large Vision Models (LVMs) without relying on linguistic data. The key concept is the use of "visual sentences" to represent a wide array of visual data, such as raw images, videos, and annotated data like semantic segmentations and depth reconstructions. This approach involves converting these visual elements into sequences, which are then used to train the model using a cross-entropy loss function for next token prediction, similar to modern LLM's like GPT or LLaMA. The model scales effectively across various model architectures and data diversities, capable of solving a variety of vision tasks by designing suitable visual prompts at test time.
The work acknowledges the impact of Large Language Models like GPT and LLaMA, and poses a question about the feasibility of developing a Large Vision Model. The aim is to replicate the key features of LLMs, such as scaling with big data and flexible task specification through prompting, but solely from pixel data, without involving any linguistic components.

Image Tokenization

The methodology involves converting images into discrete tokens using a VQGAN (Vector Quantized Generative Adversarial Network) model, which employs semantic tokens. This process involves encoding and decoding mechanisms that assign input images to sequences of discrete tokens. This tokenization occurs independently for each image, allowing the tokenizer to be trained separately from the downstream Transformer model.

Sequence Modeling of Visual Sentences

The visual sentences, once tokenized, are treated as unified sequences by concatenating the tokens from multiple images into a one-dimensional sequence. The Transformer model used is akin to autoregressive language models, trained to predict the next token in the sequence. This approach enables the model to infer relationships between images from context, fostering its ability to generalize to unseen visual sentence structures.



Annotations

You may be wondering, if the model only produces outputs in image space, how exactly can model carry out tradition tasks like detection or pose estimation? Well, its actually quite simple, as all annotations are represented as images, with different types of annotations being addressed through tailored methods:

Object Detection: Annotations are created by overlaying color-coded bounding boxes around each object.
Human Pose: Human skeletons are rendered in pixel space, adhering to predefined formats.
Other Types of Annotations: This includes semantic segmentation maps, edge maps, depth, and normal images, which are already represented in image form.
This systematic approach to handling annotations enables the creation of a unified structure for diverse visual data, facilitating the training and evaluation of the Large Vision Model (LVM)​​.

Data

A significant contribution of this research is the assembly of the Unified Vision Dataset v1 (UVDv1), which is composed of various sources of visual data: unlabelled images, images with visual annotations, unlabelled and annotated videos, and 3D synthetic objects. This dataset, containing 1.64 billion images, aims to provide the diversity and volume of data necessary for training LVMs. The concept of "visual sentences" is introduced as a unified format for this diverse visual data, with each visual sentence comprising a sequence of one or more images followed by an end-of-sentence token.

Scalability

The study evaluates the trained model's scalability and its ability to perform a range of prompted tasks. Results indicate that the model's performance improves with increasing size, demonstrating strong scalability with both larger models and more data. The model also shows promising results on downstream tasks like semantic segmentation, depth estimation, surface normal estimation, and edge detection.



Sequential and Analogy Prompting

The model's capabilities are further explored through sequential and analogy prompting. Sequential prompting involves predicting the next image in a sequence, demonstrating the model's inferential abilities regarding spatial positioning, viewpoint, and object understanding. Analogy prompting challenges the model to comprehend and respond to analogies of varying lengths and complexities, showcasing its advanced interpretative abilities.

In summary, this paper presents an innovative approach to Large Vision Models using sequential modeling of visual sentences, demonstrating its scalability and adaptability to a range of vision tasks.
The Paper:
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.