Show-o: A Unified Transformer for Multimodal Understanding and Generation

The future of multimodal models?
Created on August 29|Last edited on August 29
Comment
Show-o is a new transformer model that unifies multimodal understanding and generation in a single framework. Unlike traditional models that focus separately on understanding tasks, like visual question-answering, or generation tasks, like text-to-image synthesis, Show-o integrates both capabilities. It uses a combination of autoregressive modeling for sequential text prediction and discrete diffusion modeling for visual tasks, allowing it to handle mixed-modality inputs such as text and images. This makes Show-o versatile and efficient across a wide range of vision-language tasks, including text-guided inpainting, extrapolation, image captioning, and video keyframe generation.
Motivation for a Unified ModelExisting multimodal AI approaches often separate understanding and generation into distinct models. For example, LLaVA is designed for understanding tasks and processes inputs using an autoregressive method, while Stable Diffusion relies on continuous diffusion techniques to generate visual content. These specialized methods have limitations in terms of flexibility and efficiency because they require separate models for each task. Show-o addresses these limitations by combining autoregressive text generation and discrete diffusion image modeling in a single transformer model. This integration allows it to flexibly handle diverse inputs and tasks, making it more efficient than using multiple specialized models.
Tokenization Strategy in Show-oTo enable its unified approach, Show-o uses a specialized tokenizer that processes both text and images into discrete tokens. For text, the tokenizer uses a standard pre-trained language model tokenizer, similar to those found in GPT-like architectures. This tokenizer converts text into discrete tokens, such as words or subwords, and represents them as embeddings in a high-dimensional space.
The image tokenizer in Show-o operates differently. Based on the MAGVIT-v2 model, it divides an image (such as a 256x256-pixel image) into smaller patches, such as 16x16-pixel patches. This results in 256 patches for the entire image. Each patch is then mapped to one of the 8,192 entries in a learned codebook. The mapping process involves matching the visual features of each patch to the closest entry in the codebook, where each entry is a vector representing specific visual patterns like textures or colors. The image is therefore encoded as a sequence of discrete tokens, similar to how a sentence is represented as a sequence of words.
Show-o’s tokenizer unifies both text and image data by representing them as sequences of discrete tokens. This unified tokenization allows the model to process mixed-modality inputs in the same way, enabling it to easily switch between tasks like visual question answering, text-to-image generation, and other multimodal applications.
Architecture of Show-oShow-o builds on the architecture of large language models but incorporates an omni-attention mechanism to handle both text and image data. The omni-attention mechanism combines two types of attention: causal attention and full attention. Causal attention is used for text tokens, where each token only attends to the tokens that came before it, which is essential for autoregressive text generation. Full attention is used for image tokens, allowing all tokens to attend to each other. This is crucial for image generation tasks that require understanding the entire context of the image, such as inpainting or extrapolation.
﻿
Show-o starts with a pre-trained large language model and extends its embedding layer to incorporate image tokens. Each of the 8,192 image tokens has a corresponding embedding in the model’s high-dimensional space, allowing the transformer to handle both text and image data within the same framework. To perform both understanding and generation tasks, Show-o uses a unified prompting strategy that formats various types of input data, including text and images, into structured sequences. Special task tokens, such as [MMU] for multimodal understanding and [T2I] for text-to-image generation, indicate the type of task, while other tokens mark the beginning and end of text and image sequences. This strategy allows the model to flexibly manage different inputs and outputs without needing separate fine-tuning for each task.
Training Objectives and Loss FunctionsShow-o is trained using a combination of autoregressive and discrete diffusion modeling objectives designed to handle both text and image data efficiently. For autoregressive tasks like text generation and multimodal understanding, Show-o uses a Next Token Prediction (NTP) loss. This loss maximizes the likelihood of predicting the next text token in a sequence given all preceding tokens, including both text and image tokens. The NTP loss is represented mathematically as the sum of the logarithm of the conditional probability of each token given all previous tokens in the sequence. This loss helps the model learn dependencies between tokens to generate coherent text.
For tasks involving image generation, Show-o employs a Mask Token Prediction (MTP) loss. In this approach, a subset of the image tokens is randomly masked, replaced with a special mask token, and the model is trained to predict the original tokens from this masked sequence. The MTP loss aims to maximize the likelihood of correctly predicting the masked tokens using the surrounding tokens' information. This formulation teaches the model to recover or denoise the original image tokens from the corrupted ones. The masking process represents a single-step corruption, and the model learns to reverse this in one step, which aligns with the discrete diffusion modeling approach.
Discrete Diffusion Modeling in Show-oTraditional diffusion models, such as Denoising Diffusion Probabilistic Models (DDPMs), involve multiple steps of adding Gaussian noise to continuous latent space and then denoising it step-by-step. Show-o uses a discrete diffusion process instead. In Show-o, image tokens are corrupted by randomly masking a certain percentage of them. Unlike continuous diffusion, which involves gradual corruption across many steps, Show-o performs this corruption in a single step by replacing some tokens with a mask token.
The model is trained to predict the original tokens from the masked ones using the unmasked tokens and any associated text tokens. This prediction process is considered a single-step denoising, making it more computationally efficient than traditional diffusion methods that require many iterations to refine the output. The discrete diffusion process in Show-o involves a single corruption and denoising step. The model learns to reconstruct or denoise the image from the corrupted state by predicting the original tokens that were masked. This approach captures the essence of diffusion—corruption followed by reconstruction—but simplifies it into a single-step process.
Show-o’s Training DataShow-o's training data includes a diverse range of datasets that cover both text and image domains. Text-only datasets, such as RefinedWeb with about 1 billion text instances, are used to maintain the model's language understanding and generation capabilities. Image datasets with class labels, such as ImageNet-1K, contain millions of images labeled with class names and help Show-o learn to associate textual descriptions with visual patterns, enabling class-conditional image generation. Paired image-text datasets, like CC12M and LAION-Aesthetics-12M, provide image-caption pairs that are crucial for training Show-o to map between visual features and language. This data enables the model to perform multimodal tasks, such as generating images from text descriptions or providing detailed captions for given images
Performance and ApplicationsShow-o demonstrates competitive performance across various multimodal benchmarks. In tasks like image captioning and visual question-answering, it achieves results comparable to or better than existing models with larger parameter sizes. In text-to-image generation, Show-o performs well against models like DALL-E 2 and SD3, especially considering its more efficient single-step denoising approach.
Show-o can perform text-guided inpainting and extrapolation without any additional fine-tuning. Given an input image with specific regions masked and a text prompt, Show-o fills in the missing areas by generating the appropriate image tokens that match the context and description provided. It can also generate video keyframes guided by text descriptions. For example, given a sequence of interleaved text descriptions and initial video frames, Show-o can predict the next set of frames or text, maintaining temporal consistency and visual coherence. This capability is particularly useful for tasks such as video generation with narrative guidance.	
Conclusion and Future DirectionsShow-o represents a major advancement in multimodal AI by unifying understanding and generation capabilities in a single model using a combination of autoregressive and discrete diffusion techniques. This allows Show-o to handle a wide range of tasks, from text-to-image generation to visual question answering, in a more flexible and efficient manner. Future developments may focus on expanding Show-o’s capabilities further, such as enabling more complex forms of mixed-modality generation, like generating long-form videos from detailed text descriptions, and enhancing its ability to generalize across even more diverse datasets and tasks.
The Paper: https://arxiv.org/pdf/2408.12528v2﻿
﻿
﻿
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.