Meta's New Multimodal LLM: Transfusion
A new architecture for multimodal modeling
Created on September 17|Last edited on September 17
Comment
Transfusion is a novel method for training a single, unified model that can handle both discrete (text) and continuous (image) data using a combination of language modeling and diffusion techniques. Developed by a team from Meta, Waymo, and the University of Southern California, Transfusion integrates next token prediction and diffusion objectives to train a single transformer capable of generating high-quality text and images. The model scales better than previous methods that relied on discrete token quantization of images, maintaining high performance across multiple benchmarks by using modality-specific encoding and decoding layers.
Multi-Modal Challenges and Existing Solutions
Multi-modal generative models face the complex task of processing discrete elements like text and continuous data such as images. Traditionally, discrete modalities have been dominated by language models trained on next token prediction, while continuous data has been best handled by diffusion models. Previous attempts to bridge these approaches have often resulted in complex architectures that involve separately trained components or quantizing images into discrete tokens, which can lead to information loss. Transfusion addresses this by training a single model that predicts text tokens and diffuses images, integrating both modalities without sacrificing data fidelity. Here’s the loss for Transfusion:

Key Features of Transfusion
Transfusion introduces a novel approach where a single transformer model processes both text and images within the same training sequence. Text data is handled with the next token prediction objective, while image data is processed through diffusion. The model uses causal attention for text and bidirectional attention within images, allowing comprehensive intra-image context while maintaining sequence coherence across modalities. To enhance performance, Transfusion uses U-Net layers for image encoding and decoding, significantly boosting image generation quality compared to simpler linear methods.



Performance and Scaling
Experiments demonstrate that Transfusion models scale efficiently across different model sizes, from smaller configurations to models with 7 billion parameters. The 7B Transfusion model, trained on 2 trillion multi-modal tokens, shows competitive performance in both text and image generation tasks, outperforming existing methods like Chameleon in terms of computational efficiency. It achieves approximately twice the performance of Chameleon models in text-to-image and image-to-text tasks, with lower computational costs.

Comparison with Existing Models
Transfusion's unique architecture allows it to rival dedicated diffusion models such as DALL-E 2, SDXL, and DeepFloyd while also generating text at a level comparable to leading language models like Llama. This capability makes Transfusion a versatile tool that combines the strengths of both diffusion and language modeling approaches, capable of generating diverse outputs from a single unified model. Transfusion's performance on benchmarks like GenEval further highlights its effectiveness, achieving similar or superior results compared to state-of-the-art models in both text and image generation.
Potential Applications and Future Directions
Transfusion's ability to handle text and image data seamlessly opens up possibilities for applications that require multi-modal understanding and generation, such as interactive AI, creative content generation, and integrated AI assistants. The success of Transfusion also suggests a new direction for future research, where models are trained end-to-end on mixed modalities, enhancing their adaptability and performance across a wide range of tasks.
Conclusion
Transfusion represents a significant step forward in multi-modal AI, demonstrating that a single model can effectively integrate language modeling and diffusion techniques to handle diverse data types. By unifying these approaches, Transfusion sets a new standard for multi-modal generative models, offering a scalable, efficient solution that bridges the gap between discrete and continuous data generation.
The Paper:
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.