Researchers Speed up Diffusion Modeling 17.5x
Created on October 14|Last edited on October 14
Comment
A new breakthrough in diffusion models has emerged with a technique called REPresentation Alignment (REPA), which speeds up the convergence of diffusion transformers by over 17.5x. Researchers from KAIST, Korea University, Scaled Foundations, and New York University introduced this advancement in their paper, "Training Diffusion Transformers is Easier than You Think."
Background on Diffusion Models
Diffusion models, such as those used in generative tasks, rely on denoising processes to create high-dimensional data like images or videos. These models have gained significant attention due to their ability to generate detailed and coherent images. Despite their success, diffusion models have struggled with efficient training due to the complexity of learning representations from noisy data. Historically, diffusion models needed extensive training steps, often millions, to reach peak performance.
The Problem of Representation Learning
The core of the problem lies in the difficulty of learning high-quality internal representations during training. Diffusion models must learn to extract relevant features from corrupted data, and this process has proven inefficient. Although some diffusion models naturally develop useful internal representations, their quality has lagged behind those learned by recent self-supervised learning techniques. This inefficiency contributes to the long training times seen in these models.
Introducing REPresentation Alignment (REPA)
REPA operates by aligning the noisy intermediate representations in diffusion models with clean external representations. Through this alignment, the diffusion model becomes more effective at learning key features early in the training process. This allows the model to focus more on capturing high-frequency details in later stages, accelerating the entire training process.
Instead of relying solely on the diffusion model to learn these representations from scratch, REPA aligns the diffusion transformer's internal states with clean representations obtained from pretrained visual encoders. By distilling high-quality representations from external models into the diffusion transformer, REPA ensures the model learns to recognize important features faster, even when working with noisy input data.

REPA functions similarly to an auxiliary objective. It adds an additional regularization term to the primary training objective of the diffusion model. While the diffusion model is still trained to predict noise or reconstruct the original data (the core denoising task), REPA aligns the internal representations of the diffusion transformer with clean visual representations obtained from pretrained encoders.
Performance Gains with REPA
When applied to popular diffusion transformers like DiTs (Diffusion Transformers) and SiTs (Stable Diffusion Transformers), REPA showed striking improvements. For instance, a SiT-XL model that traditionally required 7 million steps to reach a specific performance level could now achieve similar results in just 400,000 steps. The model's efficiency improves not only in speed but also in the quality of the generated images, with the final model achieving a state-of-the-art FID score of 1.42.
Key Results and Future Applications
The impact of REPA goes beyond faster convergence. With diffusion models becoming a cornerstone of image and video generation tasks, the ability to train these models more efficiently will significantly reduce computational costs and democratize access to high-quality generative AI. Additionally, the alignment with external pretrained representations opens the door to more flexible and robust generative models, potentially improving results across various tasks from art creation to scientific simulations.
This breakthrough suggests that diffusion transformers may not need to learn everything from scratch. By leveraging existing knowledge through REPA, diffusion models can achieve high performance much faster, making this approach a potential game-changer in generative AI.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.