Skip to main content

Stanford Researchers Present New Breakthrough For Diffusion Models

Stanford Researchers Introduce a Novel way to train diffusion models
Created on February 16|Last edited on February 16
Researchers from Stanford have unveiled a new model architecture that is capable of being added to existing pre-trained diffusion models, and allows for efficient fine-tuning of large diffusion models, along with adding support for alternative input forms (“conditional input forms”), like hand-drawn images, instead of traditional text prompts. The implications of this new method allow for creative designers to communicate more efficiently with diffusion models, and utilize more intuitive input forms, like hand-drawn features, as opposed to just text prompts. 

What's New?

The core contribution of the paper “Adding Conditional Control to Text-to-Image Diffusion Models” is called a ControlNet. A ControlNet can be added to “network blocks” within the original foundation model. The authors define a network block as a set of neural network layers that are frequently put together to be used as a unit for the overall architecture.
Examples include commonly used blocks like a “resent block” or a “multi-head attention block”. Before explaining the ControlNet, it's helpful to visualize the block, as seen below.




As seen above, the ControlNet takes as input a vector “c” which is a “conditional vector” that contains an embedding vector describing the extra inputs created by the user (eg hand-drawn images). This is applied to several blocks in the original network.
By freezing the weights of the original network blocks, the issue of overfitting is prevented even with small amounts of training data. This is important, as many of the applicable datasets for conditional input image generation tasks are much smaller in size compared to the text-image datasets commonly used to train traditional diffusion models. The use of zero convolutions also allows the parameters of the trainable network blocks to be learned gradually, starting from zero, thus adding zero noise to the original deep features.
Before a single optimization step is performed, the trainable portion of the ControlNet contains strictly zero-valued parameters, and the ControlNet behaves as if the trainable portion was not present, which leads to faster training and comparable performance to fine-tuning a traditional diffusion model.

Cornerstone Paper

This work will likely be viewed as a major contribution to the world of image generation research, as it serves as a scalable way for humans to communicate with image generation models.
As AI models increase in capability, the need for new methods which increase accessibility for regular consumer users grows. The authors state this new method is useful even with consumer GPUs, such as NVIDIA 3090TI, and can match the performance of other methods trained with orders of magnitude more compute available.

The paper:

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.