Mid-U Guidance: Fast Classifier Guidance for Latent Diffusion Models

Introducing a new method for diffusion model guidance with various advantages over existing methods, demonstrated by adding aesthetic guidance to Stable Diffusion.
Jonathan Whitaker
Created on January 7|Last edited on February 25
Comment
﻿
Example images created with SD2.1 + mid-u guidance with an aesthetic loss
Classifier guidance allows us to add additional control to diffusion model sampling. Examples include using CLIP to generate samples that match a text or image prompt, or using a pre-trained image classification model to generate specific classes of images. However, existing methods typically operate on image inputs, adding a costly performance penalty when applied to latent diffusion models. 
This article introduces a new, faster approach to model guidance and demonstrates its effectiveness through the example of aesthetic guidance, training a model on human aesthetic preferences and applying it at inference time to an existing Stable Diffusion model to generate more pleasing outputs with very little computational overhead.
Here's what we'll be covering: 
Table of ContentsTable of ContentsIntroduction to Classifier GuidanceImplementationTrainingSample ResultsEvaluation Beyond Aesthetic GuidanceConclusion
﻿
Introduction to Classifier GuidanceModel-based guidance (a.k.a. classifier guidance) is a technique whereby an additional model is used to steer the generation process of a diffusion model. The classifier is used to calculate some kind of loss signal (for example, how well the generated image matches a text prompt or image class) and the derivative of this loss signal is used to update the noisy input x between inference steps. 
Since the models used typically operate in pixel space, this operation is compute and memory-intensive since it requires tracing gradients back through the classification model, the VAE decoder (in the case of latent diffusion models) and the diffusion UNet itself at each inference step. 
Comparing the normal approach to guidance of latent diffusion models (a) with the proposed mid-u technique (b), which requires far less computation and memory due to a much shorter path through which the gradients of the guidance loss must be traced.
The core insight behind mid-u guidance is that the diffusion UNet itself creates rich representations of its inputs internally, which we can exploit as a starting point for our classifier. Because the UNet takes in the timestep and prompt as additional conditioning at multiple stages, these internal representations capture not only a rich representation of the current noisy input x but also encode information about the prompt and the noise level, which will be useful additional information for the guidance models to work with.
ImplementationIn this article we use the output activations of the mid-block in Stable Diffusion's UNet, but in theory any internal features would likely be good candidates. The diagram below shows the UNet architecture and the point at which we save the internal features:
Architecture of the UNet in Stable Diffusion
To save these outputs during the normal forward pass of the UNet, we can register a forward hook with the relevant module like so:
pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base").to(device)
def hook_fn(module, input, output):
    module.output = output 
pipe.unet.mid_block.register_forward_hook(hook_fn);
For a 512px square image these mid-block outputs will be shape (1280, 8, 8). We can build a classification model on top of these features in a number of ways - here is the architecture used for the examples shown in this article, which uses several convolution and pooling layers to reduce its inputs down to a set of 512 features which can then be fed through a couple of linear layers to produce the final output in the desired shape (10 classes for this example).
model = nn.Sequential(
    nn.Conv2d(1280, 256, kernel_size=3, padding=1), nn.ReLU(),
    nn.MaxPool2d(2, 2),
    nn.Conv2d(256, 128, kernel_size=3, padding=1), nn.ReLU(),
    nn.AdaptiveAvgPool2d(output_size=(2, 2)), nn.Flatten(),
    nn.Linear(128*4, 64), nn.ReLU(), nn.Linear(64, 10)
)
To train this model as an aesthetic classifier, we can:
Load a dataset of images with prompts or captions and aesthetic ratings (for example)
Encode a batch of images and add noise to the latents
Feed these through the UNet alongside the prompts
Store the mid-u outputs using the hook shown above
Feed these mid-u outputs to the score model as the input with the rating as the prediction target
Repeat a bunch of times
"A horse on the beach" generated without guidance (left) vs with the additional aesthetic guidance (right), starting from the same initial seed.
At inference, we apply the usual sampling loop but add additional code to pass the mid-block outputs through the score model, calculate the gradient of the aesthetic loss, and use these gradients (with some scaling to control guidance strength) to modify the noisy latents in a direction that hopefully increases the aesthetic appeal of the final result. You can see the code in this minimal inference notebook.
TrainingTraining was carried out using the miniai library, which is being developed as part of an ongoing course with FastAI. The training examples of the form (mid-block output, rating) were generated in advance, and loaded in batches of 128. 
The model was initialised with Kalming init and trained for a single epoch using the Adam optimizer and a one_cycle learning rate schedule. Two versions were trained, one on mid-block features from images with different amounts of noise and one that skips the noising step. 
Predicting the aesthetic rating from the noisy versions is potentially more useful since the model will be operating on noisy inputs during inference.
﻿
﻿
The data is based on 128,000 images from the Simulacra Aesthetic Captions dataset. The ratings are extremely noisy, since this was crowd-sourced from many different contributors and many images only received a single (possibly biased) rating. 
We could have treated this as a regression problem, but instead chose to train it as a classification task (predicting the rating as one of 10 classes) and later interpret the model outputs into a single score. 
This approach could be adapted for class-based guidance, or the targets and loss function could be switched out for some different objective such as aligning with CLIP text embeddings.
Sample ResultsBelow are some example images with different guidance scales logged in a W&B Table. A fixed seed is used for each row to allow for better comparison.
﻿
Run set1
﻿
We can also use a negative guidance scale to create 'unaesthetic' images which is more entertaining than it should be:
"James Bond" with no guidance (left) vs guidance with a negative scale to get an unaesthetic output (right)
All of these examples are based on the Stable Diffusion 2.1 Base model. 
Evaluation 
The minimal rating interface has the user select their preferred image, implemented with ipywidgets in a Colab notebook
We can ask human volunteers to choose between two images generated from the same seed and prompt, one with our guidance technique and one without, to estimate preference. 
In a small test with my family picking favorites, people preferred the outputs from the aesthetic guidance around 70% of the time, and tended to have stronger preferences when the guidance scale was higher (at scale=10 and below the effect was almost indistinguishable, with preference scores dropping to ~50%). 
A larger study with a better model and more thorough exploration of the effect of guidance scale is recommended before you take these results too seriously!
Beyond Aesthetic GuidanceAesthetic guidance is a great way to demonstrate some of the potential of this kind of guidance, but it is by no means the only potential use-case. 
Some additional ideas that can leverage this approach:
Generating images of a specific class (e.g. 'German Shepherd') using a classification model.
CLIP guidance by training a model to produce CLIP embeddings that align with an existing CLIP text or image encoder. 
﻿Discriminator guidance, in which a classifier is trained to discriminate between real images and synthetic ones based on these mid-block features. 
Discriminator guidance especially seems a fruitful direction for further exploration, with the potential to reduce the kinds of noticeable artefacts that plague current-generation models. 
ConclusionTraining models for guidance on the internal representations of a latent diffusion model seems like a promising avenue for exploration. My hope is that this article encourages the community to play around with this concept further, since I myself have run out of time for this project at present. 
﻿