Skip to main content

Multi-Resolution Noise for Diffusion Model Training

Fixing a potential issue with current approaches to diffusion model training by using a new noising approach
Created on February 28|Last edited on May 8
Stable Diffusion (left) can't create super-dark images, until we add our trick (right)!
This report proposes a new noising approach that adds multi-resolution noise to an image or latent image during diffusion model training. A model trained with this technique can generate stunning images with a very different aesthetic to the usual diffusion model outputs. This seems like a promising direction for future research.

What's in a Noise Schedule?

Recent work such as 'On the Importance of Noise Scheduling for Diffusion Models' by Ting Chen has drawn attention to an oft-overlooked aspect of diffusion model training: the noise schedule and its relation to the image resolution and input scaling. As Chen's paper (and the concurrent work 'simple diffusion') show, addressing these nuanced design decisions can make a huge improvement in the quality and training stability of diffusion models.

Adding the same amount of noise to images at different resolution illustrates how resolution affects the 'signal-to-noise' ratio and thus how rapidly information is destroyed
One core issue behind this is that the random noise used in the diffusion process is inherently high-frequency, while images contain a lot of low-frequency components that mean high redundancy between adjacent pixels. Adjusting the noise schedule to shift the signal-to-noise ratio fixes part of the problem, but a core issue remains: high-frequency details are destroyed far faster than low-frequency details.
In this report I introduce a new noising approach building on the idea of 'offset noise' to degrade all aspects of the image signal using multi-resolution noise, resulting in a model that can generate more diverse images than regular stable diffusion, including extremely light or dark images which have historically been hard to achieve without resorting to using a large number of sampling steps.

Offset Noise

The idea of offset noise was introduced in this blog post by Nicholas Gutenberg. The video below covers the topic well, and prompted me to take a closer look at the topic of offset noise (which I had previously skipped past, assuming it didn't make much difference).


The solution Nicholas came up with is to re-write the noising function from noise = torch.randn_like(latents) to noise = torch.randn_like(latents) + 0.1 * torch.randn(latents.shape[0], latents.shape[1], 1, 1).
This means that the model learns to change the "zero-frequency of the component" (i.e. the average color of the image) much more quickly, giving it more freedom to create extremely dark or extremely light images. Here is a before (left) vs after (right) showing the difference on some example prompts from the blog post:
Images generated with SD (left) vs a version fine-tuned with offset noise (right). Note the left images are never extremely light or dark.
He ends the post with a appeal: So I’d like to conclude this with a request to those involved in training these large models: please incorporate a little bit of offset noise like this into the training process the next time you do a big run. It should significantly increase the expressive range of the models, allowing much better results for things like logos, cut-out figures, naturally bright and dark scenes, scenes with strongly colored lighting, etc. It’s a very easy trick!. It didn't take long for the community to oblige, and soon Illuminati Diffusion v1.1 was being used to create amazing images.

Noise at different frequencies

The offset noise trick addresses part of the issue by effectively combining very high-frequency noise with extremely low-frequency noise (the offset). But it felt like there should be an even better way, that mixed noise at lots of different frequencies to better erase all parts of the signal easily. Enter 'Pyramid Noise':
Comparing the noise used in normal DM training (left) with pyramid noise (right)
The idea is to create noise at different resolutions and stack them, optionally scaling down the lower-resolution noise according to some factor. Here is the algorithm in code:
def pyramid_noise_like(x, discount=0.9):
b, c, w, h = x.shape # EDIT: w and h get over-written, rename for a different variant!
u = nn.Upsample(size=(w, h), mode='bilinear')
noise = torch.randn_like(x)
for i in range(10):
r = random.random()*2+2 # Rather than always going 2x,
w, h = max(1, int(w/(r**i))), max(1, int(h/(r**i)))
noise += u(torch.randn(b, c, w, h).to(x)) * discount**i
if w==1 or h==1: break # Lowest resolution is 1x1
return noise/noise.std() # Scaled back to roughly unit variance
This is just my first idea. You could also play with Perlin Noise or come up with your own approach. It's worth noting that myself and others in the AI art community were playing with "perlin init" as early as 2020 with CLIP-guided diffusion, and even before then I was stacking noise at different levels with my ImStack library. So ideas around working with multi-resolution image stacks and/or multi-res noise have been floating about for a while without getting much notice...

Training

For an initial test, I followed the Hugging Face fine-tuning example with only a few minor modifications to the original training script:
  • Using pyramid_noise_like(latents) instead of torch.randn_like(latents) to generate the noise used to corrupt the images during training
  • Logging some stats to W&B. I later that I could have just passed in `--report_to=wandb since the script already supports W&B logging!
  • Tweaking the dataloader code to accommodate the data format I had handy.
You can see my exact training settings in the run details. Update: A second training run with discount factor of 0.8 in the pyramid noise, a lower learning rate and larger batch size seems to have ended up even nicer.



Like Nicholas, I'm going to leave the hard work of further exploring this to those with more time and compute on their hands ;)

Results

Even after a few minutes training, the model was able to generate much darker or lighter images than base stable diffusion.
Prompt "A white image" - after only a few hundred training steps of an initial test run
The hard part was avoiding over-fitting to keep the power of the base SD model intact. I kept the learning rate fairly low and made sure to use EMA in the hopes that this would avoid anything too catastrophic happening, and I trained on a large enough dataset of images that none were seen more than once by the model during training.

Run set
2

Above is a selection of prompts run through both vanilla Stable Diffusion and my fine-tuned version trained with pyramid noise. Some are a little glitchy, but I love the dark torch-lit streets! Which do you prefer?

Conclusions

If you'd like to try the model, you can find it on Hugging Face here (update, better model here) although this current iteration is mostly a proof-of-concept. If you build on this idea, please cite this report and message me or tag me on Twitter (@johnowhitaker) so that I can see if this has been useful!
Hopefully this post has shown that there is still a lot for us to learn about how to best train diffusion models. I hope this inspires some follow-on work, and that together we can keep on making these amazing tools even better :)

PS

If you're exploring this, some extra notes:
  • I've yet to test this successfully on longer runs. Needs some thought around preserving variance, how you tweak the sampling approach (using pyramid noise for the initial latents as well probably needed to get nice variety of outputs) etc.
  • I have noticed that setting the discount factor lower (0.8 or 0.6 EDIT: even lower like 0.5 is good!) seems to help - which makes sense - as it stands, this is corrupting images much more than the original noise formulation. It's probably worth re-thinking the actual noise schedule, all TBD on future experiments.
  • This pyramid noise is just my hacky first idea implemented - also look at: pink noise, perlin noise, other ways to corrupt at different frequencies...
  • Some discussion and experiments around this happening in the IlluminatiAI discord
  • I have yet to try this with a from-scratch diffusion model but will update here when I do.