A Technical Guide to Diffusion Models for Audio Generation
Diffusion models are jumping from images to audio. Here's a look at their history, their architecture, and how they're being applied to this new domain
Created on September 22|Last edited on November 28
Comment
Introduction
Diffusion models are having a moment. Likely the most notable of these is the recently-released Stable Diffusion model, but that feels like just the beginning. In fact, with recent work by the team at Harmonai, diffusion models are jumping domains, from image generation to audio generation.
In this report, we'll take a look at some of the technical underpinnings of diffusion models, focusing first on their history, then their architecture, followed by doing a light tutorial on audio generation with the Harmonai colab. In fact, let's jump right in, shall we?
A Brief History of Diffusion Models
Diffusion models work by destroying and then recovering (or noising, then de-noising) the data they're trained on. More technically, they're inspired by non-equilibrium thermodynamics by Sohl-Dickstein et al [1].
Diffusion models are a special case of Markov random fields (MRFs) where the Markov chain of diffusion steps slowly add noise to the sample data. The model then learns to reverse the diffusion process, constructing novel data samples out of that noise. In their paper 2015 paper, Deep Unsupervised Learning using Nonequilibrium Thermodynamics, the authors show that you can have a model learn to reverse a diffusion process that perturbs data with noise, resulting in novel data. Independent from these bodies of research, Song et al worked on score-based generative modeling around 2019 which, like diffusion models, perturbed data with multiple scales of noise.
At the time, however, researchers didn't view the two fields - score-based generative models and diffusion models - to be anything more than superficially related. However, in 2020 researchers showed that the evidence lower bound (ELBO), a method which allows you to re-write intractable statistical inference problems as tractable optimization problems, which is used for training diffusion probabilistic models is basically equivalent to the score matching objectives used in score-based generative modeling[3].
In their ICLR 2021 paper Song and others showed that score-based generative models and diffusion probabilistic models can both be viewed as "discretizations to stochastic differential equations determined by score functions." On-going work in the diffusion space has been shown to have applications not only in image reconstruction (reconstructing medical imagery), but in varied other domains such as molecule generation, and defending against adversarial attacks on 3-D point clouds which could be important in the autonomous vehicle domain.
Let's take a look at the architecture behind most diffusion models today.
General Diffusion Model Architecture
Diffusion models tend to rely upon the U-Net model as their backbone. In looking at the illustration below from the original U-Net paper we can imagine if we were training an image diffusion model with the U-Net model the left-hand side of the network would be the down-sampling set of layers with the right-hand-side layers being the up-sampling side; skip connections link features from each of the four downsampled paths to the upsampled paths.

The diffusion model, like other generative models, learns to slowly denoise a data sample, starting from a sample that is made up entirely of noise. As shown in the image below the forward Markov chain diffusion process, q, adds noise (Gaussian noise) to a source image until a 'pure noise' image is created. Then, the 'reverse noising process', pθ, occurs, and at the conclusion of that process you're left with a noise-free image.

From Denoising Diffusion Probabilistic Models paper by Ho, et al. https://arxiv.org/pdf/2006.11239.pdf

For an excellent, more mathematically-based review of diffusion models we recommend:
- Lilian Weng's What are Diffusion Models? blog post
- Yang Song's Generative Modeling by Estimating Gradients of the Data Distribution blog post
DIY Audio Diffusion
The application of diffusion models to the domain of audio is a relatively nascent area of research: beginning in the first half of 2021 a body of research emerged regarding diffusion models for:
- de-noising text-to-speech,
- creating probabilistic models for text-to-speech,
- performing (neural audio) upsampling using a diffusion model,
- as well as some audio generation diffusion models for generating voice or musical outputs.
The computational complexity of some of these large diffusion models means that training a model from scratch is often out of reach of the home hobbyist, but you can experiment with pre-trained models much like how you can choose various settings for the image diffusion models. If you're working on developing an (audio) diffusion model from scratch consider using the U-Net model from the Imagen repository as it's less resource intensive, but keep in mind the fidelity implications which are discussed on that Imagen repository issue page.
Thankfully, the researchers at Harmonai have released their inferencing repository which allows you to generate various kinds of audio data from scratch using their already-trained diffusion model (inferencing) or you can take an existing piece of music and apply one of several new styles to it:
- honk , a style trained on Canada Geese recordings
- glitch , an industrial-sounding music style
- unlocked, 'unlocked' recordings provided by the Internet Archive: a style derived from hundreds of out-of-print LPs across many decades.
- and more!
Using Your Own Audio to Make New Pieces
To take your own music files - .wav or .flac - and generate new music with in various styles, please see this Colaboratory Notebook:
In the coming weeks and months more advanced functionalities will be showcased so be sure to check Harmonai.org and Weights & Biases for updates from the audio diffusion universe!
Supplementary Materials
The charts below are from an experimental training loss curve when working through the repositories here: audio-diffusion-pytorch or audio-diffusion-pytorch-trainer Weights & Biases allows you to quickly and easily maintain a system of record for all your experiment tracking needs, from single-person teams, to dozens of collaborators. For readers who want to roll up their sleeves we recommend that you check out those two audio-diffusion repositories for diffusion-related experiments. In the trainer repo you can train your own model, whereas the audio-diffusion-pytorch repo provides inferencing code. If you just want to play around with new music styles in a few minutes please see the Colab notebook in the prior section.
References
[1] Deep Unsupervised Learning using Nonequilibrium Thermodynamics, Sohl-Dickstein et al. https://arxiv.org/abs/1503.03585
[2] ELBO reference page from Masters student at Stanford and Research Intern at NVIDIA: Yunfan Jiang: https://yunfanj.com/blog/2021/01/11/ELBO.html
[3] Denoising diffusion probabilistic models J. Ho, A. Jain, P. Abbeel. arXiv preprint arXiv:2006.11239. 2020.
A Gentle Introduction to Dance Diffusion
Diffusion models are everywhere for images, but have yet to gain real traction in audio generation. That's changing thanks to Harmonai.
Interview: Harmonai, Dance Diffusion and The Audio Generation Revolution
Zach Evans and Dr. Scott Hawley join us to discuss Harmonai and their new Dance Diffusion model
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.