A Gentle Introduction to Dance Diffusion

Diffusion models are everywhere for images, but have yet to gain real traction in audio generation. That's changing thanks to Harmonai.

Angelica Pan

Created on September 23|Last edited on January 30

Comment

﻿
What is Dance Diffusion?Dance Diffusion is a family of audio-generating machine learning models created by Harmonai, a community-driven organization with the mission of developing open-source generative audio tools for producers and musicians, and part of Stability AI.﻿
Stability AI is the AI startup behind the popular text-to-image generator Stable Diffusion.
You can use a pre-trained Dance Diffusion model (or train your own Dance Diffusion model) to generate random audio samples in a particular style, regenerate a given audio sample, or interpolate between two different audio samples.
A Dance Diffusion model is, as its name suggests, a diffusion model. 
What is a diffusion model?A diffusion model is a type of machine learning model that generates novel data by learning how to “destroy” (called “forward diffusion” or “noising") and “recover” (called “reverse diffusion” or “de-noising") the data that the model is trained on.
During the training process, the model gets better and better at faithfully recovering the data that it had previously destroyed.
﻿This technical blog post by NVIDIA illustrates the forward and reverse processes. The model iteratively adds noise to some piece of data (like an image of a cat) until the data is pure noise, and then iteratively removes noise until the image is restored to its original form.
﻿
What if you asked the model to restore random noise?
As it turns out, when you pass random noise to a trained diffusion model, the reverse diffusion process de-noises the input into something that is the same type as the data that it has learned to recreate. In other words, this is how a diffusion model generates novel data!
Since Dance Diffusion models are trained on audio, they learn to generate audio.
Here you can listen to some piano music samples created completely by the Dance Diffusion model
﻿
project("wandb_gen", "audio").runs.summary["new_sounds/harmonai_generations"][0]
 - 4 of 4
audio
steps
model
sample_rate
sample_size
1
2
3
4
What is Dance Diffusion trained on?There are currently 6 publicly available Dance Diffusion models, each trained on a different dataset of audio files:
glitch-440k: Trained on clips provided by glitch.cool﻿
jmann-small-190k: Trained on a small subset of clips from Jonathan Mann’s “Song A Day” project
jmam-large-580k: Trained on a large subset of clips from Jonathan Mann’s “Song A Day” project
maestro-150k: Trained on a subset of piano clips of the MAESTRO dataset
unlocked-250k: Trained on clips from the Unlocked Recordings dataset
honk-140k: Trained on recordings of the Canadian Goose via xeno-canto﻿
Since the data that a diffusion model is trained on — and therefore learns how to recover — affects the type of data that it later generates, audio samples created by, for example, the maestro-150k model will always sound like piano music, and not like guitar or trumpet music.
﻿Zach Evans, the creator of Dance Diffusion, has released a Google Colab notebook for open beta access to the Dance Diffusion models listed above. Evans has also written a Colab notebook where you can fine-tune, or customize, a Dance Diffusion model on your own dataset for greater control over the generated audio clips.
Edit: This post has been deleted! If you’d like to try using or fine-tuning a Dance Diffusion model for yourself, check out Reddit user’s u/Stapler_Enthusiast detailed Dance Diffusion tutorial.
And lastly, if you'd like to create your own music samples using one of Harmonai's available models, we have a Colab link for you below!
﻿
﻿
﻿
﻿

Add a comment

Steve McGagh • 3 years ago

The tutorial link near the bottom of this article goes to a reddit post which has been deleted :/

1 reply

Tags: Harmonai, Articles, Audio, Beginner, GenAI, Large Models

Iterate on AI agents and models faster. Try Weights & Biases today.