Making a Theme Song for Our Podcast Using Dance Diffusion

How we used Stability AI's Dance Diffusion model to write generative music for our recent podcast episode–and how you can create your own music with Dance Diffusion
Justin Tenuto
Created on October 21|Last edited on January 10
Comment
We've been having a ton of fun exploring Stability AI's Dance Diffusion model. With their CEO Emad Mostaque joining us on our recent Gradient Dissent episode, we figured we'd put that exploration to good use and write a song with generated samples for his appearance. 
This report will walk you through that project from generating our samples to what we learned about Dance Diffusion to putting everything together in a quick 50-second song. And if you'd like to try any of this out yourself, here's a handy colab: 
﻿
What is Dance Diffusion?Dance Diffusion is, of course, a diffusion model. It's just concerned with audio, not images or video. 
With text-to-image models like DALL-E, Stable Diffusion, Craiyon, and the artier Midjourney, we get surprisingly high-quality images with just a little bit of clever prompt engineering. 
But unlike these text-to-image models that generate high-quality images, Dance Diffusion doesn't output fully realized songs or even parts of songs (say, a bridge or a chorus). In other words: you don't get a finished product. What it gives you is high-fidelity audio samples that you can use to actually craft songs.
This is a bit different from what something like Stable Diffusion gives you but from a creative perspective, it's honestly more enjoyable. While Stable Diffusion is all about prompt engineering, Dance Diffusion is a little more like crate digging: we're looking for a small clip we can use as the foundation for our song. In other words, we're looking for ingredients, not a complete work. 
For fun, we're also going to place one arbitrary restriction on ourselves. Though we could beat up, slow down, transform, loop, or otherwise manipulate our melodic samples, we're going to avoid that. It feels more appropriate to take something directly from a model and work around that than to lean into production tricks. We're trying to see what the model can do, not what our DAW can do. 
Okay. Let's make some drums. 
Generating PercussionThe Dance Diffusion colab has a few different models you can choose out of the box, ranging from John Mann tunes to Canadian geese yelling at you. For our project today, we'll begin their "glitch 440k" model which sounds like exactly what it is. 
Why start there? Well, glitches and noise can make some pretty decent drum one shots. Check out sample 32 in the panel below, for example. That's a solid rimshot and we'll make that the basis of our snare. The rest of our first run had a good amount of static and noise (we're training on glitches after all), but samples like that can be good for sound design so they're worth keeping around for now. In fact, with a little fussing, white noise can be turned into a serviceable high hat.
﻿
Run set4
﻿
We have some other workable nuggets here. Sample 6 has a kind of 808 cowbell thing going on. Our eighth sample feels almost like a percussion loop and we'll actually end up starting the song with that after a little light stretching and quantizing. We didn't really get any low end or kicks here, but we're resourceful, so we'll likely try to generate some of those later or take some creative liberties. Dance Diffusion isn't going to write this song, after all. We're just looking for ingredients. 
A final, vitally important note: we name training runs here at W&B with a bank of arbitrary adjectives and nouns. This first run gave us this name, which portends well: 
💡
﻿
Let's Get a MelodyWhile you can fine-tune Dance Diffusion on your own song or sample library, it felt more appropriate for our purposes here to pull something directly from their colab. We have a few options for training data:
John Mann: John Mann's been recording a song a day for several thousand days. Back of the envelope math says: that's several thousand songs. The thing is, his music is also predominately acoustic and that generally makes for less flexible samples, so we'll use something else.
Maestro: This library's a bit more promising but it's predominantly (or entirely) piano music. We want something a little more exotic to see how the model performs.  
Unlocked Recordings: From the folks behind the Wayback Machine, here's our best bet. This is a bit of a hodgepodge, sampling everything from spoken word to baroque to, uh, what looks like Third Reich war songs. Okay. A little troubling but let's just power through. The point here: we're getting a lot of variety so we're going to get some interesting samples out of the thing. 
In fact, let's get out of those bullet points and listen to what we got in our first pass:
﻿
Run: snowy-vortex-21
﻿
Our first small training run is a bit loaded with overwrought classical-adjacent samples, which just feels a little severe for a podcast introduction. This training set has a decent amount of spoken word recordings so we end up with samples like number 9, human-sounding talking in no particular language. We actually used one of these in our final product, just as a little ear candy. 
Still, we haven't found something that feels quite right. One reason? A lot of these samples were a bit staticky. The source material is likely older and because of that, prone to bad recordings or a lot of vinyl noise, so this is to be expected. In fact, when we ran the colab for those John Mann tunes, we got much cleaner sounding samples. Past that–and this is a subjective conclusion–we just didn't quite get anything particularly inspiring from our first run. So we ran the colab one more time. And then we got sample 14:
﻿
Run: toasty-totem-31
﻿
Now we're talking. 
For starters, we got much higher fidelity on this sample. There's way less hiss up top but there's still some air in there. We're all over the place in regards to tempo and time signature, but again, we just need a bit to do what we want. The sample feels a bit split down the middle: the first second or so feels like the end of a phrase, then we get into a something jauntier. Either way: this sounds better than our other 99 options and it has some genuine sway going on. This will be our main ingredient. 
Let's sum up:
What We're Using:So with less than 150 generated samples, we've got a handful of unique elements for our project: 
A rimshot / stick-click snare
A few pieces of spoken word
A series of eight or nine glitches
Some white noise for a hi-hat
One really solid sample we'll use as our base
Two or three melodic vocal samples we'll pepper in and see how they work
Putting it All TogetherThis is a machine learning blog, not an Abelton blog, so we won't spend too much time on the nitty-gritty there. That said, a few points worth noting before you hear the finished product: 
About 80% of our sounds here are directly from those few short Dance Diffusion training runs.
Specifically, all the melodic content and a lot of the higher drum bits came from our runs. We used a little low end to help with our kick, used a rimshot as part of our snare, and then cheated to add a few hand drums for vibe-related considerations
We have no clue what anyone is saying in any of the vocal samples. There's something kind of intoxicating about that honestly but just don't expect Dance Diffusion to spit out great lyrics. It won't. But it will give you a surprising melody or ten.
Anyway, here's everything all put together:
﻿
﻿﻿﻿
What We Learned About Dance DiffusionThough our project by no means exhausted what Dance Diffusion can do–in fact, we really just ran their colab a handful of times with a minor tweak here or there–we learned a couple high-level things about the model and what you can do with it. 
QualityWhen we talked to the team at Dance Diffusion earlier this year, Zach kept underlining that he wanted the model to create production-ready audio. This is in stark contrast to something like OpenAI's Jukebox, which is really quite remarkable in its own way, but produces music that sounds heavily compressed and underwater. 
Dance Diffusion was far better here. Our glitch loops were varied and interesting and we hit high and low frequencies with real fidelity. Our melodic samples had some fuzz and pop but we're fairly confident that's because they were trained on older recordings (vinyl in, vinyl out). Admittedly, this is a guess but if you listen to the Unlocked Recordings samples, that's certainly what they sound like. 
ExpectationsWe've touched on this a couple times, but don't go in expecting a finished song or even a fully formed loop. Some of what we got was a bit hard to pin down from a tempo and time signature perspective so you need to be prepared to do some fussing and massaging. 
Could you make an entire song out of just Dance Diffusion samples? While this wasn't a stated goal of the project when we interviewed their team earlier this year, the answer is yes, even from the out-of-the-box training sets. Obviously depends a lot on what kind of songs you're trying to make. Which brings us to: 
Next StepsOne thing we didn't experiment with at all is fine-tuning Dance Diffusion on your music or sample library. There's no reason we shouldn't be able to point the model at, say, a folder of kick drums and get a bunch of novel, useful samples. Intuitively, percussion seems like the best place to start, but we're excited to try it out with weird home recordings, bad singing, foreign-language spoken word, etc. Training with higher fidelity samples should also give us higher fidelity generations (getting rid of the vinyl in, vinyl out aspect).
ConclusionWe had a lot of fun with this project and it feels like we're just scratching the surface of what you can do with Dance Diffusion. The main takeaway is that the model really spits out high quality, novel samples which means you're working with unique source material, not sampling or using something that's been recycled umpteen times by other people. And, if you actually want to release music and have royalty concerns, a machine generated sample frees you from that worry.
So should you try it? If you made it this far in the report, the answer is "of course, why aren't you doing it already?" Training off the colab was a breeze and poking through the generations really did feel like sci fi crate digging. If you want to get started, give it a shot below. And if you make anything you're into, give it a share. We'd love to talk a listen. 
﻿
﻿
﻿
Related Reading: 
Emad Mostaque — Stable Diffusion, Stability AI, and What’s Next
Emad shares the story and mission behind Stability AI, a startup and network of decentralized developer communities building open AI tools.
A Gentle Introduction to Dance Diffusion
Diffusion models are everywhere for images, but have yet to gain real traction in audio generation. That's changing thanks to Harmonai.
A Technical Guide to Diffusion Models for Audio Generation
Diffusion models are jumping from images to audio. Here's a look at their history, their architecture, and how they're being applied to this new domain
Interview: Harmonai, Dance Diffusion and The Audio Generation Revolution
Zach Evans and Dr. Scott Hawley join us to discuss Harmonai and their new Dance Diffusion model
﻿
﻿
﻿
Add a comment
Jonathan Whitaker • 3 years ago
The end result sounds amazing, although I suspect most of the credit for that goes straight to Justin's mad ableton skills. Great work :)
Tags: Articles, GenAI, Harmonai, Music Generation, Plots, Experiment, Tutorial, Beginner
Iterate on AI agents and models faster. Try Weights & Biases today.