Interview: Harmonai, Dance Diffusion and The Audio Generation Revolution

Zach Evans and Dr. Scott Hawley join us to discuss Harmonai and their new Dance Diffusion model
Created on September 23|Last edited on October 4
Comment
Late last week, we had the pleasure of interviewing the core team at Harmonai about their latest work bringing diffusion models to audio. We're thrilled to present it here! 
Listen to Harmonai core team's vision and how their Dance Diffusion model is the first step on the journey to high quality music and audio generation for everyone.
For more see this link: http://wandb.me/harmonai
Specifically, you'll hear from Zach Evans and Dr. Scott Hawley as they discuss their open source audio generation work and their newly released audio diffusion model, Dance Diffusion, with Morgan and Justin from Weights & Biases. 
And if you want to jump right in and start generating music samples with Harmonai, there's a colab for that: 
﻿
﻿
﻿
﻿
﻿
﻿
Read the full transcript belowMorgan:
Zach Scott, thanks very much for joining. We'll dive straight into it. We would love to hear a little bit, before we get into a bunch of the technical questions I have around things you can do with audio generation, 
We would love to maybe hear a little bit quickly on how Harmonai came into being, how you two started to collaborate, how many are you in the project, how long it's been going? 
Zach:
A bit of background, I got into the AI, text-image space through disco diffusion through the OpenAi glide notebook. Basically, I was proficient in Python, but not really doing a whole lot of AI stuff.
I basically dove into the code in the notebooks, and saw what I can tinker around with and figured out how to make some changes and actually get stuff going there. I got really interested in making audio reactive music videos with dance diffusion, sorry, with disco diffusion, which I was working on and then realized that to do so it needed to be significantly faster. Because having like 10 minutes per frame is just not gonna work for trying stuff out. I dove more into the code and research behind the disco difusion, started talking with Katherine Crowson on discord, getting up the courage to ask her a question after seeing her name and pop all the different notebooks. I was able to figure out some of the actual implementations behind these things and get to be able to page it to my will a little bit. 
After working on that for a little, I realized a lot of the same math could be applied to music. I got into music production three or four years ago. I really started focusing on it at the beginning of lockdown, after the covid hit. I basically started spending all of my time on music producer servers on discord. and so when I realized that a lot of this text to image stuff could be applied to audio, that became a new passion and I'm like ‘I need to make that happen’.
I'm not a visual artist. I can't draw to save my life, but it was so personally validating and creatively validating to be able to use these image synthesis tools, to be able to have creative output, realizing that the same creative accessibility could be given to musicians. That's much more my people, my friends were all producers and DJs. I realized that that was the purpose. All of a sudden, I needed to bring these kinds of tools to the music space.
I talked with Katherine, asked her if she'd done any audio diffusion stuff. She sent me some examples, she had tried it on speech and it kind of worked, but it doesn't actually make coherent speech. So she was saying it's a failure. It was kind of coming out like the sims, kind of garbled speech. And I was like**:
’no, no, that's the best sound design tool ever. You're just not using it on the right things here’. I had some training code, I'll turn it up to high resolution, 48K, I need it to be a high resolution stereo. So much machine learning audio stuff is 16K mono, which works for speech and piano music, but I'm trying to do modern EDM. You need high resolution, you need stereo. Basically taking that same code, applying it to audio and taking some of the other image diffusion code and learning enough Pytorch and enough of the math behind it to change 2D to 1D in the right places. Applying some of my knowledge from music production and really just throwing stuff at the GPU and seeing what sticks, I was able to get something going with that. I was really inspired by the whole open source AI movement and wanted to be part of it. I worked at Microsoft for seven years before doing some front end design and this is just so much more working with my passions and being able to really give back to the music community with extra cool tools and things to help them with their creative process.
Morgan:
Amazing, that's a great story. And where does Scott come in and the everyone else involved in Harmonai?
Zach:
Scott can definitely tell his own origin stories in HarmonyAi but as I was building the discord server, I talk about how that came to be, how I got in contact with Emad, how it became a part of stability. Basically as I was given the opportunity to make a community around this. I started looking around and seeing who were the movers and shakers and who was already doing stuff? Who could I just enable and bring into the fold and ask to be on this journey together. Someone referred me to Scott, posting a message in, I believe the Luthor or Lion server about audio research and it was like ‘there's someone who is interested in this’. So we'd send him an invite, brought him in. He could tell the rest.
Scott:
I've been involved with machine learning and audio fully professionally since 2016, I first got hip to it back in 2013 when I saw Accusonus, released a plugin for drum source separation. And I was like, ‘this is magic. This is gonna change the world. I need to learn this.’ I wasn't planning on doing it myself, I just knew it was gonna affect my students' lives as I teach audio engineering students.
Over the years I put out academic papers, one time I worked with a company art logic to produce tools for producers and musicians, and definitely wanted to stay on the side of empowering creators and definitely in the audio space.
I work by myself and I have tiny GPUs, I guess last year I got addicted to the text to image stuff with Katherine Crowson's work, the VQGAN-CLIP, and then some guys Chris Donahue and Ethan Manilow had shown that there's a jukebox thing and if you pull apart what's inside of jukebox it's semantically interesting information. I was like: ‘yeah! That's the ticket. That's what I want to do.’ I wanna do not so much text to music, but text to music production. I wanna say gimme a little more reverb or gimme a little more this or that, cuz that's what my students do. This was last year, but then at some point, a bunch of things just all converged this spring. Stella and Katherine put out their paper on VQGAN-CLIP finally a year after, so many of us were already into it. 
I was never doing the disco diffusion thing, by the way. I'm definitely a latecomer on the diffusion stuff. Pretty much your diffusion stuff you will be talking to Zach today, but I wrote to, or to them and said, ‘Hey, if you guys ever wanna do audio I have ideas.’ And, as I was talking to Stella, somebody else from Stability. Stella was involved with Luther, somebody else was involved with stability as well. And then there was this one week where like four discord servers, all about machine learning and audio and production all came together the same week. I said:
Are you people not talking to each other?’. I thought Zach and Stella and this other person were talking to each other and no, it was just all completely independent things happening all at once. Essentially what happened with Harmonai, I thought:
here are people to work with and, they've got compute and I've got ideas. And actually at the time I thought I was gonna have access to more data, which I'm still working on. I'm working on consensual data with explicit consent, we'll get to that later. A lot of it worked out where I kind of realized the way for me to really get up to speed on a lot of this is in helping Zach doing what he's doing I will be training myself to get up to speed so that I can do what I really want to do in addition to all that. We can fulfill multiple projects and multiple goals kind of simultaneously 
Morgan:
amazing match made in heaven. So I guess the most mature project you have going at the moment would be dance diffusion. We would love to hear a little bit more about that. I think we'll leave some notes for people so they can read up on more of the basics of things like diffusion. 
If you wanted to give 30 seconds of how diffusion works and then specifically, what are the nuances involved in doing it for audio?
Zach:
At a very basic level, cuz it gets into very high level math very quickly.  You give the model noise and it makes it not noise in the form of whatever it has seen, that's unconditional diffusion. So dance diffusion is an unconditional diffusion model compared to things like stable diffusion, which is conditioned on text.
The reason I'm starting with an unconditional diffusion model is that's kind of what started the whole image diffusion frenzy. The unconditional ImageNet model released by open AI, people being able to take that unconditional model, which just makes something, it learns about some distribution of data and can turn noise into that. You can take other classifiers, really anything that'll give you a strong gradient, you can take that to guide that denoising process. So that's what that clip guided diffusion was, taking that unconditional diffusion model and using clip, which is able to compare pictures,  basically saying how how well does a caption match an image and say, all right, now optimize this thing simultaneously optimize that this image matches this caption, the caption's not changing so you better change the image to be more like that. In terms of audio, I'm a music producer so I think of side chain compression and side chain effects where one thing is affecting the other, from this kind of separate things where I like to call disco diffusion, a stable diffusion, a de-noise or where the text input side chain. Putting it in that context, because of that text input thing, it could be anything, anything that can get you a gradient to say make this better. Anything to optimize at the same time, putting out an unconditional diffusion model allows that space of research to happen. So all of that guidance. People can find different things, there was the crash model that was put out by Sony CSL, which was for diffusion for high resolution drum synthesis. They had a classifier to tell between kicks, snares, and symbols. Through using that, just that classifier, they were able to tell it to make a kick, a snare or a symbol. So there are already so many existing music classifiers and tools like that that could be hooked up to any unconditional audio diffusion model to do so many cool things. That's one of the things that we're really trying to look for with dance diffusion, kind of in the same vein as disco diffusion. Disco diffusion people will generally see it as one of the first big open source text to image notebooks, and most importantly included animation support. But really what I see disco diffusion is, was a crowdsourced hyper parameter search on guidance losses, because disco diffusion didn't have just the text, it had things like the some losses for saturation and color and total variation, all these different things to change.
And what that became was everyone, you know, trying to get in there. If I change this 0.01 to 0.001, it changes this. Basically having this ensemble of guidance losses, and kind of having everyone get in there, tweaking out, messing around and tweaking those parameters to find what's actually good. That's the kind of thing that I want to create for the audio world. If we can give out these models that are unconditional diffusion and then say ‘here's guidance, throw stuff in there’. We're gonna see an explosion of creativity here. Make a model that is print in a good amount of data and throw in a genre classifier, will it start making things in that genre? Quite possibly! 
Of course it is trickier when you get things like the classifiers are all 16 K mono and our audio is 48 K stereo. There's ways around that, you can deal with that. The dance diffusion models I've put out, there's not one massive model we have trained on a hundred thousand hours of audio data. It's not a jukebox, but what we have is a variety of different models from the same architecture. They're all the same thing, but just fine tuned on different data sets. We have these models that are trained on different data sets that are either public domain or given to us by the community.
So that gives out the architecture, that gives out ‘here's a basic list of things you can do with it. Here's a few sample things’, but we also put out a fine tuning notebook. People can take that same model, fine tune it on their own data, on their own content. And it enables this extra level of interaction, like personal research and creativity that I think will be really interesting to expose people to. I think a lot of people right now are really used to the kind of plug and play, gotcha pond ‘type in the text, prompt and get out some fun stuff’, which is great fun. I really hope that we get that at some point, but there's a whole lot of research we have to do to get there.
I see dance fusion as us putting out basically the catalyst to have that happen to audio what has happened to images in the last two years. It's not that this is going to be the, be all and end all model, far from it. One of them is trained on audio from Jonathan Mann, who we are working with. He has the Guinness world record for most consecutive days writing his song. He just hit 5,000. We've got almost 5,000 songs from the same guy. Great training data that you don't get from a lot of people, he was generous enough to give us all that audio data and allow us to release and model train on that.
Pack from glitch with friends who was a producer community that was basically finding people. To provide us with good audio training data, with consent, training on that, and releasing. Then kind of seeing where the rest of the community takes it.
Morgan:
Amazing. 
We're on the launch pad and you're gonna be the spark and see where the rocket goes 
Zach:
Yeah, that’s the goal. It's hard to plan for this stuff because we have no idea where the community's going to take it. We're seeing that with stable diffusion. There was this Twitter thread that I just saw, stable diffusion's been out for a month and there's already a year's worth of product releases based on that. We put out an advanced diffusion kind of sneak launch, it's been on GitHub for a while. We haven't been talking about and publicizing it yet mostly just so we can get the infrastructure in place to handle the community that will inevitably form around it. 
The disco diffusion server got made and within a week it was like 3000 people. We want to have an open server for people talking about advanced diffusion and Harmonai but we had to make sure we had the people in place to be able to withstand that. That incoming herd of excited people and being able to properly organize them and make sure that people can properly contribute to these things. 
Morgan:
Cool. I'd love to chat about how people can get involved in a bit, but just going back to the model itself.
I'm curious, what were the nuances or tweaks you had to make it work for the audio. Then also how was it dealing with mono versus stereo and different resolutions.
Zach:
 The model we have now, we're not doing a whole lot of extra bells and whistles for audio stuff specifically. I'm a bit of an unsupervised learning purist when it comes to audio where a lot of audio stuff they'll do Mel spectragrams or NFCCS throwing out phase data, which as someone who makes Cyran PLX and drones is absurd to me, that phase would be thrown out as throwing out the baby with a bath water there.
What if we just threw the data at it in a big unit and just saw what it learned? Can it learn all this stuff? With diffusion, the way we're training it, there's no big construction losses. You're training it to predict the score gradient, the noise. Almost like a data agnostic objective. I was really curious to see how good we can get if we don't do things like turning it into an STF which definitely would simplify it, but let's push the boundaries here.
We haven't done that much, other than changing, 2D to 1D, adding more layers and figuring out what size of and sample rate you wanna throw in there? It's simple, but it's powerful and big. One of the downsides of that is with audio versus image data you get differences in how the different frequency ranges are interpreted. Humans, we hear sound logarithmically in terms of frequency and with music things like high hats might be only at 18 K and above, which also means if you downsample it, you lose the instrument. With our model, we are seeing that it will reconstruct the low end, better than the high end.
That's really common in audio models. If it's loud and low, it's really easy for it to find that you've got these big waves, you can get a little bit of error on there and it's still just fine. You get a little bit of error on the high end and it just blows it all out of proportion.
You've seen diffusion models that have an STFT as a backbone to get the complex STFT the frequency and phase information and work in that space, which adds an implicit bias to basically have a model understand there is a band split here, there are different frequency ranges. They all matter kind of equally in their own proportions. I've been kind of enjoying just the naive model way of being:
‘here's a bunch of data, figure out what I just showed and make me things like that. I'll make you big enough where you can do it.’ 
I've only been in the machine learning world, I started training models with my first audio fusion model back in March. So it's kind of a nice ease in to have the first one's be pretty easy. You make the numbers big, don't run outta memory, throw a bunch of data at it. It'll learn what you show it. That sounded good. I did that over and over again, watching and watching my Weights and Biases reports daily, just waiting for it to get better. And we had these spectrograms in there that Scott made some utilities for. Scott's been fantastic at making a bunch of extra utilities. Get the visualizations and stuff, working with Weights and Biases and these 3d plots of our embeddings. When you're training these models, it's good to kinda watch the harmonics go from the bottom. Oh, it's learning more. It's learning in the middle now. It's learning the high end. 
Morgan:
I was actually curious about the 3d point clouds where you were visualizing your embeddings. What would you get out of that? Often embeddings are amorphous and hard to interpret. 
Zach:
I will let Scott answer that because he made them.
Scott:
I'm not sure how much I wanna be quoted in a publication on this. I am going to watch the language on this right now. I'm gonna be deliberately very careful. Right now it's still an open area of research on what the structure of these embeddings is, and what constitutes a good structure for the embeddings. Really right now we're using it as a diagnostic tool. Like are the values, at least reasonable values? They're not exploding, they're not all going to zero. Are different sounds ending up in different parts of the embedding space? That's usually a good thing. If we just have an amorphous ball of everything, then maybe that's not doing good, at least what I want to do at some point I mentioned, I wanna use these embeddings for interesting stuff. How interesting are they just in the process of training this? Then also since we're building the diffusion model, at least some of our models, we'll take those embeddings and then do diffusion from that, we will generate the audio from the embeddings. We wanna be able to compare, what is our system doing? Not just on the final output, but can we see inside?
 (technical issues) 
Zach:
There was some code that Kath and I had written. I kind of combined them together to be able to get a convolutional encoder and get a latent series and then use that as conditioning for a diffusion model, co-train all of that. You basically end up getting an auto encoder, but it's a really strong auto encoder cuz it's multiple paths, because it's diffusion based. It's like an auto encoder that keeps refining itself. You can get really good compressed spaces, really good results. It just works differently, now I've been trying to train a sound stream replication or other kind of array of models and being able to have an iterative auto encoder it gives different kinds of artifacts as opposed to giving maybe some of the down sampling artifacts or the horrible hum of some auto encoders. It kind of gives a bit more of a noisy high end, almost a bit more natural. Also because it's diffusion, you can make a giant diffusion model as a decoder and have it learn a whole bunch of stuff. That's been a really interesting kind of research there and also how we get those embedding clouds being able to see how you get your embeddings starting out and it's all just noisy. Then kinda over time, you start seeing it clustering and you see the difference, we'll have a batch of 16 go through and you'll see they're all kind of organizing themselves. You see this one's a wider embedding cloud and that audio is more stereos, maybe that's what it's learning that you get a kind of intuition and on the interpretability of the embeddings there. Also having an auto encoder like that lets you do latent diffusion. That gives us that embedding space that we can then diffuse in that space and diffuse much longer sequences and basically increase the power of the diffusion models by stacking the diffusion in the auto encoder space.
Morgan:
That sounds insane. I love it. You touched on that earlier and also Scott did, 
the data that you're training these models on. I am curious about the variety of the audio and music styles you've trained your model on. Also you mentioned it earlier, are there issues around copyright and what do you think the reception will be from musicians and actors once this is released?
Zach:
So a lot of our internal training data, I've got a lot of electronic music that I've been testing stuff out on because it tends to be full spectrum, high resolution, stereo, extra stuff going on at the high end and the low, really good to test on, there's a whole range of distorted signals.
Going on just training on piano music, it ends up being a little bit easier to do some sort of thing, because it's all piano. When you work in a bunch of dubstep, it's every distortion possible under the sun. That gives you a kind of a hard mode of things like this. I haven't released anything with that data because we have to do a lot more evaluation around memorization and overfitting. Because I don't want to put out a model that is going to be trained on a bunch of copyrighted data and spit that back out. That would be bad from a copyright perspective, that would be bad legally, that would be bad science. It just wouldn't be a very good model if it's overfitting and memorizing things.
Because we don't have a full suite right now to be able to really tell this thing has not memorized anything and because these models can be susceptible to memorization, I haven't released any models trained on copyrighted data that I would be concerned about. That is going to continue to be true for us until we figure out how to not step on that landmine.
That's basically one of the really big things that's held back music machine learning, music copyright is something you don't want to go up against, that's what brought down Napster. That's what brought down all the fun things, they blow up, they're really cool and then the RIAA, or whoever, comes in and is like ‘Nope, you aren't doing that’. I hope for a long and prosperous future for Harmonai and our products and so I want to make sure we fully understand the implications of training on and releasing on copyrighted data before doing so.
In terms of the artist's reactions to it, I'm talking with a lot of producers and I show them these tools and they are blown away by it. They love it. Because I'm making these tools that right now, we're working on high resolution audio, the diffusion model we have out there will only make about one and a half to three seconds of audio at a time. So that's not very good if you wanna make a full song. Having three seconds of a song. Okay, cool. We got half of a line of a verse and it sounded kind of like a person, but now throw in drums, throw in loops, throw in little clips. All of a sudden it becomes heaven for someone who uses sampling. We had a model train on a bunch of old records, like old, ballroom dance music and songs from the thirties and forties. I got a friend who makes hip hop and I gave him that model and he was able to keep running that and get samples out of that, and use those in his tracks. It gives us an extra level of sampling. I'm basically making tools that are more useful to producers first, as opposed to here is a thing that can make a beautiful song and, oh, no, artists are threatened. Not really. Artists are much more empowered by these tools first because they can use them. You still have to have composition. You still have to have taste. They're not well mixed. They're gonna be a little bit noisy. Can you use that? When I've talked to artists and producers about this, they love it.
It kind of helps, the easier thing to do right now is make a sample generator, not a song generator. So you kinda get a win, win there, it's really useful for the artists, they can train it on their own stuff and it's not gonna be copying a full song. It can't. 
Eventually I would love to have a model that is trained on all known music and we can say, give me 50% Bruno Mars, 50% Beethoven. That'd be great. That would be so fun to play around with, for inspiration, for whatever, just to see what it would, what it would spit out.
You can do things like that with the Jukebox. But of course, as we're seeing, particularly with the AI art stuff with the visual art. There's a lot of concerns around there. I don't wanna get into the whole conversation now. Cause it's boy, is that sticky. 
Things like Holly Herndon and Mat Dryhurst's spawning platform where they're putting together an opt in and opt out lists of artist content. What we're seeing from a lot of the artists we're talking to, when they hear what we're doing, they want to throw their data at us because they really want this to happen.
I think that we can get enough data from enthusiastically consenting artists and producers, things that we can clear to be legally scrapped and trained on that are publicly available and any deals with people that own the data or whatever, you know, getting it in through proper channels. We can get sufficiently high quality models out there where we don't have to train on things that end up being damaging to anybody or concerning like that. It's a constant conversation, it's a constant effort to find that balance between it would be really cool to have this and let's make sure we're not causing more problems.
I mean, in the music industry, it's already very exploitative. It's already very much like everyone fighting for a few dollars. My friends are all DJs and producers. I wouldn't want to take away their last chance of making money in this, I want to empower all of them to have more ways to make more art and make new sounds and new things. I have skin in that game. That's my community.
Morgan:
Scott, I think you wanted to add something in there? 
Scott:
If you know where to look, there do exist some high quality open source audio data sets that come with a creative commons license. I was part of developing one a few years ago for a system called Signal Train. There were some folks from NYU that have released some things, it's not sufficient to train a LMM, a large music model. But, essentially by aggregating stuff that people give us, people like Jonathan Mann and other friends of Zach, and then some of these open source data sets that have already been published and have already have an explicit license of creative commons or something like that. We have terabytes of stuff that we can train on legally already. It's not the commercial stuff. It doesn't know words, it doesn't know how to do Taylor Swift or Bruno Mars. It doesn't know how to do that, but it does know some other things.
Morgan:
Justin is here. He's the actual music expert, at least out of the team Weights and Biases here. I wanna make sure he has time for a couple questions. You probably know more what, what side chain is than me, 
Zach:
Justin let's do it.
Justin:
It was good, that was a great description. You touched on a lot of my questions that I was gonna have. I was kind of curious how faithfully it might have recreated some of the training data, but the fact that it's spitting out smaller samples, that kind of answered that question.
You did touch on this, but I wanted to be a little clear, we were big fans of the OpenAi Jukebox stuff, but I found a lot of that music sounded kind of like. underwater and gargle. You could hear that it was supposed to be Elton John, the chord progression felt like an Elton John song but it sounded like this (muffled sounds). I think you touched on when you were talking about hats and how it's high, were there particular things the model really struggled with or was it really just lower stuff = easier and higher = stuff harder. 
Zach:
In general, what we saw very quickly with the fusion models is they do great with percussion. They do great with anything that's transient, heavy and more noise like, and they struggled for a while until we figured it out, I guess I don't even know what actually got it to start making Harmonai content better. It's much better at things that are noisy than things that are harmonic because, harmonics are basically the opposite of noise, actual tonal content. In terms of generated stuff sounding like it's underwater, that's gonna basically mean it has a bad high end. It's not reconstructing the high end harmonic faithfully. I'm sure someone who's been doing this longer could give even more detailed reasons why you end up getting more high end error in general with things. When you have an auto aggressive model like Jukebox, I'm sure there's a much better actual mathematical reason for this, but it's just easier to predict the low and mid rate stuff and the high end ends up suffering for it. To have something that's not going to be good at low end and garbled on the high end, I guess you would want a more full band and coding of things or using an STFT, where you've got in the time domain the same amount of importance to the different parts of the frequency spectrum, and basically telling the model:
‘Hey, I care about it like this’, ‘I care about it in these amounts’. Doing things like perceptual waiting to change the EQ curve to be more like the human auditory perceptual system, so that when you do the direct comparison to try to get that loss function, it's more weighted towards how we perceive things, not necessarily paradigm the low end as much.
I also think, maybe this is a hot take, I'm not sure having that standard of quality, where will you release something that has a bad high. I kind of have to at some point, cuz it doesn't always just come together like you wanted to. But I would much rather put in a lot more time to address those issues than be like:
‘nah, that's good enough’. I'm saying that now but I'm sure I'm gonna put something like that out and be like ‘I couldn't figure it out. I couldn't do it. We don't have a good high end anymore. It's not good yet.’ I mean, that's one of the things that I'm trying to do with Harmonai is to have that high standard of quality.
A bit more background: one of the production communities I was in, was for the artist Kill the Noise, who was a very well known dubstep producer, worked with Skrillex, and has been doing it for a long time. I was kind of his Twitch co-host for a bit and I was on his Twitch stream when he was playing around with Jukebox and DDSP from Google Magenta. You know, he's a dubstep producer, he can take anything and make it sound good. Pretty much all the different effects to put on things to make it work. But messing with Jukebox and DDSP, the conclusion after a few hours was that this was cool but it's not quite ready for prime time. It's not quite good enough for me (Kill the Noise) to actually use it in my songs. So coming into Harmonai, I wanna make things that are usable by producers. If it's just a cool research product, but notable by musicians, that's not good enough for me. I don't come from the academics sphere. I come from the producer's sphere. So if this got 1% better performance on this baseline for this thing, that's neat. But does it make dubstep? Can it make a good snare? I am much more interested in applied practical research for producers and for artists with that bar of quality being higher. I needed to be at least 44.1 K sample rate 48 K helps. One of the things I learned talking with  one of the guys in the data box team who we're working with, never doing generative music stuff, they're like the biggest names and already they've been doing it for like 10 years. When you do a higher sample rate because there's that error in the high end, if you keep pushing the high end past our hearing range then the air goes up there too. Obviously it takes you more compute intensive to do higher sample rate things, but you kind of get a little bit of ‘well, it sounds bad to dogs’. It's a trade off of course, cuz you're trading off the sample rate and the time it can write all that.
Having that high end stuff, having a 48 K model, I want a 48 K stereo. I wanna be able to put in cool wide base sounds and have it make me variations on that. Like disco diffusion, that was kind of one of my things that I just saw in my head early on, I needed to make. I need it to be able to get variations on snares. I need you to be able to put in a snare drum and say ‘give me 10 of these’ without having to tweak the compression in all the little parts and EQ and stuff. Here's these, give me more of them. Cause that's already what people are looking for and looking to sample libraries for drums. I want a thing that sounds kind of like this, but isn't this. I was working on a song trying to make a side trance song. And I needed some drum fills and I had this cashmere pack from splice. I needed a feel that sounds just like this, but I've already used that one and I don't have one that sounds just like this. To learn how to make one requires eight more years of music production skills. While doing that and looking at disco diffusion and being like, I'm making variations of things right here, the math is there, someone has to do this for drum. I'll do it! I see the utility immediately.
Justin:
You mentioned not wanting to release things that people could misuse. And we had a question on here that we walked past, but are there things… the only one I know folks that use pretty regularly is a splitter, which I think is like a pretty cool tool. Is there anything that you guys are using? 
Is there anything you could put in a song and it would sound acceptable? You mentioned, opening a Jukebox was not at the level, but I'm just curious if there's any other research or models in the field that you guys found inspiring or up to that quality of threshold?
Scott:
Can I jump in for a second? I just wanna say one other way in which Zach and I are on the same page. I am an honorary member of the audio engineering department at Belmont. All my students, my colleagues are pro audio engineers. I go to AEs conference. Having that high standard of audio quality is an absolute core value. So much of the machine learning world puts out stuff, when they get into doing musical audio, they put out some piano or something. And it just sounds, I use the phrase, it sounds like ass. You can quote me on that. They thought that, going from 16K to 44K or 48K ‘Oh yeah. That's going to solve the problem. That's easy.’ No, it's not. It's non-trivial. I mean, yes, the upsampling stuff is getting quite good now. And there is some hope that maybe we can just take some 16 K stuff and it will fill in the highs. There's a lot of hope for that. But I'm still always, until you demonstrate it to me, I am gonna remain skeptical. Then so much of the audio world in machine learning is purely text and speech based. That's another kind of thing where it doesn't always translate to music and until I hear it as music, I'm gonna need to be convinced.
Zach:
I am not impressed. 
Scott:
Yeah, in terms of other people doing great work, I've always been a fan of Accusonus. I think they're amazing. I think isotope does really great work. These are not necessarily Ellen what stage is something, a generative model versus a non generative model. I'm gonna pause my thoughts there and turn it back over to Zach. 
Zach:
Yeah. That was actually one of the reasons I wanna get into this was I was pretty aware of the VST and a plugging ecosystem out there and what people are using the cutting edge stuff is. And there just didn't seem to be an obvious game changing, groundbreaking AI plugin. There's a lot of things that they'll say they use machine learning and it's like, you got some classifiers in there, that's cool. I don't know if you did a whole lot in, it's not like big, deep learning stuff. It's also cool stuff like neural DSP, which have done very cool things and actually use neural networks and have done great stuff there. But I was seeing people, it seemed to be that AI audio seems to be a lot more buzzwords than what looks like magic that I was seeing in the image space and like, okay, where is the magic? Where's the step function? Holy crap. That's so good. Now this changes everything stuff, that I feel like is where AI is, we're at the point now where we can have those things. I think one of the really hard parts about it is just that from a purely technical standpoint, digital audio processing is done on the CPU. It is serial processing that is not GPU enabled. I'm not too familiar with GPU audio and they have some plans in that space. The software itself is not made for large scale AI stuff. I don't wanna be erasing things like new tone, the new VST from Cosmos, it's basically a host for open source audio models made to run in the Daw, they got like ray of models in there. There is really cool stuff. It's not to say that's not possible. It takes a lot more optimization, a much smaller models to be able to let it run real time, low latency in the software. That's just a big blocker. Like dance diffusion is not gonna run in a Daw unless someone does some really cool stuff, but it's tricky, it's not real time software. Diffusion in general is hard to do in real time. Because it takes multiple passes. You have to go through it a hundred times. In a giant unit, you're not gonna do that in the 40 milliseconds. You need processing on a CPU,  to do real time processing. But things like Rave where it's a one shot encoder, decoder, they're much more suitable for that. And you can get real time like Torch script stuff running and people are putting a lot of effort into that.
It's been really cool to see people have been able to get some of the cutting edge stuff into the software running on CPUs. I think one of the big blockers is, even if someone is making these big models, which like we were saying in the ML world, audio's a tiny part and audio that isn't speech, it's a tiny part and music that isn't classical music, it's a tiny part of that. Then people that also have those skills and can get something running on a CPU is like five people. I'm very happy to know three of them. You know, in terms of my favorite things, honestly, I feel like I haven't even been making music for months now. Cause I'm trying to make one song that is every song . In terms of what is out there that producers are using, I guess that's part of it is that a lot of the cool stuff isn't popular yet. And a lot of the popular stuff isn't really that cool of AI.
Like a big AI plugin. It's a cool plugin maybe, but a lot of it's still algorithmic. Things like rave are still pretty small.  
Morgan:
I'm curious if we take this upcoming dance diffusion release, if you could on release day, snap your fingers and be able to shape a project or just a cool use case with it or something you would hope the community would build. What would that be?
Zach:
 Ooh. I'm really curious to see what happens when they'll start working in Guidance. I think that there is just a whole world of people taking, that was one of the cool things that I saw the most with disco diffusion and those libraries was just someone working, some other model that I hadn't heard of and be like, oh, you could just hook this up and it actually works great. I wanna see someone hook up like genre guidance to something, train some model on more music or even try it on. I'm because these guidance things can pull things out of distribution. The first clip guided diffusion model was able to make things because the clip knew about not cuz the model knew about them.
If I can give it, just like the model trained on Jonathan Mann's music and I can give it a classifier to say now make this pop country and it makes him sing a different. That'd be crazy to see all of a sudden, oh crap, we can already do this to train this on someone's discography, pull in this other thing and it will change their genre. I think there's a lot of things that just out of the box, the model itself can do. If they could just, in terms of what people will do. There's Ethan Manilow’s tag box repo he has, that was using music tags to guide Jukebox. That was more for source separation, not necessarily for genre things, but something like that, just hooking up existing models to it and just kind of seeing what pipelines and combinations you can make. Speaking of pipelines, there was someone who took the Jukebox code, because dance diffusion can kind of act as an upsampler, cuz you take audio noise it and denoise it, it denoise it into high resolution audio space. They were able to take Jukeboxes, level two outputs, which are the pre-up sample of outputs, but upsample it with Dance Diffusion. So instead of taking 18 hours or whatever of that slow transformer to upsample it, they can hook it up to my model and make it sound like Jonathan Mann. 
Scott:
send me a link to that when you get a chance.
Zach:
I'll get you a link to that in a second. That was really cool, and that was from the Jukebox community. There's a Jukebox discord community, it's run by Broccolu and then I remember stumbling upon that a couple of years back. I went and visited again a few months ago, a few weeks ago until they already had a channel for Harmonai. They already found Dance Diffusion. They were already playing with it, fine tuning their own models, making the forks of the notebook. I love this community so much. Now that's the incredible thing here. It's like, I can put in all of this work and then I can just put it out and it'll just get better.
I could just sit back and watch it get better. I'm gonna be working on more things because I love that and I want to, but it's really nice to be able to, this is not the thing I was able to do in a closed company like Microsoft, be like I wanna make a cool thing, I don't have to figure out, now I have to hide it and market it and make a thing around it and product tie it and make sure no one else takes it. I can be like, here it is. Here's everything behind it. Who wants to make it better, have fun and people will do that. It's great. It's a very privileged position to be in, to work for stability and to be able to do these things, make the open source models and basically get paid to try to make cool stuff.
Morgan:
Yeah. It's an amazing time we're living in with all this open source ML going on. It's a fun ride. You've been super generous with your time, I just have a couple of more quick questions. 
So first, where can people find Harmonai? Where can they get involved?
Zach:
Right now we are about to launch our open invite public servers called Harmonai Play. Hopefully that'll be in the next week or two. Hopefully when this goes out, we'll have the link for everyone to go into there, it's a discord server. Our website is harmonai.org . There is currently a signup list that will get you an invite to the discord server and soon that will turn into link to the server. So where we are now is soon to be the Harmonai Play discord server. That is our community server to discuss dance diffusion and just AI ML stuff in general. We have a closed R&D server we've been working with essentially. I would love to have that be open to have everyone look at that, but trying to manage a work server and having the main community server is very tricky. We wanna have some separation of our work and play, hence Harmonai Play.
That's still gonna be full of people doing the research and making new notebooks. So we'll figure out that split. That's where you'll be able to find us, Harmonai.org not Harmon.AI, that is a different organization. Hopefully that'll all be available quite soon.
Harmonai.org/sample-generator that is the repo for Dance Diffusion. There you'll find the inference notebook and the fine tuning notebook. 
Morgan:
One last question. 
What's coming after dance diffusion? Can you drop any hint about what else you're working on? 
Zach:
I also wanna know. Essentially dance diffusion was me taking code similar to Katherine Crowson’s CC12M diffusion which we actually powered and the early versions of mid journey and was very very good stuff, kind of the step past disco diffusion. Since I've started working on this, there's been a lot of advancements in diffusion, things like the paper from Nvidia, the elucidated design space, the diffusion models paper, this guy we found, or we met and saw on Twitter, Flavio Schneider, made this audio diffusion pipe word repo that is like, this is everything that I would've made had I known what I wanted to make. So we brought him in, we've gotten him, we're working with him. Basically there is already a whole new code base of way better stuff to run these things on that we're still running more tests every day. I mean, I've got three running right now on Weights and Biases. Basically better, more and faster. Things like the diffusion auto encoders, it's basically once we have something that is better and faster, then I will happily put that out. 
Scott:
I have a notebook that I'll probably release as well soon that's based on Zach's auto encoder work, where it kinda lets you play around with the audio embeddings and then listen to the results of those things. The main reason I haven't released it yet is, it's clunky and not friendly. Also I want to have a chance to discover some of the cool stuff myself, but I do want to. Zach's philosophy of, let's outsource a lot of this exploration to the open source community. There's a little bit of tension as an academic. I am supposed to put out papers every now and then, but like Harmonai philosophy, under Zach's leadership, is great in that we're so open, we'll talk to anybody, we'll share with anybody. We have people who are part of companies that are on our server and because we're just so collaborative and that's really cool. 
Zach:
Yeah. And that helps us in a lot of ways, obviously we can get help from people who already have full-time jobs. And I like to say, I'm much more interested in collaboration than competition.
I think we all win if we get the cool stuff out there and being built upon also feel free to take some of this stuff, Scott, and write papers on it. I'm clearly not writing papers. It's nice to be able to get help from people, other research labs and other companies, because there's a lot of people who want to help with this stuff, but they can't get projects at their company approved, or they don't even work in ML professionally, and they don't wanna quit their job but they wanna help out with this stuff.
So having a space where people can come together and work on it, mostly out of passion with also ideally a path to working with us, we're still working on that. It's nice to kind of not try to gate keep the research. Because that was what happened with me when I first got into this stuff, when I first learned about DDSP, I went wild. I needed to learn all this stuff. I needed to know. This is clearly the future. I have to be on top of this. This is the wave that is coming. I was looking around and I'm like, where is the community for this?
This is actually part of why I started Harmonai, I was looking around, all right. I have been in music communities. I know there were producer communities. So you get in there and you watch someone making music on Ableton, voice chat, and ask questions and learn about things from them and have these challenges.
Where is that for music ML? And there wasn't one, there was Valerio Velardo sound of AI, which was cool, but more focused on things like classifiers and more traditional music ML stuff. And then there was Google Magenta, or, Queen Mary University in London. I'm not gonna go work for Google and I'm not gonna move to London so I guess it's just not for me. I guess I have to just not do this. And that was basically the beginning of 2021. I had that realization that I want to do this, but there's just nowhere I can do it. There's nowhere for me to go take part in this just as an individual researcher. That all changed with the notebook culture, with CoLab, with things like Disco Diffusion and all right, this is chance number two.
While messing around with disco diffusion and being in the server there, I was talking with Gandamu who is one of the other devs in Disco Diffusion. And he was like ‘Hey, you're doing some practical research. You should go talk with Ahmad. He is like giving out a 100 comput, people doing open source research.’
And I was like, research? I'm having fun. What? Oh yeah. I guess technically this is research. So I happened upon him and some other conversation and ended up getting in contact with him. I basically said I want to do this stuff full time. I'm working at Microsoft. Can you compete with their salaries?
And he was like, yep, DM me. And I'm like, all right. So that's kind of when it came to looking around again, at that point I saw Luther AI had out a large language model stuff, we rely on with the big data sets. There was the image stuff kind of between those two. All right, cool. Where is the audio thing? And all I could find was one channel in the lion server for audio data sets. That's it, out of like 30 channels in like multiple servers, there's one channel for audio. It was pretty obvious at that point, I had to maybe make a community for it.
I have been a discord community manager. I know how to do that. I know producers. I'm in the space now. I needed this to exist and no one else was making it so I guess it's me, I guess I'll do it. 
Scott:
Can I jump in for a second too, as well? So later on, I got on these servers, I knew about Luther as well, of course and didn't really fit in there with lion. I was like ‘Hey, these are super computing people. I'm a supercomputing person. I speak this language. This is great. What. Oh, but they're not doing audio that I'm interested in. Where are people who are doing real music?’ That was another cool thing with Zach and I finding each other.
Zach:
It kind of helps that there was just a need for it. There was just a vacuum in the open source AI space of having a place that had both the people and the passion and most importantly, the compute, like Scott was saying, when I first made Harmonai, that was the same week as there being a bunch of other discord servers starting up.
Okay, if I'm making a new server, yet another AI art discord server, I need to have a reason to be in my place and not the other ones that were all more open invite, already have known researchers or am I just making the sound of AI too? What makes this different? Compute cluster it's a very, very good value proposition to people to be like, here's this server and we've got a100S, so that was definitely a lot of help early on in getting people excited and interested in the community was being like, you can actually act on your things you want to do here, we won't stop you. We will have our own standards of who gets onto the cluster. But in terms of I wanna make music, I can't sell that to Alexa for 10 million. No, I know this question happens just like that, but there's a lot of excited researchers and it's nice to have a space to be like, yes, I agree that's cool, come help us. 
Morgan:
I love that. That's amazing. And lads I love the energy you have with the two I can totally see on stage, playing a stadium, in some ways, some form,  someday playing some audio music. Well, you've been super generous with your time. I think we're way over what we planned. We can probably wrap it there, thanks again for taking the time and really looking forward to working with you guys and listening to some great beats. 
Zach:
For sure. Thanks for having us. 
﻿
﻿
Add a comment
Tags: Articles, Audio, Harmonai, Beginner, GenAI, Large Models
Iterate on AI agents and models faster. Try Weights & Biases today.