Harmonai's New Audio Diffusion Model: Stable Audio
A new model for generating audio!
Created on September 13|Last edited on September 13
Comment
Stability AI has made waves in the AI world with systems like Stable Diffusion that can generate realistic images from text prompts. Now, the company's generative audio research lab Harmonai has unveiled a new system called Stable Audio that can generate high-fidelity audio in real-time.
The evolution of diffusion-based generative models has significantly impacted generative AI, leading to remarkable advancements in the quality and controllability of generated multimedia content. Among these models, "latent diffusion models" offer superior speed for both training and inference.
Yet, generating audio, particularly full-length songs, has remained a complex challenge due to the fixed-size output nature of conventional diffusion models.
The Architecture
Stable Audio uses a diffusion model known as a latent diffusion model.
First, a variational autoencoder (VAE) compresses the audio into a compact latent representation. The VAE uses a fully-convolutional neural network architecture to enable encoding/decoding of audio clips of arbitrary lengths while preserving quality.
To enable text conditioning, Stable Audio utilizes a pretrained CLIP-like text encoder model to extract semantic features from prompt texts. This helps the model understand the relationships between words and sounds.
In addition, the model is conditioned on timing embeddings representing the start time and total length of the desired output. This allows for generating audio of flexible duration.
The conditioned diffusion model itself is a 900-million parameter convolutional U-Net architecture. It leverages residual blocks, self-attention, and cross-attention layers to iteratively denoise the latent audio conditioned on text and timing. Memory optimizations to the attention modules enable the model to process longer sequences.


Training
The system was trained on a dataset of over 800,000 audio files containing music, sound effects, and instrumental samples. In tests, Stable Audio could generate 95 seconds of high-quality stereo audio in under one second on an NVIDIA A100 GPU.
Can it Talk?
The system was trained on music, sound effects, and instrumental samples - no mention of speech data. So it likely does not have strong capabilities for generating natural human speech out of the box.
Open To All
Harmonai says Stable Audio represents the state-of-the-art in AI audio generation. The lab plans to release open-source versions of the system to advance audio AI research. Generating controllable, high-quality audio on demand could open up new creative possibilities across many industries.
The announcement: https://stability.ai/research/stable-audio-efficient-timing-latent-diffusion
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.