Skip to main content

Stability AI Releases Stable Audio Open

Created on June 6|Last edited on June 6
Stable Audio Open, an innovative open source model for generating short audio samples and sound effects from text prompts, has been launched. This release aims to empower sound designers, musicians, and the broader creative community by providing tools for producing high-quality audio data.

Audio Generation

Stable Audio Open is designed to create up to 47 seconds of high-quality audio from simple text descriptions. It excels at producing drum beats, instrument riffs, ambient sounds, foley recordings, and other audio samples needed for music production and sound design. However, it is not optimized for creating full songs, coherent melodies, or vocals, making it more suitable for sound design rather than complete music production. A significant feature of this model is its ability to be fine-tuned on custom audio data. For instance, a drummer could use recordings of their own drum sessions to generate new beats tailored to their style.

Architecture

The model details reveal that Stable Audio Open 1.0 is a latent diffusion model based on a transformer architecture. It leverages a pre-trained T5 model (t5-base) for text conditioning, converting text prompts into numerical embeddings that guide the audio generation process. The model was trained on a dataset consisting of 486,492 audio recordings, including 472,618 from Freesound and 13,874 from the Free Music Archive (FMA). All audio files are licensed under CC0, CC BY, or CC Sampling+, ensuring respect for creator rights while providing a robust dataset for training.
To ensure no unauthorized copyrighted music was included in the dataset, an extensive verification process was conducted. Music samples from FreeSound were identified using the PANNs music classifier. Samples with a high probability of containing music were analyzed by Audible Magic’s identification services, and suspected copyrighted music was removed. For the FMA subset, a metadata search against a large database of copyrighted music was performed, and flagged content was reviewed manually. This thorough process resulted in a dataset of 266,324 CC0, 194,840 CC-BY, and 11,454 CC Sampling+ audio recordings, with 8,967 CC-BY and 4,907 CC0 tracks from FMA.

How is Stable Audio Open Different from Stable Audio?
The commercial Stable Audio product generates high-quality, full tracks with coherent musical structures up to three minutes long, featuring advanced capabilities such as audio-to-audio generation and multi-part musical compositions. In contrast, Stable Audio Open is specialized in producing shorter audio samples, sound effects, and production elements. While it can create brief musical clips, it is not designed for complete songs, melodies, or vocals. This open model offers a look into generative AI for sound design, focusing on responsible development in collaboration with creative communities.

Available on Hugging Face

The model weights for Stable Audio Open are available on Hugging Face. Sound designers, musicians, developers, and audio enthusiasts are encouraged to download and explore the model's capabilities and provide feedback. This release is a step forward in open and responsible audio generation. The development team looks forward to continued research and collaboration with creative communities. Stay updated on this project by following us on Twitter, Instagram, LinkedIn, and joining our Discord Community.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.