A Guide to Generating Music using AudioCraft

This article provides a comprehensive guide to using state-of-the-art music and audio generation models using AudioCraft from Meta, along with Weights & Biases.
Soumik Rakshit
Created on September 13|Last edited on October 2
Comment
﻿AudioCraft is a PyTorch library from Meta Research for deep learning research on audio generation. AudioCraft contains inference and training code for two state-of-the-art AI generative models producing high-quality audio: AudioGen and MusicGen, both of which boast significant improvements over similar existing models such as MusicLM. In this report, we'll explore the following:
The architectures of MusicGen and AudioGen
How we can leverage the easy-to-use API offered by AudioCraft to generate audio using these models
How can we manage our audio generation experiments using Weights & Biases﻿
How can we use the interactive audio player and waveform visualizer by Weights & Biases to analyze and listen to the generated audio
As a note, you can generate your own audio using the following interactive colab using a free-tier GPU! 
﻿
﻿
﻿
And, since this is a GenAI report, we know what you want upfront: some amazing audio to get you started. We got you.
﻿
﻿
Music generated by MusicGen2
 
Audio generated by AudioGen2
﻿
﻿
📕 Table of Contents📕 Table of Contents🎶 A Brief Overview of MusicGen🥁 MultiBand Diffusion using MusicGen🎸 Examples of Music Generated by MusicGen + MultiBand Diffusion📢 A Brief Overview of AudioGen🔊 Examples of Audio Generated by AudioGen🏁 Conclusion
﻿
﻿
🎶 A Brief Overview of MusicGenThe architecture of MusicGen was proposed in the paper Simple and Controllable Music Generation by researchers from Meta Research, in order to tackle the task of high-quality conditional music generation at 32 kHz. MusicGen is a single language model that operates over several streams of compressed discrete music representation, i.e., tokens. 
Some notable details about the architecture of MusicGen:
Musicgen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. This enables MusicGen to generate high-quality samples while being conditioned on textual description or melodic features, allowing better control over the generated output.
MusicGen is modeled over the quantized units from an EnCodec audio tokenizer, which provides high-fidelity reconstruction from a low frame rate discrete representation. Compression models like EnCodec employ Residual Vector Quantization (RVQ) which results in several parallel streams. Under this setting, each stream is comprised of discrete tokens originating from different learned codebooks.
The paper also proposes a novel modeling framework, which generalizes to various codebook interleaving patterns, and we explore several variants. Through patterns, we can leverage the internal structure of the quantized audio tokens.
MusicGen supports conditional generation based on either text or melody. For encoding text, the authors experiment with pre-trained T5 encoder, FLAN-T5, and CLAP models. For melody conditioning, the authors experiment with controlling the melodic structure via joint conditioning on the input’s chromagram and text description.
﻿
﻿
The codebook interleaving patterns1
﻿
🥁 MultiBand Diffusion using MusicGenAudiocraft enables us to use MultiBand Diffusion using the MusicGen models, which lets us to decode tokens from EnCodec tokenizer into waveform audio. It is a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations proposed by the paper From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion. At an equal bit rate, the MultiBand Diffusion approach outperforms state-of-the-art generative techniques in terms of perceptual quality.
🎸 Examples of Music Generated by MusicGen + MultiBand Diffusion﻿
Examples of Music Generated by MusicGen13
﻿
📢 A Brief Overview of AudioGenThe architecture of AudioGen was proposed in the paper AudioGen: Textually Guided Audio Generation by researchers from Meta Research, in order to tackle the task of generating high-quality audio conditioned on descriptive text captions. Some notable details about the architecture of AudioGen include:
AudioGen is an auto-regressive model generative model that operates on a learned discrete audio representation.
The training of AudioGen is based on two main steps. Firstly, the learning of discrete representation of the raw audio using an auto-encoding method. Lastly, the training of a Transformer language model over the learned codes obtained from the audio encoder, conditioned on textual features.
During inference time, a new set of audio tokens is sampled from the language model to generate, given text features. These tokens can later be decoded into the waveform domain using the decoder component.
﻿
﻿
General Overview of the AudioGen System1
﻿
Due to the way audio travels through a medium, differentiating “objects” can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at a high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges the AudioGen paper proposes an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources.
💡
﻿
🔊 Examples of Audio Generated by AudioGen﻿
Examples of Audio Generated by AudioGen8
﻿
🏁 ConclusionIn this report, we explored the text-conditional music and audio generation capabilities of MusicGen and AudioGen, two SoTA models hosted on Audiocraft.
We briefly explore the details regarding the architectures of MusicGen and AudioGen.
We also explore MultiBand Diffusion with MusicGen, which enables us to generate any type of audio modality (such as speech, music, etc.) from low-bitrate discrete representations.
We explore some amazing music and non-music audio generated by MusicGen and AudioGen using the interactive audio player and waveform visualizer offered by Weights & Biases, along with their respective spectrograms.
Kudos to Atanu for contributing the spectrogram visualization logic.
﻿
Making a Theme Song for Our Podcast Using Dance Diffusion
How we used Stability AI's Dance Diffusion model to write generative music for our recent podcast episode–and how you can create your own music with Dance Diffusion
A Guide to Using Stable Diffusion XL with HuggingFace Diffusers and W&B
A comprehensive guide to using Stable Diffusion XL (SDXL) for generating high-quality images using HuggingFace Diffusers and managing experiments with Weights & Biases
Harmonai's New Audio Diffusion Model: Stable Audio
A new model for generating audio! 
OpenAI Whisper: How to Transcribe Your Audio to Text, for Free (with SRTs/VTTs)
In this beginner-friendly article, we’ll provide a gentle introduction to Whisper and demonstrate how to use it to transcribe and caption audio — for free!
﻿
﻿
Add a comment
Tags: Articles, Music Generation, GenAI, Tutorial, Panels
Iterate on AI agents and models faster. Try Weights & Biases today.