Music Generation with Google's MusicLM
Google has recently released MusicML, a language model that can turn descriptive text about music into actual music!
Created on January 28|Last edited on January 28
Comment
Google has recently released a language model that can turn descriptive text about music into actual music! As this article points out, AI music generation isn't anything new. There have been other attempts at this like Dance Diffusion (check out this W&B article!), OpenAI jukebox, and many more.
What makes MusicLM different? According to the authors, their approach outperforms all previous music generation systems and adheres better to the text. I encourage you to check out the samples they have released here!
How does it work?

SoundStream is an end-to-end encoder-decoder framework with Residual Vector Quantization (RVQ) which you can think of as a method to improve bitrate and the quality of the actual audio representation. W2v-BERT combines an NLP training framework called Masked Language Modeling (MLM) with a contrastive loss to train an LM that can both learn good speech representations and how to contextualize this in speech.
While these 2 models focus mainly on audio, MuLan focuses on jointly training an LM contrastively to understand patterns between audio and text (i.e. is this piece of text associated with this audio?).
These 3 models are organized into what they call a hierarchical sequence-to-sequence training task shown below.

The diagram on the left details training and the right one details inference.
For training, they took a decoder-only transformer to map MuLan audio tokens to semantic tokens from W2v-BERT. They trained presumably another decoder-only transformer to then map both the MuLan audio tokens and semantic tokens from W2v-BERT to acoustic tokens from SoundStream. So the entire training framework ingests target audio and outputs acoustic tokens if I'm correct.
For inference, they take the textual tower of the MuLan model to embed text denoted as . Their trained transformers then map this to semantic tokens which are then both combined to map into acoustic tokens that will be decoded by SoundStream to produce audio.
From what I understand, they have 3 models each to generate embeddings for different components: text, acoustic, semantic. They act as extremely complex dictionaries. The 2 decoder-only transformers they train on top are responsible for looking up in these dictionaries and creating the correct mapping.
References
- Paper
- Wiggers, Kyle. “Google Created an AI That Can Generate Music from Text Descriptions, but Won't Release It.” TechCrunch, 27 Jan. 2023, .
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.