Skip to main content

Music Generation with Google's MusicLM

Google has recently released MusicML, a language model that can turn descriptive text about music into actual music!
Created on January 28|Last edited on January 28
Google has recently released a language model that can turn descriptive text about music into actual music! As this article points out, AI music generation isn't anything new. There have been other attempts at this like Dance Diffusion (check out this W&B article!), OpenAI jukebox, and many more.
What makes MusicLM different? According to the authors, their approach outperforms all previous music generation systems and adheres better to the text. I encourage you to check out the samples they have released here!

How does it work?

Their MusicLM is composed of 3 models: SoundStream, W2v-BERT, and MuLan. From MusicLM's paper:


SoundStream is an end-to-end encoder-decoder framework with Residual Vector Quantization (RVQ) which you can think of as a method to improve bitrate and the quality of the actual audio representation. W2v-BERT combines an NLP training framework called Masked Language Modeling (MLM) with a contrastive loss to train an LM that can both learn good speech representations and how to contextualize this in speech.
While these 2 models focus mainly on audio, MuLan focuses on jointly training an LM contrastively to understand patterns between audio and text (i.e. is this piece of text associated with this audio?).
These 3 models are organized into what they call a hierarchical sequence-to-sequence training task shown below.


The diagram on the left details training and the right one details inference.
For training, they took a decoder-only transformer to map MuLan audio tokens to semantic tokens from W2v-BERT. They trained presumably another decoder-only transformer to then map both the MuLan audio tokens and semantic tokens from W2v-BERT to acoustic tokens from SoundStream. So the entire training framework ingests target audio and outputs acoustic tokens if I'm correct.
For inference, they take the textual tower of the MuLan model to embed text denoted as MTM^T. Their trained transformers then map this to semantic tokens SS which are then both combined to map into acoustic tokens that will be decoded by SoundStream to produce audio.
From what I understand, they have 3 models each to generate embeddings for different components: text, acoustic, semantic. They act as extremely complex dictionaries. The 2 decoder-only transformers they train on top are responsible for looking up in these dictionaries and creating the correct mapping.
If you'd like to, do check out the website where they host a few audio demos!

References


Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.