Skip to main content

Generating and Interpolating Music Snippets with MusicVAE

An introduction to using variational auto-encoders to model the distribution of short music snippets
Created on January 29|Last edited on July 15

Check out this interesting example to see how MusicVAE interpolates smoothly between 2 well-known children's songs, "Twinkle Twinkle Little Star" and "Mary Had a Little Lamb".




Twinkle img
card
musicvae_twinkle
Twinkle Twinklw
This run didn't log audio for key "Twinkle Twinklw", step 0, index 0. Docs →
Run set
6


Introduction

In the last report, we looked at how Music Transformer can generate minute-long piano performance from scratch. However, the controllability in Music Transformer is limited - for example, it does not support creative applications such as mixing-and-matching two composed music (commonly known as interpolation).

In this report, we will look at the MusicVAE paper, authored by Roberts et al., which proposed to model a latent distribution of music segments, hence it allows the application of smooth interpolation between different music segments. This is also one of the first usages of variational autoencoders in the symbolic music generation domain.

Background - Variational Autoencoders (VAE)

Before diving into MusicVAE, we first discuss the key points of variational autoencoders.

An autoencoder is a type of neural network architecture that can learn a compact, low-dimension representation of a data point via an unsupervised manner. It consists of an encoder, which projects an input data point XX to a (lower) dimension manifold zz, and a decoder which aims to reconstruct the data point given the compressed low dimension feature vector.

Training an autoencoder is done via optimizing a loss function which enforces the network to reconstructs the input data point, i.e.:

L=MSE(fdecode(fencode(X)),X)L = MSE(f_{decode}(f_{encode}(X)), X)

A variational autoencoder (VAE) has very similar traits as an autoencoder, but its latent variable distribution, p(z)p(z), has to conform to a standard Gaussian distribution, i.e. z∼N(μ,σ2)z \sim N(\mu, \sigma^2). Hence, from the data generation perspective in VAE, data points are generated via a two-step process:

  1. First, a latent code is sampled from the prior p(z)p(z): z∼p(z)z \sim p(z)
  2. Then, a data point is generated via decoding the latent code: X∼p(X∣z)X \sim p(X|z)

To enforce this property, the objective function to train a VAE is to maximize the evidence lower bound (ELBO) of the marginal likelihood p(X)p(X) with the following equation:

L=Eq(z∣X)[log⁡p(X∣z)]−DKL(q(z∣x)∣∣p(z))L = E_{q(z|X)}[\log p(X|z)] - D_{KL}(q(z|x) || p(z))

If we compare this loss function to the autoencoder loss function:

  • Both the first term resembles the reconstruction loss - in VAE, the posterior distribution q(z∣X)q(z|X) and likelihood p(X∣z)p(X|z) correspond to the encoder and the decoder respectively.

  • The additional second term in the VAE loss is a KL divergence term, which is commonly used to measure the "difference" between two probability distributions. Here, it is confirming the posterior distribution q(z∣X)q(z|X) to get closer to the prior distribution p(z)p(z), which is commonly a standard Gaussian distribution N(μ,σ2).N(\mu, \sigma^2).

In simple terms, we can think of VAEs as a "regularized" version of autoencoders, as autoencoders do not enforce any constraint on the distribution of the encoded latent variable. Hence, the process of encode-decode is also slightly different:

  • During encoding, instead of only encoding a single latent vector, we encode two latent vectors which represents the mean μ\mu and the standard deviation σ\sigma of the underlying distribution;
  • During decoding, we first sample a latent vector from the distribution (more precisely, it is done via a reparameterization trick ϵ∼N(0,I),z=μ+σ⋅ϵ\epsilon \sim N(0, I), z = \mu + \sigma \cdot \epsilon). Then, we feed the latent vector into the decoder to reconstruct the input.

For a more detailed version of the derivation of the objective function (ELBO) of VAE and its background related to variational inference, kindly refer to this paper. We will provide a detailed explanation about VAEs in the upcoming reports.

Why VAEs instead of autoencoders for generation?

For generative applications, VAEs are commonly favored in comparison with autoencoders due to the following reasons:

  1. Firstly, VAEs provide us with a prior distribution to sample the latent variable. This is much more convenient for generating new data points (e.g. new face images, new music snippets) as compared to autoencoders, where there isn't an intuitive way to sample new latent codes.

  2. Secondly, due to the enforced Gaussian distribution in the latent variable space, we can expect a "smoother surface" on the space as compared to the autoencoder latent variable space, which commonly results in more "latent holes". This is particularly important for the mix-and-match application which requires latent space interpolation, which means "traversing through a path" in the latent space, as the "latent holes" region will generate unrealistic data outputs (e.g. deformed face images, incoherent music snippets).

MusicVAE

We now dive into the details of MusicVAE. For the encoder and decoder, MusicVAE chooses to use recurrent neural networks (more precisely LSTMs), borrowing the idea from generating texts using recurrent VAEs.

  • The encoder q(z∣X)q(z|X) processes an input sequence and produces a sequence of hidden states. The mean and standard deviation, μ\mu and σ\sigma are derived using the final hidden state hTh_T by two separate feed-forward networks, i.e. μ=f1(hT),σ=f2(hT)\mu = f_1(h_T), \sigma = f_2(h_T).

  • The decoder p(X∣z)p(X|z) first uses the sampled latent vector zz to set the initial state of the decoder RNN. Then, it autoregressively generates the output sequence from the initial state. During training, the output sequence is trained to reconstruct the input sequence.

In this work, MusicVAE is used to generated monophonic melodies and drumlines. We will focus on generating monophonic melodies in this report. The melody sequence is represented via a (T,130)(T, 130) sequence matrix, in which the 130-dimensional output space (one-hot vectors) consists of 128 “note-on” tokens for the 128 MIDI pitches, plus single tokens for “note-off” (releasing a note played) and “rest” (nothing played). Here, MusicVAE experiments on two different lengths of melody data points: 2-bar (T=32T = 32) and 16-bar (T=256T = 256), which we will discuss the difference in performance in a later section.

Problem with vanilla RNN MusicVAE

The authors reported two main limitations with using vanilla RNNs for MusicVAE:

  1. Because the decoder is autoregressive in nature (producing the output of the current step based on previous steps), it is a sufficiently powerful decoder that may disregard the latent code. With the latent code ignored, the KL divergence term of the ELBO can be trivially set to zero. This is not desirable as the model is unable to efficiently learn the latent variable distribution.

  2. Because the model compresses the entire sequence to only a single latent vector, it yields a very tight bottleneck. As shown in the results table below, although the vanilla recurrent VAE is able to reconstruct 2-bar melodies with decent accuracy, the authors find that this approach begins to fail on longer sequences (e.g. 16 bars).

Solution

MusicVAE introduces a hierarchical decoder - since we are generating 16-bars output, the authors suggests to first generate an intermediary latent vector (known as "conductor" in the paper) for each bar, i.e. [z1′,...,z16′]=RNN(z)[z^\prime_1, ..., z^\prime_{16}] = RNN(z). Then, for each bar, we use another layer of RNN to generate the actual outputs based on the intermediary latent vector.

The reason is that the longer the output sequence, the more the influence of the latent state vanishes (which resembles the vanishing gradient problem in long sequences). Hence, the hierarchical decoder first generates a shorter sequence of intermediary latent codes, which hopefully mitigates the vanishing issue. Then for each bar, the output is generated solely based on the intermediary latent vector for the particular bar, which is yielded from the latent vector. This could enforce the generation process to not ignore the latent state, hence leading to better learning of the latent distribution.

Interpolation

Now that we are able to learn a meaningful latent distribution of melodies, the interpolation between two distinct melodies A and B take the steps as below:

  1. Obtain the latent codes for both A and B, i.e. zA,zBz_A, z_B.
  2. "Slide" between the points of zAz_A and zBz_B in the latent space and obtain N latent codes on the interpolation path, i.e. zα=zA+α(zB−zA).z_\alpha = z_A + \alpha(z_B - z_A).

The authors demonstrated that interpolating on the latent space results in a much smoother, coherent transformation from source melody to target melody, as compared to interpolating on the data space. Hence, this proves that the latent distribution learned is capable of capturing meaningful, compact information related to the structure of the music snippets.

Generation

In the attached Colab notebook, we generate 2-bar and 16-bar melody using both vanilla recurrent VAE and MusicVAE. The generated audio files, as well as the corresponding piano roll plots, are logged as below using the Weights and Biases wandb library.

Reproduce the results with this Colab notebook



Generate & Interpolate for 2 Bars




Run set
6


Generate and interpolate 16 Bars




Run set
6


Summary

MusicVAE provides a viable solution for modeling the distribution of music snippets in a compact, low-dimension manifold, hence allowing creative applications of generating melodies from scratch, as well as mixing-and-matching melodies of very different styles. Such a model is able to increase the level of controllability in music generation systems, allowing users to have a finer grain of control in generating the music in their desired style.

However, although MusicVAE is able to generate a 16-bar length of music, it is still far away from generating polyphonic (playing multiple notes at the same time, instead of a single note), minute-long sequences like Music Transformer. Can we combine the strengths of both Music Transformer (generating long, coherent sequences) and MusicVAE (high level of controllability)? We shall look at a different model related to this topic in the next report.

If you are interested to learn more, kindly refer to the original paper, and the MusicVAE blog post.

Do share with us some of the melody snippets and interpolation! Tweet to us @weights_biases and @gudgud96.


Iterate on AI agents and models faster. Try Weights & Biases today.