Generative models have long shown great promise in various domains, from generating text (GPT-2), Image to Image Translation (CycleGAN), to Musical Compositions (MuseNet).

OpenAI's recent Jukebox paper (arxiv link) furthers the work of MuseNet to generate unique audio samples. While MuseNet was trained on MIDI data, a format that can carry information about a musical note such as notation, pitch, velocity, etc, the Jukebox paper is trained on raw audio.

Inherently, the key breakthrough of the model is that it's able to learn top level features such as composition style and genre and low level features such as pace, notation, pitch, etc without any specific additional information.

In this report, we explore 2 main setups:

  1. Sampling a trained model by feeding in new audio samples and analyzing upsampling and noisy audio output.
  2. Analyzing the training behavior of the Jukebox model.

Reproduce results in this colab →

First up, we'll look at some example results. After sampling with a Country music genre for the Zac Brown Brand, we get the following audio file and it's Chromagram.

Example Results

Example Results

Core Ideas

Let's now dive into the steps to get this audio. The Jukebox paper dives into a couple of core ideas:

  1. Learning from raw audio that have long range inputs is solved by using an encoder based on the multi-scale VQ-VAE (Razavi et al., 2019) to compress audio to a low dimensional space.
  2. Learn representations to generate audio codes in the compressed space by using a autoregressive model which is a variant of the Sparse Transformer (Child et al., 2019; Vaswani et al., 2017) trained with maximum likelihood estimation.
  3. Train Autoregressive Upsamplers to recreate the lost information at each level of compression.

VQ-VAE architecture for compressing information into low dimensional space

A key part of the VQ-VAE architecture is the idea of different compression levels. Each level independently encodes the input. The top level encodes the most essential music information, while the bottom level produces the highest quality compression. (We will see later how the signals for these differ in the sampling section).

(Image from Jukebox Paper Appendix B.1)


Priors for predicting audio in compressed space

The second part of the puzzle is learning to predict sequences in the compressed space.

The autoregressive Sparse Transformer learns the distribution of the codes encoded by the VQ-VAE at each compression level. Thus, these priors also have different levels. A top level prior generates the most compressed codes and learns semantics and melodies while 2 other priors upsample codes.


In this section we take a pre-trained model and feed in sample audio files, as well as genre, artist and lyrics as prompts. Each prompt generates specific sets of audio files at varied representations (from high compression, to upsampled audio).

We then analyze the Chromagram of the audio files, which gives us information about the pitch, as opposed to a Spectrogram that gives us information about the spectrum of frequencies with respect to time. Both are good signals, and we look at Chromagrams as an alternative to the analysis in OpenAI's blog post.

Link to OpenAI Jukebox Repository

There is also a fork of this repository that instruments Weights & Biases during training and sampling. Since both of these take > 4 hrs on GPUs > Nvidia K80s, Weights & Biases stores our required metrics, tensorboard runs, generated audio files and intermediate model representations.

Link to Weights & Biases Instrumented Fork of Jukebox

Try generating your own music samples!

Link to Weights & Biases Colab

P.S: Contributions to the fork are very welcome!


Sampling with Audio Prompts

The next type of sampling we do is based on Audio Prompts. We use samples from the FMA Music Dataset (GitHub link) to prompt 3 audio files with varying genres. The sample audio files we used are given below.

Key Learning: While Jukebox wasn't able to create completely new audio from this prompt, it was able to recognize similarities between genres. This could be because the prompts have large durations. Single audio file prompts with large durations did much worse.

Reproduce results in this colab →

Sampling with Audio Prompts

Training the VQ-VAE

Now that we have seen how the sampling works, let's try and simulate a short training run for the VQ-VAE.

The Jukebox paper describes the training process in detail and it takes multiple days on Nvidia V100 GPUs with over 2 million parameters.

In this experiment, we simulate a short run to observe the behavior of spectral convergence, spectral loss and entropy, 3 important concepts used in the paper.

Training Setup

We train the data on 280 audio files from the FMA dataset (Source). The training files are instrumented with Weights and Biases in this repository.

You can create the Anaconda environment with conda env create -f jukebox.yml. The training script is an interactive bash script. Run bash and enter your Weights & Biases configuration.


Spectral Loss: Mathematically defined as $L_{spec} = || |STFT (x)| - |STFT(\hat{x})| ||$. The spectral loss penalizes norm of the difference between the input and reconstructed signal Spectograms, encouraging the model to construct mid-high frequency tones. A perfect reconstruction would have zero loss. We would expect as the training progresses, that the spectral loss reduces although erratically.

Entropy: A measure of uniformity of predictions. Adding Genre and Artist information to the learning task reduces entropy and allows us to drive the model towards a specific style of music. In the graph, we don't see much change in entropy, which explains the noisy output given by the network after the short training is complete.

Spectral Convergence: Another proxy for reconstruction and fidelity to the original signal input. Reconstruction fidelity degrades with higher compression leading to a codebook collapse where all encodings get mapped to a few embedding vectors while the other embedding vectors are unused. Jukebox mitigates this why random restarts, i.e by randomly resetting a codebook vector based on a threshold condition. We can see that the spectral convergence while reduces slowly implying better fidelity over time.

Reproduce results in this colab →

Training the VQ-VAE

Summary and Conclusion

Jukebox is a great paper to experiment with and appreciate the power of autoencoders. While this report only scratched the surface, there are a lot of additional resources you can use to explore Jukebox!

Do let us know what music you create! Tweet to us @weights_biases and @IshaanMalhi