Experiments With OpenAI Jukebox On Weights & Biases

In this article, we explore generative models that create music based on raw audio, sampling a trained model and analyzing the training behaviour of the JukeBox model.

Ishaan Malhi

Created on June 7|Last edited on November 21

Comment

Generative models have long shown great promise in various domains, from generating text (GPT-2),  Image-to-Image Translation (CycleGAN), to Musical Compositions (MuseNet).
OpenAI's recent Jukebox paper (arxiv link) furthers the work of MuseNet to generate unique audio samples. While MuseNet was trained on MIDI data, a format that can carry information about a musical note such as notation, pitch, velocity, etc, the Jukebox paper is trained on raw audio. 
Inherently, the key breakthrough of the model is that it's able to learn top-level features such as composition style and genre and low-level features such as pace, notation, pitch, etc without any specific additional information.
In this article, we explore 2 main setups:
Sampling a trained model by feeding in new audio samples and analyzing upsampling and noisy audio output.
Analyzing the training behavior of the Jukebox model.
﻿Reproduce results in this colab →﻿First up, we'll look at some example results. After sampling with a Country music genre for the Zac Brown Brand, we get the following audio file and its Chromagram.
Table of ContentsExample ResultsCore IdeasSamplingSampling with Audio PromptsTraining the VQ-VAESummary and Conclusion
﻿
Example Results﻿
Run set1
﻿
Core IdeasLet's now dive into the steps to get this audio. The Jukebox paper dives into a couple of core ideas:
Learning from raw audio that has long-range inputs is solved by using an encoder based on the multi-scale VQ-VAE (Razavi et al., 2019) to compress audio to a low dimensional space.
Learn representations to generate audio codes in the compressed space by using an autoregressive model which is a variant of the Sparse Transformer (Child et al., 2019; Vaswani et al., 2017)  trained with maximum likelihood estimation.
Train Autoregressive Upsamplers to recreate the lost information at each level of compression.
VQ-VAE architecture for compressing information into low-dimensional space
A key part of the VQ-VAE architecture is the idea of different compression levels. Each level independently encodes the input. The top-level encodes the most essential music information, while the bottom level produces the highest quality compression. (We will see later how the signals for these differ in the sampling section).
(Image from Jukebox Paper Appendix B.1)
﻿
﻿
﻿
Priors for predicting audio in compressed space
The second part of the puzzle is learning to predict sequences in the compressed space.
The autoregressive Sparse Transformer learns the distribution of the codes encoded by the VQ-VAE at each compression level. Thus, these priors also have different levels. A top-level prior generates the most compressed codes and learns semantics and melodies while 2 other priors upsample codes.
SamplingIn this section, we take a pre-trained model and feed in sample audio files, as well as genre, artist and lyrics as prompts. Each prompt generates specific sets of audio files at varied representations (from high compression to upsampled audio).
We then analyze the Chromagram of the audio files, which gives us information about the pitch, as opposed to a Spectrogram that gives us information about the spectrum of frequencies with respect to time. Both are good signals, and we look at Chromagrams as an alternative to the analysis in OpenAI's blog post.
﻿Link to OpenAI Jukebox Repository﻿
There is also a fork of this repository that instruments Weights & Biases during training and sampling. Since both of these take > 4 hrs on GPUs > Nvidia K80s, Weights & Biases stores our required metrics, tensorboard runs, generated audio files and intermediate model representations.
﻿Link to Weights & Biases Instrumented Fork of Jukebox﻿
Try generating your own music samples!
﻿Link to Weights & Biases Colab﻿
P.S:  Contributions to the fork are very welcome!
﻿
Run set5
﻿
﻿
Sampling with Audio PromptsThe next type of sampling we do is based on Audio Prompts. We use samples from the FMA Music Dataset (GitHub link) to prompt 3 audio files with varying genres. The sample audio files we used are given below.
Key Learning: While Jukebox wasn't able to create completely new audio from this prompt, it was able to recognize similarities between genres.  This could be because the prompts have large durations. Single audio file prompts with large durations did much worse.
﻿Reproduce results in this colab →﻿﻿
﻿
Run set30
﻿
﻿
Training the VQ-VAENow that we have seen how the sampling works, let's try and simulate a short training run for the VQ-VAE.
The Jukebox paper describes the training process in detail and it takes multiple days on Nvidia V100 GPUs with over 2 million parameters.
In this experiment, we simulate a short run to observe the behavior of spectral convergence, spectral loss and entropy, 3 important concepts used in the paper.
Training Setup
We train the data on 280 audio files from the FMA dataset (Source). The training files are instrumented with Weights & Biases in this repository.
You can create the Anaconda environment with conda env create -f  jukebox.yml. The training script is an interactive bash script. Run bash train.sh and enter your Weights & Biases configuration.
Terms
Spectral Loss: Mathematically defined as Lspec=∣∣∣STFT(x)∣−∣STFT(x^)∣∣∣L_{spec}  = || |STFT (x)| - |STFT(\hat{x})|  ||Lspec​=∣∣∣STFT(x)∣−∣STFT(x^)∣∣∣﻿. The spectral loss penalizes norm of the difference between the input and reconstructed signal Spectograms, encouraging the model to construct mid-high frequency tones. A perfect reconstruction would have zero loss. We would expect as the training progresses, that the spectral loss reduces although erratically.
Entropy: A measure of uniformity of predictions. Adding Genre and Artist information to the learning task reduces entropy and allows us to drive the model towards a specific style of music. In the graph, we don't see much change in entropy, which explains the noisy output given by the network after the short training is complete.
Spectral Convergence: Another proxy for reconstruction and fidelity to the original signal input. Reconstruction fidelity degrades with higher compression leading to a codebook collapse where all encodings get mapped to a few embedding vectors while the other embedding vectors are unused. Jukebox mitigates this why random restarts, i.e by randomly resetting a codebook vector based on a threshold condition. We can see that the spectral convergence while reduces slowly implying better fidelity over time.
﻿Reproduce results in this colab →﻿﻿
﻿
Run set1
﻿
﻿
Summary and ConclusionJukebox is a great paper to experiment with and appreciate the power of autoencoders. While this article only scratched the surface, there are a lot of additional resources you can use to explore Jukebox!
﻿Directory of music created by the Jukebox model trained by OpenAI.
﻿OpenAI's blog introducing Jukebox
﻿Paper on Arvix
Do let us know what music you create! Tweet to us @weights_biases and @IshaanMalhi﻿
﻿
﻿

Add a comment

Ameer Azam • 3 years ago

Nice

Tags: Intermediate, Audio, Music Generation, OpenAI, Experiment, Panels, Plots

Iterate on AI agents and models faster. Try Weights & Biases today.

Experiments With OpenAI Jukebox On Weights & Biases

﻿Reproduce results in this colab →﻿

Table of Contents

Example Results

Core Ideas

Sampling

Sampling with Audio Prompts

﻿Reproduce results in this colab →﻿

Training the VQ-VAE

﻿Reproduce results in this colab →﻿

Summary and Conclusion

Reproduce results in this colab →

Reproduce results in this colab →

Reproduce results in this colab →