Generating Piano Music With Music Transformer
In this article, we explore how Music Transformer, by Google Magenta, generates piano music from scratch, logging our experiments using Weights & Biases.
Created on January 8|Last edited on December 2
Comment
Learning to generate music using machine learning algorithms has gained tremendous interest in the past 5 years. With the emergence of deep learning, several neural network architectures such as convolutional neural networks (CNN), long-short term memory networks (LSTM) and restricted Boltzmann machines (RBM) has become popular choices for algorithmic music generation tasks (for more information, kindly refer to this survey paper).
In this report, we will look at the Music Transformer paper, authored by Huang et al. from Google Magenta, which proposed a state-of-the-art language-model-based music generation architecture. It is one of the first works that introduce Transformers, which gained tremendous success in the NLP field, to the symbolic music generation domain.
Table of Contents
Background
Generally, there are two main approaches to represent music in a generation task. The first approach is to represent music as audio waveforms. Using this approach, the common music data representation includes raw waveform, spectrograms, Mel spectrogram, constant-Q-transform (CQT), and so on. The second approach is to represent music as symbolic event tokens.
This can come in the form of MIDI events (which is used in this report), piano rolls, text (e.g. ABC notation), and so on. A previous W&B report introduces JukeBox, which uses the audio-based approach for generating music. In this report, we will introduce Music Transformer which uses the symbolic-based approach instead.
Representing piano music as event tokens
MIDI events map each granular action while playing a piano as event tokens. In this work, the following 4 types of events are modelled:
- NOTE_ON - pressing on a note
- NOTE_OFF - release the note
- TIME_SHIFT - moving forward to the next time step
- SET_VELOCITY - setting how fast the notes should be played
In this work, a total of 128 NOTE_ON and NOTE_OFF events are introduced, which corresponds to 128 note pitches, as well as 100 TIME_SHIFT events (from 10ms to 1 second), and 128 SET_VELOCITY events that correspond to 128 different note velocities. An example in the paper is depicted as below.
By using this approach, we can see that it corresponds to tokenization in natural language processing:
- A music piece corresponds to a sentence or paragraph;
- Each note event corresponds to a word in the sentence / paragraph;
- All possible note events form the "vocabulary set".
Hence, it is therefore reasonable to use Transformers as the architecture of our generative model, as we are treating music generation like a language generation task.
Music Transformer VS Vanilla Transformer
Vanilla Transformers are known to be notorious when handling long sequences due to its quadratic memory requirement. This is a huge problem, especially for music generation, because a minute-long composition can easily contain thousands of MIDI event tokens.
Also, the authors argued that relative position information is important in music applications, because music often consists of structured phrases such as repetition, scales, and arpeggios. Hence, the model should be able to capture relative position information in a more efficient way.
Withe the above considerations, the authors improved the Transformer model with the following changes.
1 - Relative attention
The main difference between both Transformers lie in the self-attention mechanism. vanilla Transformer relies on scaled dot product attention, which can be depicted by the equation below:
where , , represents the query, key and value tensors, each having tensor shape (, ) which represents the sequence length and the number of dimensions used in the model, respectively.
Relative attention, proposed by Shaw et al., allows the model to be informed by how far two positions are apart in a sequence. To represent this information, the equation is changed as below:
The additional term, , is an (, )-shape tensor. and the values in the tensor is related to the relative distance of position and in length . Since it is relative in nature, this means that if , then .
So how can we obtain ?
- First, we initialize a fixed set of embeddings , which has a unique embedding for each relative distance .
- Then, we form a 3D-tensor of shape , where .
- Finally, we reshape into a tensor, and .
2 - Memory efficient implementation
From the diagram above, it is obvious that the intermediate tensor requires a memory footprint of , which is infeasible for long sequences. Hence, the authors proposed a "skewing" trick which can obtain without computing , and keep the memory footprint within .
The steps are as follows:
- Multiply by ;
- Pad a dummy vector before the leftmost column;
- Reshape the matrix to have shape ;
- Slice the matrix to get the last rows, which corresponds to .
Results
As shown in the paper, the proposed Music Transformer architecture has some obvious advantages as compared to the piano music generated by vanilla Transformers and an LSTM-based model (known as PerformanceRNN):
- Comparing between Transformers and LSTM, Transformer-based models are generally better at preserving and reusing the primer motif. Due to using relative attention, Music Transformer creates phrases which are repeated and varied, whereas vanilla Transformer uses the motif in a more uniform fashion. LSTM model uses the motif initially, but soon drifts off to other material.
- Comparing between vanilla Transformer and Music Transfomer, it is observed that relative attention was able to generalize to lengths longer than trained. Vanilla Transformer seems to deteriorate beyond its training length, but Music Transformer is still able to generate coherent musical structure.
Generation
There are 3 different modes for music generation by using the pre-trained Music Transformer models: (i) generating from scratch, (ii) generating from a primer melody, and (iii) generating accompaniment for a melody.
We start with Google Magenta's provided Colab notebook, as it already handles pre-trained model loading, tensor2tensor framework initialization and problem definition. The generated audio files, as well as the corresponding piano roll plots, are logged as below using Weights and Biases wandb library.
Reproduce the results with this Colab notebook
Generating from scratch
Run: stoic-bush-8
1
Generating from a primer melody
Run: stoic-bush-8
1
Generating accompaniment for a given melody
Run: stoic-bush-8
1
Summary
Music Transformer is a great paper that brings the power of a language model to the domain of symbolic music generation, and it is able to generate longer piano music with coherent musical structure and style.
With the emergence of various linear complexity Transformers nowadays, it will also be interesting to investigate the impact of replacing Music Transformer architecture with candidates such as Linformer and Transformers are RNNs, as they share the advantage of linear-time attention mechanism.
If you are interested to learn more, kindly refer to the original paper, and the Music Transformer blog post.
Add a comment
This is great! Awesome explanation. Thanks for the post!
1 reply
Iterate on AI agents and models faster. Try Weights & Biases today.