Generating Piano Music With Music Transformer

In this article, we explore how Music Transformer, by Google Magenta, generates piano music from scratch, logging our experiments using Weights & Biases.

Hao Hao Tan

Created on January 8|Last edited on December 2

Comment

Learning to generate music using machine learning algorithms has gained tremendous interest in the past 5 years. With the emergence of deep learning, several neural network architectures such as convolutional neural networks (CNN), long-short term memory networks (LSTM) and restricted Boltzmann machines (RBM) has become popular choices for algorithmic music generation tasks (for more information, kindly refer to this survey paper).
In this report, we will look at the Music Transformer paper, authored by Huang et al. from Google Magenta, which proposed a state-of-the-art language-model-based music generation architecture. It is one of the first works that introduce Transformers, which gained tremendous success in the NLP field, to the symbolic music generation domain.
Table of ContentsBackgroundMusic Transformer VS Vanilla TransformerResultsGenerationSummary
﻿
BackgroundGenerally, there are two main approaches to represent music in a generation task. The first approach is to represent music as audio waveforms. Using this approach, the common music data representation includes raw waveform, spectrograms, Mel spectrogram, constant-Q-transform (CQT), and so on. The second approach is to represent music as symbolic event tokens. 
This can come in the form of MIDI events (which is used in this report), piano rolls, text (e.g. ABC notation), and so on. A previous W&B report introduces JukeBox, which uses the audio-based approach for generating music. In this report, we will introduce Music Transformer which uses the symbolic-based approach instead.
Representing piano music as event tokensMIDI events map each granular action while playing a piano as event tokens. In this work, the following 4 types of events are modelled:
NOTE_ON - pressing on a note
NOTE_OFF - release the note
TIME_SHIFT - moving forward to the next time step
SET_VELOCITY - setting how fast the notes should be played
In this work, a total of 128 NOTE_ON and NOTE_OFF events are introduced, which corresponds to 128 note pitches, as well as 100 TIME_SHIFT events (from 10ms to 1 second), and 128 SET_VELOCITY events that correspond to 128 different note velocities. An example in the paper is depicted as below.
﻿
By using this approach, we can see that it corresponds to tokenization in natural language processing:
A music piece corresponds to a sentence or paragraph;
Each note event corresponds to a word in the sentence / paragraph;
All possible note events form the "vocabulary set".
Hence, it is therefore reasonable to use Transformers as the architecture of our generative model, as we are treating music generation like a language generation task.
Music Transformer VS Vanilla TransformerVanilla Transformers are known to be notorious when handling long sequences due to its quadratic memory requirement. This is a huge problem, especially for music generation, because a minute-long composition can easily contain thousands of MIDI event tokens. 
Also, the authors argued that relative position information is important in music applications, because music often consists of structured phrases such as repetition, scales, and arpeggios. Hence, the model should be able to capture relative position information in a more efficient way.
Withe the above considerations, the authors improved the Transformer model with the following changes.
1 - Relative attentionThe main difference between both Transformers lie in the self-attention mechanism. vanilla Transformer relies on scaled dot product attention, which can be depicted by the equation below:
﻿Attention(Q,K,V)=softmax(QKTdk)VAttention(Q, K, V) = softmax(\frac{Q K^T}{\sqrt{d_k}})VAttention(Q,K,V)=softmax(dk​​QKT​)V﻿﻿
where QQQ﻿, KKK﻿, VVV﻿ represents the query, key and value tensors, each having tensor shape (lll﻿, ddd﻿) which represents the sequence length and the number of dimensions used in the model, respectively.
Relative attention, proposed by Shaw et al., allows the model to be informed by how far two positions are apart in a sequence. To represent this information, the equation is changed as below:
﻿Attention(Q,K,V)=softmax(QKT+Sreldk)VAttention(Q, K, V) = softmax(\frac{Q K^T + S_{rel}}{\sqrt{d_k}})VAttention(Q,K,V)=softmax(dk​​QKT+Srel​​)V﻿﻿
The additional term, SrelS_{rel}Srel​﻿, is an (lll﻿, lll﻿)-shape tensor. and the values in the tensor vi,jv_{i, j}vi,j​﻿ is related to the relative distance of position iii﻿ and jjj﻿ in length lll﻿. Since it is relative in nature, this means that if i1−j1=i2−j2i_1 - j_1 = i_2 - j_2i1​−j1​=i2​−j2​﻿, then vi1,j1=vi2,j2v_{i_1, j_1} = v_{i_2, j_2}vi1​,j1​​=vi2​,j2​​﻿.
So how can we obtain SrelS_{rel}Srel​﻿? 
First, we initialize a fixed set of embeddings ErE_rEr​﻿, which has a unique embedding ere_rer​﻿ for each relative distance r=i−jr = i - jr=i−j﻿. 
Then, we form a 3D-tensor RRR﻿ of shape (l,l,d)(l, l, d)(l,l,d)﻿, where Ri,j=ei−jR_{i, j} = e_{i -j}Ri,j​=ei−j​﻿. 
Finally, we reshape QQQ﻿ into a (l,1,d)(l, 1 ,d)(l,1,d)﻿ tensor, and Srel=QRTS_{rel} = QR^{T}Srel​=QRT﻿.
﻿
2 - Memory efficient implementationFrom the diagram above, it is obvious that the intermediate tensor RRR﻿ requires a memory footprint of O(L2D)O(L^{2}D)O(L2D)﻿, which is infeasible for long sequences. Hence, the authors proposed a "skewing" trick which can obtain SrelS_{rel}Srel​﻿ without computing RRR﻿, and keep the memory footprint within O(LD)O(LD)O(LD)﻿ .
The steps are as follows:
Multiply QQQ﻿ by ErE_rEr​﻿;
Pad a dummy vector before the leftmost column;
Reshape the matrix to have shape (l+1,l)(l + 1, l)(l+1,l)﻿;
Slice the matrix to get the last lll﻿ rows, which corresponds to SrelS_{rel}Srel​﻿.
﻿
ResultsAs shown in the paper, the proposed Music Transformer architecture has some obvious advantages as compared to the piano music generated by vanilla Transformers and an LSTM-based model (known as PerformanceRNN):
Comparing between Transformers and LSTM, Transformer-based models are generally better at preserving and reusing the primer motif. Due to using relative attention, Music Transformer creates phrases which are repeated and varied, whereas vanilla Transformer uses the motif in a more uniform fashion. LSTM model uses the motif initially, but soon drifts off to other material. 
Comparing between vanilla Transformer and Music Transfomer, it is observed that relative attention was able to generalize to lengths longer than trained. Vanilla Transformer seems to deteriorate beyond its training length, but Music Transformer is still able to generate coherent musical structure.
﻿
GenerationThere are 3 different modes for music generation by using the pre-trained Music Transformer models: (i) generating from scratch, (ii) generating from a primer melody, and (iii) generating accompaniment for a melody.
We start with Google Magenta's provided Colab notebook, as it already handles pre-trained model loading, tensor2tensor framework initialization and problem definition. The generated audio files, as well as the corresponding piano roll plots, are logged as below using Weights and Biases wandb library.
﻿Reproduce the results with this Colab notebook﻿
Generating from scratch﻿
Run: stoic-bush-81
﻿
Generating from a primer melody﻿
Run: stoic-bush-81
﻿
﻿
Generating accompaniment for a given melody﻿
Run: stoic-bush-81
﻿
﻿
SummaryMusic Transformer is a great paper that brings the power of a language model to the domain of symbolic music generation, and it is able to generate longer piano music with coherent musical structure and style. 
With the emergence of various linear complexity Transformers nowadays, it will also be interesting to investigate the impact of replacing Music Transformer architecture with candidates such as Linformer and Transformers are RNNs, as they share the advantage of linear-time attention mechanism.
If you are interested to learn more, kindly refer to the original paper, and the Music Transformer blog post.
Do share us some of the piano music that you generated! Tweet to us @weights_biases and @gudgud96.
﻿
﻿