WhistleGen V2: Making Music With Transformers

This article will dive into how the second version of WhistleGen works and explores how it enables you to generate folk music in ABC notation using minGPT.
Jonathan Whitaker
Created on June 1|Last edited on January 18
Comment
How often have you thought, "I wish I could generate folk music at the push of a button?" All the time? Me too! Enter WhistleGen, a tool that does exactly that.
﻿
Some Random Generated Examples
Step
﻿
This is made possible by training a transformer model on a dataset of traditional tunes. This article will dive into how it works, but if you're in a hurry to make some jolly AI-generated music, then you can also jump straight to the demo space and take the model for a spin. 
Table of ContentsThe DataThe ApproachTrainingEvaluationTry It YourselfFinal Thoughts
﻿
The DataThere is a standard for 'ABC Music Notation', which was designed as an accessible way to share music, especially folk music. Notes are represented as letters, and various other symbols act as modifiers, rests, repeats, and so on (more info here). 
The songs are traditionally shared in simplified form, and it is down to the performer to add ornamentation and embellishments. This means that the ABC format is usually a very concise representation of the tune, which in turn, makes it a good candidate for language modeling!
Viewing an example of a generated tune with https://editor.drawthedots.com/. ABC notation on the left, the resulting score on the right. The title, time signature, and key were specified as inputs.
In my previous attempt at a project like this, I scraped a website called thesession.org which has a decent collection of traditional Irish tunes. 
This time around, I wanted even more training data. I hit the jackpot with this collection compiled by José A. Oliveira. I downloaded all the linked zip files and then extracted and merged the contents with a couple of shell commands: for z in *.zip; do unzip "$z"; done and then cat *.abc > all_tunes.txt. The result is a single 8MB text file with thousands of songs, one after another. 
The ApproachWe're going to train a fairly standard language model, which will try to predict the next token in a sequence. Except that instead of the tokens being words or sub-words, they'll be individual characters in a string of ABC notation. 
The architecture is a decoder-only transformer. For the implementation, I chose Andrej Karpathy's minGPT library since I've used it before and know it comes with a character-level example which is a good starting point for this project.
The approach (diagram adapted from https://jalammar.github.io/illustrated-gpt2/, an excellent resource for understanding GPT style transformers).
Once we have a model trained, we can generate new sequences auto-regressively by repeatedly feeding in a sequence, picking the next token based on the model's predictions, and then append this token to the sequence. 
TrainingI modified the trainer class included with minGPT to add W&B logging. A notebook with the code is available here. Training on this dataset can take hours, and since I was pretty happy with even the early baseline results, I didn't go too crazy exploring different configurations. The plots below show the two key models:
A small version with 6 layers, 8 transformer heads, and an embedding size of 256, trained with a batch size of 512. ~4.8 million parameters.
A larger model with 8 layers, 8 heads, and an embedding dimension of 512 trained with a batch size of 327 (in honor of Thomas' recent post on the subject). ~25 million parameters.
For comparison, the smallest GPT-2 model has 12 layers, 12 heads, and an embedding size of 768, for a total of 117 million parameters. Training something this size on our tiny dataset is left as an exercise for the reader!
﻿
﻿
﻿
PS: It's worth noting that after a few years of relative inactivity, Karpathy is working on a PR refactoring things and moving the examples into their own mini-projects. Replicating this training (without W&B logging) can now be done by simply running chargpt.py with the ABC tunes file as an argument. I recommend this over attempting to run my messy notebook, and include my code only for transparency :)
💡
EvaluationHere you can see example outputs from the model as it trains, logged using wandb.Html with the abcjs library handling the rendering of the score and the MIDI player. 
If you look back at early generations, many are disjointed and incoherent (not helped by some bugs in my preview code). After a number of steps, it starts to do a pretty good job, and we see some very believable songs being generated. 
If you're seeing a blank preview or no working player, try a different step (I like 'Pub Troohed To The Barcaroe' at step 9100, for example) or switch to playing with the demo space - there are some bugs in my preview code for these runs.
﻿
﻿
I was worried that the model might be overfitting, especially since the dataset does contain multiple versions of some songs. Some observations:
Searching through the training data for specific patterns shows that most outputs do appear to be fairly novel, although there will sometimes be a short phrase or two that comes from existing songs
A few outputs are indeed VERY close to tunes from the training data, so please check before you claim something as original if you're using this for anything serious
The model is much more prone to verbatim copying of text such as annotator emails or collection names which appears a large number of times throughout the training data
The test set is taken from the end of the training data, which is sorted by tune name. This means that the test set contains many songs that aren't present in the training data. The test loss is higher than the training loss, but it's tricky to compare since the nature of the training also means there is a LOT of repetition/overlap thanks to the way sequences are handled. 
﻿
I also did some fairly informal human evaluations. I used to play in an Irish band, so I know some domain experts! Initial impressions are that the outputs often sound nearly believable but retain enough quirks that we can still spot that these aren't made by humans. 
Try It YourselfI made a demo space where you can experiment with generating your own songs using an early checkpoint of the smaller model. It'll be updated once the current training runs finish. Give it a go here. ﻿﻿
The Gradio interface for the demo space.
The ABC notation is simple enough that you'll soon get the hang of tweaking the output - with a little fiddling you can get some lovely tunes going and it feels like co-writing with an eager but weird assistant :)﻿﻿﻿﻿
Final ThoughtsI've explored music generation once or twice before, and while the systems have been pretty impressive, I've never seen much coherence from them in terms of global structure. This time was different, thanks in part to how concise ABC notation is. The outputs often have consistent parts with repeats. There are leitmotifs that appear throughout the music, with variations just as we'd see in traditional tunes. And besides the comedically bad text in titles and descriptions, I'm hard-pressed to tell that some of these outputs are indeed AI-generated. 
Thank you for coming along on this journey with me and making it all the way through this report. As always, you're welcome to reach out to me @johnowhitaker with comments, questions, or suggestions. And now, I think it's only right that we end with these sweet words straight from the model's mouth:
Ellern Froll
Lot I chang in Adar, sprink Night
There a cast of thu aron is for swer?
They a stay two mestlants of marry
And she can but a bonnyaran was she's awants as see sunked.
It sawn felt may bonnie, I sed
If an he charty fain on hight
I chound and summ.
PS: This project and report were motivated by the ongoing Weights & Biases #blogathon. It's a great excuse to write something up, and you should definitely join in! :)
﻿