Polyphonic Note Identification
Created on October 3|Last edited on October 4
Comment
Project OverviewPersonal Motivation for the ProjectModel designWavelet TransformsConvolutional Neural NetworksDomain-Specific EnhancementsResidual ConnectionsPutting things togetherImplementation DetailsReferences
Project Overview
This project is my attempt at polyphonic note identification from music in the form of raw audio data. The aim is to create a model that identifies musical notes in raw audio samples. For example, the following audio clip of a piano input into the model will produce a list of musical notes. Click the play button to hear the clip.
This is the output, which is fully correct for this example. You can play these notes on the piano and it will sound exactly like the audio sample.
['C4', 'E4', 'G4', 'C5', 'E5']
Another example shows how highly time-localized signals are still identified very well. I am not a great guitarist so this is my best attempt to squeeze 8 notes into a 1.2 second clip:
Again, the notes of the C Lydian Scale are identified correctly:
['C3', 'D3', 'E3', 'F#3', 'G3', 'A3', 'B3', 'C4']
The rest of the post will explain the architecture of this model.

Model Architecture Overview
Personal Motivation for the Project
Music and the guitar are my hobbies so I play and listen to various kinds of music. Being an ordinary person who lacks perfect pitch recognition abilities, I usually do not know what notes and chords are present in the music I hear. Figuring these out requires a time-consuming process of trial and error on my guitar before I can figure out the rough harmonic structure of the song. Out of curiosity and interest, I decided to explore the idea of creating a mobile app that would help with this. I teamed up with a few friends who helped out with the app-development side while I got to work on a machine-learning model.
I was interested in digital signal processing so there were a number of things I wanted to try putting together:
- Wavelet Transforms
- Convolutional Nets
- Domain-Specific Enhancements
- Residual Connections
Model design
Wavelet Transforms
The first challenge I tackled with was to design meaningful input features for whatever machine-learning model that comes later. Many models that take audio data as input utilize Short-Time Fourier Transforms (STFT) as input features. This is a common enhancement to a simple Fourier Transform to address the issue of signals being highly localized in time. However, I think that Wavelet Transforms provide a more flexible and targeted approach for extracting time-localized signals. To implement this, I construct a filter bank of Morlet Wavelets of F number of frequencies in ascending order. This filter bank of size F is convolved with the input signal to produce F number of frequency response arrays that are fed into the rest of the neural network. I use Morlet wavelets as suggested by Kumar, N., & Kumar, R. (2020). Wavelet frequencies defined are modelled after human perception of musical octaves, which varies exponentially in base 2.
The rest of this section elaborates details related to signal processing. To check the degree of redundancy of the convolution filters (Morlet Wavelets), I did pairwise inner products of all Morlet Wavelets to check if they are actually orthogonal. As one would expect of wavelet filters constructed in ascending frequency, wavelets very different in frequency are very orthogonal (inner product is close to zero), vice versa. The inner product values are plotted on the upper triangle of the below plot. Brighter is higher (less orthogonal). Filter frequencies increase from left to right, up to down. The brightest spots correspond with inner products of filters with itself which should be high. The similarity values steadily as the compared frequency increases.

Orthogonality of Wavelet Kernels. Frequencies increase from left to right and top to bottom.
Since Wavelet Transforms are a generalization of the Fourier Transform, we can perform a similar analysis for simple sine/cosine filters that would more closely resemble the Fourier Transform.

Orthogonality of Fourier-like Kernels. Frequencies increase from left to right and top to bottom.
There turns out to be a more complicated redundancy pattern in the simple sine/cosine filters as similarity does not decrease monotonically as the compared filter frequency increases. My guess is that this is caused by spectral leakage. Using a simple STFT without proper windowing would likely lead to this problem. Thus, wavelet transforms are potentially better than STFT, though it would take a more rigorous investigation to know if wavelet transforms actually perform better.
Convolutional Neural Networks
CNNs are typically used for image-related tasks. Features generated from the wavelet filters produce meaningful images:

Image-like features of "Piano C Major Chord" audio clip. Frequencies increase from top to bottom. Horizontal time axis progresses from left to right.
['C4', 'E4', 'G4', 'C5', 'E5']
Notice that there are 5 "lines" that range from dim to bright and that the number of lines is the same as the number of notes in the audio clip. Indeed, each of those lines actually correspond with each predicted note. From this point onwards, we may apply standard techniques for image recognition to achieve polyphonic note identification.
Domain-Specific Enhancements
Musical Sounds are very feature-rich. One defining characteristic of sounds from musical instruments is timbre. Most people, even without any musical training, are able to tell the difference between a middle C played on a piano and a middle C played on the guitar. The timbre of these instruments play a key role in distinguishing them apart. The timbre of an instrument is largely characterized by the dominant overtones. From a frequency point of view, the two instruments have very similar sets of frequencies. However, the strength of each frequency differs. In other words, the two instruments have the same harmonic frequencies (same fundamental frequency and this frequency is an integer multiple of all higher frequencies).
Since the image fed into the model has a specific structure where frequencies increase from top to bottom, we can construct a kernel that learns the overtone structure of various instruments by constructing "tall" kernels.
Residual Connections
So far, much effort has been put into engineering input features. We want a deep model that can efficiently create higher abstractions of these rich input features. An established technique for doing so is to introduce an "identity function bias" by adding residual connections in the network, as suggested by He, K., Zhang, X., Ren, S., & Sun, J. (2016). Batch normalization is also included to attenuate effects of covariate shift.
Putting things together

Model Architecture
The diagram above shows the overall design of the model. To evaluate the performance of this model, I have a test dataset of 2496 labelled audio clips of various kinds of songs. Each clip is of length 1.2 seconds and has a varying number of notes. I measure accuracy by counting the number of notes correctly identified and dividing by the total number of notes in the entire test dataset. By this measure, the model achieves 82% accuracy. This model has since been exported to an Android app. My friends and I have spent some time playing around with it and it works decently well with occasional mistakes.
Implementation Details
The model is written in Python and uses the PyTorch framework. The model is quite lightweight so I was able to build and train it on my personal laptop. Training and test data is synthesized using a dataset of MIDI files. Training is performed over a small number of epochs using Adam optimizer.
References
- Kumar, N., & Kumar, R. (2020). Wavelet transform-based multipitch estimation in polyphonic music. Heliyon, 6(1), e03243.
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
- Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456). PMLR.
Add a comment