Fine-tuning ResNet-18 for Audio Classification

Reproducible experiments using fastai, fastaudio, and wandb sweeps to explore hyperparameters for achieving high accuracy on the ESC-50 dataset.
John Hartquist


One common challenge that all machine learning researchers face at one point or another is that of hyperparameter tuning. Once the data is obtained, cleaned, and transformed, one must choose which type of algorithm to apply to the specific problem at hand, and most of them have at least a few hyperparameters that can affect the final results drastically. This is especially true when training deep learning models as one needs to define the architecture of the network, which learning rate to use, the size of training batches, the number of epochs to train for, and potentially many others. Trying all the combinations is generally very expensive both in terms of time and in resource utilization.

Even when many experiments take place, usually only the best results are reported, and it is not always clear exactly which hyperparameters were used, and how many different configurations were tested. While it is becoming more and more common for researchers to publish the code used along with their work, I believe tools like Weights and Biases, (the platform I'm writing this report on), have the potential to bring even more transparency and reproducibility to ML research. It is now possible to not only visualize a single training session, but to compare statistics across many training sessions, and be able to save the exact code and all of the parameters used for each one.

In order to familiarize myself with this tool, I decided to apply it to a problem I'm already familiar with. About two years ago, I wrote a blog post about using the PyTorch and fastai libraries to generate spectrograms for audio classification at training time. Since then, fastai version 2 has been released, and a module called fastaudio has been created to streamline the whole process. The creators of fastaudio also set up a small competition repository based on the ESC-50 dataset, a collection of 2000 audio samples that are labelled with 50 different classes. I thought it would be fun to train some baseline models and try to beat some of the existing benchmarks (86.50% accuracy is the highest listed at the time of writing).


The general approach used in fastaudio is to transform raw audio files into log-mel spectrograms, which are two-dimensional, image-like representations. These can be used with traditional computer vision models to produce relatively good results. Since the conversion from audio to log-mel spectrogram also requires a number of hyperparameters (window size, hop length, number of mels, FFT size, etc.), I figured that it could be useful for others to see how these knobs affect classification accuracy. As one example, the authors of this paper were able to increase the accuracy on their task from 88.9% to 96.9% simply by adjusting the parameters used to generate their spectrograms. Since the ESC-50 dataset is so small, it is possible to train a pretty good model in only a few minutes, making it great for running many experiments.

The Experiment

Experiment Setup

The goal of this exercise was primarily to find out how varying each of the spectrogram parameters would affect classification accuracy. After some initial experiments, I decided I would focus on the transfer learning case, and fine-tune pre-trained ResNet-18 models to the image-like spectrograms. While spectrograms are actually different than real images in a lot of ways, I was able to achieve pretty good accuracy in a short amount of time, while training models from scratch ended up taking 3-4 times as long to get to a similar accuracy.


I used the ESC-50 dataset which comes with its own cross validation splits. For all of the hyperparameter experiments, I used fold 1 as the validation set. Then at the end of the experiment, once I found a good combination of parameters, I would run many trials, using each fold as the validation set in turn to report the final mean accuracy over all the runs.


As mentioned in the introduction, I used the new version of fastai along with the fastaudio to train the models, and wandb to track the progress of each one. Fastai makes this integration super easy, and all that is required is to add a callback to the main Learner object. The "Sweep" feature of wandb came in very handy, as it gave me the ability to define which parameters I'd like to modify for each experiment, and then turn on an agent (or many agents) to train a model for every possible combination of those parameters (grid search). I rented a cloud GPU server from with 8 v100 GPUs for about 24 hours, and ended up training 1430 models in total.


When we train deep learning models, we generally get varying results even when training with the exact same hyperparameters. This is due to the randomness introduced primarily by weight initialization, as well as shuffling the training data. Fortunately, fastai comes with a set_seed function (docs), that sets the seeds for numpy, pytorch, and random. By giving each run a specific seed, it can be reproduced exactly. Weights and Biases' python client, wandb, also comes with a parameter named save_code that can be turned on when initializing a run, allowing the script or notebook to be saved along with the training statistics for that specific run. If in a git repository, it also saves the commit hash, and any parameters that were passed in if it was run as part of a sweep.

Training Script

All of the code for this experiment can be found in and found in the associated github repo. At the beginning of the script, a bunch of default parameters are defined and passed to wandb.init:

run_config = dict(
    # spectrum

    # model

    # training
    # data

Then, a sweep configuration file can be defined that replaces specific parameters over the course of many runs. For example, to test with many different batch sizes, the sweep file can look like this:

method: grid
project: fastaudio-esc-50
    values: [8, 16, 32, 64, 96, 128, 192, 256]

In order to run each configuration multiple times, I added a line for trial_num, so that each batch_size will be run for each distinct value, allowing me to average the results over 5 trials each. Therefore, the following will produce 40 total runs:

    values: [8, 16, 32, 64, 96, 128, 192, 256]
    values: [1, 2, 3, 4, 5]

Hyperparameters Tested

For more information about the different parameters used to generate spectrograms, I highly recommend the YouTube series Audio Signal Processing for Machine Learning by Valerio Velardo. For this experiment, I ran a sweep for each of the following hyperparameters:

In each sweep I averaged the results over 5 trials and tested each configuration by running for 10, 20, and 80 epochs. In all cases I used the default learning rate of 0.01, and the following results all correspond to the 80 epoch versions unless otherwise specified. I also used a sample rate of 44,100 Hz as that is the sample rate of the raw audio. For each of the following graphs, you can expand the "run set" to view all the individual runs, including training statistics, hardware utilization, parameters, and the code that was executed.

Sweep Results


Hop length is measured in samples here, and 441 samples corresponds to 10 milliseconds at a sampling rate of 44,100. In this sweep, we can see that a hop_length of 308 samples (about 7 ms) had the highest average accuracy overall, while 529 produced the lowest validation loss. (The black bars on the right show the standard deviation over the group of 5 trial runs). Hop length is important, because it directly affects how "wide" the image is, so while a hop_length of 5 ms might have a slightly higher accuracy than 10 ms, the images will be twice as large, and therefore take up twice as much GPU memory.

Section 7


When generating a spectrogram with most audio libraries, if the window length is not set, it is set to the FFT size by default. Usually the FFT size is set to be a power of 2, as it is computationally more efficient (though I recently learned that this is not necessarily the case, depending on the implementation). The window length cannot be larger than the FFT size, but if it is shorter, then the FFT buffer is generally zero-padded. It is a very important parameter for determining the time-frequency tradeoff. Smaller window lengths will result in good time resolution, but poor frequency resolution, especially in the low frequencies. In the same way, large window sizes give good frequency resolution, but they are averaging over a longer period of time.

In this sweep, we see that 2205 samples (50ms) produces the best average accuracy, though I was surprised to see that 4410 (100ms) was almost as good.

Section 5


If hop_sizedetermines the width of the spectrogram, then the height is determined by n_mels. After first producing a regular STFT using an FFT size of say 2048, the result is a spectrum with 1025 FFT bins, varying over time. When we convert to a mel spectrogram, those bins are logarithmically compressed down to n_mels bands. This sweep showed that 32 is definitely too few, but other than that it does not affect accuracy too much. A value of 128, which is the default with fastaudio, gives the highest mean accuracy, and a value of 160 gives the lowest mean validation loss. I am curious if a larger network architecture might be able to take advantage of more mel bands.

Section 15


FFT size is an interesting parameter. At first glance, it might not be obvious why it would make a difference as long as it was larger than the win_length. After all, however many frequency bins are created are just compressed down to n_mels anyway, so it doesn't directly affect the shape of the spectrogram.

I believe that the reason is due to how we create the mel bands. The lower frequency mel bands are generated from a small number of FFT bins, while the higher frequency mel bands are generated from many FFT bins. When you use a higher n_fft, but are keeping the win_length the same, you don't actually get any more information from the raw signal, as the FFT buffer is just zero padded. What happens however, is that the resulting frequency spectrum is interpolated. The same information is spread out across more FFT bins, so when the mel bands are created, (especially the low frequency ones), they get a more accurate representation, since they are averaged over more values.

In this sweep, I set win_length to 1024, and we can see over a 4% improvement in classification accuracy when moving between n_fft 1024 and 4096. After that, increasing n_fft further does not seem to help. I believe this is the single greatest finding throughout this experiment, and helps to explain why I was able to exceed some of the public benchmarks for ESC-50.

Section 17


When converting from a regular spectrogram to a mel spectrogram, you can specify f_min and f_max which define the range of frequencies that the bands will be split between. f_min defaults to 0, which I used in all cases. f_max however, defaults to nyquist frequency which is equal to half the sampling rate. It is common for researchers to downsample their audio to 16 KHz to minimize memory usage and speed up preprocessing, however if you select hop_length and win_length with respect to fixed time intervals (e.g. using milliseconds instead of samples), the size of the resulting spectrogram will be the same. Therefore, using a higher sampling rate will allow you to have a higher f_max, say 22.5KHz for a sample rate of 44.1KHz, whereas you would be limited to 8 KHz with a sample rate of 16 KHz. Depending on the nature of the data, you may or may not care about information in the high frequencies.

In this dataset, lower f_max seemed to correlate with lower accuracy, but it was not very definitive. 18,000 Hz had the lowest validation loss on average.

Section 13


When doing transfer learning with fastai, by default, the data is normalized using the same statistics as what the model was originally trained on. The output of a spectrogram is significantly different than that for a normal image. Chris Kroenke wrote an awesome article about normalizing spectrograms, and I borrowed his code for this sweep, specifically using the "Global Normalization" technique. My results are similar to his, and show a clear improvement in both validation loss and accuracy when normalizing the data to have a mean of 0 and standard deviation of 1.

Section 3


I hadn't heard of mixup until experimenting with the baseline tutorials on the fastaudio Audio-Competition repo. It is a data augmentation technique that can significantly improve classification accuracy. You can read the original paper here. Since an implementation is built into fastai, I was able to use it with a single line of code, and it clearly helped. This is the only data augmentation in this entire set of experiments, and you can see that a value of only 0.1 produces the highest accuracy and the lowest validation loss.

Section 11


Usually it is recommended to use a batch_size as large as possible as long as you don't run out of GPU memory. In this case, because the dataset is so small (only 2000 examples), 64 seems to be the best value. This could be because there are more steps per epoch with a smaller batch.

Section 2


I ran all of the previous configurations using 10, 20, and 80 epochs. While training for 80 epochs almost always better than with 20 epochs, it was not significantly different. Because fastai uses one-cycle training by default (paper by Leslie Smith), good accuracy is sometimes possible with a small number of epochs, and running for more epochs does not necessarily produce better results.

In this sweep we can see that while 20 epochs produced the lowest validation loss, 50 epochs achieved the highest accuracy, and training for 150 epochs for over 23 minutes was even worse than training for 20 epochs for only 3.5 minutes!

Section 9


Now that I had a good idea of how various hyperparameters affected performance on ResNet-18, I wanted to see what kind of gains in performance I could achieve by simply swapping out the pre-trained model for something a little more powerful.

As we increase the size of the ResNet model, we see that our training loss gets better until getting to the very large ResNet-152. The validation loss is the best with ResNet-34, and then gets worse with ResNet-50 and ResNet-101, which may be a possible sign of overfitting. I have a feeling that these larger models may also just need to be trained longer as they each have a very large number of parameters. It is also worth noticing how much longer it takes to train the larger models for the same number of epochs with the same batch sizes.

We see a similar pattern with the DenseNet models with DenseNet-161 performing the best of all the architectures, achieving an impressive 91.4% accuracy!

Section 19

Final Results

To wrap up, I decided to measure the accuracy of a set of hyperparameters on both resnet18 and on densenet161. To do this, I ran 5 trials for each of the 5 folds, and averaged the results.


For the resnet18 trial, I averaged 25 runs for each of 10, 20, and 80 epochs (75 runs in total). Here is the corresponding sweep configuration:

method: grid
project: fastaudio-esc-50
    value: 64
    value: 44100
    value: 308 
    value: 2205
    value: 224
    value: 4096
    value: True
    value: 0.1
    value: 18000
    value: resnet18
    values: [10, 20, 80] 
    values: [1, 2, 3, 4, 5]
    values: [1, 2, 3, 4, 5]

These results show that even training for only 20 epochs for just under 4 minutes, we get 86.14%. For comparison, the highest accuracy listed on the official ESC-50 is 86.5%. When training for 80 epochs, we reach 89.54% in a little over 14 minutes. That's pretty good for using a model pre-trained on images, and using no data augmentation except for MixUp!

(grouped by n_epoch)

Section 25

(grouped by fold; only 80 epoch runs)

Section 26


The best results overall can be seen with the same set of parameters with DenseNet-161. Averaged over 5 runs for each of the 5 folds and training for 80 epochs each, we see an average accuracy of 91.47%.

(grouped by fold; only 80 epoch runs)

Section 28


After training 1,430 models, and analyzing the results with the Weights and Biases platform, I was able to find hyperparameters to fine-tune a ResNet-18 model, pre-trained on image data, to reach 89.5% accuracy in under 4 minutes. I imagine that by using more specialized models informed by domain expertise, it will be possible to reach even higher levels of accuracy using similar spectrogram settings.

Throughout this project I learned a lot about both designing experiments and training deep learning models. I'm grateful for all the open-source libraries and communities behind them that make this kind of research both possible and accessible. In the future I plan to try out these techniques on larger audio classification datasets as well as comparing different strategies for data augmentation as related to audio.

I sincerely thank you for taking the time to read through this report, and you can find all the code I used here. Please let me know if you have any questions, comments, or corrections.

Website | GitHub | Twitter | LinkedIn