Skip to main content

Selecting the best models

In this report the selection process for further training of the best models is discussed
Created on August 18|Last edited on August 30


Introduction

After doing the first few experiments and trying to fix the offset issue as discussed at length on this report, we need to select from the latest sweep which sets of hyperparameters or runs we consider to be best in order to train our model further and evaluate it. The sweep being discussed in this report is aa0yaae1, that has 39 runs with different hyperparameters.

1520253035404550556065batch_size50100150200250300350400450500550d_model020406080100120140160180200220240260dim_feedforward0.100.110.120.130.140.150.160.170.180.190.200.210.220.230.24dropout0.0180.0200.0220.0240.0260.0280.0300.0320.0340.0360.0380.040learning_rate1.01.52.02.53.03.54.04.55.05.56.06.57.07.58.0n_heads6.06.57.07.58.08.59.09.510.010.511.011.512.0n_layers0.250.300.350.400.450.500.550.600.650.700.750.800.850.90loss_multiplier1.31.41.51.61.71.81.92.02.1test_loss
aa0yaae1
5


Model selection

From losses to hit accuracies

Up to this point, we had been looking at the test loss values (since the test subset hasn't been used to train the model) to determine whether a model was doing well or not, but since now the loss values are being weighted, and each run has a different weight (loss_multiplier in the plot shown above) these values are no longer comparable. A run could have a really low test loss because the loss_multiplier is quite low and still be one of the worst combinations of hyper parameters, as happens with the absurd-sweep-24 run:

absurd-sweep-24
0

As can be seen and listened to above, this run generates hits on all timesteps and voices for almost every pattern - we need another metric that can give us an idea of which runs of all of these we should pay more attention to. Since the weights don't affect the hit accuracy, this could be a good metric to look at. If we see the sweeps below, the poor example we just discussed (absurd-sweep-24) has a vey low hit accuracy, reinforcing the idea that this could be a decisive metric to choose the best hyperparameter sets. Perhaps a better approach would have been to do one sweep to find a good value for the loss_multiplier or penalty, and a second sweep where all the runs have this same loss_multiplier to figure out the rest of the hyperparameters.

aay0yaae1
5

Let's take a look at the top 12 runs when sorted by highest hit accuracy (test set):

Top runs by test_hit_accuracy
5


Patterns with triplets

By listening to examples from the selected runs and looking to the velocity heatmaps presented above, we can see they are quite close/resemble the ground truth set. Once thing to note is that since our dataset only contains 4/4 beat examples, and the timesteps are 4 per beat, in some examples that use triplets (e.g. some of the funk patterns) our model's predictions diverge more from the ground truths or expected patterns - as can be heard in the synthesized audios presented below.

Top runs by test_hit_accuracy
5



Hit accuracy + offset loss

Apart from this issue derived from the limitations and bias of our data, listening to most of the audios these models seem like a good sample of sets of hyperparameters that could work well for our task. In order to reduce them to a smaller set, we sorted them by lowest test_offset_loss, and took the top four runs - ensuring offset variability is relatively acceptable:

Top runs by test_hit_accuracy + lowest test_offset_loss
3


We will re-train these 4 models with further tracking to be able to observe the learning process, but also to select the epoch at which the model is at its sweet spot: being good enough at tap2drum and still being able to generalize (underfitting vs overfitting).

Best epoch

A good way to get an estimate of how long each of these experiments should be trained for is to look at the loss curves, comparing the ones from the train set with those from the test set. As the training loss goes down during training, the test set loss also goes down, but if you train the model enough, at some point the test set losses will start to go up again. This happens because the model then starts to learn so well to generate the examples from the train set that it cannot generalize to other examples it has not seen before. The ideal epoch would be where the distance between the test loss and training loss is minimal, or closest to 0 in the graph shown below to the right. In the graphs shown below, an exponential moving average has been applied to reduce noise and be able to see the tendencies better.

selected runs
3

Looking at this graph we get an idea of how long we should train our model for each set of hyperparameters, although once retrained it should be recalculated for the final selection and evaluation - as for each training weights are initialized randomly and the learning process might differ.

Re-trained models

These are the four retrained models:

Run set
4

Looking at plot shown above we can select the epochs at which we will evaluate our models. Taking into account the distance between test loss and training loss, the models logged (models were not saved at every single epoch) and the test_loss trend, the selected models for evaluation are: