Training FastSpeech2
Created on November 28|Last edited on November 29
Comment
I took the code from the seminar and the code from the repo that is used there. I implemented attention in the spirit of Karpathy's MinGPT and switched the transformer blocks to use pre-norm. I turned all the bash code into python to automatically download and parse the training data, download the trained WaveGlow model etc.
To implement the pitch and energy adaptors from FastSpeech2 I extract the pitch or f0 from the audio using pyworld and extract the energy by computing the norm of STFTs.
To predict the pitch and energy I use the same module architecture as for duration prediction. The values are then quantized to 256 values using the min and max statistics computed during the load-up of the dataset into memory. In both cases the quantization is logarithmic. The quantized values are used to obtain embeddings that are added to the length-regulated encoder output.
I train for 15 hours with OneCycleLR scheduler and 4000 steps of warm-up.
The predictions during training start sounding the same as the original spectrograms, but when I try to do inference on custom text it sounds abominable. An hour and a half before the deadline I discover that I'm not using the energy embeddings at all. I was using the pitch embeddings instead. I wasn't using the energy predictor as well, the pitch predictor was used instead. In hindsight this mistake was obvious, it was clear to me even as it was happening that there was a trade-off between the energy and the pitch predictions. You can see the pitch loss improving as the energy loss is getting way worse.
Just trying to imagine, what sort of optimizational hell my model ended up in after training for 15 hours to predict both energy and pitch at the same time, and to accommodate both pitch and energy information in one set of embeddings, brings me t tears.
I fixed that mistake and changed the loss computation for pitch and energy. I also add predictions mix-in into variance adaptor, so that during training for 5% of the samples the spectrogram is predicted with the predicted values for pitch or energy instead of the ground truth values. I "fine-tune" the checkpoint from the previous run for a bit after that. It gets somewhat better.
I don't know what's happening with the duration loss there, maybe the learning rate's too high.
Add a comment