Homework report

hw#4 p.2, Skoltech DL course
Created on May 12|Last edited on May 27
Comment
﻿
Instead of introductionI love and hate this task simultaneously. It was very attractive, very big and very unusual - I neved had experience of working with sound before.However, due to mostly technical issues, it took almost two weeks, of which one week - holidays. 
BugsTo start with, I want to tell you that instead of writing own "buggy" code, I decided to wrap my experiments into Pytorch Lightning Module. I supposed that it would allow me to escape from unwanted issues related to training and validation loops, device mismatching and so on and so forth. What do you think? I wrote two training substeps and called them in the main training_step , but in this training_step I forgot to do return. I tried to find this error 3 days, and I was sure that nan in progress bar is a pl bug, cause I checked the loss calculated in substeps. I was totally wrong. However, while looking through code for 5 times, I fixed some other minor errors in the code, e.g. I changed addition of duration/2 to it's substraction. There were many other bugs that I don't remember already, but the latest one was following.
I got CUDA-device side error on validation. (You'll see that train duration line is quite short due to this). I was clever this time and after two attempts of unsuccessful debugging switched device to cpu and waited for 30 minutes. It turned out, that when doing full inference (w/o durations given), the DurationModel produces times, that sum up to a bigger time interval than the longest seen on train/validation datasets. I fixed this issue by simply cutting the inputs to Decoder (mistake happened in this part, not in AlignmentModel which was originally shown when running on gpu). Hope that it won't lead to unexpected results. By the way, it's another reason to remove learnable positional embeddings from the homework.
Runs historyI had some technical problems: I couldn't seat in front of laptop and monitor colab to not let it turn off, so some runs simply fell down. 
Other runs (many) were done before fixing return, so had no sense.
Finally, first runs didn't train DurationModel, but they produced quite good soundtracks. Once I tried to insert activation between Transformer and ResBlock, but it didn't have any effect on perfomance.
Last runs, when I started to train DurationModel, showed that it's better to train it simultaneously with the main TTS part, what I did in the last run. I also tried to replace one dense layer with MLP, but didn't get better results.
Also, in the final run, I didn't use gradient clipping and Noam policy, and achieved better perfomance - the model continued to improve (slowly) even after 30 epochs (with clipping and noam it reached the plateau after ~20 epochs). Thus I don't understand the strong advice to use these techniques and disappointed that it was present in the notebook.
﻿
Run set4
﻿
I think that the model perfomance is not optimal, more experiments could be done. But I am fine with the quality of obtained audio's - it's a good result for me. Sorry, I did final run using unchanged notebook and forgot to pass vocoder to train_tts. Some validation audios obtained without information on duration and with it are represented below.
﻿
Run set4
﻿
Also, I logged some predicted spectrograms. The spectrograms with and without look similar, but some little differences can be noticed, especially on long examples.
﻿
Run set4
﻿
Hardest thingsNot to miss all details from the notebook. Sometimes I just didn't read till the end and did wrong implementations.
To be lucky to get accelator on colab that works 2 times faster. Otherwise, I had to wait for 3-4 hours till the end of experiment (more than one lesson in Skoltech).
Not to be lazy and write proper code for logging and experimenting encountering all details that may arise during training/inference. Otherwise I would spend much more time on rewriting technical stuff after next detail will completly annoy me.
Learned thingsI improved my skills of working with pytorch lightning and wandb. I also should think in advance: what thing are permanent and should be saved/hardcoded, and what things are temporal and really need to be calculated every launch. It would save time a lot. More things are that I didn't learn but realized that I should definetely learn in a short time. 
I should improve checkpoint sync with wandb (I once had problems with it and didn't even try this time). 
I should learn some tool for hyperparameter optimization (e.g. wandb's sweeps (?)).
QuestionsI have more questions on the first part of the homework, because the things that arised there were more unexpected for me, because I had worked in NLP already and things like connection between initialization and expected input were not so clear to me.
I would rather like to have a look at vocoder details, because it's a quite heavy model that does magic and transforms images to sound.
What I would like to add to this assignment?I would be grateful to have some prewritten utils for training and logging - I would like to concentrate more on theory than on this sort of things.
﻿
Add a comment