Report. HIFI-GAN
Created on January 27|Last edited on January 28
Comment
HiFi-GAN is a generative adversarial network for speech synthesis. HiFi-GAN consists of one generator and two discriminators: multi-scale and multi-period discriminators. The generator and discriminators are trained adversarially, along with two additional losses for improving training stability and model performance.
The generator is a fully convolutional neural network. It uses a mel-spectrogram as input and upsamples it through transposed convolutions until the length of the output sequence matches the temporal resolution of raw waveforms.
Training details
Model was trained for 30 epochs (12 hours in Kaggle kernel approximately) with batch size 3. Audios were truncated to 12031 samples (took random windows from audio). Both generator and discriminator were trained with AdamW and lr=2e-4.
Inference examples
Run set
1
Add a comment