Meta-TTS fine-tuning
Meta-TTS is a multi-lingual FastSpeech2 model meta-trained on LibriTTS-100 dataset for quick new speaker adaptation.
Created on June 20|Last edited on June 27
Comment
Fine-tuning dataset size: 1min vs 40min
Comparison of samples produced by models fine-tuned with 1 minute and 40 minutes of training data for Morgan angry voice.
This is a voicemod T T Speach trial of zero sample
Speaker embedding initialization
There are several approaches to speaker embedding initialization:
- starting from arbitrary pretrained speaker
- averaging pretrained speaker embeddings
- shared speaker embedding from meta-model
- random init
Different loudness
Warning: the loundness of displayed audios varies significantly
💡
Different "moods"
Fine-tuning different layers
Run set
4
LJSpeech results
This results are obtained with fine-tuning dataset of ~2700 utterances (LJ002-LJ010)
This set of panels contains runs from a private project, which cannot be shown in this report
Add a comment