Skip to main content

Meta-TTS fine-tuning

Meta-TTS is a multi-lingual FastSpeech2 model meta-trained on LibriTTS-100 dataset for quick new speaker adaptation.
Created on June 20|Last edited on June 27

Fine-tuning dataset size: 1min vs 40min

Comparison of samples produced by models fine-tuned with 1 minute and 40 minutes of training data for Morgan angry voice.

This is a voicemod T T Speach trial of zero sample
This run didn't log audio for key "phrase1.wav", step 701, index 0. Docs →
This run didn't log audio for key "phrase1.wav", step 701, index 0. Docs →



Speaker embedding initialization

There are several approaches to speaker embedding initialization:
    • starting from arbitrary pretrained speaker
    • averaging pretrained speaker embeddings
    • shared speaker embedding from meta-model
    • random init



Different loudness

Warning: the loundness of displayed audios varies significantly
💡



Different "moods"




Fine-tuning different layers


Run set
4


LJSpeech results

This results are obtained with fine-tuning dataset of ~2700 utterances (LJ002-LJ010)

This set of panels contains runs from a private project, which cannot be shown in this report