Meta-TTS fine-tuning

Meta-TTS is a multi-lingual FastSpeech2 model meta-trained on LibriTTS-100 dataset for quick new speaker adaptation.

Created on June 20|Last edited on June 27

Comment

﻿
Fine-tuning dataset size: 1min vs 40minComparison of samples produced by models fine-tuned with 1 minute and 40 minutes of training data for Morgan angry voice.
﻿
This is a voicemod T T Speach trial of zero sample
meta-emb-vad-spk_0-angry-1m
This run didn't log audio for key "phrase1.wav", step 701, index 0. Docs →
meta-emb-vad-spk_0-angry
This run didn't log audio for key "phrase1.wav", step 701, index 0. Docs →
Step
﻿
﻿
Speaker embedding initializationThere are several approaches to speaker embedding initialization:
starting from arbitrary pretrained speaker
averaging pretrained speaker embeddings
shared speaker embedding from meta-model
random init
﻿
﻿
Different loudnessWarning: the loundness of displayed audios varies significantly
💡
﻿
﻿
Different "moods"﻿
﻿
Fine-tuning different layers﻿
Run set4
﻿
LJSpeech resultsThis results are obtained with fine-tuning dataset of ~2700 utterances (LJ002-LJ010)
﻿
This set of panels contains runs from a private project, which cannot be shown in this report
﻿
﻿

Add a comment