Text2Speech

Training a model to convert automatically generated image captions into intelligible speech.
Created on March 2|Last edited on August 30
Comment
﻿
Tacotron2 🌮Check out tacotron2's PyTorch model hub page fhttps://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/﻿
💡
Tacotron2  is a seq2seq model developed by NVIDIA for converting sequences of characters into mel spectrograms, representations of sound that are tailored to the way humans perceive frequency spectra.
Tacotron2 architecture diagram
Our team needs to be able retrain this model on new datasets in order to produce different speaker voices or fine tune the model for captions specifically.  We will make use of a pretrained neural network called Waveglow to convert the mel spectrograms predicted by our tacotron variant into intelligible speech.
 We have updated the codebase to register datasets as Weights & Biases Artifacts and construct versioned train/validation splits of each dataset. We have additionally modified the code so that performance metrics and model predictions are logged to Weights & Biases so the whole team can see a clear picture of how each candidate model behaves.
Training Metrics 📈The metrics below are captured by adding wandb.log into the inner loop of the training process. Calling, for example, wandb.log({"accuracy": 0.9, "loss": 3.2}) will stream those metrics to Weights & Biases,  where they are stored, organized, and visualized for you and your collaborators.
All the code used to generate the visuals in this report can be found here.
💡
In addition to the training curves, we can also add a myriad of custom visualizations using Weave Plot, which below we have used to construct a box plot that shows the distribution of loss values logged from each model, and bar chart showing the minimum loss value for each model.
﻿
Run set11
﻿
System Metrics⚙️When tracking experiments with wandb, you can also see how your system is behaving over the course of a run. These metrics are automatically captured over time and can help you understand how to optimize your training process for your hardware based on the power draw, memory utilization, core utilization, and temperature.
﻿
Run set11
﻿
Visualizing Predictions 🧐The wandb.Table object allows you to log your datasets and model predictions, including rich media types, in a structured way. In this project, we are logging a dataframe that contains tacotron's predictions on the validation data. Each row of the dataframe we have logged the training step, sentence, ground truth audio + spectrogram, and the spectrogram predicted by our model.
﻿
Run set11
﻿
By using the group by feature of Tables, we can aggregate the spectrograms that are logged at each validation step for each example in the validation set, and get a view of how the spectrograms predicted by the model evolve over the course of training.
﻿
Run set1
﻿
﻿
Inference 🎶We can also visualize predictions that our model makes on unlabeled data, i.e. text with no available spoken recording. In the table below, click through the options in the audio column and listen to how our different models perform 🎤
﻿
Run set6
﻿
End to End🤝The purpose of our speech to text model is to convert automatically generated captions into intelligible audio. We can pull predicted captions, run tacotron inference on them, and then join the resulting audio to the original Table, which includes the source image, true caption, and attention plots. By including the inference results from multiple models an grouping by true caption, we can create a tabular view that allows us to see all relevant information about an example and compare the predictions that our different text to speech models are making. Clicking through the Predicted Audio tab on the far right will cycle you through the audio predicted for the caption in that row of the table.
﻿
Run set 2
﻿
﻿
Add a comment