Text2Speech
Training a model to convert automatically generated image captions into intelligible speech.
Created on March 2|Last edited on August 30
Comment
Tacotron2 🌮
Check out tacotron2's PyTorch model hub page fhttps://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/
💡
Tacotron2 is a seq2seq model developed by NVIDIA for converting sequences of characters into mel spectrograms, representations of sound that are tailored to the way humans perceive frequency spectra.

Tacotron2 architecture diagram
Our team needs to be able retrain this model on new datasets in order to produce different speaker voices or fine tune the model for captions specifically. We will make use of a pretrained neural network called Waveglow to convert the mel spectrograms predicted by our tacotron variant into intelligible speech.
We have updated the codebase to register datasets as Weights & Biases Artifacts and construct versioned train/validation splits of each dataset. We have additionally modified the code so that performance metrics and model predictions are logged to Weights & Biases so the whole team can see a clear picture of how each candidate model behaves.
Training Metrics 📈
The metrics below are captured by adding wandb.log into the inner loop of the training process. Calling, for example, wandb.log({"accuracy": 0.9, "loss": 3.2}) will stream those metrics to Weights & Biases, where they are stored, organized, and visualized for you and your collaborators.
💡
In addition to the training curves, we can also add a myriad of custom visualizations using Weave Plot, which below we have used to construct a box plot that shows the distribution of loss values logged from each model, and bar chart showing the minimum loss value for each model.
Run set
11
System Metrics⚙️
When tracking experiments with wandb, you can also see how your system is behaving over the course of a run. These metrics are automatically captured over time and can help you understand how to optimize your training process for your hardware based on the power draw, memory utilization, core utilization, and temperature.
Run set
11
Visualizing Predictions 🧐
The wandb.Table object allows you to log your datasets and model predictions, including rich media types, in a structured way. In this project, we are logging a dataframe that contains tacotron's predictions on the validation data. Each row of the dataframe we have logged the training step, sentence, ground truth audio + spectrogram, and the spectrogram predicted by our model.
Run set
11
By using the group by feature of Tables, we can aggregate the spectrograms that are logged at each validation step for each example in the validation set, and get a view of how the spectrograms predicted by the model evolve over the course of training.
Run set
1
Inference 🎶
We can also visualize predictions that our model makes on unlabeled data, i.e. text with no available spoken recording. In the table below, click through the options in the audio column and listen to how our different models perform 🎤
Run set
6
End to End🤝
The purpose of our speech to text model is to convert automatically generated captions into intelligible audio. We can pull predicted captions, run tacotron inference on them, and then join the resulting audio to the original Table, which includes the source image, true caption, and attention plots. By including the inference results from multiple models an grouping by true caption, we can create a tabular view that allows us to see all relevant information about an example and compare the predictions that our different text to speech models are making. Clicking through the Predicted Audio tab on the far right will cycle you through the audio predicted for the caption in that row of the table.
This set of panels contains runs from a private project, which cannot be shown in this report
Add a comment