Skip to main content

SpeechMatrix: Meta AI's New Dataset For Speech-To-Speech Translation

Meta AI has released SpeechMatrix, a new open-source dataset for textless speech-to-speech model creation. Along with it come numerous pretrained models for translation as well as those used in it's creation.
Created on October 21|Last edited on October 21
For many speech-to-speech translation models, intermediate transcription and text-to-text translation steps are performed. Recently, Meta AI and others have been working towards making true textless speech-to-speech models a mainstream reality.
In the aim to make speech-to-speech model development easier, Meta AI is now releasing a new dataset (along with various models) called SpeechMatrix, which contains speech alignments across 136 languages, at an average of 1,537 hours of source speech in each pairing direction, making a total of 418 thousands hours of speech data total.


SpeechMatrix

SpeechMatrix was built using the VoxPopuli dataset, a dataset of European Parliament event recordings collected over a little more than a decade. Because it would be unreasonable for a human to go through it all and hand-pair language clips across every language pair, they needed an automated solution.
The researchers put together SpeechLASER, a set of models based on LASER. LASER learns how sentences (in text form) represent their ideas, and translates that understanding into a language-agnostic embedding space. SpeechLASER does the same thing but for speech data, allowing speech pairing to be automated.
Using SpeechMatrix, the researchers trained numerous bilingual textless speech-to-speech translation models. These pre-trained models can be plugged into various mHuBERT vocoder models, which were also prepared, to produce speech for playback in the output language.
Everything to do with SpeechMatrix, including the speech dataset, pre-trained speech-to-speech models, and the various models and methods used in it's creation, are open-source and available for download at the GitHub repository.
The researchers hope that this work can help others develop textless speech-to-speech models for underserved languages, especially those which don't have much written content. If you would like all the details on SpeechMatrix, consider reading the full research paper.
If you'd like to try out a demo for speech-to-speech translation using models built with SpeechMatrix, check out the Gradio demo on Hugging Face.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.