Meta Releases New Open Source Multi-Modal Speech Translation Models
Meta, with a focus on developing large AI models that accept multiple modalities of speech, has released a new modal that can translate NLP data across multiple languages!
Created on August 23|Last edited on August 23
Comment
Meta has just introduced a new model named SeamlessM4T. This technology is designed to translate and transcribe across text and speech and supports almost 100 languages in various translation functions. SeamlessM4T includes features like automatic speech recognition, speech-to-text translation, speech-to-speech translation, text-to-text translation, and text-to-speech translation. Meta, as known for their commitment to open research, has released the model and related tools under a Creative Commons license.
So Many Languages...
The development of SeamlessM4T is based on overcoming limitations in existing translation systems, which often provide coverage for only a fraction of the world's languages. The model's design integrates a series of sequential components to handle various translation tasks.
Model Architecture
The centerpiece of the SeamlessM4T approach is the multitask UnitY model architecture, designed to directly generate translated text and speech. It integrates various functionalities like automatic speech recognition, text-to-text, text-to-speech, speech-to-text, and speech-to-speech translations that are a part of the vanilla UnitY model. The architecture consists of three main sequential components, including text and speech encoders that recognize speech in nearly 100 languages, a text decoder that transfers meaning into nearly 100 languages for text, and a text-to-unit model that decodes into discrete acoustic units for 36 speech languages. The decoded units are converted into speech using a multilingual HiFi-GAN unit vocoder.

Processing speech is done through a self-supervised speech encoder, w2v-BERT 2.0, which is an improved version of w2v-BERT. It learns to find structure and meaning in speech by analyzing millions of hours of multilingual speech. The encoder breaks down the audio signal into smaller parts and builds a representation of what is said, mapping sounds to actual words using a length adaptor. Similarly, text is processed through an encoder based on the NLLB model, trained to understand text in nearly 100 languages and produce representations for translation.
Producing text and speech involves multiple stages. The text decoder is trained to take encoded speech or text representations and apply them to tasks like automatic speech recognition and multilingual translation. The text-to-unit (T2U) component in the UnitY model generates discrete speech units based on the text output, which are then converted into audio waveforms using a multilingual HiFi-GAN unit vocoder.
Scaling
Data scaling plays a significant role in this architecture, as data-driven models like SeamlessM4T benefit from large amounts of high-quality end-to-end data. To tackle the task of speech translation for 100 languages, they build upon pioneering work on text-to-text mining using a similarity measure in a joint embedding space, and initial work in speech mining. A new massively multilingual and modal text embedding space named SONAR was created, which substantially outperforms existing approaches. Mining is performed in data from publicly available repositories, and in total, they were able to automatically align more than 443,000 hours of speech with texts and create about 29,000 hours of speech-to-speech alignments.
Results
SeamlessM4T has demonstrated state-of-the-art results across various languages, and its multitask support shows substantial improvements, especially for low and mid-resource languages. Its robustness against background noise and speaker variations in speech-to-text tasks marks a noticeable advancement over current technologies. Meta's approach of merging so many modalities in the SeamlessM4T model is intriguing. The open-source release of these models is a great step forward, allowing researchers to innovate faster without worrying about expensive compute costs. By enabling widespread access to these tools, Meta is contributing to a future where technological advancements are more accessible and potentially transformative on a global scale.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.