Skip to main content

MetaVoice-1B: A New Multilingual Voice Cloning Model

A new lighweight model for multi-lingual voice cloning
Created on February 7|Last edited on February 7
MetaVoice has recently unveiled its latest development in the realm of text-to-speech (TTS) technology, the MetaVoice-1B. This advanced TTS model is comprised of 1.2 billion parameters and has been trained on an impressive 100,000 hours of speech data. A standout feature of MetaVoice-1B is its support for voice cloning across languages, with an emphasis on fine-tuning capabilities. This allows for the replication of emotional speech rhythms and tones in English, ensuring a more natural and realistic output without the risk of generating unintended or irrelevant content, commonly referred to as "hallucinations."

The Architecture

The model generates detailed speech representations using a combination of causal and non-causal transformers, then transforms these representations into clear, audible speech through a multi-band diffusion process. Background noise introduced during this process is refined using DeepFilterNet, resulting in high-quality, realistic speech output that can accurately mimic specific voices with minimal input data.

Zero-shot Cloning

One of the remarkable capabilities of this model is its zero-shot cloning feature for American and British voices, requiring only a 30-second audio sample to replicate a voice. This opens up new possibilities for both short and long-form audio content generation, making it more accessible for various applications. The model is also uniquely positioned for cross-lingual voice cloning, having demonstrated success with minimal training data for Indian speakers.

Open Source

MetaVoice-1B is released under the Apache 2.0 license, allowing developers and businesses to use the model freely and without restrictions. This move is particularly significant for those looking to integrate sophisticated voice technology into their services or products, providing a tool to enhance interaction and engagement with users through high-quality speech synthesis.

Risks

The advent of sophisticated text-to-speech technologies like MetaVoice-1B, capable of realistic voice cloning, brings with it the risk of increased spam calls and voice phishing attacks. This technology's ability to replicate human voices with minimal reference audio can be exploited for malicious purposes, posing challenges for telecommunications security. As such, there's a pressing need for enhanced security measures, regulatory oversight, and public awareness to mitigate potential misuse, ensuring that advancements in voice synthesis serve to benefit society rather than compromise personal privacy and security.


Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.