From Meatzza Extravaganza to 100 million phone calls: How Rime built the world's most realistic text-to-speech model with W&B

"For every model that we’ve ever trained, W&B has been a part of that process. I would say W&B is a crucial part of the experiment paradigm."
1645466914944
Lilly Clifford
Founder and CEO

While Lilly Clifford was a PhD student in Stanford’s linguistics department, she had a simple but powerful motivation: “When I call a major telecom company’s customer service line, it shouldn’t suck.” The more she learned about the state of enterprise voice AI, the more she realized how limited the previous status quo was, inspiring her to co-found Rime.

Rime recently released Arcana, the world’s most realistic spoken language model now powering over 100 million phone calls per month at some of the world’s biggest enterprises. For someone as passionate about linguistics and human speech as Clifford is, tackling the unique challenges of text-to-speech AI models has been a perfect marriage of theoretical knowledge and practical problem-solving, as well as a fascinating exploration of what makes human speech truly human.

image (83)

“I learned very quickly that it doesn’t matter how lifelike the model sounds, or how fast it is – if it can’t pronounce Meatzza Extravaganza, it’s not going to work,” said Clifford. “The previous status quo of customers picking one North American English voice they hate the least and putting that into production, I find that very offensive. When a customer calls, they should hear a voice they respond to. That’s what Arcana is.”

Read on to learn more about the process and challenge of building the world’s best multimodal, autoregressive text-to-speech (TTS) model, and how Rime leaned on Weights & Biases to help them through those challenges.

The technical challenges of training TTS models and building Arcana

“Training text-to-speech models is not for the faint of heart,” said Clifford in what may be an understatement. “The interesting thing about speech is just how dense it is; in a given second of audio, there may be a 100 tokens where you’re running autoregressive decoding to predict those 100 tokens. The amount of things that can go wrong when you’re decoding that many tokens is just a lot. So that really expands the problem space, from an architectural standpoint, a training standpoint, and a product standpoint.”

Building a TTS model for enterprise customers and use cases created a whole new set of requirements and edge cases. Those customers need models that can handle:

  • Domain-specific pronunciations (the aforementioned Meatzza Extravaganza being the perfect example)
  • Natural prosody for structured data, such as reading phone numbers with human-like rhythm)
  • Last-mile customization to allow non-linguists to solve pronunciation edge cases
  • Multilingual code-switching to seamlessly switch between languages mid-sentence if necessary
  • And of course, cost-effective deployment to run on customers’ existing GPU infrastructure

Rime’s previous leading model, Mist v2, remains the fastest and most customizable TTS model for high-volume business applications, but it was non auto-regressive, and wasn’t great at modeling paralinguistic bits of human information and natural, realistic mouth sounds. When it came to building Arcana, Rime wanted to flex their muscles and have two products in the market with voice parity, such that there was a model routing opportunity to switch between each model, depending on what was needed.

With Arcana, the team saw an opportunity to leverage advancements in next-token prediction to really achieve a new level of unparalleled realism. The goal was to have a spoken language model that could infer emotion from context, laugh and sigh, use filler words like “um” naturally, and sound as human as possible. That required novel approaches to model architecture, training, and especially with data.

“We’ve been collecting an extremely large proprietary dataset of conversational speech, prioritizing authenticity and diversity,” said Clifford. “We’ve done a lot of in-house annotation, from multilingual PhD annotators, to label things like coughing and sneezing and laughter with 98-100% accuracy.” This meticulous approach helped Rime capture the subtle nuances of speech including accents, emotions, and conversational dynamics.

Architecturally, Arcana has the backbone of an LLM trained on extensive text and audio data, while utilizing a high-resolution codec to capture nuanced acoustic details. It then employs a tokenization process to represent audio features effectively. This approach enables faster-than-real-time synthesis, with the pre-trained LLM auto-regressively decoding flattened codec representations, ordered from coarsest to finest.

And when it came to training Arcana, Rime had a three-stage process. The team leveraged open-source LLM backbones and did additional pre-training to learn general linguistic and acoustic patterns. They then completed initial fine-tuning on their massive proprietary dataset, giving Arcana world-class realism and emergent capabilities. Finally, they took their most exemplary speakers and optimized the model for conversations and reliability, resulting in the eight flagship voices for Arcana, in addition to consistency with Mist v2.

All of this training was done on Weights & Biases (W&B) Models.

How W&B has made a huge impact at Rime

As Clifford mentioned, spoken from years of in-the-trenches work, training TTS models is a tricky process for a variety of reasons. Fortunately, dating back to her time as a PhD student at Stanford, Weights & Biases has been an integral part of Clifford’s life and work.

“When I was starting the company, I said we need to buy and use Weights & Biases,” said Clifford. “For every model that we’ve ever trained, W&B has been a part of that process. I would say W&B is a crucial part of the experiment paradigm.”

“As a small but growing team of 12, with five machine learning engineers, W&B is our single source of truth for our team to make decisions about where our priorities lie,” continued Clifford. “Did this experiment go well? Let’s refer to W&B. Did it not go well? Let’s look in W&B and make a decision. From a product velocity and decision-making velocity, I don’t see how we could have succeeded without W&B – when we run an experiment, it needs to be logged to W&B.”

In addition to logging all experiments in W&B Models, the Rime team also relies on W&B as a great sanity check, especially when dealing with some of the nuances of TTS models.

“With TTS models, you can’t compare the quality of one model against another, checkpoint to checkpoint,” explained Clifford. “You have to develop an intuition, and W&B is really good at helping us with that.”

For every run the RIME team runs, they’re logging not just loss curves, but also sample audio as model artifacts. When their human annotators do what’s called a comparative mean opinion score test, they can determine, with the help of W&B, what is the best checkpoint they’ve trained.

In addition to just logging and tracking all experiments in W&B, Clifford and her MLEs also lean heavily on Sweeps and Custom Views as crucial W&B features. Sometimes, the team runs experiments with cadences that go on so long that hyperparameter tuning needs to be automated via Sweeps.

“When we need to train a model in 10 minutes, there’s no way we can do that without Sweeps,” said Clifford. “And sometimes, apples-to-apples comparisons with ML models are shockingly hard to find, so for our team being able to use Custom Views to quickly identify two runs and say we want to compare them, that’s a really special and useful aspect of W&B.”

Next steps at Rime and the future of TTS

Both the cutting-edge Arcana and Mist v2 have led to explosive growth at Rime, with the startup having recently doubled its number of customers, including on-prem customers, something which caught Clifford by surprise. Sophisticated customers are deftly using both models in the same conversations, through core use cases like customer support, fast-food ordering, outbound calling, audiobook narration, and so much more across diverse industries like retail, financial services, and healthcare.

The technical performance of Arcana has also been impressive, delivering 200ms time-to-first audio token, as well as ~300ms public cloud latency. This faster-than-real-time synthesis delivers the bleeding edge capabilities that top enterprises crave and need from their speech models.

And Clifford and the team are just getting started.

“Arcana is just another step in our ambitious roadmap at Rime,” said Clifford. “We’re excited to push the frontier of what’s possible with voice AI, continually improving both realism and business impact.”

Check out Arcana and try it out for yourself, with live chat available on the Rime website.