ElevenLabs launches "Eleven v3"
Created on June 6|Last edited on June 6
Comment
ElevenLabs has introduced Eleven v3 (alpha), a new version of its text-to-speech (TTS) model that pushes expressiveness and dialogue realism well beyond earlier iterations. Marketed as “the most expressive Text to Speech model,” v3 is designed to handle complex emotional tone shifts, natural timing, and layered delivery, creating the illusion of true human conversation in audio generation. The alpha release is an early public version, signaling that while it’s ready to try, it’s still undergoing refinement based on user feedback and internal iteration.
Expressive Control and Audio Tags
A key innovation in v3 is its use of inline audio tags that let users embed expressive controls directly in the text. These tags allow for adjustments in pitch, emotion, delivery pace, and even audio effects like background noise or emphasis. For creators and developers, this enables the generation of voice output that can shift from calm narration to intense shouting or laughter within a single piece of dialogue. The goal is to offer not just voice synthesis, but performance—voice acting by prompt.
Dynamic Dialogue and Multi-Speaker Features
The model also introduces more natural multi-speaker interaction, a feature previously limited in expressiveness. Dialogue Mode in v3 allows multiple speakers to respond to each other with appropriate emotional and contextual flow. Rather than producing disconnected lines, the system is able to generate overlapping, interrupting, or even cooperative speech patterns that sound more like a podcast or live interaction. The system’s improved timing model ensures these overlaps feel intentional, not robotic.
Language Expansion and Global Reach
Eleven v3 also vastly expands its language support. While version 2 supported 29 languages, v3 supports over 70, targeting broad international accessibility. This includes major world languages like English, Portuguese, and Chinese, but also smaller language groups, enabling more inclusive and localized content generation. Importantly, the emotional and prosodic range seems to carry over well into non-English languages, which has traditionally been a weakness in TTS models.
Comparing v3 to Previous Versions
Compared to Eleven v2, v3 brings substantial upgrades. Audio tag support in v2 was limited to basic elements like pauses or breaks. In v3, the tag library is expanded to cover full emotional direction, layered effects, and spatial audio design. The performance leap is especially notable in dialogue realism and narrative flow. While v2 produced technically accurate voice output, v3 delivers voice acting—tone, nuance, and dynamic variation over time.
Availability, API Access, and Use Cases
Eleven v3 is available to try in its alpha form, and developers can access it via API. The platform is clearly targeting creators, studios, and enterprises that rely on high-quality voice generation, such as for dubbing, podcasts, virtual assistants, or immersive storytelling. An 80% discount is being offered, though the full pricing and access model isn’t detailed in the promo materials. Use cases range from automated call centers to interactive games and audio entertainment, especially where emotional range is crucial.
In short, Eleven v3 positions itself not just as a technical upgrade, but as a shift toward AI-generated performance. Whether it succeeds in the long term will depend on user adoption and how the model handles creative edge cases in more complex use scenarios.
Add a comment