OpenAI Rolls Out Next-Gen Audio Models for Developers

Created on March 21|Last edited on March 21
Comment
OpenAI has launched gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-tts in its API, improving both speech-to-text and text-to-speech capabilities. These models offer better transcription accuracy and more expressive AI-generated voices.
Alongside the release of its next-generation audio models, OpenAI has introduced OpenAI.fm, an interactive demo allowing developers to experiment with its advanced text-to-speech capabilities. This tool provides users with a way to test different AI-generated voices, apply unique vocal styles, and hear how the models interpret scripted dialogue. The demo highlights OpenAI’s focus on making AI voices more expressive, engaging, and adaptable for various applications.
﻿
New speech-to-text improvementsOpenAI has released new models called gpt-4o-transcribe and gpt-4o-mini-transcribe, which offer major improvements in transcription accuracy. These models reduce word error rates significantly compared to earlier Whisper models, particularly in difficult conditions like noisy rooms, diverse accents, and fluctuating speech speeds. This makes them better suited for real-world applications such as transcribing customer service calls or meetings. Benchmark tests using the multilingual FLEURS dataset show that these models consistently outperform Whisper v2 and v3, as well as competing systems like Gemini and Nova.
﻿
﻿
Customizable text-to-speechThe latest text-to-speech model, gpt-4o-mini-tts, introduces a new level of control. For the first time, developers can guide not just the words an AI says, but also how it says them. Whether the voice needs to sound professional, soothing, dramatic, or empathetic, this steerability opens up new use cases in storytelling, virtual assistance, and automated support systems. While the model uses synthetic voices, OpenAI is monitoring their use closely to ensure they remain within predefined and safe boundaries.
Technical innovations powering the updateSeveral technical advancements underpin these new models. First, the models are built on GPT-4o and GPT-4o-mini architectures and are trained on large, high-quality, audio-centric datasets. This targeted pretraining allows the models to better understand speech nuances. Reinforcement learning has been used heavily on the speech-to-text side, improving accuracy and reducing hallucinations. For the smaller models, OpenAI has applied advanced distillation techniques—transferring knowledge from larger systems using synthetic training data that mimic real-world conversations.
API availability and what’s nextThese new audio models are already available in OpenAI’s API for any developer to use. They can be integrated with the existing Agents SDK to quickly build conversational agents that respond by voice. For applications needing fast voice responses—such as live customer support or translation—OpenAI recommends using its real-time speech-to-speech API. Looking forward, the company plans to enhance the intelligence of these models further and may open up the possibility of developers uploading custom voices. OpenAI is also engaging with policymakers and creatives to discuss the broader implications of synthetic voice technology, and plans to continue investing in multimodal tools that combine speech, text, and even video.
This release marks another step toward more natural, conversational AI—bringing developers closer to creating voice agents that feel less like bots and more like people.
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.