NaturalSpeech: Unmatched Human-Like Text-To-Speech Waveform Generation Model
NaturalSpeech, a newly developed text-to-speech generation model, has proved itself more human-like than any other existing TTS model.
Created on May 10|Last edited on May 11
Comment
A new paper has been released by Microsoft researchers Tan et al. describing the creation of a brand new text-to-speech model exhibiting speech patterns more human-like than existing models.
NaturalSpeech, the model at the center of this paper, hosts a number of architectural improvements over existing TTS models that allow it to produce waveforms of human-like speech so convincing that there is no statistically significant difference from recordings of actual human speech according to their judgement tests.

Figure 1: System overview of NaturalSpeech.
Provided alongside the paper release, a web page providing demos of NaturalSpeech's TTS capabilities is available here: https://speechresearch.github.io/naturalspeech/
Defining human-level quality speech
By their definition written in the paper, for TTS audio to be considered to have human-level quality, it's quality score compared to it's corresponding real human audio recording from the test dataset must not have statistically significant difference.
As for the process of gauging whether generated TTS audio is sufficiently human-like, the researchers established a system for subjective human judgement of audio samples, rather than relying on objective automated metrics such as PESQ, STOI, and SI-SDR which the researchers found unreliable.
The subjective evaluation process employs a panel of judges which individually compare two matching audio clips generated by TTS models against eachother to determine voice quality.
Human vs. NaturalSpeech vs. other TTS models
Exhaustive detail on the production of NaturalSpeech including model architecture, hyper parameters, training processes, hardware used, and more are available in section 3 of the paper. NaturalSpeech was trained on the LJSpeech dataset, a dataset consisting of 24 hours of annotated human speech; This dataset was modified to fit NaturalSpeech's model training requirements.
NaturalSpeech was put up against 4 other models, which were all tested on the same guidelines described above. The table below illustrates the MOS and CMOS ((comparative) mean opinion score) determined through judging processes. A MOS value closer to 5 and a CMOS value closer to 0 represent the model's ability to generate human-quality speech.

Check out the demos webpage if you want to hear comparisons between NaturalSpeech's generated speech and human recordings here: https://speechresearch.github.io/naturalspeech/
Find out more
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.