Microsoft’s research team has unveiled VALL-E 2, a new AI system for speech synthesis capable of generating “human-level” voices with just a few seconds of audio, indistinguishable from the source.
“(VALL-E 2 is) the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech (TTS), achieving human parity for the first time,” the research paper reads. The system builds on its predecessor, VALL-E, which was introduced in early 2023. Neural codec language models represent speech as sequences of code.
According to the team, what sets VALL-E 2 apart from other voice cloning techniques is its “Repetition Aware Sampling” sampling method and adaptive switching between sampling techniques. These strategies improve consistency and address the most common issues of traditional generative voice.
“VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally difficult due to their complexity or repetitive phrases,” the researchers wrote, noting that the technology could help generate speech for people who are losing the ability to speak.
As impressive as it is, the tool will not be made available to the public.
“We currently have no plans to integrate VALL-E 2 into a product or broaden public access,” Microsoft said in its ethics statement, noting that such tools carry risks such as impersonating voices without consent and using convincing AI voices in scams and other criminal activities.
The research team stressed the need for a standard method to digitally label AI generations, acknowledging that detecting AI-generated content with high accuracy still remains a challenge.
“If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves of the use of his or her voice and a model for detecting synthesized speech,” they wrote.
That being said, VALL-E 2’s results are very accurate compared to other tools. In a series of tests conducted by the research team, VALL-E 2 outperformed human criteria in terms of robustness, naturalness, and similarity of the generated speech.
Image: Microsoft
VALL-E-2 was able to achieve these results with just 3 seconds of audio. However, the research team noted that “using 10-second speech samples resulted in even better quality.”
Microsoft isn’t the only AI company that has showcased cutting-edge AI models without publishing them. Meta’s Voicebox and OpenAI’s Voice Engine are two impressive voice cloners that also face similar restrictions.
“There are many interesting use cases for generative speech models, but due to the potential risks of misuse, we are not making the Voicebox model or code publicly available at this time,” a Meta AI spokesperson said. Decrypt Last year.
OpenAI also explained that it is trying to address the security issue first before launching its synthetic voice model.
“Consistent with our approach to AI safety and our voluntary commitments, we are choosing to preview this technology, but not to release it broadly at this time,” OpenAI explained in an official blog post.
This call for ethical guidelines is spreading throughout the AI community, especially as regulators begin to worry about the impact of generative AI on our daily lives.
Edited by Ryan Ozawa.