Microsoft's AI needs just 3 seconds of speech to bring people back from the dead

Pocket-lint

By Oliver Haslam

Published Jan 11, 2023

Microsoft's new AI tool is capable of replicating a person's voice from just three seconds of sample audio.

Microsoft has announced a new tool that it says is capable of replicating a person's voice after analysing just three seconds of sample audio.

The tool, which Microsoft is calling VALL-E, is something that it's calling a "neural codec language model," something that Meta first announced in October 2022. But what makes it so interesting is Microsoft's belief that VALL-E is capable of preserving emotional tone, mimicking what it hears in that three-second sample clip.

As for how Microsoft does all of that, the company says that "VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively." After that, "the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder."

Microsoft says that the AI tool was trained on a Meta-assembled audio library that contains 60,000 hours of English language speech. More than 7,000 individual speakers were used to try and ensure that the AI was trained on a wide range of voices.

You can get a feel for how well VALL-E works over on Microsoft's sample website, with there being plenty of audio clips offered for your delectation.

Ars Technica reports that "in addition to preserving a speaker's vocal timbre and emotional tone, VALL-E can also imitate the 'acoustic environment' of the sample audio." That means that if a sample came from a particular environment, like at the end of a particularly bad phone call, that's exactly what it will sound like anytime VALL-E kicks out replicated audio.

Notably, Microsoft has not chosen to allow people to actually test the VALL-E AI tool out for themselves. It's been suggested that the company has concerns that people will get up to no good with it - a concern that might well have merit.

"Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker," Microsoft warns. And yes, theoretically you could feed VALL-E a clip of a deceased person and have it talk back to you.