With just 3.7 seconds of audio, a new AI algorithm developed by Chinese tech giant Baidu can clone a pretty believable fake voice. Much like the rapid development of machine learning software that democratized the creation of fake videos, this research shows why it's getting harder to believe any piece of media on the internet.
Researchers at the tech giant unveiled their latest advancement in Deep Voice, a system developed for cloning voices. A year ago, the technology needed around 30 minutes of audio to create a new, fake audio clip. Now, it can create even better results with just a few seconds of training material.
Of course, the more training samples it gets, the better the output: One-source results still sound a bit garbled, but it doesn’t sound much worse than a low-quality audio file might.
Here’s a sample of a British male speaker:
And here’s Deep Voice creating an American voice using that sample:
Here’s a three second clip of a woman’s voice:
Using just that sample, Deep Voice made this clip:
Using 100 samples, the voice sounds almost as good as the original:
You can listen to the rest of the samples and the AI-generated results, here.
The system can change a female voice to male, and a British accent to an American one—demonstrating that AI can learn to mimic different styles of speaking, personalizing text-to-speech to a new level. “Voice cloning is expected to have significant applications in the direction of personalization in human-machine interfaces,” the researchers write in a Baidu blog article on the study.
This iteration of Deep Voice marks yet another development in AI-generated voice mimicry in recent years. Adobe demonstrated its VoCo software in 2016, which could generate speech from text after 20 minutes of listening to a voice. Montreal-based AI startup Lyrebird claims it can do text-to-speech using just one minute of audio.
These technologies represent the kind of leaps in the advancement of AI that researchers and theorists raised concerns around when deepfakes democratized machine learning-generated videos. If all that’s needed is a few seconds of someone’s voice and a dataset of their face, it becomes relatively simple to fabricate an entire interview, press conference, or news segment.