A woman looks at the camera and says, “Knowledge is one thing, virtue is another.” Then, she says, “Knowledge is virtue.” The same person, with the same voice, says two conflicting statements—but she only said the first in real life. The second statement is the work of an AI system that took audio of her speech and turned it into a video.
Researchers from Nanyang Technological University in Singapore, the National Laboratory of Pattern Recognition in China, and artificial intelligence software company SenseTime developed the method for creating deepfakes from audio sources. Basically, the AI takes an audio clip of someone speaking, and a video of another person (or the same person), and generates realistic footage of the person saying the words from the source audio. The person in the video becomes a puppet for the original voice.
To do this, the researchers first create a three-dimensional face model on every frame of the target video to extract the geometry, pose, and expressions of the face, according to the paper, "Everybody’s Talkin’: Let Me Talk as You Want," published this month to the arxiv preprint server. From there, they draw 2D landmarks of the face, focusing especially on mouth movements. That way, instead of requiring the algorithm to train on the entire scene, it's only training on the facial features and leaves the background sharp.
They then reconstruct a 3D face mesh to match lip movements that match source audio phonemes, or individual sounds—similar to how recent text-to-video methods work.
The researchers say this method creates "very deceptive" audio-visual results. Compared to past deepfake methods like Face2Face from 2016 and "Synthesizing Obama" from 2017, the results are more crisp with fewer artifacts visible to the naked eye. In an online poll of 100 participants conducted by the researchers, 55 percent of generated videos were rated as "real."
The researchers say this is the first end-to-end learnable audio-based video editing method. If you're going to make your deepfake speak, however, an unconvincing voice can make or break its believability—for example, the deepfakes of Mark Zuckerberg last year, with a voice that's comically unrealistic. Faked audio has been a focus of AI engineers and deepfake developers for years, and algorithmically-generated voices alone can sound incredibly real. A generated voice mimicking Jordan Peterson was so realistic that Peterson himself threatened to sue its creator.
Because this method can use the real voice of the person you're trying to deepfake, and splice their words up into whatever you want, it's another leap forward in deepfake realism.
On Thursday, the Bulletin of the Atomic Scientists, stewards of the Doomsday Clock, included deepfakes as a reason why we're closer to the end of the world than ever, saying that the emergence of algorithmically-generated video "threatens to further undermine the ability of citizens and decision makers to separate truth from fiction." But sussing truth from fiction might be relatively low on the list of today's AI-related concerns: for example, SenseTime, one of the companies that developed this research, was recelty implicated in developing technology that helped the Chinese government profile a Muslim minority group.
In this paper, at least, the researchers seem to be aware of the risks highly customizable and realistic deepfakes pose to society.
"We do acknowledge the potential of such forward-looking technology being misused or abused for various malevolent purposes," including media manipulation and propaganda, the researchers write in their paper. "Therefore, we strongly advocate and support all safeguarding measures against such exploitative practices... Working in concert, we shall be able to promote cutting-edge and innovative technologies without compromising the personal interest of the general public."