“Look,” the voice of Jordan Peterson says in a YouTube video. “If you had one shot, or one opportunity, to seize everything you ever wanted in one moment—Would you capture it? Or just let it slip?”
It sounds like something the author of 12 Rules for Life might say, assuming he’s a fan of Eminem’s classic 2002 track, “Lose Yourself.” But Peterson, a controversial Canadian professor known for his lectures about a supposed “crisis in masculinity,” isn’t actually saying any of these things. It’s a voice generated using machine learning techniques to make it sound like he’s rapping.
Using just six hours of Peterson talking, the creator, who goes by Miles, employed machine learning techniques like audio style-transfer to make this haunting cover. In the description of the video, they say that they implemented techniques from two recent papers on the arXiv preprint server.
The papers both deal with methods for using AI to model style in end-to-end text to speech (TTS) systems and prosody matching—the rhythm, sound, stress and intonation of prose. That’s how this audio not only sounds like Peterson’s voice, but matches his cadence of his speech, as in his real speaking style.
“The model is given thousands of short audio clips and their transcripts of a speaker, and through hours of computation can learn how to synthesize speech in the style of that speaker,” Miles told me in a Reddit message. “This is how any [machine learning] project works (lots of data is provided and a model will learn how to analyze new instances of that data or generate more of it).”
It’s also similar to how deepfakes—AI-generated fake videos—are created, in that one of these fake audio clips requires a ton of clear, good-quality audio samples from the source.
Peterson talks a lot, usually into earbud microphones vlog-style, or on stages, or lecturing classes. In his YouTube videos, he’s almost always in a quiet room, and goes for minutes or hours at a time, generally uninterrupted while he monologues or interviews other people. That makes him an ideal candidate for building a machine learning dataset.
Miles told me that while their background is more in business analytics than programming, they’ve been studying machine learning (and TTS technology, especially) for several months.
This sounds a lot like what the creator of “deepfakes” told me in 2017: that he’s not a professional machine learning researcher, just someone with an interest in learning more about AI techniques. Both of these instances show that this technology is becoming more and more easily accessible as at-home machine learning hobbyists tinker with open-source code.
“I got interested in this field after coming across a project from Facebook researchers called VoiceLoop where they had imitated Trump and Zuckerberg’s voices with some success and was intrigued by the idea of being able to do that,” Miles said.
VoiceLoop is a neural TTS method developed by Facebook Research in 2018, that’s able to transform TTS using sampled voices. The code is available online for anyone to use, to make their own voices from scratch.
Miles isn’t the first amateur AI whisperer to try this at home: Others have created their own voice models using programs like Lyrebird and Modulate.ai. And to make fake voice detection easier, Google released a library of synthetic voices last year, spoken by its deep learning-driven TTS models.
When technology arises that so closely mimics elements that are innately human—our speaking styles, personalities, voices, faces, and even mannerisms—it’s tempting to feel our palms get sweaty, knees weak, and arms heavy. But don’t lose yourself: While these advancements may push us uncomfortably close to the uncanny valley, whether we fall for these fakes is a societal challenge, much more than it is a technical one.