This Company Wants to Match Nonverbal People to More Human Digital Voices

This Company Wants to Match Nonverbal People to More Human Digital Voices

The technology for text-to-speech communication has been limiting for those who can't speak.
February 27, 2017, 1:00pm

For those of us who have always been verbal, a unique voice is  something we take for granted. Maybe we sound a little like our parents, or sometimes people mistake us for our siblings, but for the most part, we own our voices—they are an extension of our bodies, our physical footprint on the world.

But for many years, people who use communication devices to speak have been grappling with sub-standard technology to express themselves. Non-verbal kids  have had to communicate through with text-to-speech machines using the voice of a 30-year-old woman called Heather or an adult man named Ryan, synthesized voices that are readily available for download for anyone who wants them.


Although companies like Apple have sunk thousands of hours into creating lifelike voices for their AI (hello, Siri), and although GPS companies have paid celebrities like Kevin Hart to give you directions, the same technology hasn't yet been made available to the ten million people who rely on text-to-speech devices for every verbal communication.

Now, a company in Massachusetts is trying to give people without the ability to speak with their own unique voice. VocaliD blends vocal sounds made by the non-verbal with hours of recordings donated by verbal individuals to create distinctive, synthesized voices that reflect their age, nationality, and even (possibly) aspects of their personalities.

"We're trying to make a one-of-a-kind knockoff," explained Rupal Patel when I met her at the VocaliD office in Belmont, Massachusetts.  She had just finished demonstrating how to bank your voice in the VocaliD system.

Rupal Patel demonstrates VocaiD system. Image: Katy Kelleher

The way VocaliD creates voices is complicated. Clients upload vocal sounds like "ahh" and "eee" using a simple, $20 headset with a microphone. These recordings are entered into the VocaliD system, which searches through the voice bank of uploaded voices to find one that matches.

While other companies like Acapella Group and CereProc also offer customized voices, VocaliD is the only one that uses vocal sounds collected directly from the client. These voices can be programmed into text-to-speech devices or used through a text-to-speech app. (VocaliD voices are compatible with Windows devices, and according to their website, the team is currently working to develop apps for iOS and Android.)


Clients then enter certain information—age, weight, height, country of origin, all of which affect your voice—which help the VocaliD staff to find a donor voice that matches their criteria. Typically, three voices are selected and presented to the client, who can listen to samples of each voice. Once they chose one they like, their own vocal sounds are blended with the donor voice to create an entirely new voice.

The individualized voices don't sound perfectly human—they still have a robotic edge to them, a flattening of vocal patterns that lodges the sound firmly in the aural uncanny valley. There's not quite enough rhythm, and the emphasis is sometimes placed on the wrong syllable, particularly with longer words. But for Patel's clients, this is still a vast improvement. Their every expression is no longer mass produced—it's their voice, digitized.

Screenshot from VocalID system.

Kara Flack, director of business development and sales at VocaliD, has a daughter with cerebral palsy. She knows what it's like to be in a room full of people using the same voice, six Heathers and a Ryan, kids communicating in the emotionless robotic tones of anonymous adults.

When her daughter Maeve first received a voice from VocaliD, she couldn't wait to show her friends. "The day she got it, she walked into her classroom and said, 'Hi.' Her friends were like, 'Is that your new voice Maeve?' She's just a ten-year-old kid, but she likes to have her own identity."


These days, Maeve uses her voice all the time. The family just bought a Google Home, and Maeve can't stop talking to it. Her mother has heard her saying one sentence over and over: "Okay Google, play Taylor Swift."


On the day I visited VocaliD, the office was busier than usual. A 9-year-old boy in Syracuse named Leo was about to get his first voice. While this is a feel good story, Patel tempers any expectations I might have of seeing a tear-jerking miracle.

"It's not like when someone gets a cochlear implant and hears for the first time," she says. "You don't get to see the kid's eyes widening as they hear their voice. Most of them have been using text-to-speech devices for awhile now."

At a nearby computer, Geoff Meltzner lets me hear the voice created for Leo. It is high pitched yet boyish (a few sentences sound downright playful). Meltzner is in charge of algorithm and "custom voice creation efforts" at VocaliD. He has a PhD in Speech and Hearing Biosciences and Technology from MIT.

Meltzner looking at voice soundwaves. Image: Katy Kelleher

Before he came to the startup, Meltzner was working in defense for Burlington-based BAE Systems on a program called silence to speech recognition. "We would put sensors on your face, let you mouth the words, and it would recognize what you were saying," he explains. "That's where I met Rupal—she came in as an expert in speech pathology."

Although VocaliD is doing work for speechless people, this is still a for-profit company. While Patel advocates about the importance of using cutting edge technology to solve social problems, the company isn't giving voices away for free. A distinct voice (which can be used on most text to speech devices, but not all) currently costs $1,250.


"We're working really hard to bring the price down," Patel says. "Right now, we are funded through government contracts that help us commercialize the research, but we are working with a very bare bones structure." She adds, "Technologically, [creating a voice] is a very big lift."

If the company becomes more efficient (and if they can create a larger voice bank), she said, it may be possible to make custom-built voices cheaper and more readily available. And that will create actual impact.

Screenshot from VocaliD

Patel argues that many people overlook people with disabilities—it's not even that they're viewed as lesser, but they're not really seen at all. Empathy often arises from the sharing of stories and life experiences. When you can't hear someone, it can be harder to empathize with them.

"When you hear someone's unique voice, you understand that it's them in there," she says. "You start attributing things to them that you might have not seen before, things that were there all along. Like intelligence. Like humor."


I've never liked my voice. I'm a victim of some fierce vocal fry. My voice creaks and sometimes squeaks, and I have an accent that even Patel can't place. Not from here, not from there, Massachusetts clipped and Midwestern flat.

It's difficult not to analyze it as I speak with Patel about the cadences, pauses, stutters, and melodies that make our voices so distinctive, so human. But where I see flaws—things I thought could discount me from donating my voice—she sees distinguishing features, characteristics that many of her clients want.


Four years ago, when Patel was creating her first voices for clients, she noticed something surprising. She had seven clients, and 21 voices. Her team had worked to smooth out the articulation, clear up consonants and vowels, and polish the voices until they were clear and relatively melodic.

Screenshot from VocaliD

And yet, when they presented the options to their clients, five out of seven picked the less polished options—the voices with flaws, the ones that hadn't been quite as worked over. "We were scratching our heads. Why wouldn't they pick the most clear and understandable voices?"

Now, she knows. She's worked with people who have lost their voices to cancer and degenerative diseases. She's had requests for voices with lisps, stutters, and accents (and yes, vocal fry).

"That is who they were—or who they are," she explains. "Rather than saying it's not acceptable to have a voice like that, and so much of society does tell us that there is something wrong with our voices, we're seeing that these are natural variations. That's what makes us think we can relate to someone. It's real."