Tech by VICE

Here's What's Holding Back Your Universal Translator

Google and Microsoft are racing to create a real-life Babel fish, but there's still a long way to go.

by Matthew Braga
Jan 13 2015, 6:55pm

​New efforts from Google and Microsoft are cool, but they're a long ways from science fiction. ​Photo: Mika Hiltunen

In our science fiction future, everyone has a hand-held device that can translate their dulcet tones into a language that anyone, anywhere can understand. In our present day, such a feat is still really damn hard to pull off.

Nevertheless, both Microsoft and Google are in the early stages of trying to make real-time translation a reality. Microsoft has been teasing a new feature called Skype Translator that can translate a Spanish speaker into a synthesized English voice (and vice-versa) live during a voice or video call. Google—not to be outdone, according to The New York Times—will soon update its Google Translate app with the ability to detect if someone is speaking a popular foreign language, and translate their speech into text in real-time too.

It's a fascinating stuff—imagine travelling anywhere in the world, without fear of being misunderstood!—but doing this sort of work quickly, and accurately, is still a puzzle that hasn't really been solved.

"The reason that real-time [translation] is difficult for most of us is that it's really a matter of probabilities," said Gerald Penn, associate chair of the University of Toronto's department of computer science, and a specialist in natural language processing.

In a modern speech recognition system, a computer is typically trained on a language model—essentially, a database of what people are most likely to say, and in what order. Using this model, a computer gathers speech data from a microphone, and makes some educated guesses about what was actually said.

"The modern approach is not to make the guess right away," Penn explained, "but to collect the evidence, and then rank it, score it, and augment it." The challenge is performing this process fast and accurately enough that you can create the illusion of a conversation, where the translation appears to happen in real-time.

Part of the reason current speech recognition software, like Google's voice search or Apple's Siri, appears to recognize speech and convert it text so quickly, says Penn, is that its search space is limited. In other words, people tend to use a relatively restrained vocabulary when they search, and so Google's language model is geared towards this.

And not only are there fewer words the system needs to recognize—meaning the system can make its guesses faster—the speech input is often relatively high-quality, too. You can expect that people are going to speak more slowly, and enunciate, as people tend to do when they speak to a machine.

In a language translation scenario, however, the process of recognizing speech is more complex. The most obvious difference is that, rather than training the computer on a limited language model of query-like language, the computer must be trained on a wider-ranging model of conversational speech. As a result, the search space can be rather large, and the number of probabilities to evaluate quite high. The challenge, according to Penn, is figuring out how big the search space will be—how many hypotheses the system is willing to entertain at any one time, how many ranked solutions it's able to keep in memory—and making a tradeoff between speed and accuracy.

Audio quality is also a concern, too, as conversations between two people rarely take place in a vacuum. There might be background noise, maybe a child screaming or a police siren. Perhaps one of the participants is too far away from their microphone, or his or her pronunciation of a word wasn't perfect. In general, people speak more quickly and casually with one another than when speaking to a machine. "All of those things yield to some error in what was actually detected," says Penn.

And on top of all this is the challenge of translation itself. Already, language translation is quite good, assuming you're feeding a translation engine complete sentences or paragraph of text. But in real-time translation, that's obviously not the case, and you can't feed the translation engine word-by-word, either. Context in language is key, and the quicker the speech recognition engine can recognize sequences of words that can be accurately translated, the faster that translation can take place.

In spite of all this, what Google and Microsoft have managed to pull off is no small feat. Even if it the experience is a bit like, as the Times' Quentin Hard described, "as if two telemarketers were using walkie-talkies," it's a tantalizing glimpse of what's to come.

Or, as Google Translate would say, "es una tentadora idea de lo que está por venir."