Google may have DeepMind, but Baidu, China's homegrown Google, has Deep Speech.
Deep Speech, which debuted in December 2015, is a speech recognition system that uses an artificial neural network to translate audio input directly to transcribed output. By contrast, most speech recognition systems, including Siri, use multiple, engineer-crafted steps to make translations.
The system has learned how to recognize and transcribe both English and Mandarin, and according to a Baidu paper released in February 2016, it has a recognition rate that is more accurate than most native Mandarin speakers. Baidu announced earlier in April that it will begin rolling out the deep speech technology in collaboration with Peel, a smart remote app that will be available in both English and Mandarin for Android, followed by iOS.
True, Deep Speech hasn't received the same amount of press as Google's champion deep-learning based AlphaGo, but speech recognition technology might forever alter how humans interact with their mobile devices within the next decade—especially for users in China.
While English speakers tend to find typing roman characters relatively fast and easy, typing in Mandarin is typically more time consuming, said Adam Coates, director of Baidu's Silicon Valley AI Lab in Sunnyvale, California.
There are over 80,000 Chinese characters, though most contemporary Mandarin speakers use only between 1,000 and 3,500, and each character generally represents one 'word' or meaning. To help make typing in Chinese easier, Mandarin speakers use various forms of input editors to type in "pinyin," the standard system of romanized spelling for transliterating Chinese.
Speech recognition technology might forever alter how humans interact with their mobile devices within the next decade—especially for users in China
In 2015, 89 percent of China's internet-using population was mobile, compared to 75.1 perecent in North America, according to We are Social and Statista respectively. In addition, the ways in which Chinese users interact with their mobile phones differs from most English-speaking users, according to Adweek. Not only do they use more transcription software, but they also stream more videos and engage more often with mobile ads.
"In China because there are a lot of interface challenges, a lot of habits of mobile users are much more sophisticated than in the US, because that's their main access to internet," Coates said. For example, Chinese users are accustomed to paying at vending machines with their phones or QR codes, something Coates feels awkward doing.
For this reason, he thinks Chinese users will adopt speech-to-text tools like Deep Speech more rapidly than Americans have adopted tools like Siri or Google Now.
Deep Speech is even able to transcribe "hybrid speech," a reference to the combination of Mandarin and English speech used by many Mandarin speakers, Coates said. "'IPhone' is a very popular word, and because the system is entirely data-driven it actually learns to do hybrid transcription on its own," Coates said. "It has English and Mandarin characters and it just learns when someone says 'I own an iPhone' and says it in Mandarin, it will actually switch to English and print out 'iPhone' in roman characters."
In Coates's vision, users in China and elsewhere will use voice to unlock doors, turn on lights, speak to our cars and much more in the near future. It's his lab's goal to get at least 100 million users—and with more than
900 million Mandarin speakers
struggling to type on their phones, that may not be as ambitious as it sounds.