How DARPA Plans to Decrypt the Languages That Computers Still Don't Understand

In the weeks that followed the 2010 earthquake in Haiti, it wasn’t just money that locals needed to rebuild, but people with whom they could speak. Even when medicine and clean water were available, foreign troops and aid workers couldn’t converse with locals about where those supplies were needed most. With far too few human translators available, hopes for more effective disaster relief fell to machine translation—but the Haitian Creole spoken by many of the country’s displaced people was largely unknown to computer linguistics.

About 10 million people speak Haitian Creole, but in the parlance of linguistics it is still a “low resource” language. These languages are mostly absent from the cross-referenced linguistic databases that feed modern translation software, have few written texts with which to study, and aren’t widely used online. And yet, they comprise the vast majority of the world’s more than 7,000 linguistic divisions, and often dominate the most conflict-ridden nations on Earth.

Videos by VICE

That’s why DARPA is working on LORELEI, or Low Resource Languages for Emergent Incidents. It’s distinct from past military language projects in that it doesn’t seek to translate speech, but to decrypt it. With LORELEI, the hope is to build both a hardware and software platform that can be deployed amidst unknown languages, chew through massive amounts of speech and text, and automatically produce some real understanding of its meaning.

DARPA’s plan is to share LORELEI with the general public so it can simultaneously gather plenty of data while assisting some of the most desperate situations worldwide. In Nigeria, foreign troops hunt Boko Haram terrorists through areas using as many as 44 distinct languages. Ebola aid workers must try to treat patients in 19 distinct African languages. Even in the United States, Central American refugee children have been found to speak more than 20 languages. And of course LORELEI will also assist American troops in places like Afghanistan, where human translation has struggled to allow effective local diplomacy and intelligence gathering.

LORELEI program manager Dr. Boyan Onyshkevych told Motherboard that DARPA’s proposed system could allow for “much more detailed and timely assistance” in situations such as these. They want LORELEI to be able to output very basic results within as little of a day of exposure to a new language. The names of places or emotional states, for example, would be conveyedmore like intelligence reports than conversational transcripts. It wouldn’t provide the sort of robust verbal bridge needed for aid workers soothe a panicking Ebola patient—at least, not at first—but it would allow a worker to figure out which village the patient came from and how many people live there.

Dr. Bonnie Dorr, a computational linguist at the Institute for Human and Machine Cognition, said that the challenge for low-resource analysis is twofold. You can’t just build a dataset—you have to understand it, too. “In [low resource] speech, you have no idea what’s coming your way,” she told Motherboard. “If you find documents, you have no idea what the nature of those documents are.”

The first step is to collect data. Lots of it. That data might come in the form of an aid worker recording a conversation with a refugee, a soldier taking pictures of signs and business ledgers, or a group of low-resource language speakers working with expert linguists. It’s also crucial to associate data with as much metadata as possible. For example, a string of meaningless words is much easier to analyze if you know it was shouted, and even easier if you know that it was shouted by a father at his son. A cryptic hand-written document is easier to decrypt if you know it concerns the local Mayor.

“People don’t say what they mean. They change what they mean from day to day… where they are, and even what the major happenings of the day have been.”

To make sense of all the raw information, researchers often turn to linguistic universals, or laws that pervade virtually all of known human language. In short declarative sentences, for example, the subject of the sentence will almost always precede the object. This can be a powerful technique when used together with metadata; if an algorithm knows that it’s analyzing a short declarative sentence and has the likely subject or object already labelled, then some tentative word identification can begin.

Word identification is just the start, though. The meaning and placement of even a few words can allow statistical guesses about a conversation’s topic, or the relationship between two speakers—but it’s also easy to be led astray. Dorr said that the challenge for natural language programmers is in many ways harder than classical codebreaking. “People don’t say what they mean,” she lamented. “They change what they mean from day to day… where they are, and even what the major happenings of the day have been.”

One coping strategy is to look at a low-resource language through the lens of a better-known parent language, called a bridge or pivot language. A language which shares some of its evolutionary history with Arabic, for example, might share some of the same rules of grammatical organization. Pivot languages provide tentative, case-by-case laws that can be applied to speed decryption, but it’s not a foolproof solution. Places like Nigeria have hundreds of languages from dozens of distinct lineages, making it harder to apply the pivot approach—and many of the most important pivot languages are themselves low-resource at present.

Nevertheless, LORELEI is a huge step in the right direction. Teaching computers a new language has always taken years of expert analysis and millions of dollars, according to Onyshkevych, and “LORELEI aims to solve this problem.”

The project should officially kick off in May.