‘Russia’s Google’ Is Collecting a Database of Rare Languages
Street scene in Tomsk, Russia. Image: Adam Jones Adam63/Wikimedia Commons
Languages, like species, experience cycles of evolution, proliferation, and often extinction. An estimated 400 languages have died out over the last century, and thousands more are at risk of vanishing by 2100, according to the United Nations' endangered language list. Every dead language represents the loss of invaluable cultural capital, which is why experts passionately support preservation of all extant languages before it's too late.
Search engines are becoming a crucial part of the effort to prevent a looming linguistic mass extinction event. With their sophisticated digital translation tools, developed through machine learning, these engines can efficiently analyze large language databases and preserve them for posterity.
One of the leaders of this charge is Yandex, a Russian information technology giant often referred to as "Russia's Google," which is a bit of a misnomer since the two companies are actually fierce rivals in Europe.
Where Google Translate currently offers 100+ languages, Yandex Translate is close behind with a catalogue of 90. But Yandex has outpaced Google in processing major regional languages within Russia, including Tatar (5.5 million speakers) and Bashkir (1.2 million speakers). The company is also working with linguists to preserve some of Russia's most vulnerable languages, like Mari (500,000 speakers), Udmurt (324,000 speakers), and Hill Mari (23,000 speakers), all of which are available only on Yandex Translate.
"Usually, it's the many language enthusiasts who ask us to translate some language," Anton Dvorkovich, a developer at Yandex Translate, told me over email. (This exchange was mediated by Yandex spokesperson Matvey Kireev, who translated Dvorkovich's answers from Russian to English.) "That's what happened with the Mari language, where we had a huge help from the Mari Research Institute of Language, Literature, and History," an academic cultural preservation center located in Yoshkar-Ola, Russia.
Dvorkovich and his colleagues are not only focused on languages within Russia, and are working towards cementing Yandex as a world leader in the preservation of endangered dialects that can't be found on other search engines.
"We believe that doing what we do helps to preserve rare languages for future generations and allows people to see just how culturally diverse and beautiful our world is," Dvorkovich said. "This is also a story about seeing how peoples of the world have been influencing each other for a very long time—we have never lived in a bubble, and we can see that by analyzing various languages."
A good example of this global language exchange is the little-known Caribbean Creole language Papiamento, which was recently added to Yandex Translate at the suggestion of one of the company's Netherlands-based employees, who happened to be one of an estimated 270,000 native Papiamento speakers in the world.
Papiamento is a melting pot language that has borrowed from Spanish, Portuguese, Dutch, English, Native American, and African vocabulary groups. Refining all of these ancestral contributions into the extant version of the language proved to be an interesting challenge for Yandex's machine translators.
By cross-examining millions of samples of its parent languages against Papiamento excerpts, these automatic translators memorize key patterns and apply them to new words and phrases. Take this visualization below, which illustrates Papiamento's pluralization markers, denoted by "nan," which is borrowed from Spanish patterns of pluralizing with "s" or "es." (The Russian headings marked "папьяменто" and "испанский" mean "Papiamento" and "Spanish").
Whenever a word does not seem to be pluralized with "nan," the automatic translator notes the exception and looks for a better analog (thus, the question mark sequence).
Yandex's algorithms can also analyze complex differences in grammatical structures between languages in order to more accurately translate them. This visualization shows the process of translating the English sentence "I saw a cat" into the Uzbek phrase, "bir mushuk ko'rdim" with the help of Turkish-English parallel texts.
These advanced machine learning tools can even crack fictional languages. Taking a hint from Microsoft's search engine Bing, which debuted Klingon translation in 2013, Yandex Translate became the first engine to add Sindarin Elvish, or "Eglathrin" as it's known to native elven speakers, one of many dialects dreamed up by JRR Tolkien for his Middle Earth epics.
Dvorkovich said the company may add more fictional languages in the future, but the main focus for now is to continue cataloguing and preserving threatened languages.
"Our job is never really done," he told me, "because there are so many endangered languages spoken by even fewer people. We believe no matter the language and the number of people who speak it, the cultural value of that language should never be underestimated."
Subscribe to pluspluspodcast, Motherboard's new show about the people and machines that are building our future.