Delmaine Donson via getty Images
Delmaine Donson via getty Images
A “shocking” amount of the internet is machine-translated garbage, particularly in languages spoken in Africa and the Global South, a new study has found. Researchers at the Amazon Web Services AI lab found that over half of the sentences on the web have been translated into two or more languages, often with increasingly worse quality due to poor machine translation (MT), which they said raised “serious concerns” about the training of large language models.
“We actually got interested in this topic because several colleagues who work in MT and are native speakers of low resource languages noted that much of the internet in their native language appeared to be MT generated,” Mehak Dhaliwal, a former applied science intern at AWS and current PhD student at the University of California, Santa Barbara, told Motherboard. “So the insight really came from the low-resource language speakers, and we did the study to understand the issue better and see how widespread it was.” “With that said, everyone should be cognizant that content they view on the web may have been generated by a machine,” Dhaliwal added.The study, which was submitted to the pre-print server arXiv last Thursday, generated a corpus of 6.38 billion sentences scraped from the web. It looked at patterns of multi-way parallelism, which describes sets of sentences that are direct translations of one another in three or more languages. It found that most of the internet is translated, as 57.1 percent of the sentences in the corpus were multi-way parallel in at least three languages. Like all machine learning efforts, machine translation is impacted by human bias, and skews toward languages spoken in the Western world and the Global North. Because of this, the quality of the translations varies wildly, with “low-resource” languages from places like Africa having insufficient training data to produce accurate text.
“In general, we observed that most languages tend to have parallel data in the highest-resource languages,” Dhaliwal told Motherboard in an email. “Sentences are more likely to have translations in French than a low resource language, simply by virtue of there being much more data in French than a low resource language.”High-resource languages, like English or French, tended to have an average parallelism of 4, meaning that sentences had translational equivalents in three other languages. Low-resource languages, like the African languages Wolof or Xhosa, had an average parallelism of 8.6. Additionally, lower-resource languages tended to have much worse translations. “We find that highly multi-way parallel translations are significantly lower quality than 2-way parallel translation,” the researchers state in the paper. “The more languages a sentence has been translated into, the lower quality the translations are, suggesting a higher prevalence of machine translation.”In highly multi-way parallel languages, the study also found a selection bias toward shorter, “more predictable” sentences of between 5-10 words. Because of how short the sentences were, researchers found it difficult to characterize their quality. However, “searching the web for the sentences was enlightening,” the study stated. “The vast majority came from articles that we characterized as low quality, requiring little or no expertise or advance effort to create, on topics like being taken more seriously at work, being careful about your choices, six tips for new boat owners, deciding to be happy, etc.”The researchers argued that the selection bias toward short sentences from low-quality articles was due to “low quality content (likely produced to generate ad revenue) being translated via MT en masse into many lower resource languages (again likely for the purpose of generating ad revenue). It also suggests that such data originates in English and is translated into other languages.” This means that a large portion of the internet in lower-resource languages is poorly machine-translated, which poses questions for the development of large language models in those languages, the researchers said. “Modern AI is enabled by huge amounts of training data, typically several hundred billion tokens to a few trillion tokens,” the study states. “Training at this scale is only possible with web-scraped data. Our findings raise numerous concerns for multilingual model builders: Fluency (especially across sentences) and accuracy are lower for MT data, which could produce less fluent models with more hallucinations, and the selection bias indicates the data may be of lower quality, even before considering MT errors.”