Tech

The World's Second Largest Wikipedia Is Written Almost Entirely by One Bot

'Wikipedia consensus is that an unedited machine translation, left as a Wikipedia article, is worse than nothing.'
1_31_2020_THE_SECOND_LARGEST_WIKIPEDIA_SECTION_WRITTEN_BY_ONE_BOT_CV_ALT
Image: Cathryn Virginia 

Kyle Wilson is an administrator on the English Wikipedia and a global user renamer . He does not receive payment from the Wikimedia Foundation nor does he take part in paid editing, broadly construed. You can follow him on Twitter @kwilsonmg .

Wikipedia’s founding goal is to make knowledge freely available online in as many languages as possible. To date, this has mostly been in English. Different languages on Wikipedia are called "editions," and the English edition recently surpassed 6 million articles. Having over a million articles is a feat that only 16 of the 309 editions have accomplished.

Advertisement

The Cebuano Wikipedia is the second largest edition of Wikipedia, lagging behind the English version by only just over 630 thousand articles and ahead of the Swedish and German editions by over 1.64 and 2.98 million articles, respectively. Its positioning is rather peculiar given that, according to the Encyclopedia Britannica, there are only approximately 16.5 million speakers of the language in the Philippines. Despite having over 5.37 million articles, it has only 6 administrators and 14 active users. The English edition, by comparison, has 1,143 administrators and 137,368 active users for over 6 million articles, at the time of writing.

According to research by Motherboard and comments by several global administrators, highly trusted users who specialize in combating vandalism across Wikipedia editions, this is due to the use of bots, automated tools that primarily carry out repetitive and mundane tasks, but can also be used to generate Wikipedia entries. According to a paper published in Proceedings of the ACM on Human-Computer Interaction journal, there are approximately 1,601 of these bots in existence across Wikipedia editions. While the English Wikipedia and other editions use these tools to perform repetitive and otherwise mundane tasks, some editions have taken to using them to write content.

While this may not seem like an issue, when the majority of an edition’s content is written by a single bot it can negatively impact the quality of the edition. The particular bot writing the Cebuano edition is called “Lsjbot” and was created by the Swedish physicist Sverker Johansson. His creation is responsible for over 24 of the edition’s 29.5 million edits and according to research done by Guilherme Morandini, another global administrator, has created 5,331,028 of the edition’s 5,378,570 articles, or 99.12 percent of its article creations. According to that same research, all but five of the edition’s top 35 editors are bots, with no human editors in the top 10. Based on this, Morandini argued that bots have taken over the Cebuano edition from human editors.

Advertisement

“Bots are the product of people,” Vermont, a long-time global administrator who asked to be referred to by their Wikipedia username, said. “They have not taken over any project; rather, they have simply disincentivized article creation with vast amounts of stub [articles].” Vermont also pointed out that Lsjbot has made “more edits…than there are speakers of Cebuano.”

Riley Huntley, a new global administrator, compiled a sample of 1,000 random articles that Lsjbot created. From the random selection of these 1,000 results that Motherboard reviewed, the majority were surprisingly well constructed.

According to Johansson, his bot operates using the following basic principles: to begin, he selects a semantic domain—an area of meaning and the words used to describe it. For instance the domain “body” would include “foot,” “hand,” “face,” and so on. The next step in the process is to find machine-readable databases covering the domain; these will provide the basic facts about each subtopic—foot, hand, face, etc.—to include within the articles. The machine-readable database that Lsjbot used for geography based articles, for example, is called GeoNames.

Once this information is obtained, the next step is to write formulaic, generic, and reusable templated sentences with spots for specific information; this will express, in text, the various facts for each article. The bot then fills in these sentences with the information from the machine-readable databases and adds infoboxes (like the sidebars seen on most developed biographies on Wikipedia), categories, and links to other articles as appropriate. Once this is all complete, the last step is to save the edit, thus uploading the content to the Wikipedia edition in question.

Advertisement

Johansson said—and Motherboard verified by checking the bot's contributions log—that Lsjbot is currently doing maintenance work on the Cebuano Wikipedia and “no major” article creation projects are currently underway.

Lsjbot is responsible for the creation of articles about various species on the Cebuano, Swedish, and Waray-Waray Wikipedias. When asked why Lsjbot has stopped its article creations, Johansson responded that “opinions shifted” within the Swedish Wikipedia community and Waray-Waray editors were unable to form a consensus about the automatic creation of articles.

When reached for comment, the Wikimedia Foundation—the charity responsible for the maintenance of Wikipedia’s servers, software, and outreach—acknowledged the knowledge gap present between editions, which limits access to information for those who only speak languages with poor representation. In an email to Motherboard, the Wikimedia Foundation’s Adora Svitak stated that the Foundation is attempting to resolve this by “providing local language communities with tools, resources, and partnerships.” These include providing resources and platforms, such as Wikimedia Cloud Services, to developers wishing to create bots and other tools. According to Svitak, however, policies around bots and their permitted uses are strictly up to the individual communities themselves. She also spoke of technical developments to help ease the burden on editors translating content, most notably the “content translation” tool, which has been used to publish over 500,000 articles.

Advertisement

When asked how he felt about the Wikimedia Foundation’s work on addressing these issues and the disparity between editions, Vermont stated that while they do conduct outreach, “actually making any sort of difference” with socioeconomic factors preventing users from contributing is “nonexistent.”

With this perceived lack of support, communities have taken to generating content through various means. Some have chosen to focus heavily on quality whereas others prefer to have short one or two sentence “stub” articles on as much as possible. When machine translations, such as those created with the content translation tool, are left unedited this can cause problems. For instance, “village pump” when put through Google Translate can become “bomb the village” in Portuguese. While this example was for a Wikimedia community consultation, errors like this can end up just as easily in “live” Wikipedia articles. "Wikipedia consensus is that an unedited machine translation, left as a Wikipedia article, is worse than nothing," according to the English Wikipedia’s translation guide.

Lsjbot isn't the only automated or necessarily the best way to help people create Wikipedia articles in different languages. Another tool, which relies on more human input, was created in 2018 by João Alexandre Peschanski and Érica Azzellini, who also co-wrote a paper on content transclusion bots. This was based on a more specialized framework created a year earlier by Richard Knipel, Wikimedian-in-Residence at the Metropolitan Museum of Art, for a "Museum of Babel" to help build Wikipedia articles for every possible work in an art collection.

Advertisement

Peschanski and Azzellini’s tool, Mbabel, automatically generates article drafts based on information stored on the “web semantic database,” Wikidata—an open online database hosted by the Wikimedia Foundation designed to be readable by automated software. Unlike the Foundation’s Content Translation tool, Mbabel does not allow for the direct publishing of articles. Instead, it puts generated content on a “user test page on Wikipedia,” with the intent for users to then expand the basic templated information Mbabel supplied.

The demo article created using Mbabel that Azzellini shared with Motherboard is about the Paulista Museum in São Paulo, Brazil and is on the Portuguese Wikipedia. It was generated solely off of the content available within its Wikidata entry. Mbabel is also capable of compiling information from multiple different Wikidata entries, as was done for this article on the 2016 Brazilian elections.

This approach, however, does have its drawbacks. Due to its heavy reliance on Wikidata entries, the quality of content produced is heavily influenced by the quality of the Wikidata available.

“Of course, each community should decide how to deal with bot written content, but from my point of view, it's not beneficial for the Wikipedia project to deliver this kind of text [using the kind of templated information Mbabel creates] on the main domain as something equivalent to an encyclopedic article." Azzellini said. "It can discredit other Wikipedia entries related to automatic creation of content or even the Wikipedia quality.”

Advertisement

There is still room for improvement when it comes to making its entries sound more human and getting the grammar and pronouns right. For example, inserting a sentence in Portuguese to say that someone was a film director can get complicated. Whereas in English the sex of the director doesn’t change the sentence structure, in Portuguese the structure is very much sex context dependent. This has forced Azzellini to stick with writing those sort of sentences in the passive voice, playing it “safe” with the translations. In saying this, however, she stressed that “Mbabel doesn’t work as a bot and depends directly on human editing to be published.”

In its infancy, the English Wikipedia was similar to what the Cebuano edition is now—though with significantly less articles. A large number of its articles were also bot-generated. Since 2006, the English Wikipedia has had a “bot approvals group” which supervises the approval of bots allowed to run and helps to enforce the bot policy, which was originally created in 2002. Since 2010, the English Wikipedia bot policy has included a section preventing the use of bots to generate content in the vast majority of cases.

Having the majority of an edition’s content written by a single bot is a double-edged sword. It can lead to credible concerns over its quality but is also arguably better than nothing. Ultimately, more human editors knowledgeable in multiple languages are needed to help with the expansion of content and to review, improve, and clean up bot-made articles. At present, this is a daunting task and given the Cebuano edition only has 148 active users, and 5,331,028 bot created articles.

“The problem for me is not whether or not to use… a templated information," Azzellini said. "But to not critically think about where it is coming from and relying on the template as a definitive text instead of expanding and improving the content with your human capabilities of search, critical sense, analysis and review.”

Vermont ultimately views the Cebuano Wikipedia edition as a “pilot wiki” of sorts for the “idea of an article-creating bot.” He firmly believes that more work is needed to perfect the ability of bots to write. For the foreseeable future, he said, humans are a necessity to control article content and quality. “I’m of the opinion that bots could, at some point, do everything that a human can.”