via Nadya Peek/Flickr
It’s a tension as old as Wikipedia itself: the “anyone problem.” Most of the time it works out just fine—anyone can write and edit, and by and large that works out pretty well for Wikipedia. But in addition to the fast and furious editing of controversial topics—politics, religion, circumcision—there’s also the issue of “anyone” including people with less than pure motives. How do you distinguish between someone who’s just kind of innocently doing a lousy job and someone who’s intentionally awful? That’s what researchers from University of Alabama-Birmingham want to try to do through analyzing the Wikipedian justice system.
Far from the Wild West that our high school teachers warned us it was, Wikipedia has rules. It forbids both sockpuppet accounts and the practice of “paid advocacy editing,” wherein someone is paid to edit articles for promotional purposes. So it’s not like the Wikimedia Foundation caught anyone by surprise when it posted in October about shutting down over 250 accounts for those reasons.
Videos by VICE
While Wikipedia’s hive of editors can usually sniff out and correct an endorsement, figuring out if someone is working from an alias or sockpuppet account is a little trickier. Right now it takes someone manually noticing and reporting up the Wikipedia hierarchy.
The most telling signs—checking the user’s IP addresses—can only be accessed by special administrators, and even then, it’s easy to see how doing so can be interpreted as a violation of privacy. As the reaction to Google requiring that YouTube commenters set up Google+ accounts in order to comment reminded us, users love anonymity.
But what if there was a way to judge if someone was sockpuppeting around without having to hack their IP address? What if a roving algorithm could be taught to recognize sockpuppetry just based on what was written?
The UAB team compiled data from Wikipedia’s sockpuppet investigation page and from discussion pages where sockpuppets were known to roam, in order to create a large “corpus” where they and future programmers can practice building and teaching a machine-based sockpuppet predator. The Wikipedia page was a large and desirable place to start because, according to the researchers, “the [sockpuppet] authors were not aware of someone collecting their writings to study attribution, thus this new data set will allow the study of deceptive writing in the wild.”
Referring to this new corpus, the UAB team developed an algorithm that looked at 230 features like grammar, punctuation, other syntactic features of the writing by editors from article’s discussion sections—where authorial voice was slightly clearer, if still only fairly brief—and were able to identify sockpuppet accounts with about 75 percent accuracy—not bad for a start.
The study that the researchers published starts with Wikipedia and offers future possibilities for algorithms that only need an average of 500 words to match to a source:
“This type of authorship attribution of short text has potential applications in identifying terrorists in web forums, online discussion boards, phone text messages, tweets and other social media interactions where comments and text tend to be brief and short in length.”
At this point you might begin to feel a bit iffier about defrocking sockpuppets. Sure, catching people cheating at Wikipedia seems as noble as anything—especially because Wikipedia only shuts down the sockpuppet accounts, not the main ones, when it catches someone afoul of their rules. But if revealing an IP address is a violation of privacy, how would a program like the ones proposed not also be one?
Many YouTube commentators had their own reasons for not wanting their comments linked to their name—was this because, as Josh Constine wrote for Tech Crunch, the YouTube comments were “a haven for homophobia and racism,” which the anonymous commenting system permitted, or is it conceivable that people want anonymity for less vicious reasons?
And this is where the Wikipedia “anyone problem” kind of expands out to become everyone’s “anyone problem.” Once tools like this exist, will anyone be able to use them?
More
From VICE
-

Photo: janiecbros / Getty Images -

Photo: NATALIA ANDREEVA / Getty Images -

(Photo by John Nacion/Variety via Getty Images) -

Illustration by Reesa