FYI.

This story is over 5 years old.

Tech

Beware of Cheap Data

Social scientists often miss the inherent biases in social media data-streams.
​Image: archives.gov

​Beware of easy data. The massive, cheap datasets assured by social media pipelines like Twitter are likely offering dangerous distortions of the real world.

This is the conclusion anyway of a pair of computer scientists, Juergen Pfeffer and Derek Ruths, based at McGill University and Carnegie Mellon University, as ​described in the current issue of Science. With thousands of papers based on social media data now being published each year—compared to handfuls just five years ago—the situation might even be viewed as quite dire. Imagine astronomers, newly armed with telescopes, trying to chart the movements and development of galaxies without understanding the influence of black holes, a hidden gravitational influence—or hidden bias.

Advertisement

Bias is the key term as we attempt to extract meaningful observations from the non-stop social media avalanche of conversations, pronouncements, locations, images, categories, and on and on. In the face of these sheer volumes, it's easy to delude oneself into thinking that those volumes are capable of delivering the random (or otherwise specified) sample needed to conduct good research.

Ruths and Juergen liken our present state of social media-based inquiry to the early days of telephone polling. Infamously, the Chicago Tribune trusted its new sampling methods—circa 1948—enough to publish the post-presidential election headline "Dewey Defeats Truman," only to learn shortly thereafter that Truman had actually won in a landslide and that its polling methods had oversampled Dewey supporters enormously.

"Not everything that can be labeled as 'Big Data' is automatically great," Juergen notes in a statement. "People want to say something about what's happening in the world and social media is a quick way to tap into that. You get the behavior of millions of people—for free."

Developers of online social platforms are building tools to serve a specific, practical purpose—not necessarily to provide good data.

As Pfeffer and Ruths explain, social science researchers often underestimate the degree to which different social media platforms are favored by certain segments of the population. Instagram, for example, is slanted toward 20-something African-Americans, Latinos, women, and urban dwellers, while Pinterest is big with women in households with incomes greater than $100,000.

Advertisement

Making the situation worse is that social media feeds are usually the result of some proprietary filtering process. What goes in is not necessarily what comes out, and researchers often don't have a way of knowing what exactly happens in between.

"Developers of online social platforms are building tools to serve a specific, practical purpose—not necessarily to represent social behavior or provide good data for research," Pfeffer and Ruths write. "So the way data are stored and served can destroy aspects of the human behavior of interest."

Google, for example, only stores final searches submitted by users, after auto-completing, rather than the text originally typed in by the user. Twitter meanwhile takes apart retweet chains by connecting every RT back to the original source rather than the post that triggered the RT in the first place. Is that "natural"?

Rather than sampling a pure data stream, social scientists might well just be sampling some algorithm. There are "embedded researchers," as Pfeffer call scientists with special behind-the-scenes access, but this creates a new problem of a divided research community. Some can see the real feeds, others get an illusion.

The effect is that independent researchers lack the needed access to naked data, while the researchers that do have access remain at the mercy of the social media platform, which obviously will have its own interests in mind as much as basic social research.

Yet another fundamental problem is that the names and handles behind social media data streams mostly remain unverified. There are PR shills, fake accounts, bots, dummy accounts-for-hire, are other things that might be considered background noise except for the difficulty in seperating that noise from the desired signal. Researchers may examine larger and larger data sets in the hope that this noise becomes minimized, but it's not quite that easy.

Other researchers might not even bother, of course. A Twitter feed looks so easy, looks so random.

The answer is mostly in social scientists acting more like actual scientists, or at least teaming up with scientists with the ability to apply statistical and/or machine learning techniques to noisy data. And yet, if the allure of social media data-sets in the very first place is its cheapness and availability, we might expect a more likely outcome to be a whole lot less social media-based research. That would probably be a good thing.