This story is over 5 years old.


AI System Sorts News Articles By Whether or Not They Contain Actual Information

How much "news" is actually new?
Image: Gabe Bornstein

There’s a thing in journalism now where news is very often reframed in terms of personal anecdote and-or hot take. In an effort to have something new and clickable to say, we reach for the easiest, closest thing at hand, which is, well, ourselves—our opinions and experiences.

I worry about this a lot! I do it (and am doing it right now), and I think it’s not always for ill. But in a larger sense it’s worth wondering to what degree the larger news feed is being diluted by news stories that are not “content dense.” That is, what’s the real ratio between signal and noise, objectively speaking? To start, we’d need a reasonably objective metric of content density and a reasonably objective mechanism for evaluating news stories in terms of that metric.


In a recent paper published in the Journal of Artificial Intelligence Research, computer scientists Ani Nenkova and Yinfei Yang, of Google and the University of Pennsylvania, respectively, describe a new machine learning approach to classifying written journalism according to a formalized idea of “content density.” With an average accuracy of around 80 percent, their system was able to accurately classify news stories across a wide range of domains, spanning from international relations and business to sports and science journalism, when evaluated against a ground truth dataset of already correctly classified news articles.

At a high level this works like most any other machine learning system. Start with a big batch of data—news articles, in this case—and then give each item an annotation saying whether or not that item falls within a particular category. In particular, the study focused on article leads, the first paragraph or two in a story traditionally intended to summarize its contents and engage the reader. Articles were drawn from an existing New York Times linguistic dataset consisting of original articles combined with metadata and short informative summaries written by researchers.

So, the first task was to take a whole bunch of NYT articles—just over 50,000—and compare their lead paragraphs to the aforementioned short summaries. The difference between these two things can be viewed as an indicator of information richness. We can presume the summaries maximize content density (that’s why they exist) and so they can act as a benchmark to compare article leads against. The actual content quantification was done in terms of another existing dataset containing big lists of words more or less likely to convey content (high content density: “official,” “united,” “today”; low content density: “man,” “day,” “world.”)

So, we can imagine that each summary and article in a pair gets a score and the content density of a story is in the difference between these two scorings. These initial evaluations were done both via an automated system (mostly) and by the researchers themselves and Amazon Mechanical Turk workers (about 1,000 articles). In the end, we wind up with a big batch of news articles labeled as content dense or not and this is what gets fed to the machine learning algorithm, which basically builds its own internal abstract representation of what is and isn’t content dense.

Interestingly, this varies a bit depending on the journalism domain. “In sports and science, the distribution of content-dense scores is clearly skewed towards the non content-dense end of the spectrum,” the study notes. “In these domains writers more often resort to the use of creative and indirect language meant to provoke readers’ interest.” (LOL.)

The model was then evaluated against a subset of labeled data that had been set aside for validation purposes. This is where we get the 80 percent statistic, which in the grand scheme of machine learning is OK verging on good. Across the total set of analyzed articles, only about half were found to have content-dense leads. Make of that what you will. (Sadly, there doesn’t seem to be an existing linguistic dataset for Fox News, yet.)

“We have confirmed that the automatic annotation of data captures distinctions in informativeness as perceived by people,” the paper concludes. “We also show proof-of-concept experiments that show how the approach can be used to improve single-document summarization of news and the generation of summary snippets in news-browsing applications. In future work the task can be extended to more fine-grained levels, with predictions on sentence level and the predictor will be integrated in a fully functioning summarization system.”