Anytime there is a major event going down, my first reaction is to check Twitter to see what it's all about. By simply manipulating a few choice keywords I am able to get a real-time feed of updates from around the world related to the topic, whether it's the Olympics or the latest climate-fueled disaster.
This is a great way to get up to the minute news in theory, but in reality, filtering through millions of tweets is incredibly inefficient. In an effort to combat the explosion of unstructured textual data on the internet, a team of researchers at the Georgia Institute of Technology have developed a new data visualization technique that helps Twitter users cut through the bullshit.
The goal of SentenceTree is to allow Twitter users to have a meta-level view of what is being discussed about a given topic by using a proprietary algorithm to parse through hundreds of thousands of related tweets and compressing all this textual data to a single "uber-tweet" that is between 100 and 200 words.
"We're trying to capture this large number of tweets and communicate their theme, contents, ideas through a very concise and relatively direct visualization," John Stasko, a professor of interactive computing at Georgia Institute of Technology, told Motherboard. "Most of the prior approaches to textual data visualization use some kind a word cloud or something like that, but you're really just getting the individual words there. SentenceTree fills in the blanks a little bit more between those words and also provides more context."
The visualization tool, which was created by Stasko's PhD student Mengdie Hu, is implemented in a web browser and works in near real-time. The first time SentenceTree was released into the wild was during the 2014 World Cup, when it analyzed nearly 250,000 tweets sent during a 15-minute window.
As you might expect from the resulting word tree (available on GitHub), this trial occurred during opening match of the World Cup which saw Brazil faceoff against Croatia. The first goal of the Cup was scored by Brazil's Marcelo—against Brazil. The tweets in the visualization also capture the dismay that this even triggered, with keywords like 'own goal' and 'bad' prominently factoring into the visualization.
"SentenceTree does the logical text analysis to throw away stop words like 'the,' 'a,' 'and,' etc," said Stasko. "The meaty content words that remain are analyzed by frequency so you see the words that are sized more frequently. But unlike a word cloud one word may appear multiple times in there if it's being used in different contexts."
The end result is an easy to understand uber-tweet that gives users an instant idea of what is being discussed in real-time and the general context of this discussion. Although the demo was originally just applied to tweets, Stasko said this method of analysis can be applied to any large unstructured textual database. While Stasko doesn't think it will completely kill hashtag culture, it could render this indexing function pretty superfluous since the algorithm is targeting individual words in context, rather than the individual words themselves.
"Our goals is fast triage and fast understanding," said Stasko. "We're trying to help investigative journalists, law enforcement, and academic researchers. Anybody who is going to come across a large collection of text documents will have a usage scenario for SentenceTree."