Researchers Find 'Anonymized' Data Is Even Less Anonymous Than We Thought

Last fall, AdBlock Plus creator Wladimir Palant revealed that Avast was using its popular antivirus software to collect and sell user data. While the effort was eventually shuttered, Avast CEO Ondrej Vlcek first downplayed the scandal, assuring the public the collected data had been “anonymized”—or stripped of any obvious identifiers like names or phone numbers.

“We absolutely do not allow any advertisers or any third party…to get any access through Avast or any data that would allow the third party to target that specific individual,” Vlcek said.

But analysis from students at Harvard University shows that anonymization isn’t the magic bullet companies like to pretend it is.

Dasha Metropolitansky and Kian Attari, two students at the Harvard John A. Paulson School of Engineering and Applied Sciences, recently built a tool that combs through vast troves of consumer datasets exposed from breaches for a class paper they’ve yet to publish.

“The program takes in a list of personally identifiable information, such as a list of emails or usernames, and searches across the leaks for all the credential data it can find for each person,” Attari said in a press release.

They told Motherboard their tool analyzed thousands of datasets from data scandals ranging from the 2015 hack of Experian, to the hacks and breaches that have plagued services from MyHeritage to porn websites. Despite many of these datasets containing “anonymized” data, the students say that identifying actual users wasn’t all that difficult.

Videos by VICE

“An individual leak is like a puzzle piece,” Harvard researcher Dasha Metropolitansky told Motherboard. “On its own, it isn’t particularly powerful, but when multiple leaks are brought together, they form a surprisingly clear picture of our identities. People may move on from these leaks, but hackers have long memories.”

For example, while one company might only store usernames, passwords, email addresses, and other basic account information, another company may have stored information on your browsing or location data. Independently they may not identify you, but collectively they reveal numerous intimate details even your closest friends and family may not know.

“We showed that an ‘anonymized’ dataset from one place can easily be linked to a non-anonymized dataset from somewhere else via a column that appears in both datasets,” Metropolitansky said. “So we shouldn’t assume that our personal information is safe just because a company claims to limit how much they collect and store.”

The students told Motherboard they were “astonished” by the sheer volume of total data now available online and on the dark web. Metropolitansky and Attari said that even with privacy scandals now a weekly occurrence, the public is dramatically underestimating the impact on privacy and security these leaks, hacks, and breaches have in total.

Previous studies have shown that even within independent individual anonymized datasets, identifying users isn’t all that difficult.

In one 2019 UK study, researchers were able to develop a machine learning model capable of correctly identifying 99.98 percent of Americans in any anonymized dataset using just 15 characteristics. A different MIT study of anonymized credit card data found that users could be identified 90 percent of the time using just four relatively vague points of information.

Another German study looking at anonymized user vehicle data found that that 15 minutes’ worth of data from brake pedal use could let them identify the right driver, out of 15 options, roughly 90 percent of the time. Another 2017 Stanford and Princeton study showed that deanonymizing user social networking data was also relatively simple.

Individually these data breaches are problematic—cumulatively they’re a bit of a nightmare.

Metropolitansky and Attari also found that despite repeated warnings, the public still isn’t using unique passwords or password managers. Of the 96,000 passwords contained in one of the program’s output datasets—just 26,000 were unique.

The problem is compounded by the fact that the United States still doesn’t have even a basic privacy law for the internet era, thanks in part to relentless lobbying from a cross-industry coalition of corporations eager to keep this profitable status quo intact. As a result, penalties for data breaches and lax security are often too pathetic to drive meaningful change.

Harvard’s researchers told Motherboard there’s several restrictions a meaningful U.S. privacy law could implement to potentially mitigate the harm, including restricting data access to unauthorized employees, maininting better records on data collection and retention, and decentralizing data storage (not keeping corporate and consumer data on the same server).

Until then, we’re left relying on the promises of corporations who’ve repeatedly proven their privacy promises aren’t worth all that much.