In November, a federal court ruling revealed that CSIS, Canada's CIA analog, operated a secret metadata collection program for a decade; metadata being all of the information—time stamps, locations, names and numbers—wrapped around our digital communications.
The police line is often that you shouldn't worry because they're "just" collecting metadata. But as privacy advocates and technologists have noted over and over, metadata can reveal a lot of very personal information. Now, researchers from Norwegian telecom Telenor, the MIT Media Lab, and big data nonprofit Flowminder have concluded that metadata from your cell phone can reveal if you're unemployed, or even what you do for a living.
In a paper posted to the arXiv preprint server, which hasn't been peer reviewed, the researchers describe how they were able to use metadata—again, not the content of communications—from a telecom in a South Asian country (the researchers say they can't divulge the company or nation), to guess an individual's occupation. The system ended up being 67.5 percent accurate overall, with the "clerk" profession peaking at 73.5 prediction accuracy.
Read More: Is Metadata Anonymous? Of Course Not
The researchers' goal was to design a system that can determine employment statistics in developing countries without solid data. As Telenor researcher Pål Sundsøy told me in an email, it's possible to feed anyone's formatted cell phone metadata into the system and have it predict whether you fit into one of the 18 profession "groups" they identified—a student, an agriculture worker, a landlord, etc.
It was made possible by deep learning, a type of software that trains itself to look for patterns in large amounts of data.
"As such applications emerge it is important to be transparent around the decision making process—especially as intelligent machines make errors sometimes, too," Sundsøy wrote. "In the field of social sciences this includes always validating the methodology to actual ground truth data, and use it as a complementary source of insight."
The researchers obtained cellular network logs for 76,000 people from the telecom company. These logs included metadata such as: phone model, level of interaction per phone contact, which cell phone towers they connected to and when, the number of sent and received text messages, how much they spend on their phone plan, and so on. Next, they matched up the metadata with individuals from two national surveys that asked respondents to reveal their occupation. This allowed the researchers to tease out the defining characteristics of cell phone usage for people with different jobs.
They then used this dataset to train a deep learning algorithm to look for those patterns in metadata, and found that it was shockingly accurate—but not perfect by any means. And, Sundsøy wrote, this is just one study with a very specific dataset and method of collecting it, so there are limitations to what they could find.
"In future research it's important to understand how often the model needs to be retrained, as market and technological conditions might change the signals of human behaviour," Sundsøy explained.
"I think there are many ways accuracy of such studies can be approved," he continued. "One way is to combine more data sources, and build even smarter variables as input to models. Another way is to let the machine decide which variables are smart with minimal input from the user."
Regardless of future applications, it's clear that metadata is never "just" metadata—it contains everything a determined person needs to find out all sorts of sensitive things about your life, like if you're spending all day working retail, or eating Cheetos.
Get six of our favorite Motherboard stories every day by signing up for our newsletter.