One of the many paradoxes of our techno-future involves the ever-accelerating field of machine learning and the ever-shrinking realm of personal privacy. Machine learning is predicated on the existence of training sets, vast collections of data that can be used to build new and better algorithms, but, clearly, excavating deeper and deeper data comes at the cost of individual identification. You are your data, right?
The question then is of how to collect and use these troves of information while keeping the owners of that information truly anonymous. It's a contradiction, but maybe not necessarily so. Could there be a third way? Enter differential privacy, a long-theorized but sparingly implemented concept/principle that can be defined very simply, courtesy of Northwestern University's Anthony Tockar, as such:
The risk to one's privacy should not substantially increase as a result of participating in a statistical database. Thus an attacker should not be able to learn any information about any participant that they could not learn if the participant had opted out of the database. One could then state with some confidence that there is a low risk of any individual's privacy being compromised as a result of their participation in the database.
Differential privacy is more than just a nice-sounding principle. It's a method, or series of methods. It's something that can be proven and explained mathematically.
First, we should clarify what's meant by privacy in this context. In our databases, just assume that actual identities and corresponding data points are scrambled. We might know some individual is in the database, but the corresponding info isn't clear.
That doesn't mean the data is truly anonymized, however. Tockar gives this example:
Suppose you have access to a database that allows you to compute the total income of all residents in a certain area. If you knew that Mr. White was going to move to another area, simply querying this database before and after his move would allow you to deduce his income.
In other words, it's possible to infer supposedly-hidden information about Mr. White. This is general problem with anonymous data: accidentally revealing private info. We might not have names within an anonymous dataset, but we have forms. And from forms, a reasonably clever data trawler might come up with names.
As Tockar explains, the solution is noise. It's possible to apply some noise-generating mechanism to related datasets such that queries sensitive enough to reveal a dataset participant's identity give output with enough perturbation to hide the individual. (Phew.)
If we added enough noise to our example above in just the right places, we might instead see a whole bunch of imaginary people making the same move, masking the real Mr. White. The idea is that the system finds situations like this, where the resolution of some database aspect might reveal an individual, and adds noisy padding. This masks the individual while keeping the overall data intact.
This can be done using statistical distributions. Rather than output information as discrete data spikes, which might be traceable back to individuals, the database gives information in terms of smooth probabilistic curves. This noise is applied in proportion to the most different individual in a given data set. This person is the outlier in every situation, a career anomaly. If the average income in a neighborhood is $60,000, this person makes $120,000. If a neighborhood is 98 percent white, this person is African-American. You can see how being an anomaly becomes a privacy contradiction. This exceptional person's data is where we find the most noise.
"In our Mr. White example above, let's assume the total income in his original neighborhood is $50 million," Tockar writes. "After he leaves, this figure drops to $49 million. Therefore, one can infer that his true income is $1 million. To keep his income private, we have to ensure the query response is noisy enough to 'hide' this information. In fact, to ensure we privatize income for all people in our dataset we need to make sure the richest person is protected as well. As it turns out, Mr. White was the richest person in his neighborhood, so the sensitivity is $1 million."
We successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.
This system isn't perfect. Sometimes we want data at super-sensitive resolutions, and sometimes the compromise just doesn't work out. But the stakes are probably higher than most people would think; in 2006, a Stanford paper found that 63 percent of Americans could likely be identified using only their zip code, gender, and date of birth. (Note, however, that this percentage is down from 81 percent in 1990.) In one classic incident, the medical records of former Massachusetts governor William Weld were pulled from a supposedly anonymized database by comparing overlaps in that database with a voter registration database. Anonymity is a highly precarious thing.
Something similar to the Massachusetts incident was achieved by researchers at the University of Texas at Austin using Netflix's anonymized training database. Simply correlating that database, containing viewer ratings and whatnot, with the non-anonymized database of IMBD revealed personal details aplenty. "Using the Internet Movie Database as the source of background knowledge," the team reported, "we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information."
The general idea of added noise applies in a machine learning context as well, according to a recent review study posted to arXiv. An algorithm then isn't fed a pure data stream, it's fed one that's been muddied a bit by statistical noise. The algorithm is still able to learn and do its job, but it's able to remain oblivious to high-sensitivity data. This might even help solve the machine learning challenge known as overfitting, where an algorithm becomes too sensitive to a dataset and loses it's ability to handle and process new data points or data points outside of its now-rigid strictures.
"Differential privacy thrives because it is natural, it is not domain-specific, and it enjoys fruitful interplay with other fields," a 2011 Microsoft Research paper declared. "This flexibility gives hope for a principled approach to privacy in cases, like private data analysis, where traditional notions of cryptographic security are inappropriate or impracticable."
In a way, data wants to be anonymized. The purpose of datasets is the examination of groups, not individuals, but this side effect persists. As long as it continues to persist, the data-driven future only gets more precarious.