Algorithms that assess people’s likelihood to reoffend as part of the bail-setting process in criminal cases are, to be frank, really scary.
We don’t know very much about how they work—the companies that make them are intensely secretive about what makes their products tick—and studies have suggested that they can harbor racial prejudices. Yet, these algorithms provide judges with information that is used to decide the course of somebody’s life.
Now, a new study published on Wednesday in Science Advances from Dartmouth College computer science professor Hany Farid and former student Julia Dressel claims to “cast significant doubt on the entire effort of algorithmic recidivism prediction,” the authors write. In short, bail algorithms don’t appear to perform any better than human beings.
Read More: AI Could Resurrect a Racist Housing Policy
According to their study, COMPAS—one of the most popular algorithms used by courts in the US and elsewhere to predict recidivism—is no more accurate than 20 people asked to estimate the likelihood of recidivism in an online survey. Additionally, COMPAS didn’t outperform a simple linear predictor algorithm armed with just two inputs: age and number of crimes committed. COMPAS, in contrast, uses 137 unique inputs to make decisions, the study's authors write.
In a statement released after the study was published, Equivant—the company behind COMPAS—argued that COMPAS in fact only uses six inputs, and that the rest are "needs factors that are NOT used as predictors in the COMPAS risk assessment." In response, the authors wrote to me in an email that "regardless how many features are used by COMPAS, the fact is that a simple predictor with only two features and people responding to an online survey are as accurate as COMPAS."
“Our point isn’t that it's good or bad,” said co-author Farid over the phone. “But we would like the courts to understand that the weight they give these risk assessments should be based on an understanding that the accuracy from this commercial black box software is exactly the same as asking a bunch of people to respond to an online survey.”
The baseline accuracy of online respondents estimating recidivism within two years was 63 percent, the authors report, while COMPAS’ is 65 percent (a finding based on a dataset covering its use in Broward County, Florida, between 2013 and 2014). The simple linear algorithm with just two inputs had an accuracy of 66 percent. It’s worth noting that many researchers prefer to gauge accuracy with a different statistical measure known as AUC-ROC—even using this measure, though, online survey respondents managed an AUC-ROC value of .71, while COMPAS achieves .70.
"The findings of 'virtually equal predictive accuracy' in this study, instead of being a criticism of the COMPAS assessment," Equivant wrote in an online statement, "actually adds to a growing number of independent studies that have confirmed that COMPAS achieves good predictability and matches the increasingly accepted AUC standard of 0.70 for well-designed risk assessment tools used in criminal justice."
In response, the authors wrote me that .70 AUC is indeed the industry standard, but noted that their study participants nonetheless managed .71. "Therefore, regardless of the preferred measure of predictive performance, COMPAS and the human participants are indistinguishable," they wrote.
According to the study’s authors, their work suggests a cap on the accuracy of predictions about people’s futures based on historical data, whether the predictions are made by people or machines. Indeed, the whole idea of predicting someone’s behaviour two years from the present may be wrongheaded, Fahid said. Regardless, the overall point is these automated techniques are not any better than humans.
A potential caveat, however: According to Sam Corbett-Davies—a Stanford PhD student who has done research on the risks posed by bail algorithms—predictions based solely on select historical data (whether it’s done by algorithms or not) are often still more accurate than those that include more subjective factors like how a judge feels about tattoos.
"Judges are exposed to much more information: they can talk to defendants, assess their demeanor, see their tattoos, and ask about their upbringing or family life,” Corbett-Davies wrote me in an email. “All these extra factors are mostly useless, but they allow human biases to seep into judges' decisions. Multiple studies have looked at thousands of judge decisions and found that algorithms based on very few factors can significantly outperform judges."
In other words, human "intuition" based on a grab bag of subjective factors may still be less accurate than algorithms (or even humans) just looking at select historical information about a person.
Still, Fahid and Dressel’s findings are, at the very least, an indictment of how companies armed with flashy advertising and a staunch refusal to reveal their secret sauce have managed to flood the criminal justice system with algorithms that help to decide people’s futures without publicly vetted evidence of their accuracy.
Indeed, study co-author Julia Dressel told me over the phone, the last published study that specifically compared the accuracy of algorithms versus that of humans in predicting recidivism (that they could find, anyway) was done in Canada in 1984. A few things have changed since then.
“Companies should have to prove that these algorithms are actually accurate and effective,” Dressel told me over the phone. “I think the main step forward is recognizing that we need to be a bit wary of machine learning and artificial intelligence. And though these words sound impressive, and they can do really great things, we have to hold these technologies to a high standard.”
Get six of our favorite Motherboard stories every day by signing up for our newsletter .
UPDATE: Equivant initially did not respond to Motherboard's request for comment, but after publication released a statement that criticized the study published in Science Advances by Hany Farid and Julia Dressel. The company claimed that the researchers misstated the number of inputs COMPAS uses, and questioned their methodology. We asked Equivant for more details, but it declined. The story has been updated with Equivant's response and additional comments from the authors defending their work.