An AI Paper Published in a Major Journal Dabbles in Phrenology

A paper in 'Nature' that claimed to evaluate 'trustworthiness' based on physical features has sparked major backlash.
On Friday, a trio of evolutionary psychology researchers published a research paper in Nature that sought to use machine learning to track historical changes in "trustworthiness" using facial expressions in portraits. The experiment was widely panned onli
Image: Composition

On Friday, a trio of evolutionary psychology researchers published a research paper in Nature that sought to use machine learning to track historical changes in "trustworthiness" using facial expressions in portraits. The experiment was widely panned online as a digital revival of racist practices that claimed to discern character from physical characteristics, such as phrenology and physiognomy


At its core, the paper is concerned with "research linking facial morphological traits to important social outcomes" and uses portraits over the past few centuries, along with selfies over the past few years, to conduct its experiment. To that end, the researchers used machine learning to train an algorithm to analyze why and how those judgements were made, specifically in European portraits over time. In addition to that core question, they investigated whether people from richer nations were more likely to have “trustworthy” portraits. 

On Twitter, the researchers shared their study and said that they designed “an algorithm to automatically generate trustworthiness evaluations for the facial action units (smile, eye brows, etc.).” The tweet was shared with an image from the study that resembles outdated and debunked diagrams from a well-known phrenology booklet from 1902 that promised "to acquaint all with the elements of human nature and enable them to read these elements in all men, women and children in all countries."

Quickly, this sparked a backlash as a flood of researchers pointed to a deeply flawed set of assumptions, questionable methodology and analysis, superficial engagement with art history, and a disregard for sociology and economics. Critics also accused the project of being the latest to simply use machine learning to train an algorithm to be racist.


The co-authors of this study did not respond to Motherboard’s request for comment.

It has long been understood that people consistently and (un)consciously make judgements about a person's personality based on facial features, despite there being no evidence of a relationship. Predictably, then, the study’s conclusions are weak; for example, the finding that "trustworthiness displays in portraits increased throughout history" seems to simply be saying that the closer in time a portrait is to us, the more trustworthy we would rate it. 

The claim that "trustworthiness displays in portraits increased with affluence" is more problematic. The study relies on a 2014 publication from the Maddison Project, a collaborative effort by historians to build on economic historian Angus Madison’s attempts to reconstruct medieval economic data. A more recent Maddison Project publication from 2018 emphasizes that, in the years since, the collaborative has realized we "urgently need a new approach to Maddison's historical statistics" because Maddison's traditional method was found to result in significant distortions and contradictions. 

There are more issues, however. Take for example, the fact that there are no art historian (or historian) co-authors. As one historian pointed out on Twitter, the paper makes questionable claims about European social trust such as "religious tolerance increased, witch hunts abated, honor killings and revenge lost their appeal and intellectual freedom became a central value of modern countries." The major source for these claims is Steven Pinker’s Better Angels of Our Nature, itself criticized as a deeply flawed exercise in “wishful thinking.”


 The study also does not account for the intentions of artists or subjects, the context of certain portraits and art styles, or their changes as the art itself changed. Take another user’s thread on various portraits and styles which the study fails to adequately engage. If you were to subjectively view a portrait of, say, Henry VII, your subjective perception of its trustworthiness would be biased, not only because of your personal biases but because of the intention of Henry and his painter. As the thread explains, Henry was a king who "wanted to look like he could crush you like a bug if you opposed him." 

The study’s conclusions on trustworthiness, then, don’t really jibe with reality. A portrait  of Thomas Cranmer was found to have low trustworthiness by the algorithm, and one of Sir Matthew Wood was found to have high trustworthiness. As one writer explained on Twitter, Cranmer was "martyred for renouncing a recantation extracted under torture" while Wood "finagled an inheritance by seducing the 'feeble-minded' daughter of a prominent banker.

Or consider the source of data: the National Portrait Gallery collection and the Web Gallery of Art, which boast some 1,962 and 4,106 pieces of art, respectively. These are huge and rich datasets, but they are also explicitly curated ones. The study does not question its datasets and how they were constructed—curation obviously favors certain art styles, time periods, and artists. Instead, the study analyzes the degree of democratization present when and where they were painted, and relies on Maddison’s likely flawed historical statistics when trying to measure economic indicators.

The algorithm can’t actually detect trustworthiness according to one statistician’s Twitter thread, where he calculates its ability to detect "trustworthy" or "dominant" faces is only 5 percent better than simply saying every face is equally trustworthy. The algorithm’s inherent flaws are made even worse by incomplete data. The core claim that the increase in trustworthiness is "more strongly associated with GDP per capita than institutional change" is undermined by the fact that while the portraits stretch back from 1500 to now, the economic data only begins in 1800. This means nearly 42 percent of the economic data from this analysis is missing. 

All together, it’s not clear there is any value in this sort of experiment. If anything, it seems destined to end up being used in attempts to legitimize digital reskins of physiognomy and phrenology, much in the way police departments tried to use empirical analysis to legitimize their racial profiling.