Scary 'Emergent' AI Abilities Are Just a 'Mirage' Produced by Researchers, Stanford Study Says

"There's no giant leap of capability," the researchers said.
Scary 'Emergent' AI Abilities Are Just a 'Mirage' Produced by Researchers, Stanford Study Says
Image: The Asahi Shimbun / Contributor via Getty Images

In a new paper, Stanford researchers say they have shown that so-called "emergent abilities" in AI models—when a large model suddenly displays an ability it ostensibly was not designed to possess—are actually a "mirage" produced by researchers. 

Many researchers and industry leaders, such as Google CEO Sundar Pichai, have perpetuated the idea that large language models like GPT-4 and Google's Bard can suddenly spit out knowledge that it wasn’t programmed to know. A 60 Minutes segment from April 16 claimed that Bard was able to translate Bengali even though it was not trained to. 60 Minutes claimed that AI models are "teaching themselves skills that they weren't expected to have." Microsoft researchers, too, claimed that OpenAI's GPT-4 language model showed “sparks of artificial general intelligence,” saying that the AI could “solve novel and difficult tasks…without needing any special prompting.” Such concerns not only hype up the AI models that companies hope to profit from, but stoke fears of losing control of an AI that suddenly eclipses human intelligence.


Co-authors Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo present an explanation for emergent abilities in their paper, posted to the arXiv preprint server on Friday. The authors write that “for a particular task and model family, when analyzing fixed model outputs, one can choose a metric which leads to the inference of an emergent ability or another metric which does not.” 

Though the model the researchers discuss is a previous iteration in the GPT family, GPT-3, they compare their findings to previous papers that also focused on GPT-3 when defining its emergent abilities. The researchers found that AI abilities only appeared to suddenly emerge when people used specific metrics to measure the task.  The researchers wrote that a person’s choice of a "non-linear" or "discontinuous" measurement can result in what appear to be sharp and unpredictable changes that are then falsely labeled as emergent abilities when in reality the performance curve is increasing smoothly. The authors write that this is compounded by researchers not having enough data on small models—perhaps they really are capable of the supposedly emergent task—and not enough on large models, either. 

A discontinuous metric is something like a “Multiple Choice Grade,” which is the metric that produced the most supposed emergent abilities. Linear metrics, on the other hand, include things like “Token Edit Distance,” which measures the similarity between two tokens, and “Brier Score,” which measures the accuracy of a forecasted probability. What the researchers found was that when they changed the measurement of their outputs from a nonlinear to a linear metric, the model's progress appeared predictable and smooth, nixing the supposed "emergent" property of its abilities. 


An example that the researchers debunk in the paper is GPT-3’s ability to perform integer arithmetic tasks such as adding two five-digit integers. 

“I think the mental picture a lot of people have is that small AI models can’t do this task whatsoever, but above some critical scale, AI models immediately and suddenly become capable of doing addition very well,” the authors of the paper told Motherboard in a joint email. “This mental picture suggests cause for concern. It suggests that you might have one model that is well behaved and trustworthy, but if you train the next model with more data or with more parameters, the next model might (unpredictably) become toxic or deceptive or malicious. These are some of the concerns that are leading people to believe we might unexpectedly lose control of AI."

"What we found, instead, is that there's no giant leap of capability," the authors continued. "When we reconsidered the metrics we use to evaluate these tools, we found that they increased their capabilities gradually, and in predictable ways.”

The authors conclude the paper by encouraging other researchers to look at tasks and metrics distinctly, consider the metric’s effect on the error rate, and that the better-suited metric may be different from the automated one. The paper also suggests that other researchers take a step back from being overeager about the abilities of large language models. “When making claims about capabilities of large models, including proper controls is critical,” the authors wrote in the paper. “

“We emphasize that researchers need to comprehend the implications of their chosen metric and should not be taken aback when their choices lead to predictable outcomes. To provide a tangible example, imagine evaluating baseball players based on their ability to hit a baseball a certain distance,” the researchers said. “If we use a metric like ‘average distance’ for each player, the distribution of players' scores will likely appear smooth and continuous. However, if we opt for a discontinuous metric like ‘whether a player's average distance exceeds 325 feet,’ then many players will score 0, while only the best players will score 1. Both metrics are valid, but it's important not to be surprised when the latter metric yields a discontinuous outcome. This understanding can guide researchers in making more informed decisions about the metrics they choose to evaluate emergent abilities.”

Many AI ethicists have been outspoken about researchers who write papers on behalf of corporations that contain overblown hype and promotion. Emily M. Bender, a professor of linguistics at the University of Washington, called the Microsoft paper about the emerging sparks of AGI “speculative fiction” and the GPT-4 technical report something that ignores the most basic risk mitigation strategies.