Why Can't Scientists Reproduce a Definition of 'Reproducibility'?

Some large part of the whole scientific endeavor hinges around what would seem to be a simple concept: reproducibility. If some research produces a result that the researcher would like to claim as meaningful, that result should be able to be reproduced by another researcher given the same or very similar conditions. That would seem to be a very obvious prerequisite for calling something “true” or even strongly indicative of truth (which is much more likely in scientific research). But, as obvious as reproducibility might seem, within science it’s unsettled.

A survey conducted by the journal Nature published last month found that, of 1,500 scientists polled, 70 percent of them had been unable to replicate another scientist’s results while 50 percent had been unable to replicate even their own results. Fifty-two percent of the scientists agreed that a significant crisis of reproducibility exists, yet 31 percent agreed that failure to reproduce published research results means that those results are probably wrong.

Videos by VICE

The reproducibility problem can at least partially trace back to the meaning of the term itself, a problem covered in a panel session at the Best Practices of Biomedical Research: Improving Reproducibility and Transparency of Preclinical Research Conference earlier this month in Bethesda, Maryland. Ferric Fang, a microbiologist at the University of Washington in Seattle, noted three common definitions in use that are not necessarily in agreement.

The first seems obvious enough: Research results should be mostly consistent across slight variations in experimental set-up. A minor change in the experiment should probably not produce an exponential change in the experiment’s results, compared to other experiments. “Reproduction is taking the idea of a scientific project and showing that it is robust enough to survive various sorts of analysis,” he said, according to Nature News.

A second definition widely in use is much more narrow. A scientist should be able to get the same results in an experiment when duplicating that experiment exactly. This is sort of the opposite of robustness and allows for very fragile experimental setups and does not require the results to survive across different contexts. A third definition says that results don’t have to be reproduced to be reproducible. That is, so long as the scientists conducting an experiment provide sufficient information in their study such that another scientist could exactly replicate it, then that is chill, scientifically.

They all sound reasonable on their own, I think. But taken together, the disagreement becomes clear, as does the reproducibility problem itself.

A paper published earlier this year by researchers at the Meta-Research Innovation Center at Stanford suggested that reproducibility should maybe not be a single unified concept in the first place, and instead suggested a trio of reproducibility types roughly corresponding to the three versions outlined by Fang above.

“The causes of and remedies for what is called poor reproducibility, in any scientific field, require a clear specification of the kind of reproducibility being discussed (methods, results, or inferences), a proper understanding of how it affects knowledge claims, scientific investigation of its causes, and an improved understanding of the limitations of statistical significance as a criterion for claims,” the Stanford group wrote in Science Translational Medicine.

A white paper by the American Society for Cell Biology likewise dismissed reproducibility as a catch-all term, suggesting instead a four-legged definition.

From Nature:

According to this paper, “analytic replication” refers to attempts to reproduce results by reanalysing original data; “direct replication” refers to efforts to use the same conditions, materials and methods as an original experiment; “systematic replication” describes efforts to produce the same findings using different experimental conditions (such as trying an experiment in a different cell line or mouse strain), and “conceptual replication”, which refers to attempts to demonstrate the general validity of a concept, perhaps even using different organisms.

Across the far-flung fields of scientific research, the landscape only gets more complex. For example, the Reproducibility Project: Psychology, which attempted to replicate 100 psychology studies in order to rate psychology publications, used five indicators of successful replication, while a recent neuroimaging best practices paper came up with 10 levels of reproducibility across three general categories: “measurement stability,” “analytical stability,” and “generalizability.”

Differentiating reproducibility seems reasonable, but maybe it’s worth returning briefly to the source of the idea itself, which is generally agreed to be Robert Boyle, the OG chemist and discoverer of Boyle’s law. He put reproducibility in terms of a law analogy: “For, though the testimony of a single witness shall not suffice to prove the accused party guilty of murder; yet the testimony of two witnesses, though but of equal credit . . . shall ordinarily suffice to prove a man guilty.”