When a new scientific discovery is announced, how can you determine the level of confidence that the claim deserves? In a series of three articles, I want to consider this question.

Claims are based on evidence, and different kinds and quantities of evidence provide different levels of confidence. If the evidence is conflicting, the evidence that provides higher confidence must be prioritized. If the available evidence can provide only low confidence, any claims should be stated tentatively.

Because the field of medicine involves life-or-death outcomes, it is arguably the most important application of science — that is, from the perspective of potential benefit to society. As a result, we have established regulatory bodies (like the FDA) and demanded that medical guidelines be backed by strong evidence. This intense pressure has led doctors to establish hierarchical levels of medical evidence — clear descriptions of what evidence provides higher confidence.1 For example, as evidence for the benefit of a medication, all doctors would agree that a well-conducted randomized controlled clinical trial provides much greater confidence than mining a database to summarize outcomes for people who chose to take or not to take the medication (also known as a retrospective observational study).

Six Criteria for Confidence

Other areas of science that have not faced this intense pressure have not established hierarchical levels of evidence. Although the specific practices that apply to medicine, such as randomized controlled clinical trials, cannot be directly applied to many other applications of science, the well-established hierarchy of evidence in medicine provides general concepts that can be applied to all applications of science. These general concepts can be summarized as six criteria that can be applied to assess any type of evidence:2

Is the evidence repeatable? Repeatedly arriving at the same result inherently builds confidence. Can the evidence be directly measured or observed? The more direct the measurement, the higher the confidence. For example, measurements of black holes in space can only be obtained very indirectly, whereas the orbit of the moon can be directly observed. Was the evidence obtained through prospective study? Designing experiments in advance provides the opportunity to block out confounding factors — anything that could get in the way of uncovering the truth. If done successfully, prospective experimentation can directly establish what caused the results that were observed. Was bias minimized? Everyone has biases, and bias is devastating to good science. Bias can generally be assessed by determining who stands to benefit from a given scientific claim. Active steps should be taken to remove bias, such as conducting measurements through a blinded, independent third party. Were assumptions minimized and openly disclosed? In science, assumptions are used to save time and money. Assumptions should first be minimized, but any remaining assumptions should be openly disclosed and justified. Hidden assumptions are devastating to science. Did they make reasonable claims? Claims should flow directly from the evidence, not from extrapolation or amplification. The results of the experiment should not be overstated, and they strictly apply only to the experimental conditions that were studied. Proper use of hedging language, like “these results suggest…,” can convey an appropriate level of confidence. Language that implies extremes, like “always,” “never,” or “conclusive proof,” are often warning signs of an overstated claim, suggesting a conversion from scientist to salesperson.

The first three criteria speak to the quality of the scientific study, whereas the latter three criteria speak more to the quality of the scientist. The criteria do not provide a simple binary assessment of confidence; they provide gradations of the level of confidence associated with evidence. Evidence that provides higher confidence must be prioritized over evidence that provides lower confidence. Thankfully, doctors do this all the time.

Learning from a Piece of Humble Pie

I gained a deeper appreciation of these criteria through a hard lesson. My colleagues and I were interested in studying if a particular feature of an implanted cardiac device benefited patients by avoiding development of dangerous rhythm (atrial fibrillation). Looking retrospectively through a database of more than 37,000 patients, we found a 54 percent reduction in the risk of developing atrial fibrillation when the feature was enabled.3 In our manuscript, we openly disclosed the chief limitation of our study: because this was a retrospective observational study, patients were not randomly assigned to have the feature enabled or disabled, and it was not possible to determine why patients had the feature enabled or disabled. Doctors may have decided to enable the feature in patients that already had a lower risk of developing atrial fibrillation. Because our study was not prospectively designed, we could not exclude this potential confounding factor, known generally as selection bias. Also, as an employee of the company that made the product, I was biased, and our study adopted the assumption that selection bias was not significant. Our study did not stack up well against these six criteria.

Unfortunately, our findings were not repeated in a subsequent prospective, randomized, controlled clinical trial.4 Even though our study was ten times larger than the randomized controlled trial, the randomization and prospective design eliminated selection bias and provided a higher level of confidence. All doctors would agree.

Standardized Levels of Evidence

Appropriate prioritization of evidence is a fundamental component of epistemology, and other fields of science should follow the example set by the medical community and agree upon standardized levels of evidence.

In the next article in this series, we will use these criteria to assess the evidence that is commonly used to support evolution.

