Medical Evidence Blog: Why Most True Research Findings Are Useless

In his provocative essay in PLOS Medicine over a decade ago, Ioannidis argued that most published research findings are false, owing to a variety of errors such as p-hacking, data dredging, fraud, selective publication, researcher degrees of freedom, and many more. In my permutation of his essay, I will go a step further and suggest that even if we limit our scrutiny to tentatively true research findings (scientific truth being inherently tentative), most research findings are useless.

My choice of the word "useless" may seem provocative, and even untenable, but it is intended to have an exquisitely specific meaning: I mean useless in an economic sense of "having zero or negligible net utility", in the tradition of Expected Utility Theory [EUT], for individual decision making. This does not mean that true findings are useless for the incremental accrual of scientific knowledge and understanding. True research findings may be very valuable from the perspective of scientific progress, but still useless for individual decision making, whether it is the individual trying to determine what to eat to promote a long healthy life, or the physician trying to decide what to do for a patient in the ICU with delirium. When evaluating a research finding that is thought to be true, and may at first blush seem important and useful, it is necessary to make a distinction between scientific utility and decisional utility. Here I will argue that while many "true" research findings may have scientific utility, they have little decisional utility, and thus are "useless".

The first reason most true findings are useless, especially in the biological sciences, is that most true findings that are reported are associational rather than causal. Associations are important for scientific understanding, incremental accrual of knowledge, and especially "hypothesis generation", but they do not help decision makers solve practical problems. Decision makers must take action to alter outcomes in their environment and doing so requires knowledge of causal factors rather than associations. While associations may hint at or represent underlying causal pathways, it is often just as likely that they do not, that they are spurious, or that the actual causal pathway is opposite what the association or its interpretation appears to suggest. When decisions have actual consequences, placing a bet on an association is not generally justified. Note that I said "an" association. While an individual association may be insufficient for decision making, the accrual of knowledge and application of specific frameworks such as the Bradford Hill criteria, such as applied to smoking and lung cancer rates, can be used to inform decisions. This case is the exception rather than the rule, especially because of the strength of that particular association (discussed below).

The second reason most true research findings are useless is that they report outcomes that are (at best) surrogates for what we wish to know to make decisions. A popular category of these kinds of studies are reported almost weekly in the New York Times by Gretchen Reynolds and Jane Brody on exercise and health. In this article called "A Positive Outlook May Be Good For Your Health" trending right now, Brody writes:

"There is no longer any doubt that what happens in the brain influences what happens in the body. When facing a health crisis, actively cultivating positive emotions can boost the immune system and counter depression. Studies have shown an indisputable link between having a positive outlook and health benefits like lower blood pressure, less heart disease, better weight control and healthier blood sugar levels."

Actively cultivating positive emotions may "boost the immune system" in the lab in some artificial setting based on some surrogate measure, but it does not give us information about what we want to know: does it make us healthier and more resistant to disease? The rest of that selection involves the highly fraught associations problem described above. Chicken, egg, or neither?

Also trending right now is Gretchen Reynold's article "Why Deep Breathing May Keep Us Calm":

"The scientists confirmed that idea in a remarkable study published last year in Nature, in which they bred mice with a single type of pacemaker cell that could be disabled. When they injected the animals with a virus that killed only those cells, the mice stopped sighing, the researchers discovered. Mice, like people, normally sigh every few minutes, even if we and they are unaware of doing so. Without instructions from these cells, the sighing stopped."

What is happening in a rat's brain (or muscle, or liver, or heart) may be scientifically interesting, but it provides no actionable information for the individual decision maker. Studies using surrogate outcomes or subjects (i.e., animal models) are simply not reliable for humans making decisions in life. They serve mainly to reinforce the intuitions denoted by the headlines of the above articles. Even if we knew beyond a doubt that sighing were essential for life (it may be!), knowing this does not allow us to reasonably infer that we should do more of it than we are naturally inclined to, or that it may "keep us calm".

The third reason most true research findings are useless has received short shrift because of Ioannidis' article and the ensuing assault on false positive research results and the "replicability crisis" (nevermind that almost all of those nonreplicable psych studies would have been useless for the individual even if true, for the reasons described above). For a true research finding to have decisional utility on the individual level the finding has to have a large enough effect size as that it is worth acting upon, given the costs of those actions, even if we assume 100% certainty of the truth of the finding. An enormous impediment to evaluating the size or magnitude of a research effect is that it is often not reported by both the lay press and academic writers. Positive results, as measured by the fallible P-value, are stated as truth without mention of the magnitude of the effect, or are exaggerated by reporting relative rather than absolute effects, the latter necessary for decision making in accord with EUT. This article called "Good News for Older Mothers" trending in the New York Times this week, and written by a physician, reports numerous positive research results without once mentioning the magnitude of the positive effects:

"In the two earlier studies, there was a negative association; maternal age 35-39 at birth was associated with poorer cognitive scores in the children, tested a decade later; the children who had been born to mothers 25-29 did better. On the other hand, for the most recent study, that association was reversed; the children born to the 35- to 39-year-olds did significantly better on the cognitive testing than the children born to the younger mothers."

In addition to the problem of association, we are told that in different epochs children of different age mothers do either "poorer" or "better". When I read that article, I immediately suspected that the statistically significant differences (had they not been they would not have been published, right?) referred to were likely to be very small. The standard deviation (SD) for IQ scores is 15 points - thus a 3 point difference represents one fifth of a SD of IQ. My guess was that any differences that were found were on the order of 3 points. Perhaps not negligible, but nothing to write home about and probably not cause for any celebration even if there were something you could do about it. The abstract of the index report is here and it says:

"Results For the 1958-1970 cohorts, maternal ages 35-39 were associated with 0.06 (95% CI: -0.13, 0.00) and 0.12 (95% CI: -0.20, -0.03) standard deviations (SD) lower cognitive ability compared to mothers in the reference category (25-29), while for the 2001 cohort with 0.16 (95% CI: 0.09-0.23) SD higher cognitive ability."

As I suspected, the differences are small (on the order of one tenth to one fifth of a SD). The effect is small enough that I'm not sure there was a "problem" for the 1958-1970 cohort, who in delaying maternity sacrificed 1-2 IQ points in their children, or that there is suddenly "good news for older mothers" since the older cohort has gained a couple of IQ points, assuming the research result is true.

In this excellent JAMA article raining criticism on surrogate endpoints, Lipska and Krumholz, two accomplished and respected researchers, fail to report the strength of the evidence they tout:

"For example, treatment with empagliflozin (a sodium-glucose cotransporter 2 inhibitor) and treatment with liraglutide (a glucagon-like peptide 1 [GLP-1] agonist) both significantly reduced the risk of major cardiovascular events, mortality from cardiovascular causes, and mortality from any cause when compared with placebo.⁴^,5"

Reference 4 is the FDA-mandated safety trial by Marso. Table 1 shows that the difference in the primary outcome (death from cardiovascular causes, nonfatal myocardial infarction, or nonfatal stroke) was 3.4 such events per 100 persons per year for liraglutide versus 3.9 for placebo, a difference of .5 events per 100 persons per year or one half of one percentage point. For those who prefer the number needed to treat (NNT) that's 200 persons that need to be treated per year to prevent one event with liraglutide. A rate of 0.5% would not sway me to take liraglutide versus another drug were I choosing a treatment for diabetes, unless cost, side effects, and convenience were also all aligned in its favor. I have called this problem the "therapeutic paradox" not realizing that Geoffrey Rose dubbed it the "prevention paradox" 30 years before I. These paradoxes result because minuscule effects on the individual level are not enough to sway decisions, but on the population level, thousands of events are prevented if a particular course is preferred, because of scaling. An elaboration of this concept is beyond our scope here and interested readers are referred to the links above. The point is that there can be a true positive research result that is of such a small magnitude that it is negligible or useless to guide decision making on the individual level - other factors will have much greater weight, whether it involves the decision of when to have a child, or what drug to take to control your diabetes.

In these three ways (and perhaps others) even true research findings are not adequate to guide decisions and are often best ignored. Even if we assume that a finding is true, a course which Ioannidis and others have strenuously warned us against, most research findings are useless for the individual faced with a decision because they report associations rather than causal pathways, they report surrogate outcomes or surrogate experimental models, or they report "positive" findings that are simply too small in magnitude to matter for individual decisions.

Medical Evidence Blog

Thursday, April 6, 2017

Why Most True Research Findings Are Useless

1 comment: