I was reading in the NYT yesterday a story about Warren Buffet and how the Oracle of Omaha has trailed the S&P 500 for four of the last five years. It was based on an analysis done by a statistician who runs a blog called Statistical Ideas, which has a post on p-values that links to this Nature article a couple of months back that describes how we can be misled by P-values. And all of this got me thinking.

We have a dual problem in medical research: a.) of conceiving alternative hypotheses which cannot be confirmed in large trials free of bias; and b.) not being able to replicate the findings of positive trials. What are the reasons for this?

Trials can be negative either because of 1.) a lack of statistical power, or; 2.) because they are testing therapies that just don't work, often on the basis of poorly conceived or naive hypotheses. Studies can be impossible to replicate because: 3.) the results represent a "false positive" or type 1 error, or; 4.) they had inherent bias which drove the positive results. Since Warren Buffet got me thinking about this post, I will extend an analogy:

We have a dual problem in medical research: a.) of conceiving alternative hypotheses which cannot be confirmed in large trials free of bias; and b.) not being able to replicate the findings of positive trials. What are the reasons for this?

Trials can be negative either because of 1.) a lack of statistical power, or; 2.) because they are testing therapies that just don't work, often on the basis of poorly conceived or naive hypotheses. Studies can be impossible to replicate because: 3.) the results represent a "false positive" or type 1 error, or; 4.) they had inherent bias which drove the positive results. Since Warren Buffet got me thinking about this post, I will extend an analogy:

- underpowering (type II errors) is analogous to a start-up that can't get enough venture capital
- hypotheses that just don't work are like the dot com bubble company pets.com
- false positive results (type I errors) are like pets.com
*before*the bubble bursts - inherent study bias is analogous to a company that is cooking the books like Enron (think intensive insulEnron therapy).

Lack of power, number 1 (which, ironically, is often due to lack of funding) deserves little further comment, and this blog is a virtual treatise on number 4, study bias and cooking the books. But why, save for our inclination for irrational exuberance do we keep investing in pets.com without realizing it? There are two very important and related reasons.

The first is that we are willfully or unwittingly ignorant of the failure rate of start-ups, that is, new and unproven therapies. That failure rate is very high, which means that the prior probability of success is correspondingly low. (If we were going to give a business loan for one of these ventures, we should charge a very high interest rate to account for a high rate of default.) I estimate, without data, just as a gestalt, that the prior probability of actual success of any therapy without biological precedent (basically me-too drugs) is less than 10%, probably within an order of magnitude of 1% in either direction. But we get seduced by a good story, and then we make our second error - we are fooled by the data. This is where the p-value comes in.

Because we overestimate the prior (pre-test) probability of success, p-values confirm a false alternative hypothesis. The Nature article describes this elegantly, and mathematically inclined readers can reference several original expositions of the concepts here and here. The astonishingly instructive and useful result of the latter article is a nomogram based on prior probability distributions of the null and alternative hypotheses that can be used to estimate the posterior probability distribution of the null (or alternative) hypothesis beginning with a "guess" of the prior probability of the null. Here's that nomogram (it's an open access article - you can open it in another browser by right-clicking this link for Alt-Tab reference):

On the left is the prior probability of the

*null hypothesis*, in the middle is the p-value of the frequentist data from your trial and on the right is the*minimum*posterior probability of the*null.*(Power has some effect too, but it's less than you might think so it can generally be ignored.)*Let's say your prior probability of the alternative hypothesis is a generous 50% (corresponding to a prior of the null of 50%). Draw the line from 50% on the left through 0.05 in the middle, and behold! The**minimum*posterior probability of the null is 30%! And*minimum*is a pivotal word here, because it means that it could be more than that.
But the situation is worse than that. If I'm correct and the prior probability of the alternative hypothesis is on the order of 10% or less (corresponding to a 90% probability of the null), and you draw the line from 90% on the left through 0.05 in the middle, you are off the darned chart! Your

*minimum*posterior probability of the null exceeds 50%! If your p-value is 0.01, you're still at a minimum of 50% posterior probability of the null, and even if your p-value is 0.001, there is a minimum 15% probability that the null is true. You can also use this nomogram backwards, assuming a high posterior probability of a therapy that is "known" to not work, and back-extrapolate what we should have used as the prior. Try it with the PROWESS data relating to drotrecogin-alfa. Even with a generous 50% posterior and the PROWESS p-value of 0.007, we see that the prior probability that Xigris would work was less than 2 percent, similar to what I have been saying for all immunomodulatory therapies for sepsis.
This nomogram beautifully demonstrates how sensitive the posterior probability distribution is to the prior probability and reveals a startling and disturbing fact: many of the "positive" studies that we base our decisions upon are actually totally uninformative. Now we have the complete picture, now the mystery of negative studies and non-replicability is largely solved. Even if we're adequately powered, even if we're free from bias, we have this seemingly intractable problem of unknown priors which makes our significance tests unreliable for decision making. And, worst of all, the prior probability of the null hypothesis is likely to be quite high across a wide spectrum of disciplines and therapies.

Behold how the Oracle of Omaha still delivers his prophecy far better than we, even if he has had a rough run of late.

## No comments:

## Post a Comment

Note: Only a member of this blog may post a comment.