Saturday, May 1, 2010

Everyone likes their own brand - Delta Inflation: A bias in the design of RCTs in Critical Care

At long last, our article describing a bias in the design of RCTs in Critical Care Medicine (CCM) has been published (see: ). Interested readers are directed to the original manuscript. I'm not in the business of criticising my own work on my own blog, but I will provide at least a summary.

When investigators design a trial and do power and sample size (SS) calculations, they must estimate or predict a priori what the [dichotomous] effect size will be, in say, mortality (as is the case with most trials in CCM). This number should ideally be based upon preliminary data, or a minimal clinically important difference (MCID). Unfortunately, it does not usually happen that way, and investigators rather choose a number of patients that they think they can recruit with available funding, time and resources, and they calculate the effect size that they can find with that number of patients at 80-90% power and (usually) an alpha level of 0.05.

If power and SS calculations were being performed ideally, and investigators were using preliminary data or published data on similar therapies to predict delta for their trial, we would expect that, over the course of many trials, they would be just as likely to OVERESTIMATE observed delta as they are to underestimate it. If this were the case, we would expect random scatter around a line representing unity in a graph of observed versus predicted delta (see Figure 1 in the article). If, on the other hand, predicted delta uniformely exceeds observed delta, there is directional bias in its estimation. Indeed, this is exactly what we found. This is no different from the weatherman consistently overpredicting the probability of precipitation, a horserace handicapper consistently setting too long of odds on winning horses, or Tiger Woods consistently putting too far to the right. Bias is bias. And it is unequivocally demonstrated in Figure 1.

Another point, which we unfortunately failed to emphasize in the article, is that if the predicted deltas were being guided by a MCID, well, the MCID for mortality should be the same across the board. It is not (Figure 1 again). It ranges from 3%-20% absolute reduction in mortality. Moreover, in Figure 1, note the clustering around numbers like 10% - how many fingers or toes you have should not determine the effect size you seek to find when you design an RCT.

We surely hope that this article stimulates some debate in the critical care community about the statistical design of RCTs, and indeed in what primary endpoints are chosen for them. It seems that chasing mortality is looking more and more like a fool's erand.

Caption for Figure 1:
Figure 1. Plot of observed versus predicted delta (with associated 95% confidence intervals for observed delta) of 38 trials included in the analysis. Point estimates of treatment effect (deltas) are represented by green circles for non-statistically significant trials and red triangles for statistically significant trials. Numbers within the circles and triangles refer to the trials as referenced in Additional file 1. The blue ‘unity line’ with a slope equal to one indicates perfect concordance between observed and predicted delta; for visual clarity and to reduce distortions, the slope is reduced to zero (and the x-axis is horizontally expanded) where multiple predicted deltas have the same value and where 95% confidence intervals cross the unity line. If predictions of delta were accurate and without bias, values of observed delta would be symmetrically scattered above and below the unity line. If there is directional bias, values will fall predominately on one side of the line as they do in the figure.


  1. Scott, there is no reason that the delta used in sample size calculations has to be a prediction. In the case of a drug trial, for example, it might be the delta which would be needed to be superior to a competitor, or to justify moving forward with development. In a sense, deltas are always arbitrary. If I choose a sample size to give me 80% power of detecting a delta of 10 units, for example, then at the same time I also have a power of 70% for detecting a delta of 9, and 90% for a delta of 12, etc. When we consider the continuum of possible deltas and pick a point to use in sample size calculations, there are no restrictions on how we pick it, as long as the trial passes ethical muster.

  2. Tom -

    Delta IS a prediction. It is a prediction of the size of the effect. You think it is not? Why not use a delta of 1%? Or 50%? Because the SS required for 1% is insurmountable for most trials, and because 50% is unrealistic for most all therapies with resulting nonsensically wide confidence intervals. So, it is an inevitable conclusion that not just ANY delta is acceptable. Which begs the question - what is a REASONABLE delta? And a reasonable one is one which is anchored to the biological plausible effect of the therapy studied, or to the MCID.

  3. A couple of other things have occurred to me about this graph since we published this paper. Of course, the data points to the right of the graph, with the largest predicted and observed deltas, have wider confidence intervals - which is expected since they used smaller sample sizes. But consider the phenomenon of regression to the mean. These datapoints also lie way to the extreme right of the mean observed delta of all trials - and if regression to the mean applies to this case, if these studies were repeated, the observed delta would probably be much smaller.

    Also, I realized that we should have tabulated which among the "positive, stat sig" studies (there were few of them) were held up versus not confirmed on repeat study. That would winnow down the "positive" trials in our study to an even more discouraging number. (The original Leuven study would fall out, some would say Xigris would fall out, Annane steroid study would fall out [wait, was the unadjusted analysis even positive?], etc...)