A recent article in JAMA ( http://jama.ama-assn.org/cgi/content/abstract/303/9/865) reports a meta-analysis of (three) trials comparing a strategy of higher versus lower PEEP (positive end-expiratory pressure) in Acute Lung Injury (ALI – a less severe form of lung injury) and Acute Respiratory Distress Syndrome (ARDS – a more severe form, at least as measured by oxygenation, one facet of its effects on physiology). The results of this impeccably conducted analysis are interesting enough (High PEEP is beneficial in a pre-specified subgroup analysis of ARDS patients, but may be harmful in the subgroup with less severe ALI), but I am more struck by the discussion as it pertains to the future of trials in critical care medicine – a discussion which was echoed by the editorialist (http://jama.ama-assn.org/cgi/content/extract/303/9/883 ).

The trials included in this meta-analysis lacked statistical precision for two principal reasons: 1.) they used the typical cookbook approach to sample size determination, choosing a delta of 10% without any justification whatever for this number (thus the studies were guilty of DELTA INFLATION); 2.) according to the authors of the meta-analysis , two of the three trials were stopped early for futility, thus further decreasing the statistical precision of already effectively underpowered trials. The resulting 95% CIs for delta in these trials thus ranged from (-)10% (in the ARDSnet ALVEOLI trial; i.e., high PEEP may increase mortality by up to 10%) to +10% (in the Mercat and Meade trials; i.e., high(er) PEEP may decrease mortality by upwards of 10%).

Because of the lack of statistical precision of these trials, the authors of the meta-analysis appropriately used individual patient data from the trials as meta-analytical fodder, with a likely useful result – high PEEP is probably best reserved for the sickest patients with ARDS, and avoided for those with ALI. (Why there is an interaction between severity of lung injury and response to PEEP is open for speculation, and is an interesting topic in itself.) What interests me more than this main result is the authors' and editorialist's suggestion that we should be doing “prospective meta-analyses” or at least design our trials so that they easily lend themselves to this application should we later decide to do so. Which begs the question: why not just make a bigger trial from the outset, choosing a realistic delta and disallowing early stopping for “futility”?

(It is useful to note that the term futility is unhappily married to or better yet, enslaved by, alpha (the threshold P-value for statistical significance). A trial is deemed futile if there is no hope of crossing the alpha/p-value threshold. But it is certainly not futile to continue enrolling patients if each additional accrual increases the statistical precision of the final result, by narrowing the 95% CI of delta. Indeed, I’m beginning to think that the whole concept of “futility” is a specious one - unless you're a funding agency.)

Large trials may be cumbersome, but they are not impossible. The SAFE investigators (http://content.nejm.org/cgi/content/abstract/350/22/2247 ) enrolled ~7000 patients seeking a delta of 3% in a trial involving 16 ICUs in two countries. Moreover, a prospective meta-analysis doesn’t reduce the number of patients required, it simply splits the population into quanta and epochs, which will hinder homogeneity in the meta-analysis if enrollment and protocols are not standardized or if temporal trends in care and outcomes come into play. If enrollment and protocols ARE standardized, it is useful to ask “then why not just do one large study from the outset?” using a realistic delta and sample size? Why not coordinate all the data (American, French, Canadian, whatever) through a prospective RCT rather than a prospective meta-analysis?

Here’s my biggest gripe with the prospective meta-analysis – in essence, you are taking multiple looks at the data, one look after each trial is completed (I’m not even counting intra-trial interim analyses), but you’re not correcting for the multiple comparisons. And most likely, once there is a substantial positive trial, it will not be repeated, for a number of reasons such as all the hand-waving about it not being ethical to repeat it and randomize people to no treatment, (one of the cardinal features of science being repeatability notwithstanding). Think ARMA (http://content.nejm.org/cgi/content/extract/343/11/812 ) . There were smaller trials leading up to it, but once ARMA was positive, no additional noteworthy trials sought to test low tidal volume ventilation for ARDS. So, if we’re going to stop conducting trials for our “prospective meta-analysis”, what will our early stopping rule be? When will we stop our sequence of trials? Will we require a P-value of 0.001 or less after the first look at the data (that is, after the first trial is completed)? Doubtful. As soon as a significant result is found in a soundly designed trial, further earnest trials of the therapy will cease and victory will be declared. Only when there is a failure or a “near-miss” will we want a “do-over” to create more fodder for our “prospective meta-analysis”. We will keep chasing the result we seek until we find it, nitpicking design and enrollment details of “failed” trials along the way to justify the continued search for the “real” result with a bigger and better trial.

If we’re going to go to the trouble of coordinating a prospective meta-analysis, I don’t understand why we wouldn’t just coordinate an adequately powered RCT based on a realistic delta (itself based on an MCID or preliminary data), and carry it to its pre-specified enrollment endpoint, “futility stopping rules” be damned. With the statistical precision that would result, we could examine the 95% CI of the resulting delta to answer the practical questions that clinicians want answers for, even if our P-value were insufficient to satisfy the staunchest of statisticians. Perhaps the best thing about such a study is that the force of its statistical precision would incapacitate single center trialists, delta inflationists, and meta-analysts alike.

## Tuesday, March 23, 2010

## Friday, March 5, 2010

### Levo your Dopa at the Door - how study design influences our interpretation of reality

Another excellent critical care article was published this week in NEJM, the SOAP II study: http://content.nejm.org/cgi/content/short/362/9/779 . In this RCT of norepinephrine (norepi, levophed, or "levo" for short) versus dopamine ("dopa" for short) for the treatment of shock, the authors tried to resolve the longstanding uncertainty and debate surrounding the treatment of patients in various shock states. Proponents of any agent in this debate have often hung their hats on extrapolations of physiological and pharmacological principles to intact humans, leading to colloquialisms such as "leave-em-dead" for levophed and "renal-dose dopamine". This blog has previously emphasized the frailty of pathophysiological reasoning, the same reasoning which has irresistibly drawn cardiologists and nephrologists to dopamine because of its presumed beneficial effects on cardiac and urine output, and, by association, outcomes.

Hopefully all docs with a horse in this race will take note of the outcome of this study. In its simplest and most straightforward and technically correct interpretation, levo was not superior to dopa in terms of an effect on mortality, but was indeed superior in terms of side effects, particularly cardiac arrhythmias (a secondary endpoint). The direction of the mortality trend was in favor of levo, consistent with observational data (the SOAP I study by many of the same authors) showing reduced mortality with levo compared with dopa in the treatment of shock. As followers of this blog also know, the interpretation of "negative" studies (that is, MOST studies in critical care medicine - more on that in a future post) can be more challenging than the interpretation of positive studies, because "absence of evidence is not evidence of absence".

We could go to the statistical analysis section, and I could harp on the choice of delta, the decision to base it on a relative risk reduction, the failure to predict a baseline mortality, etc. (I will note that at least the authors defended their delta based on prior data, something that is a rarity - again, a future post will focus on this.) But, let's just be practical and examine the 95% CI of the mortality difference (the primary endpoint) and try to determine whether it contains or excludes any clinically meaningful values that may allow us to compare these two treatments. First, we have to go to the raw data and find the 95% CI of the ARR, because the Odds Ratio can inflate small differences as you know. That is, if the baseline is 1%, then a statistically significant increase in odds of 1.4 is not meaningful because it represents only a 0.4% increase in the outcome - miniscule. With Stata, we find that the ARR is 4.0%, with a 95% CI of -0.76% (favors dopamine) to +8.8% (favors levo). Wowza! Suppose we say that a 3% difference in mortality in either direction is our threshold for CLINICAL significance. This 95% CI includes a whole swath of values between 3% and 8.8% that are of interest to us and they are all in favor of levo. (Recall that perhaps the most lauded trial in critical care medicine, the ARDSnet ARMA study, reduced mortality by just 9%.) On the other side of the spectrum, the range of values in favor of dopa is quite narrow indeed - from 0% to -0.76%, all well below our threshold for clinical significance (that is, the minimal clinically important difference or MCID) of 3%. So indeed, this study surely seems to suggest that if we ever choose between these two widely available and commonly used agents, the cake goes to levo, hands down. I hardly need a statistically significant result with a 95% CI like this one!

So, then, why was the study deemed "negative"? There are a few reasons. Firstly, the trial is probably guilty of "delta inflation" whereby investigators seek a pre-specified delta that is larger than is realistic. While they used, ostensibly, 7%, the value found in the observational SOAP I study, they did not account for regression to the mean, or allow any buffer for the finding of a smaller difference. However, one can hardly blame them. Had they looked instead for 6%, and had the 4% trend continued for additional enrollees, 300 additional patients in each group (or about 1150 in each arm) would have been required and the final P-value would have still fallen short at 0.06. Only if they had sought a 5% delta, which would have DOUBLED the sample size to 1600 per arm, would they have achieved a statistically significant result with 4% ARR, with P=0.024. Such is the magnitude of the necessary increase in sample size as you seek smaller and smaller deltas.

Which brings me to the second issue. If delta inflation leads to negative studies, and logistical and financial constraints prohibit the enrollment of massive numbers of patients, what is an investigator to do? Sadly, the poor investigator wishing to publish in the NEJM or indeed any peer reviewed journal is hamstrung by conventions that few these days even really understand anymore: namely, the mandatory use of 0.05 for alpha and "doubly significant" power calculations for hypothesis testing. I will not comment more on the latter other than to say that interested readers can google this and find some interesting, if arcane, material. As regards the former, a few comments.

The choice of 0.05 for the type 1 error rate, that is, the probability that we will reject the null hypothesis based on the data and falsely conclude that one therapy is superior to the other; and the choice of 10-20% for the type 2 error rate (power 80-90%), that is the probability that the alternative hypothesis is really true and we will reject it based on the data; derive from the traditional assumption, which is itself an omission bias, that it is better in the name of safety to keep new agents out of practice by having a more stringent requirement for accepting efficacy than the requirement for rejecting it. This asymmetry is the design of trials is of dubious rationality from the outset (because it is an omission bias), but it is especially nettlesome when the trial is comparing two agents already in widespread use. As opposed to the trial of a new drug compared to placebo, where we want to set the hurdle high for declaring efficacy, especially when the drug might have side effects - with levo versus dopa, the real risk is that we'll continue to consider them to be equivalent choices when there is strong reason to favor one over the other based either on previous or current data. This is NOT a trial of treatment versus no treatment of shock, this trial assumes that you're going to treat the shock with SOMETHING. In a trial such as this one, one could make a strong argument that a P-value of 0.10 should be the threshold for statistical significance. In my mind it should have been.

But as long as the perspicacious consumer of the literature and reader of this blog takes P-values with a grain of salt and pays careful attention to the confidence intervals and the MCID (whatever that may be for the individual), s/he will not be misled by the deeply entrenched convention of alpha at 0.05, power at 90%, and delta wildly inflated to keep the editors and funding agencies mollified.

Hopefully all docs with a horse in this race will take note of the outcome of this study. In its simplest and most straightforward and technically correct interpretation, levo was not superior to dopa in terms of an effect on mortality, but was indeed superior in terms of side effects, particularly cardiac arrhythmias (a secondary endpoint). The direction of the mortality trend was in favor of levo, consistent with observational data (the SOAP I study by many of the same authors) showing reduced mortality with levo compared with dopa in the treatment of shock. As followers of this blog also know, the interpretation of "negative" studies (that is, MOST studies in critical care medicine - more on that in a future post) can be more challenging than the interpretation of positive studies, because "absence of evidence is not evidence of absence".

We could go to the statistical analysis section, and I could harp on the choice of delta, the decision to base it on a relative risk reduction, the failure to predict a baseline mortality, etc. (I will note that at least the authors defended their delta based on prior data, something that is a rarity - again, a future post will focus on this.) But, let's just be practical and examine the 95% CI of the mortality difference (the primary endpoint) and try to determine whether it contains or excludes any clinically meaningful values that may allow us to compare these two treatments. First, we have to go to the raw data and find the 95% CI of the ARR, because the Odds Ratio can inflate small differences as you know. That is, if the baseline is 1%, then a statistically significant increase in odds of 1.4 is not meaningful because it represents only a 0.4% increase in the outcome - miniscule. With Stata, we find that the ARR is 4.0%, with a 95% CI of -0.76% (favors dopamine) to +8.8% (favors levo). Wowza! Suppose we say that a 3% difference in mortality in either direction is our threshold for CLINICAL significance. This 95% CI includes a whole swath of values between 3% and 8.8% that are of interest to us and they are all in favor of levo. (Recall that perhaps the most lauded trial in critical care medicine, the ARDSnet ARMA study, reduced mortality by just 9%.) On the other side of the spectrum, the range of values in favor of dopa is quite narrow indeed - from 0% to -0.76%, all well below our threshold for clinical significance (that is, the minimal clinically important difference or MCID) of 3%. So indeed, this study surely seems to suggest that if we ever choose between these two widely available and commonly used agents, the cake goes to levo, hands down. I hardly need a statistically significant result with a 95% CI like this one!

So, then, why was the study deemed "negative"? There are a few reasons. Firstly, the trial is probably guilty of "delta inflation" whereby investigators seek a pre-specified delta that is larger than is realistic. While they used, ostensibly, 7%, the value found in the observational SOAP I study, they did not account for regression to the mean, or allow any buffer for the finding of a smaller difference. However, one can hardly blame them. Had they looked instead for 6%, and had the 4% trend continued for additional enrollees, 300 additional patients in each group (or about 1150 in each arm) would have been required and the final P-value would have still fallen short at 0.06. Only if they had sought a 5% delta, which would have DOUBLED the sample size to 1600 per arm, would they have achieved a statistically significant result with 4% ARR, with P=0.024. Such is the magnitude of the necessary increase in sample size as you seek smaller and smaller deltas.

Which brings me to the second issue. If delta inflation leads to negative studies, and logistical and financial constraints prohibit the enrollment of massive numbers of patients, what is an investigator to do? Sadly, the poor investigator wishing to publish in the NEJM or indeed any peer reviewed journal is hamstrung by conventions that few these days even really understand anymore: namely, the mandatory use of 0.05 for alpha and "doubly significant" power calculations for hypothesis testing. I will not comment more on the latter other than to say that interested readers can google this and find some interesting, if arcane, material. As regards the former, a few comments.

The choice of 0.05 for the type 1 error rate, that is, the probability that we will reject the null hypothesis based on the data and falsely conclude that one therapy is superior to the other; and the choice of 10-20% for the type 2 error rate (power 80-90%), that is the probability that the alternative hypothesis is really true and we will reject it based on the data; derive from the traditional assumption, which is itself an omission bias, that it is better in the name of safety to keep new agents out of practice by having a more stringent requirement for accepting efficacy than the requirement for rejecting it. This asymmetry is the design of trials is of dubious rationality from the outset (because it is an omission bias), but it is especially nettlesome when the trial is comparing two agents already in widespread use. As opposed to the trial of a new drug compared to placebo, where we want to set the hurdle high for declaring efficacy, especially when the drug might have side effects - with levo versus dopa, the real risk is that we'll continue to consider them to be equivalent choices when there is strong reason to favor one over the other based either on previous or current data. This is NOT a trial of treatment versus no treatment of shock, this trial assumes that you're going to treat the shock with SOMETHING. In a trial such as this one, one could make a strong argument that a P-value of 0.10 should be the threshold for statistical significance. In my mind it should have been.

But as long as the perspicacious consumer of the literature and reader of this blog takes P-values with a grain of salt and pays careful attention to the confidence intervals and the MCID (whatever that may be for the individual), s/he will not be misled by the deeply entrenched convention of alpha at 0.05, power at 90%, and delta wildly inflated to keep the editors and funding agencies mollified.

Subscribe to:
Posts (Atom)