At many institutions, Journal Clubs meet to dissect a trial after its results are published to look for flaws, biases, shortcomings, limitations. Beyond the dissemination of the informational content of the articles that are reviewed, Journal Clubs serve as a reiteration and extension of the limitations part of the article discussion. Unless they result in a letter to the editor, or a new peer-reviewed article about the limitations of the trial that was discussed, the debates of Journal Club begin a headlong recession into obscurity soon after the meeting adjourns.
The proliferation and popularity of online media has led to what amounts to a real-time, longitudinally documented Journal Club. Named “post-publication peer review” (PPPR), it consists of blog posts, podcasts and videocasts, comments on research journal websites, remarks on online media outlets, and websites dedicated specifically to PPPR. Like a traditional Journal Club, PPPR seeks to redress any deficiencies in the traditional peer review process that lead to shortcomings or errors in the reporting or interpretation of a research study.
PPPR following publication of a “positive” trial, that is one where the authors conclude that their a priori criteria for rejecting the null hypothesis were met, is oftentimes directed at the identification of a host of biases in the design, conduct, and analysis of the trial that may have led to a “false positive” trial. False positive trials are those in which either a type I error has occurred (the null hypothesis was rejected even though it is true and no difference between groups exists), or the structure of the experiment was biased in such a way as that the experiment and its statistics cannot be informative. The biases that cause structural problems in a trial are manifold, and I may attempt to delineate them at some point in the future. Because it is a simpler task, I will here attempt to list a differential diagnosis that people may use in PPPRs of “negative” trials.
Far and away, the principal reason that a trial “fails” to show a positive result is because the null hypothesis is true and there is no difference between the populations of patients randomized to the different groups. For medical trials of an intervention, this means there is no effect of the intervention on the assigned primary outcome. I can say that it is the principal reason because after a widely cited article in the NEJM in 1978, most trials have used sample size calculations based on a dual hypothesis model, and they are generally powered at the 80-90% level. If the power of the trial to find a difference of a given amount (called “delta” or the effect size of the therapy) is 90%, the probability that the trial will find a difference as great or greater than delta if such a difference truly exists is 90%. If it’s there, you’ll probably find it. If you didn’t find it, it’s probably not there.
The next reason that a trial may fail to find a difference, at least in critical care trials, is a phenomenon dubbed delta inflation. Delta inflation occurs when the delta value used in power calculations is overestimated, and it occurs because smaller deltas (less than 5-10%) require increasingly large sample sizes (>1000-5000 patients) which may be prohibitive for investigators based on cost or logistics. Instead of abandoning the trial altogether, investigators settle for looking for an unrealistically large delta because it minimizes sample sizes and makes the trial feasible. If a trial with delta inflation that used a 90% power to find a 10% difference in mortality is negative, our best move is to look at the confidence interval surrounding the result to see if it includes values that may be important to us. If it does, we may be inclined to advocate that a larger trial using a smaller delta be performed. Negative trials with delta inflation provide us with some evidence of the lack of efficacy of the therapy, but less than if delta inflation were not operative.
If the primary endpoint is mortality (as it often is in trials in critical care) and there is no difference between the intervention groups, we have to ask ourselves if it is reasonable to expect the intervention to affect mortality. For example, it is quite possible that right heart or Swan Ganz catheters, which have repeatedly been shown to have no impact on mortality, may have other benefits that were not measured in the trials and may not be measurable. Indeed, it is worth asking if mortality is the appropriate endpoint for any trial of a diagnostic test or intervention, because there are few diagnostic tests or interventions which have been shown to improve mortality. Diagnosis is so far upstream from intervention and outcomes that any positive effect on mortality may be diluted to the point that it is not detectable without extremely large sample sizes. The same logic has been used to argue that trials of prevention of diseases (such as ARDS) are doomed from the outset.
If there was no difference in the primary outcomes of the trial, we could look at other outcomes (secondary outcomes) to see if there is a suggestion of a difference there. This approach is fraught with problems including multiple comparisons and, like inspecting the confidence interval of a trial with possible delta inflation, is useful only for “hypothesis generation” for a future trial. Positive results in secondary outcomes and subgroup analyses should not override the results of the primary analysis.
If the goal of trial enrollment is not fulfilled, or the trial is stopped early for futility, a trial designed with 90% power may have only 60% power. If the primary outcome is evaluated using a time-to-event analysis, and follow-up is incomplete, or fewer events occur than the assumptions used in sample size calculations, then the trial likewise does not have its stated power, effectively increasing the Type II error rate.
Another category of problems that can cause a trial to be falsely negative results from what I will call the “problem of separation”. Problems of separation occur when the two groups do not get treatments that differ enough in their intensity to allow separation of the populations to be observed in test statistics. For example, the ARDSnet ARMA trial showed a difference in mortality when comparing 6cc/kg PIBW to 12cc/kg PIBW, but it is likely that a trial comparing 6cc/kg and 7cc/kg PIBW would fail because the difference in the dose between the groups causes a problem of separation. Similarly, the TTM (Targeted Temperature Management Trial) has been criticized for temperature targets that were too close together (33 degrees versus 36 degrees). A problem of separation can also occur if there is a lot of crossover between groups such that placebo patients get active treatment and active treatment patients stop taking treatment.
Finally (unless there are scenarios that I’ve neglected) a false negative trial may arise if the patients enrolled in the trial are either “too sick” or “not sick enough”. Imagine two trials of an intervention for an infection. In one trial, patients are enrolled only if they are on three vasopressors and both groups have mortality exceeding 95%. When all patients are bound to die, the intervention may not show any effect. In the other trial, patients with very mild infection are enrolled and more than 95% of them survive. When everybody is going to survive, the intervention may not show an effect except with very large sample sizes. By this logic, the probability of an intervention showing an effect plotted against the severity of illness will reveal an inverted U-shaped curve. Such a curve probably describes responses to antidepressant medications.
Here’s a summary differential diagnosis for the autopsy of a dead RTC that can be used in Journal Clubs and PPPR:
The proliferation and popularity of online media has led to what amounts to a real-time, longitudinally documented Journal Club. Named “post-publication peer review” (PPPR), it consists of blog posts, podcasts and videocasts, comments on research journal websites, remarks on online media outlets, and websites dedicated specifically to PPPR. Like a traditional Journal Club, PPPR seeks to redress any deficiencies in the traditional peer review process that lead to shortcomings or errors in the reporting or interpretation of a research study.
PPPR following publication of a “positive” trial, that is one where the authors conclude that their a priori criteria for rejecting the null hypothesis were met, is oftentimes directed at the identification of a host of biases in the design, conduct, and analysis of the trial that may have led to a “false positive” trial. False positive trials are those in which either a type I error has occurred (the null hypothesis was rejected even though it is true and no difference between groups exists), or the structure of the experiment was biased in such a way as that the experiment and its statistics cannot be informative. The biases that cause structural problems in a trial are manifold, and I may attempt to delineate them at some point in the future. Because it is a simpler task, I will here attempt to list a differential diagnosis that people may use in PPPRs of “negative” trials.
Far and away, the principal reason that a trial “fails” to show a positive result is because the null hypothesis is true and there is no difference between the populations of patients randomized to the different groups. For medical trials of an intervention, this means there is no effect of the intervention on the assigned primary outcome. I can say that it is the principal reason because after a widely cited article in the NEJM in 1978, most trials have used sample size calculations based on a dual hypothesis model, and they are generally powered at the 80-90% level. If the power of the trial to find a difference of a given amount (called “delta” or the effect size of the therapy) is 90%, the probability that the trial will find a difference as great or greater than delta if such a difference truly exists is 90%. If it’s there, you’ll probably find it. If you didn’t find it, it’s probably not there.
The next reason that a trial may fail to find a difference, at least in critical care trials, is a phenomenon dubbed delta inflation. Delta inflation occurs when the delta value used in power calculations is overestimated, and it occurs because smaller deltas (less than 5-10%) require increasingly large sample sizes (>1000-5000 patients) which may be prohibitive for investigators based on cost or logistics. Instead of abandoning the trial altogether, investigators settle for looking for an unrealistically large delta because it minimizes sample sizes and makes the trial feasible. If a trial with delta inflation that used a 90% power to find a 10% difference in mortality is negative, our best move is to look at the confidence interval surrounding the result to see if it includes values that may be important to us. If it does, we may be inclined to advocate that a larger trial using a smaller delta be performed. Negative trials with delta inflation provide us with some evidence of the lack of efficacy of the therapy, but less than if delta inflation were not operative.
If the primary endpoint is mortality (as it often is in trials in critical care) and there is no difference between the intervention groups, we have to ask ourselves if it is reasonable to expect the intervention to affect mortality. For example, it is quite possible that right heart or Swan Ganz catheters, which have repeatedly been shown to have no impact on mortality, may have other benefits that were not measured in the trials and may not be measurable. Indeed, it is worth asking if mortality is the appropriate endpoint for any trial of a diagnostic test or intervention, because there are few diagnostic tests or interventions which have been shown to improve mortality. Diagnosis is so far upstream from intervention and outcomes that any positive effect on mortality may be diluted to the point that it is not detectable without extremely large sample sizes. The same logic has been used to argue that trials of prevention of diseases (such as ARDS) are doomed from the outset.
If there was no difference in the primary outcomes of the trial, we could look at other outcomes (secondary outcomes) to see if there is a suggestion of a difference there. This approach is fraught with problems including multiple comparisons and, like inspecting the confidence interval of a trial with possible delta inflation, is useful only for “hypothesis generation” for a future trial. Positive results in secondary outcomes and subgroup analyses should not override the results of the primary analysis.
If the goal of trial enrollment is not fulfilled, or the trial is stopped early for futility, a trial designed with 90% power may have only 60% power. If the primary outcome is evaluated using a time-to-event analysis, and follow-up is incomplete, or fewer events occur than the assumptions used in sample size calculations, then the trial likewise does not have its stated power, effectively increasing the Type II error rate.
Another category of problems that can cause a trial to be falsely negative results from what I will call the “problem of separation”. Problems of separation occur when the two groups do not get treatments that differ enough in their intensity to allow separation of the populations to be observed in test statistics. For example, the ARDSnet ARMA trial showed a difference in mortality when comparing 6cc/kg PIBW to 12cc/kg PIBW, but it is likely that a trial comparing 6cc/kg and 7cc/kg PIBW would fail because the difference in the dose between the groups causes a problem of separation. Similarly, the TTM (Targeted Temperature Management Trial) has been criticized for temperature targets that were too close together (33 degrees versus 36 degrees). A problem of separation can also occur if there is a lot of crossover between groups such that placebo patients get active treatment and active treatment patients stop taking treatment.
Finally (unless there are scenarios that I’ve neglected) a false negative trial may arise if the patients enrolled in the trial are either “too sick” or “not sick enough”. Imagine two trials of an intervention for an infection. In one trial, patients are enrolled only if they are on three vasopressors and both groups have mortality exceeding 95%. When all patients are bound to die, the intervention may not show any effect. In the other trial, patients with very mild infection are enrolled and more than 95% of them survive. When everybody is going to survive, the intervention may not show an effect except with very large sample sizes. By this logic, the probability of an intervention showing an effect plotted against the severity of illness will reveal an inverted U-shaped curve. Such a curve probably describes responses to antidepressant medications.
Here’s a summary differential diagnosis for the autopsy of a dead RTC that can be used in Journal Clubs and PPPR:
- The therapy truly does not work
- Type II error
- Delta Inflation
- Wrong endpoint studied
- Failure of separation from dosing intensity
- Wrong patient populations
nice
ReplyDelete“To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination......to say what the experiment died of” - RA Fisher foune in Mayo’s book Statistical Inference as Severe Testing p 294
ReplyDelete