Thursday, August 13, 2009

The enemy of good evidence is better evidence: Aspirin, colorectal cancer, and knowing when enough is enough

An epidemiological study of the impact of aspirin (ASA) on outcomes from colorectal carcinoma (CRCA) in JAMA has made quite a splash which has extended to the lay press (see Chan et al: ; and ). I read this study, which normally would not have been of interest to me, because I knew that it was an epidemiological study and suspected that numerous methodological flaws would limit the conclusions that one could draw from it. And I admit that I was surprised that it has all the trappings of a methodologically superb study, complete in almost every way, including reporting of the Wald test for the assumption of proportional hazards, reporting of assessment for collinearity and overfitting of the model etc. It's all there, everything that one could want.

I should start by stating that there is biological plausibility of the hypothesis that ASA might influence the course of these cancers which express COX-2. I am no expert in this area, so I will take it as granted that the basic science evidence is sound enough to inflate the pre-test probability of an effect of ASA to a non-negligible level. Moreover, as pointed out by the authors, other smaller epidemiological investigations have suggested that ASA might improve outcomes from CRCA. The authors of the current investigation found a [marginally] statistically significant reduced hazards of death of approximately 0.3 in patients who took ASA after a diagnosis of CRCA, but not before.

Without delving into the details (knowing that one might find the devil there), I found the conclusions the authors made interesting, namely that additional investigations and randomized controlled trials will be needed before we can recommend ASA to patients diagnosed with CRCA. This caught me as a bit odd, depending upon what our goals are. If our goal is to further study the mechanisms of this disease in pursuit of the truth of the EFFICACY of ASA (see previous blog entry on vertebroplasty for the distinctions between efficacy and effectiveness research), then fine, we need a randomized controlled trial to eliminate all the potential confounding that is inherent in the current study, most notably the possibility that patients who took ASA are different from those who didn't in some important way that also influences outcome. But I'm prepared to accept that there is ample evidence that ASA benefits this condition and that if I had CRCA, the risks of not taking ASA far exceed the risks of taking it, and I would shun participation in any study in which I might be randomized to placebo. This may sound heretical, but allow me to explain my thinking.

I do worry that something that "makes sense" biologically and which is bolstered by epidemiological data might prove to be spurious, as happened in the decades-long saga of Premarin-prevention which came to a close with the Women's Health Initiative (WHI) study. But there are important differences here. Premarin had known side effects (clotting, increased risk of breast cancer) and it was being used long-term for the PREVENTION of remote diseases that would afflict women in the [distant] future. ASA has a proven safety profile spanning over a century, and patients with CRCA have a near-term risk of DEATH from it. So, even though both premarin and ASA might be used on the basis of fallible epidemiological data, there are important differences that we must consider. (I am also reminded of the ongoing debates and study of NAC for prevention of contrast nephropathy, which I think has gone on for far too long. There is ample evidence that it might help, and no evidence of adverse effects or excessive cost. When is enough enough?)

I just think we have become too beholden to certain mantras (like RCTs being the end-all-be-all or mortality being the only acceptable outcome measure), and we don't look at different situations with an independently critical eye. This is not low tidal volume ventilation where the critical care community needs unassailable evidence of efficacy to be convinced to administer it to patients who will have little say in the tidal volume their doctor uses to ventilate them. These are cognizant patients with cancer, this is a widely available over-the-counter drug, and this is a disease which makes people feel desperate, desperate enough to enroll in trials of experimental and toxic therapies. The minor side effects of ASA are the LEAST of their worries, especially considering that most of the patients in the cohort examined in this trial were using ASA for analgesia! If they are generally not concerned about side effects when it is used for arthritis, how can we justify withholding or not recommending it for patients with CANCER whose LIVES may be saved by it?

I were a patient with CRCA, I would take ASA (in fact I already take ASA!) and I would scoff at attempts to enroll me into a trial where I might receive placebo. The purists in pursuit of efficacy and mechanisms and the perfect trial be damned. I would much rather have a gastrointestinal hemorrhage than an early death from CRCA. That's just me. Others may appraise the risks and values of the various outcomes differently. And if they want to enroll in a trial, more power to them, so long as the investigators have adequately and accurately informed them of the existing data and the risks of both ASA and placebo, in the specific context of their specific disease and given the epidemiological data. Otherwise, their enrollment is probably ethically precarious, especially if they would go home and take an ASA for a more benign condition without another thought about it.

Tuesday, August 11, 2009

Vertebroplasty: Absence of Evidence Yields to Evidence of Absence. It Takes a Sham to Discover a Sham but how will I Get a Sham if I Need One?

"When in doubt, cut it out" is one simplified heuristic (rule of thumb) of surgery. Extension (via inductive thinking) of the observation that removing a necrotic gallbladder or correcting some other anatomic aberration causes improvement in patient outcomes to other situations has misled us before. It is simply not always that simple. While it makes sense that arthroscopic removal of scar tissue in an osteoarthritic knee will improve patients' symptoms, alas, some investigators had the courage to challenge that assumption, and reported in 2002 that when compared to sham surgery, knee arthroscopy did not benefit patients. (See

In a beautiful extension of that line of critical thinking, two groups of investigators in last week's NEJM challenged the widely and ardently held assumption that vertebroplasty improves patient pain and symptom scores. (See ; and .) These two similar studies compared vertebroplasty to a sham procedure (control group) in order to control for the powerful placebo effect that accounts for part of the benefit of many medical and surgical interventions, and which is almost assuredly responsible for the reported and observed benefits of such "alternative and complementary medicines" as accupuncture.

There is no difference. In these adequately powered trials (80% power to detect a 2.5 and a 1.5 point difference on the pain scales respectively), the 95% confidence intervals for delta (the difference between the groups in pain scores) were -0.7 to +1.8 at 3 months in the first study and -0.3 to + 1.7 at 1 month in the second study. Given that the minimal clinically important difference in the pain score is considered to be 1.5 points, these two studies all but rule out a clinically significant difference between the procedure and sham. They also show that there is no statistically significant difference between the two, but the former is more important to us as clinicians given that the study is negative. And this is exactly how we should approach a negative study: by asking "does the 95% confidence interval for the observed delta include a clinically important difference?" If it does not, we can be reasonably assured that the study was adequately powered to answer the question that we as practitioners are most interested in. If it does include such a value, we must assume that for us given our judgment of clinical value, the study is not helpful and essentially underpowered. Note also that by looking at delta this way, we can determine the statistical precision (power) of the study - powerful studies will result in narrow(er) confidence intervals, and underpowered studies will result in wide(r) ones.

These results reinforce the importance of the placebo effect in medical care, and the limitations of inductive thinking in determining the efficacy of a therapy. We must be careful - things that "make sense" do not always work.

But there is a twist of irony in this saga, and something a bit concerning about this whole approach to determining the truth using studies such as these with impeccable internal validity: they lead beguillingly to the message that because the therapy is not beneficial compared to sham that it is of no use. But, very unfortunately and very importantly, that is not a clinically relevant question because we will not now adopt sham procedures as an alternative to vertebroplasty! These data will either be ignored by the true-believers of vertebroplasty, or touted by devotees of evidence based medicine as confimation that "vertebroplasty doesn't work". If we fall in the latter camp, we will give patients medical therapy that, I wager, will not have as strong a placebo effect as surgery. And thus, an immaculately conceived study such as this becomes its own bugaboo, because in achieving unassailable internal validity, it estranges its relevance to clinical practice insomuch as the placebo effect is powerful and useful and desireable. What a shame, and what a quandry from which there is no obvious escape.

If I were a patient with such a fracture (and ironically I have indeed suffered 2 vertebral fractures [oh, the pain!]), I would try to talk my surgeon into performing a sham procedure (to avoid the costs and potential side effects of the cement).....but then I would know, and would the "placebo" really work?

Wednesday, August 5, 2009

Defining sample size for an a priori unidentifiable population: Tricks of the Tricksters

During a recent review of critical care literature for a paper on trial design, a few trials (and groups) were noted to have pulled a fast one and apparently slipped it by the witting or unwitting reviewers and editors. This has arisen in the case of two therapies which have in common a targeted population in which efficacy is expected which population cannot be identified at the outset. What's more, both of the therapies are thought to be mandatory to begin early for maximal efficacy, at a time when the specific target population cannot be identified. These two therapies are intensive insulin therapy (IIT) and corticosteroids for septic shock (CSS). In the case of IIT, the authors (Greet Van den Berghe et al) believe that IIT will be most effective in a population that remains in the ICU for at least some specified time, say 3 or 5 days. That is, "the therapy needs time to work." The problem is that there is no way to tell how long a person will remain in the ICU in advance. The same problem crops up for CSS because the authors (Annane, et al) wish to target non-responders to ACTH, but they cannot identify them at the outset; and they also believe that "early" administration is essential for efficacy. The solution used by both of these groups for this problem raises some interesting and troubling questions about the design of these trials and other trials like them in the future.

An "intention-to-treat" population must be identified at the trial outset. You need to have some a priori identifiable population that you target and you must analyze that population. If you don't do that, you can have selective dropout or crossovers that undermine your randomization and with it one of the basic guarantors of freedom from bias in your trial. Suppose that you had a therapy that you thought would reduce mortality, but only in patients that live at least 3 days, based on the reasoning that if you died prior to day three, you were too sick to be saved by anything. And suppose that you thought also that for your therapy to work, it had to be administered early. Suppose further that you enroll 1000 patients but 30% of them (300) die prior to day three. Would it be fair to exclude these 300 and analyze the data only for the 700 patients who lived past three days (some of whom die later than three days)? Even if you think it is allowable to do so, does the power of your trial derive from 700 patients or 1000? What if your therapy leads to excess deaths in the first three days? Even if you are correct that your therapy improves late(r) mortality, what if there are other side effects that are constant with respect to time? Do we analyze the entire population when we analyze these side effects or do we analyze the entire population, the "intention-to-treat" population?

In essence what you are saying when you design such a trial is that you think that the early deaths will "dilute out" the effect of your therapy, much as people who drop out of a trial or do not take their assigned antihypertensive pills dilute out an effect in a trial. But in these trials, you would account for drop-out rates and non-compliance by raising your sample size. Which is exactly what you should do if you think that early deaths, ACTH responders, or early-departures from the ICU will dilute out your effect. You raise your sample size.

But what I have discovered in the case of the IIT trials is that the authors wish to have their cake and eat it too. In these trials, they power the trial as if the effect they seek in the sub-population will exist in the intention-to-treat population (e.g., ; inadequate information is provided in the 2001 study.) In the case of CSS ( ), I cannot even justify the power calculations that are provided in the manuscript, but another concerning problem occurs. First, note that in Table 4 ADJUSTED odds ratios are reported so these are not raw data. Overall there appears to be a trend toward benefit in the overall group in terms of an ADJUSTED odds ratio with an associated P-value of 0.09. But look at the responders versus non-responders. While, (AFTER ADJUSTMENT) there is a statistically significant benefit in non-responders (10% reduction in mortality), there is a trend towards HARM in the responders (10% increase in mortality)! [I will not even delve into the issue of presenting risk as odds when the event rate is high as it is here, and how it inflates the apparent relative benefit.] This is just the issue we are concerned about when we analyze what are basically subgroups, even if they are prospectively defined subgroups. A subgroup is NOT an intention-to-treat population, and if we focus on the subgroup, we risk ignoring harmful effects in the other patients in the trial, we inflate the apparent number needed to treat, and we run the risk of ending up with an underpowered trial because we have ignored the fact that patients who are enrolled who don't a posteriori fit our target population are essentially drop-outs and should have been accounted for in sample size calculations.

This is very similar to what happened in an early trial of a biological agent for sepsis ( ). The agent, HA-1A human monoclonal antibody against endotoxin, was effective in the subgroup of patients with gram negative infections, which of course could not be prospectively identified. It was not effective in the overall population. It was never approved and never entered into clinical use, because, like the investigators, clinicians will have no way of knowing a priori which patients have gram negative infections and which ones will not, so their experience with the clinical use of the agent is more properly represented by the investigation's result in the overall population.

[I am reminded here of the 2004 Rumbak study in Critical Care Medicine in which a prediction was made as to who would require 14 or more days of mechanical ventilation as a requirement for entry into a study which randomized patients to tracheostomy or conventional care on day 2. In this study, an investigator made the prediction of length of mechanical ventilation, based on unspecified criteria, which was a major shortcoming of the study in spite of the fact that the investigator was correct in about 80% of cases. See: ]

I propose several solutions to this problem. Firstly, studies should be powered for the expected effect in the overall population, and this effect should account for dilution caused by enrollment of patients who a posteriori are not the target population (e.g., ACTH responders or early-departures from the ICU.) Secondly, only overall results from the intention-to-treat population should be presented and heeded by clinicians. And thirdly, efforts to better identify the target population a priori should be undertaken. Surely Van den Berghe by now have sufficient data to predict who will remain in the ICU for more than 3-5 days. And surely those studying CSS could require a response or non-response to a rapid ACTH test as a requirement for enrollment or exclusion.