Monday, September 21, 2009

The unreliable assymmetric design of the RE-LY trial of Dabigatran: Heads I win, tails you lose



I'm growing weary of this. I hope it stops. We can adapt the diagram of non-inferiority shenanigans from the Gefitinib trial (see http://medicalevidence.blogspot.com/2009/09/theres-no-such-thing-as-free-lunch.html ) to last week's trial of dabigatran, which came on the scene of the NEJM with another ridiculously designed non-inferiority trial (see http://content.nejm.org/cgi/content/short/361/12/1139 ). Here we go again.

These jokers, lulled by the corporate siren song of Boehringer Ingelheim, had the utter unmitigated gall to declare a delta of 1.46 (relative risk) as the margin of non-inferiority! Unbelievable! To say that a 46% difference in the rate of stroke or arterial clot is clinically non-significant! Seriously!?

They justified this felonious choice on the basis of trials comparing warfarin to PLACEBO as analyzed in a 10-year-old meta-analysis. It is obvious (or should be to the sentient) that an ex-post difference between a therapy and placebo in superiority trials does not apply to non-inferiority trials of two active agents. Any ex-post finding could be simply fortuitously large and may have nothing to do with the MCID (minimal clinically important difference) that is SUPPOSED to guide the choice of delta in a non-inferiority trial (NIT). That warfarin SMOKED placebo in terms of stroke prevention does NOT mean that something that does not SMOKE warfarin is non-inferior to warfarin. This kind of duplicitious justification is surely not what the CONSORT authors had in mind when they recommended a referenced justification for delta.

That aside, on to the study and the figure. First, we're testing two doses, so there are multiple comparisons, but we'll let that slide for our purposes. Look at the point estimate and 95% CI for the 110 mg dose in the figure (let's bracket the fact that they used one-sided 97.5% CIs - it's immaterial to this discussion). There is a non-statistically significant difference between dabigatran and warfarin for this dose, with a P-value of 0.34. But note that in Table 2 of the article, they declare that the P-value for "non-inferiority" is <0.001 [I've never even seen this done before, and I will have to look to see if we can find a precedent for reporting a P-value for "non-inferiority"]. Well, apparently this just means that the RR point estimate for 110 mg versus warfarin is statistically significantly different from a RR of 1.46. It does NOT mean, but it is misleadingly suggested that the comparison between the two drugs on stroke and arterial clot is highly clinically significant, but it is not. This "P-value for non-inferiority" is just an artifical comparison: had we set the margin of non-inferiority at a [even more ridiculously "P-value for non-inferiority" as small as we like by just inflating the margin of non-inferiority! So this is a useless number, unless your goal is to create an artificial and exaggerated impression of the difference between these two agents.

Now let's look at the 150 mg dose. Indeed, it is statistically significantly different than warfarin (I shall resist using the term "superior" here), and thus the authors claim superiority. But here again, the 95% CI is narrower than the margin of non-inferiority, and had the results gone the other direction, as in Scenarios 3 and 4, (in favor of warfarin), we would have still claimed non-inferiority, even though warfarin would have been statistically significantly "better than" dabigatran! So it is unfair to claim superiority on the basis of a statistically significant result favoring dabigatran, but that's what they do. This is the problem that is likely to crop up when you make your margin of non-inferiority excessively wide, which you are wont to do if you wish to stack the deck in favor of your therapy.

But here's the real rub. Imagine if the world were the mirror image of what it is now and dabigatran were the existing agent for prevention of stroke in A-fib, and warfarin were the new kid on the block. If the makers of warfarin had designed this trial AND GOTTEN THE EXACT SAME DATA, they would have said (look at the left of the figure and the dashed red line there) that warfarin is non-inferior to the 110 mg dose of dabigatran, but that it was not non-inferior to the 150 mg dose of dabigatran. They would NOT have claimed that dabigatran was superior to warfarin, nor that warfarin was inferior to dabigatran, because the 95% CI of the difference between warfarin and dabigatran 150 mg crosses the pre-specified margin of non-inferiority. And to claim superiority of dabigatran, the 95% CI of the difference would have to fall all the way to the left of the dashed red line on the left. (See Piaggio, JAMA, 2006.)

The claims that result from a given dataset should not depend on who designs the trial, and which way the asymmetry of interpretation goes. But as long as we allow asymmetry in the interpretation of data, they shall. Heads they win, tails we lose.

Tuesday, September 15, 2009

Plavix (clopidogrel), step aside, and prasugrel (Effient), watch your back: Ticagrelor proves that some "me-too" drugs are truly superior

Another breakthrough is reported in last week's NEJM: http://content.nejm.org/cgi/content/abstract/361/11/1045 . Wallentin et al report the results of the PLATO trial showing that ticagrelor, a new reversible inhibitor of P2Y12 is superior to Plavix in just about every imaginable way. Moreover, when you compare the results of this trial to the trial of prasugrel (Effient, recently approved, about which I blogged here: http://medicalevidence.blogspot.com/2007/11/plavix-defeated-prasugrel-is-superior.html ), it appears that ticagrelor is going to be preferable to prasugrel in at least 2 ways: 1.) a larger population can benefit (AMI versus just patients undergoing PCI); and 2.) less bleeding, which may be a result of reversible rather than irreversible inhibition of P2Y12.

I will rarely be using either of these drugs or Plavix because I rarely treat AMI or patients undergoing PCI. My interest in this trial and that of prasugrel stems from the fact that in the cases of these two agents, the sponsoring company indeed succeeded in making what is in essence a "me-too" drug that is superior to an earlier-to-market agent(s). They did not monkey around with this non-inferiority trial crap like anidulafungin and gefitinib and just about every antihypertensive that has come to market in the past 10 years, they actually took Plavix to task and beat it, legitimately. For this, and for the sheer size of the trial and its superb design, they deserve to be commended.


One take-home message here, and from other posts on this blog is "beware the non-inferiority trial". There are a number of reasons that a company will choose to do a non-inferiority trial (NIT) rather than a superiority trial. First, as in the last post (http://medicalevidence.blogspot.com/2009/09/theres-no-such-thing-as-free-lunch.html ) running a NIT often allows you to have your cake and eat it too - you can make it easy to claim non-inferiority (wide delta) AND make the criterion for superiority (of your agent) more lenient than the inferiority criterion, a conspicuous asymmetry that just ruffles my feathers again and again. Second, you don't run the risk of people saying after the fact "that stuff doesn't work," even though absence of evidence does not constitute evidence of absence. Third, you have great latitude with delta in a NIT and that's appealing from a sample size standpoint. Fourth, you don't actually have to have a better product which might not even be your goal, which is rather to get market share for an essentially identical product. Fifth, maybe you can't recruit enough patients to do a superiority trial. The ticagrelor trial recruited over 18,000 patients. You can look at this in two ways. One is that the difference they're trying to demonstrate is quite small, so what does it matter to you? (If you take this view, you should be especially dismissive of NITs, since they're not trying to show any difference at all.) The other is that if you can recruit 18,000 patients into a trial, even a multinational trial, the problem that is being treated must be quite prevalent, and thus the opportunity for impact from a superior treatment, even one with a small advantage, is much greater. It is much easier and more likely, in a given period of time, to treat 50 acute MIs and save a life with ticagrelor (compared to Plavix - NNT=50=[1/0.02]) than it is to find 8 patients with invasive candidiasis and treat them with anidulafungin (compared to fluconazole; [1/.12~8]; see Reboli et al: http://content.nejm.org/cgi/reprint/356/24/2472.pdf ), and in that latter case, you're not saving one life but rather just preventing a treatment failure. Thus, compared to anidulafungin, with its limited scope of application and limited impact, a drug like ticagrelor has much more public health impact. You should simply pay more attention to larger trials, there's more likely to be something important going on there. By inference, the conditions they are treating are likely to be a "bigger deal".

Of course, perhaps I'm giving the industry too much credit in the cases of prasugrel and ticagrelor. Did they really have much of a choice? Probably not. Generally, when you do a non-inferiority trial, you try to show non-inferiority and also something like preferable dosing schedules, reduced cost or side effects. That way, when the trial is done (if you have shown non-inferiority), you can say, "yeah, they have basically the same effect on xyz, but my drug has better [side effects, dosing, etc.]". Because of the enhanced potency of prasugrel and ticagrelor, they knew there would be more bleeding and that this would cause alarm. So they needed to show improved mortality (or similar) to show that that bleeding cost is worth paying. Regardless, it is refreshing to see that the industry is indeed designing drugs with demonstrable benefits over existing agents. I am highly confident that the FDA will find ticagrelor to be approvable, and I wager that it will quickly supplant prasugrel. I also wager that when clopidogrel goes generic (soon), it will be a boon for patients who can know that they are sacrificing very little (2% efficacy compared to ticagrelor of prasugrel) for a large cost savings. For most people, this trade-off will be well worth it. For those fortunate enough to have insurance or another way of paying for ticagrelor, more power to them.

Sunday, September 6, 2009

There's no such thing as a free lunch - unless you're running a non-inferiority trial. Gefitinib for pulmonary adenocarcinoma


A 20% difference in some outcome is either clinically relevant, or it is not. If A is worse than B by 19% and that's NOT clinically relevant and significant, then A being better than B by 19% must also NOT be clinically relevant and significant. But that is not how the authors of trials such as this one see it: http://content.nejm.org/cgi/content/short/361/10/947 . According to Mok and co-conspirators, if gefitinib is no worse in regard to progression free survival than Carboplatin-Paclitaxel based on a 95% confidence interval that does not include 20% (that is, it may be up to 19.9% worse, but not more worse), then they call the battle a draw and say that the two competitors are equally efficacious. However, if the trend is in the other direction, that is, in favor of gefitinib BY ANY AMOUNT HOWEVER SMALL (as long as it's statistically significant), they declare gefinitib the victor and call it a day. It is only because of widespread lack of familiarity with non-inferiority methods that they can get away with a free lunch like this. A 19% difference is either significant, or it is not. I have commented on this before, and it should come as no surprise that these trials are usually used to test proprietary agents (http://content.nejm.org/cgi/content/extract/357/13/1347 ). Note also that in trials of adult critical illness, the most commonly sought mortality benefit is about 10% (more data on this forthcoming in a article soon to be submitted and hopefully published). So it's a difficult argument to subtend to say that something is "non-inferior" if it is less than 20% worse than something else. Designers of critical care trials will tell you that a 10% difference, often much less, is clinically significant.

I have created a figure to demonstrate the important nuances of non-inferiority trials using the gefitinib trial as an example. (I have adapted this from the Piaggio 2006 JAMA article of the CONSORT statement for the reporting of non-inferiority trials - a statement that has been largely ignored: http://jama.ama-assn.org/cgi/content/abstract/295/10/1152?lookupType=volpage&vol=295&fp=1152&view=short .) The authors specified delta, or the margin of non-inferiority, to be 20%. I have already made it clear that I don't buy this, but we needn't challenge this value to make war with their conclusions, although challenging it is certainly worthwhile, even if it is not my current focus. This 20% delta corresponds to a hazard ratio of 1.2, as seen in the figure demarcated by a dashed red line on the right. If the hazard ratio (for progression or death) demonstrated by the data in the trial were 1.2, that would mean that gefitinib is 20% worse than comparator. The purpose of a non-inferiority trial is to EXCLUDE a difference as large as delta, the pre-specified margin of non-inferiority. So, to demonstrate non-inferiority, the authors must show that the 95% confidence interval for the hazard ratio falls all the way to the left of that dashed red line at HR of 1.2 on the right. They certainly achieved this goal. Their data, represented by the lowermost point estimate and 95% CI, falls entirely to the left of the pre-specified margin of non-inferiority (the right red dashed line). I have no arguments with this. Accepting ANY margin of non-inferiority (delta), gefitinib is non-inferior to the comparator. What I take exception to is the conclusion that gefitinib is SUPERIOR to comparator, a conclusion that is predicated in part on the chosen delta, to which we are beholden as we make such conclusions.

First, let's look at [hypothetical] Scenario 1. Because the chosen delta was 20% wide (and that's pretty wide - coincidentally, that's the exact width of the confidence interval of the observed data), it is entirely possible that the point estimate could have fallen as pictured for Scenario 1 with the entire CI between an HR of 1 and 1.2, the pre-specified margin of non-inferiority. This creates the highly uncomfortable situation in which the criterion for non-inferiority is fulfilled, AND the comparator is statistically significantly better than gefitinib!!! This could have happened! And it's more likely to happen the larger you make delta. The lesson here is that the wider you make delta, the more dubious your conclusions are. Deltas of 20% in a non-inferiority trial are ludicrous.

Now let's look at Scenarios 2 and 3. In these hypothetical scenarios, comparator is again statistically significantly better than gefitinib, but now we cannot claim non-inferiority because the upper CI falls to the right of delta (red dashed line on the right). But because our 95% confidence interval includes values of HR less than 1.2 and our delta of 20% implies (or rather states) that we consider differences of less than 20% to be clinically irrelevant, we cannot technically claim superiority of comparator over gefitinib either. The result is dubious. While there is a statistically significant difference in the point estimate, the 95% CI contains clinically irrelevant values and we are left in limbo, groping for a situation like Scenario 4, in which comparator is clearly superior to gefitinib, and the 95% CI lies all the way to the right of the HR of 1.2.

Pretend you're in Organic Chemistry again, and visualize the mirror image (enantiomer) of scenario 4. That is what is required to show superiority of gefitinib over comparator - a point estimate for the HR whose 95% CI does not include delta or -20%, an HR of 0.8. The actual results come close to Scenario 5, but not quite, and therefore, the authors are NOT justified in claiming superiority. To do so is to try to have a free lunch, to have their cake and eat it too.

You see, the larger you make delta, the easier it is to achieve non-inferiority. But the more likely it is also that you might find a statistically significant difference favoring comparator rather than the preferred drug which creates a serious conundrum and paradox for you. At the very least, if you're going to make delta large, you should be bound by your honor and your allegiance to logic and science to make damned sure that to claim superiority, your 95% confidence interval must not include negative delta. If not, shame on you. Eat your free lunch if you will, but know that the ireful brow of logic and reason is bent unfavorably upon you.


Saturday, September 5, 2009

Troponin I, Troponin T, Troponin is the Woe of Me

As a critical care physician, I have not infrequently been called to the emergency department to admit a patient on the basis of "abnormal laboratory tests" with no synthesis, no assimilation of the various results into any semblance of a unifying diagnosis. It is bad enough that patients' chests are no longer ausculted, respiratory rates and patterns not noted, neck veins not examined, etc. It is worse that the portable chest film (often incorrectly interpreted), the arterial blood gas (also often incorrectly interpteted), and the BNP level have supplanted any sort of logical and systematic approach to diagnosing a patient's problem. If we are going to replace physical examination with BNPs and d-dimers, we should at least insist that practitioners have one iota of familiarity with Bayes' Theorem and pre-test probabilities and the proper interpretation of test results.

Thus I raised at least one brow slightly on August 27th when the NEJM reported two studies of highly sensitive troponin assays for the "early diagnosis of myocardial infarction" (wasn't troponin sensitive enough already? see: http://content.nejm.org/cgi/content/abstract/361/9/858 and http://content.nejm.org/cgi/content/abstract/361/9/868 ). Without commenting on the studies' methodological quality specifically, I will emphasize some pitfalls and caveats related to the adoption of this "advance" in clinical practice, especially that outside of the setting of an appropriately aged person with risk factors who presents to an acute care setting with SYMPTOMS SUGGESTIVE OF MYOCARDIAL INFARCTION.

In such a patient, say a 59 year old male with hypertension, diabetes and a family history of coronary artery disease, who presents to the ED with chest pain, we (and our cardiology colleagues) are justified in having a high degree of confidence in the results of this test based on these and a decade or more of other data. But I suspect that only the MINORITY of cardiac troponin tests at my institution are ordered for that kind of indication. Rather, it is used as a screening test for just about any patient presenting to the ED who is ill enough to warrant admission. And that's where the problem has its roots. Our confidence in the diagnostic accuracy of this test in the APPROPRIATE SETTING (read appropriate clinical pre-test probability) should not extend to other scenarios, but all too often it does, and it makes a real conundrum when it is positive in those other scenarios. Here's why.
Suppose that we have a pregnancy test that is evaluated in women who have had a sexual encounter and who have missed two menstrual periods and it is found to be 99.9% sensitive and 99.9% specific. (I will bracket for now the possibility that you could have a 100% sensitive and/or specific test.) Now suppose that you administer this test to 10,000 MEN. Does a positive test mean that a man is pregnant? Heavens No! He probably has testicular cancer or some other malady. This somewhat silly example is actually quite useful to reinforce the principle that no matter how good a test is, if it is not used appropriately or in the appropriate scenario that the results are likely to be misleading. Likewise, consider this test's use in a woman who has not missed a menstrual cycle - does a negative test mean that she is not pregnant? Perhaps not, since the sensitivity was determined in a population that had missed 2 cycles. If a woman were obviously24 weeks pregnant and the test was negative, what would we think? It is important to bear in mind that these tests are NOT direct tests for the conditions we seek to diagnose, but are tests of ASSOCIATED biological phenomena, and insomuch as our understanding of those phenomena is limited or there is variation in them, the tests are liable to be fallible. A negative test in a woman with a fetus in utero may mean that the sample was mishandled, that the testing reagents were expired, that there is an interfering antibody, etc. Tests are not perfect, and indeed are highly prone to be misleading if not used in the appropriate clinical scenario.

And thus we return to cardiac troponins. In the patients I'm called to admit to the ICU who have sepsis, PE, COPD, pneumonia, respiratory failure, renal failure, metabolic acidosis, a mildly positive troponin which is a COMMON occurrence is almost ALWAYS an epiphenomenon of critical illness rather than an acute myocardial infarction. Moreover, the pursuit of diagnosis via cardiac catheterization or the empiric treatment with antiplatelet agents and anticoagulants almost always is a therapeutic misadventure in these patients who are at much greater risk of bleeding and renal failure via these interventions which are expected to have a much reduced positive utility for them. More often than not, I would just rather not know the results of a troponin test outside the setting of isolated acute chest pain. Other practitioners should be acutely aware of the patient populations in which these tests are performed, and the significant limitations of using these highly sensitive tests in other clinical scenarios.

Thursday, August 13, 2009

The enemy of good evidence is better evidence: Aspirin, colorectal cancer, and knowing when enough is enough

An epidemiological study of the impact of aspirin (ASA) on outcomes from colorectal carcinoma (CRCA) in JAMA has made quite a splash which has extended to the lay press (see Chan et al: http://jama.ama-assn.org/cgi/content/short/302/6/649?home ; and http://www.nytimes.com/2009/08/12/health/research/12aspirin.html ). I read this study, which normally would not have been of interest to me, because I knew that it was an epidemiological study and suspected that numerous methodological flaws would limit the conclusions that one could draw from it. And I admit that I was surprised that it has all the trappings of a methodologically superb study, complete in almost every way, including reporting of the Wald test for the assumption of proportional hazards, reporting of assessment for collinearity and overfitting of the model etc. It's all there, everything that one could want.


I should start by stating that there is biological plausibility of the hypothesis that ASA might influence the course of these cancers which express COX-2. I am no expert in this area, so I will take it as granted that the basic science evidence is sound enough to inflate the pre-test probability of an effect of ASA to a non-negligible level. Moreover, as pointed out by the authors, other smaller epidemiological investigations have suggested that ASA might improve outcomes from CRCA. The authors of the current investigation found a [marginally] statistically significant reduced hazards of death of approximately 0.3 in patients who took ASA after a diagnosis of CRCA, but not before.

Without delving into the details (knowing that one might find the devil there), I found the conclusions the authors made interesting, namely that additional investigations and randomized controlled trials will be needed before we can recommend ASA to patients diagnosed with CRCA. This caught me as a bit odd, depending upon what our goals are. If our goal is to further study the mechanisms of this disease in pursuit of the truth of the EFFICACY of ASA (see previous blog entry on vertebroplasty for the distinctions between efficacy and effectiveness research), then fine, we need a randomized controlled trial to eliminate all the potential confounding that is inherent in the current study, most notably the possibility that patients who took ASA are different from those who didn't in some important way that also influences outcome. But I'm prepared to accept that there is ample evidence that ASA benefits this condition and that if I had CRCA, the risks of not taking ASA far exceed the risks of taking it, and I would shun participation in any study in which I might be randomized to placebo. This may sound heretical, but allow me to explain my thinking.

I do worry that something that "makes sense" biologically and which is bolstered by epidemiological data might prove to be spurious, as happened in the decades-long saga of Premarin-prevention which came to a close with the Women's Health Initiative (WHI) study. But there are important differences here. Premarin had known side effects (clotting, increased risk of breast cancer) and it was being used long-term for the PREVENTION of remote diseases that would afflict women in the [distant] future. ASA has a proven safety profile spanning over a century, and patients with CRCA have a near-term risk of DEATH from it. So, even though both premarin and ASA might be used on the basis of fallible epidemiological data, there are important differences that we must consider. (I am also reminded of the ongoing debates and study of NAC for prevention of contrast nephropathy, which I think has gone on for far too long. There is ample evidence that it might help, and no evidence of adverse effects or excessive cost. When is enough enough?)

I just think we have become too beholden to certain mantras (like RCTs being the end-all-be-all or mortality being the only acceptable outcome measure), and we don't look at different situations with an independently critical eye. This is not low tidal volume ventilation where the critical care community needs unassailable evidence of efficacy to be convinced to administer it to patients who will have little say in the tidal volume their doctor uses to ventilate them. These are cognizant patients with cancer, this is a widely available over-the-counter drug, and this is a disease which makes people feel desperate, desperate enough to enroll in trials of experimental and toxic therapies. The minor side effects of ASA are the LEAST of their worries, especially considering that most of the patients in the cohort examined in this trial were using ASA for analgesia! If they are generally not concerned about side effects when it is used for arthritis, how can we justify withholding or not recommending it for patients with CANCER whose LIVES may be saved by it?

I were a patient with CRCA, I would take ASA (in fact I already take ASA!) and I would scoff at attempts to enroll me into a trial where I might receive placebo. The purists in pursuit of efficacy and mechanisms and the perfect trial be damned. I would much rather have a gastrointestinal hemorrhage than an early death from CRCA. That's just me. Others may appraise the risks and values of the various outcomes differently. And if they want to enroll in a trial, more power to them, so long as the investigators have adequately and accurately informed them of the existing data and the risks of both ASA and placebo, in the specific context of their specific disease and given the epidemiological data. Otherwise, their enrollment is probably ethically precarious, especially if they would go home and take an ASA for a more benign condition without another thought about it.

Tuesday, August 11, 2009

Vertebroplasty: Absence of Evidence Yields to Evidence of Absence. It Takes a Sham to Discover a Sham but how will I Get a Sham if I Need One?

"When in doubt, cut it out" is one simplified heuristic (rule of thumb) of surgery. Extension (via inductive thinking) of the observation that removing a necrotic gallbladder or correcting some other anatomic aberration causes improvement in patient outcomes to other situations has misled us before. It is simply not always that simple. While it makes sense that arthroscopic removal of scar tissue in an osteoarthritic knee will improve patients' symptoms, alas, some investigators had the courage to challenge that assumption, and reported in 2002 that when compared to sham surgery, knee arthroscopy did not benefit patients. (See http://content.nejm.org/cgi/content/abstract/347/2/81.)

In a beautiful extension of that line of critical thinking, two groups of investigators in last week's NEJM challenged the widely and ardently held assumption that vertebroplasty improves patient pain and symptom scores. (See http://content.nejm.org/cgi/content/abstract/361/6/557 ; and http://content.nejm.org/cgi/content/abstract/361/6/569 .) These two similar studies compared vertebroplasty to a sham procedure (control group) in order to control for the powerful placebo effect that accounts for part of the benefit of many medical and surgical interventions, and which is almost assuredly responsible for the reported and observed benefits of such "alternative and complementary medicines" as accupuncture.

There is no difference. In these adequately powered trials (80% power to detect a 2.5 and a 1.5 point difference on the pain scales respectively), the 95% confidence intervals for delta (the difference between the groups in pain scores) were -0.7 to +1.8 at 3 months in the first study and -0.3 to + 1.7 at 1 month in the second study. Given that the minimal clinically important difference in the pain score is considered to be 1.5 points, these two studies all but rule out a clinically significant difference between the procedure and sham. They also show that there is no statistically significant difference between the two, but the former is more important to us as clinicians given that the study is negative. And this is exactly how we should approach a negative study: by asking "does the 95% confidence interval for the observed delta include a clinically important difference?" If it does not, we can be reasonably assured that the study was adequately powered to answer the question that we as practitioners are most interested in. If it does include such a value, we must assume that for us given our judgment of clinical value, the study is not helpful and essentially underpowered. Note also that by looking at delta this way, we can determine the statistical precision (power) of the study - powerful studies will result in narrow(er) confidence intervals, and underpowered studies will result in wide(r) ones.

These results reinforce the importance of the placebo effect in medical care, and the limitations of inductive thinking in determining the efficacy of a therapy. We must be careful - things that "make sense" do not always work.

But there is a twist of irony in this saga, and something a bit concerning about this whole approach to determining the truth using studies such as these with impeccable internal validity: they lead beguillingly to the message that because the therapy is not beneficial compared to sham that it is of no use. But, very unfortunately and very importantly, that is not a clinically relevant question because we will not now adopt sham procedures as an alternative to vertebroplasty! These data will either be ignored by the true-believers of vertebroplasty, or touted by devotees of evidence based medicine as confimation that "vertebroplasty doesn't work". If we fall in the latter camp, we will give patients medical therapy that, I wager, will not have as strong a placebo effect as surgery. And thus, an immaculately conceived study such as this becomes its own bugaboo, because in achieving unassailable internal validity, it estranges its relevance to clinical practice insomuch as the placebo effect is powerful and useful and desireable. What a shame, and what a quandry from which there is no obvious escape.

If I were a patient with such a fracture (and ironically I have indeed suffered 2 vertebral fractures [oh, the pain!]), I would try to talk my surgeon into performing a sham procedure (to avoid the costs and potential side effects of the cement).....but then I would know, and would the "placebo" really work?

Wednesday, August 5, 2009

Defining sample size for an a priori unidentifiable population: Tricks of the Tricksters

During a recent review of critical care literature for a paper on trial design, a few trials (and groups) were noted to have pulled a fast one and apparently slipped it by the witting or unwitting reviewers and editors. This has arisen in the case of two therapies which have in common a targeted population in which efficacy is expected which population cannot be identified at the outset. What's more, both of the therapies are thought to be mandatory to begin early for maximal efficacy, at a time when the specific target population cannot be identified. These two therapies are intensive insulin therapy (IIT) and corticosteroids for septic shock (CSS). In the case of IIT, the authors (Greet Van den Berghe et al) believe that IIT will be most effective in a population that remains in the ICU for at least some specified time, say 3 or 5 days. That is, "the therapy needs time to work." The problem is that there is no way to tell how long a person will remain in the ICU in advance. The same problem crops up for CSS because the authors (Annane, et al) wish to target non-responders to ACTH, but they cannot identify them at the outset; and they also believe that "early" administration is essential for efficacy. The solution used by both of these groups for this problem raises some interesting and troubling questions about the design of these trials and other trials like them in the future.

An "intention-to-treat" population must be identified at the trial outset. You need to have some a priori identifiable population that you target and you must analyze that population. If you don't do that, you can have selective dropout or crossovers that undermine your randomization and with it one of the basic guarantors of freedom from bias in your trial. Suppose that you had a therapy that you thought would reduce mortality, but only in patients that live at least 3 days, based on the reasoning that if you died prior to day three, you were too sick to be saved by anything. And suppose that you thought also that for your therapy to work, it had to be administered early. Suppose further that you enroll 1000 patients but 30% of them (300) die prior to day three. Would it be fair to exclude these 300 and analyze the data only for the 700 patients who lived past three days (some of whom die later than three days)? Even if you think it is allowable to do so, does the power of your trial derive from 700 patients or 1000? What if your therapy leads to excess deaths in the first three days? Even if you are correct that your therapy improves late(r) mortality, what if there are other side effects that are constant with respect to time? Do we analyze the entire population when we analyze these side effects or do we analyze the entire population, the "intention-to-treat" population?

In essence what you are saying when you design such a trial is that you think that the early deaths will "dilute out" the effect of your therapy, much as people who drop out of a trial or do not take their assigned antihypertensive pills dilute out an effect in a trial. But in these trials, you would account for drop-out rates and non-compliance by raising your sample size. Which is exactly what you should do if you think that early deaths, ACTH responders, or early-departures from the ICU will dilute out your effect. You raise your sample size.

But what I have discovered in the case of the IIT trials is that the authors wish to have their cake and eat it too. In these trials, they power the trial as if the effect they seek in the sub-population will exist in the intention-to-treat population (e.g., http://content.nejm.org/cgi/content/abstract/354/5/449 ; inadequate information is provided in the 2001 study.) In the case of CSS (http://jama.ama-assn.org/cgi/content/abstract/288/7/862?maxtoshow=&HITS=10&hits=10&RESULTFORMAT=&fulltext=annane+septic+shock&searchid=1&FIRSTINDEX=0&resourcetype=HWCIT ), I cannot even justify the power calculations that are provided in the manuscript, but another concerning problem occurs. First, note that in Table 4 ADJUSTED odds ratios are reported so these are not raw data. Overall there appears to be a trend toward benefit in the overall group in terms of an ADJUSTED odds ratio with an associated P-value of 0.09. But look at the responders versus non-responders. While, (AFTER ADJUSTMENT) there is a statistically significant benefit in non-responders (10% reduction in mortality), there is a trend towards HARM in the responders (10% increase in mortality)! [I will not even delve into the issue of presenting risk as odds when the event rate is high as it is here, and how it inflates the apparent relative benefit.] This is just the issue we are concerned about when we analyze what are basically subgroups, even if they are prospectively defined subgroups. A subgroup is NOT an intention-to-treat population, and if we focus on the subgroup, we risk ignoring harmful effects in the other patients in the trial, we inflate the apparent number needed to treat, and we run the risk of ending up with an underpowered trial because we have ignored the fact that patients who are enrolled who don't a posteriori fit our target population are essentially drop-outs and should have been accounted for in sample size calculations.

This is very similar to what happened in an early trial of a biological agent for sepsis (http://content.nejm.org/cgi/content/abstract/324/7/429 ). The agent, HA-1A human monoclonal antibody against endotoxin, was effective in the subgroup of patients with gram negative infections, which of course could not be prospectively identified. It was not effective in the overall population. It was never approved and never entered into clinical use, because, like the investigators, clinicians will have no way of knowing a priori which patients have gram negative infections and which ones will not, so their experience with the clinical use of the agent is more properly represented by the investigation's result in the overall population.

[I am reminded here of the 2004 Rumbak study in Critical Care Medicine in which a prediction was made as to who would require 14 or more days of mechanical ventilation as a requirement for entry into a study which randomized patients to tracheostomy or conventional care on day 2. In this study, an investigator made the prediction of length of mechanical ventilation, based on unspecified criteria, which was a major shortcoming of the study in spite of the fact that the investigator was correct in about 80% of cases. See: http://journals.lww.com/ccmjournal/pages/articleviewer.aspx?year=2004&issue=08000&article=00009&type=abstract ]

I propose several solutions to this problem. Firstly, studies should be powered for the expected effect in the overall population, and this effect should account for dilution caused by enrollment of patients who a posteriori are not the target population (e.g., ACTH responders or early-departures from the ICU.) Secondly, only overall results from the intention-to-treat population should be presented and heeded by clinicians. And thirdly, efforts to better identify the target population a priori should be undertaken. Surely Van den Berghe by now have sufficient data to predict who will remain in the ICU for more than 3-5 days. And surely those studying CSS could require a response or non-response to a rapid ACTH test as a requirement for enrollment or exclusion.

Friday, July 10, 2009

Happy Anniversary to the Blog! Two Years Old!

The medical evidence blog has turned out to be a fruitful experience for me and hopefully for others. The idea was conceived while I was at OSU auditing a course on capital punishment in the law school taught by the wonderful Douglas Berman, JD, who used a blog as part of the course material and who created the prominent SLAP (Sentencing Law and Punishment) blog. That formative and enriching experience led me to create this blog to ruffle feathers in the medical evidence community, as an alternative to numerous and sundry letters to the editor of the NEJM which I had theretofore been writing. (Every now and again I lose the ability to restrain myself and submit a letter in spite of the blog.) The experiment has paid off, I hope, and hopefully this blog provides fodder for thoughtful clinicians and researchers, as well as physicians in training, and journal clubs. I hope that the tradition of the first two years will continue into perpetuity and we will beat the bushes of evidence on this blog as we strive to understand the truth and the limitations of what is currently known using our logic and our sense of reason to guide us. Thank all of you who have followed this blog for the encouragement to keep it going.

Cheers, Scott

Type rest of the post here

Thursday, July 9, 2009

No Sham Needed in Sham Trials: Polymyxin B Hemoperfusion in Abdominal Septic Shock (Alternative Title: How Meddling Ethicists Ruin Everything)

This a superlative article to jab at to demonstrate some interesting points about randomized controlled trials that have more basis in hope than reason and whose very design threatens to invalidate their findings: http://jama.ama-assn.org/cgi/content/abstract/301/23/2445?maxtoshow=&HITS=10&hits=10&RESULTFORMAT=&fulltext=polymyxin&searchid=1&FIRSTINDEX=0&resourcetype=HWCIT . Because endotoxin has an important role in the pathogenesis of gram-negative sepsis, there has been interest in interfering with it or removing it in the hopes of abating the untoward effects of the sepsis inflammatory cascade. Learning from previous experiences/studies (e.g., http://content.nejm.org/cgi/content/abstract/324/7/429 ) that taking a poorly defined and heterogenous illness (namely sepsis) and using therapy that is expected to work in only a subset of patients with the illness (gram-negative source), the authors chose to study abdominal sepsis because they expected that the majority of patients will have gram-negatives as a causative or contributory source of infection. They randomized such patients to receive standard care (not well defined) or the insertion of a dialysis catheter with subsequent hemoperfusion over a Polymyxin B impregnated surface because this agent is known to adsorb endotoxin. The basic biological hypothesis is that removing the endotoxin in this fashion will cause amelioration of the untoward effects of the sepsis inflammatory cascade in such a way as to improve blood pressure, other phyisological parameters, and hopefully, mortality as well. There is reason to begin one's reading of this report with robust skepticism. The history of modern molecular medicine, for well over 25 years, has been polluted with the vast detritus of innumerable failed sepsis trials founded on hypotheses related to modulation of the sepsis cascade. During this period, only one agent has been shown to be efficacious, and even its efficacy remains highly doubtful to perhaps the majority of intensivists (myself excluded; see: http://content.nejm.org/cgi/content/abstract/344/10/699 ).


Mortality was not the primary endpoint in this trial, but rather was used for the early stopping rule. Even though I am currently writing an article suggesting that mortality may not be a good endpoint for trials of critical illness, this trial reminds me why the critical care community has selected this endpoint as the bona fide gold standard. Who cares if this invasive therapy increases your MAP from the already acceptable level of ~77mmHg to the supertarget level of 86? Who cares if it reduces your pressor requirements? Why would a patient, upon awakening from critical illness, thank his doctors for inserting a large dialysis catheter in him to keep his BP a little higher than it otherwise would have been? Why would he rather have a giant hole in his neck (or worse - GROIN!) than a little more levophed? If it doesn't save your life or make your life better when you recover, why do you care? We desperately need to begin to study concepts such as "return to full functionality at three (or six) months" or "recovery without persistent organ failures at x,y,z months". (This latter term I would define as not needing ongoing therapy for the support of any lingering organ failure after critical illness [that did not exist in the premorbid state], such as oxygen therapy, tracheostomy, dialysis, etc.). Should I be counted as a "save" if my existence after the interventions of the "saviors" is constituted by residence in a nursing home dependent on others for my care with waxing and waning lucidity? What does society think about these questions? We should begin to ask.

And we segue to the stopping issue which I find especially intriguing. Basing the stopping rule on a mortality difference seems to validate my points above, namely that the primary endpoint (MAP) is basically a worthless one - if it were not, or if it were not trumped by mortality, why would we not base stopping of the trial on MAP? (And if this is a Phase II or pilot trial, it should be named accordingly, methinks.) This small trial was stopped on the basis of a mortality difference significant at P=0.026 with the stopping boundary at P<0.029. I will point out again on this blog for those not familiar with it this pivotal article warning of the hazards of early stopping rules (http://jama.ama-assn.org/cgi/content/abstract/294/17/2203 ). But here's the real rub. When they got these results at the first and only planned interim analysis, (deep breath), they consulted with an ethicist. The ethicist said that it is unethical to continue the trial because to do so would be to deny this presumably effective therapy to the control group. But does ANYONE in his or her right state of mind agree that this therapy is effective on the basis of these data? And if these data are not conclusive, does not that condemn future participants in a future trial to the same unfair treatment, namely randomization to placebo? Does not stopping the trial early just shift the burden to other people? It does worse. It invalidates to large degree the altruistic motives of the participants (or their surrogates) in the current trial because stopping it early invalidated it scientifically (per the above referenced article) and because stopping it early necessitates the performance of yet another larger trial where participants will be randomized to placebo, and which, it is fair to suspect, will demonstrate this therapy to be useless, which is tantamount to harmful in the net because of the risk of catheters and wasted resources in performing yet another trial. Likewise, if we assume that this therapy IS beneficial, stopping it has reduced NET utility to current participants, because now NOBODY is receiving the therapy. So, from a consequentialist or utilitarian standpoint, overall utility is reduced and net harm has resulted from stopping the trial. What if the investigators of this trial had made it more scientifically valid from the outset by using a sham hemoperfusion device (an approach that itself would have caused an ethical maelstrom)? And what if the sham group proved superior in terms of mortality - would the ethicists have argued for stopping the trial because continuing it would mean depriving patients of sham therapy? Would there have been a call for providing sham therapy to all patients with surgically intervened abdominal sepsis? I write this with my tongue in my cheek, but the ludicrousness of it does seem to drive home the point that the premature stopping of this trial is neither ethically clear-cut nor obligatory, and that from a utilitarian standpoint, net negative utility (for society and for participants - for everyone!) has resulted from this move. And that segues me to the issue of sham procedures. It is abundantly obvious that patients with a dialysis catheter inserted for this trial (probably put in by an investigator, but not stated in the manuscript) will be likely to receive more vigilant care. This is the whole reason that protocols were developed in critical care research, as a result of the early ECMO trials (Morris et al 1994) where it was recognized that you would have all sorts of confounding by the inability to blind treating physicians in such a study. While it is not feasible to blind an ECMO study, the investigators of this study do little to convince us that blinding was not possible and feasible, and they make light of the differences in care that may have resulted from lack of blinding. Moreover, they do not report on the use of protocols for patient care that may/could have minimized the impact of lack of blinding, and in a GLARING omission, they do not describe fluid balance in these patients, a highly discretionary aspect of care that clearly could have influenced the primary outcome and which could have been differential between groups because of the lack of blinding and sham procedures. Unbelievable! (As an afterthought, even the mere increased stimulation [tactile, auditory, or visual] of patients in the intervention group, by more nursing presence or physician presence in the room may have led to increases in blood pressure.) There are also some smaller points, such as the fact that by my count 10 patients (not accounting for multiple organisms) in the intervention group had gram positive or fungal infections making it difficult to imagine how the therapy could have influenced these patients. What if patients without gram-negative organisms isolated are excluded from the analysis? Does the effect persist? What is the p-value for mortality then? And that point segues me to a final point - if our biologically plausible hypothesis is that reducing endotoxin levels with this therapy leads to improvements in parameters of interest, why, for the love of God, did we not measure and report endotoxin levels and perform secondary analyses of the effect of the therapy as a function of endotoxin levels and also report data on whether these levels were reduced by the therapy, thus supporting the most fundamental assumption of the biological hypothesis upon which the entire study is predicated?

Saturday, June 20, 2009

Randomized controlled trial of an intervention to reduce gun-related violence: A Parody

I am incredibly disappointed that the journal that I consider to be the very pinnacle of medical evidence continues to print ideological propaganda without any regard whatever to evidence and logic when it suits the editorial agenda http://content.nejm.org/cgi/content/extract/360/22/2360. Unadulterated propaganda pieces related to capital punishment, abortion, and gun control are shamelessly and predictably aligned with a singular political stance, and evidence and logic are eschewed entirely in favor of dogmatic and sanctimonious deontology. Without slinging any more mud on my favorite journal, I will demonstrate this in the following parody:

ARTICLE TITLE:
Efficacy of a gun control policy in reducing gun-related violence: A multi-state, multi-center, randomized controlled trial.

BACKGROUND:
Gun related violence results in tens of thousands of deaths (mostly suicides and homicides) each year. Interventions to reduce the toll of gun-related violence are desperately needed.

METHODS:
We used CDC data on gun-related deaths over the last decade to identify populations at risk for gun-related violence. However, our inclusion criteria did not comport with NIH-funding guidelines about inclusion of women and minorities and vulnerable populations such as former prisoners and felons and people with mental disabilities, some of which were over-represented and some of which were under-represented in the at-risk group we identified. Therefore, we dropped inclusion and exclusion criteria altogether, and randomized the entire populations of several states to the intervention (moratorium on firearms ownership defined as a complete ban imposed by state legislatures coupled with Directly Observed Confiscation) versus control (no moratorium or ban). Causes of deaths in each group were tracked and adjudicated by medical examiners in each state.

RESULTS:
The two populations were well matched on baseline demographic characteristics. There was no difference in the gun-related fatality rate between the intervention and control groups (20.1 per 100,000 in the intervention group and 20.2 per 100,000 in the control group; P=0.98) based on an intention to treat analysis. There was considerable cross-over between groups and this potentially explains the failure of the intervention to produce the intended result. In subjects who crossed over from the intervention to the control group (hereafter called "criminals"), the odds of gun-related violence increased 1000.42 (p=0.00001). Many criminals were responsible for more than one gun-related death and crossed over multiple times from intervention to control. There was wide variability between the rates of gun related violence on the basis of geography and other factors, with fatality rates 10-100 times higher in Baltimore, MD than in Provo, UT.

CONCLUSIONS:
An intervention to reduce gun-related violence failed to achieve this goal, largely as a result of cross-over from the intervention to the control group by "criminals". These criminals undermined the efficacy of the intervention. Moreover, the high geographic variability in gun related violence suggests that factors unrelated to the availability of firearms may drive gun-related violence rates. Future studies in limiting gun-related violence should focus on at-risk groups identified through crime statistics, and should not be NIH funded. Moreover, recrudescent crossover in future studies should be limited by incarceration of criminals for life without parole. Future studies might also focus on more traditional ways of preventing recrudescent cross-over (such as capital punishment). The Personalized Healthcare movement might also provide guidance on how to deal with this challenging problem.

Monday, May 11, 2009

Autism, Vaccines, and The Tragedy of the Commons: Whose Tragedy and Whose Commons?

In last week's NEJM, there is an article about the purported perils of foregoing vaccinations for your kids. The article is here: http://content.nejm.org/cgi/content/full/360/19/1981 .

There are a few points that I think deserve to be made about this issue. First, I digress to outline briefly the idea of "The Tragedy of the Commons."

The Tragedy of the Commons refers to the notion that "commons" such as parks or more traditionally "grazing areas" will be more fruitfully enjoyed by all if they are used responsibly. If everybody grazes as many sheep as s/he pleases on the commons, soon enough, there will be no grass for the sheep to eat. So it stands to reason that one should graze his sheep responsibly and sparingly on the commons. Paradoxically, there is little incentive to exercise such restraint. Because insomuch as you do, your neighbor does not, and the sparing of the commons effected by you is obliterated by your neighbor, or his neighbor, etc. As when passing a sign enjoining you to not walk on the grass and you are want to say "ah, but what difference will it make?", your neighbor might respond "yeah, but if we all did that....". The sign is there to regulate the commons that would be depradated were it not for some social policy forcing restraint. So long as the MAJORITY refrains from treading on the lush monocultured turf, it will remain lush. But after a certain threshold number of defectors trammels it, the commons is lost.

And such, I will demonstrate, is the issue with refusing vaccinations. The threat that results is not so much to the unvaccinated child, but rather to the commons - to the herd immunity. So far, it seems to me, the medical and public health establishments have sought to appeal to the sensitivities of parents to their own children's welfare rather than to supplicate them to "do what's right for society." To me, this is a overtly disingenuous approach. The vaccination of any indivudual child, when the baseline vaccination rate is above some critical threshold is an act of social responsibility much more than it is something essential for the health of the individual child. I suspect that some vaccine-refusing parents (let's call them Refusniks, shall we?) recognize this, and this recognition, combined with a tendency for rebellion, creates an impetus for refusal, especially if they think that the vaccine may cause autism or some other untoward effect. Let's look at some numbers.

First let's start with an estimate of the incidence of Measles with and without vaccination (if you take issue with these estimates and the resulting conclusions, please furnish your own numbers with a reference):

Measles with vaccination:
0.0000010000000 per annum
Measles without vaccination:
0.0002500000000

Even though this is a 250x increase, it is still only an absolute increase of:
0.0002490000000

So, if you fail to vaccinate your child, you increase his/her risk of measles by only 0.024%.

But the case fataility rate for measles is only about 0.3%. So, you increase your child's risk of death from measles by only:
0.0000004230000

That's a very small number, my friends.

Now let's also say that you're concerned about the risk of autism, for whatever reason, even a specious one. And you ask your pediatrician who is skeptical, so s/he refers you to the most recent good quality epidemiological data, the Danish data from NEJM in 2002: http://content.nejm.org/cgi/content/abstract/347/19/1477 .

In this study, the upper 95% CI for an association of MMR with Autism was 1.24. Thus, a 24% increase in the risk of autism is certainly within the range of plausibility based on these data. The base rate of autism in this study was:

Base rate of autism:
0.0005880000000
Rate of autism with a 24% increase (assuming it may be as hight as the UCI):
0.0007290000000
Absolute increase in autism rate:
0.0001410000000

Now, I realise that autism may not be as bad as death for a child, but this POTENTIAL increase in autism, consistent with good data, far overshadows the risk of death from Measles attributable to failure to vaccinate your child.

So it stands to reason that, if a person has, for whatever reasons, a value system that makes autism a grave concern for them, they are NOT acting terribly far outside the bounds of rationality by refusing vaccination for their individual child.

Now if their child has siblings, and/or they live in a community where there is a high rate of vaccination refusal, these numbers are out the window and the individual child risk is much harder to calculate and probably much higher.

(I recognize also that I have used data on the ANNUAL measles risk which may be cumulative and this may sway the numbers in favor of vaccination since presumably the risk of autism from vaccine exposure is a one-time event.)

I do not mean to imply here that I am against vaccination (I am not), nor that I believe that autism is caused by MMR or other vaccines (I do not), but I think 4 points are germane to this conversation which may be emblematic of other issues in public health where officials are apt to take a paternalistic stance:

1.) The individual child's absolute risk of death from Measles is VERY small, as is the increase in risk from failure to be vaccinated.

2.) The risk of autism from MMR based on the Madsen data has a wide confidence interval which does not exclude what some parents may think is a meaningful increased risk of 24%. The meaningfulness of this risk may be especially important in the context of comparing it with another very small risk, such as that of death or diasbility from measles, or motor vehicle accidents.

3.) The refusal to vaccinate is more of a social responsibility issue, a Tragedy of the Commons, than it is an individual patient safety and health issue. (Such is also the case with PPDs, TB, and INH prophylaxis, but don't get me started on that.)

4.) The risks that parents take for their children through vaccination refusal is similar risks they take via motor vehicle travel. We are not encouraging parents to cut in half the number of miles they drive with their children per annum to reduce the risk of death from MVAs from 0.000145 to half of that, so why are we so adamant about their getting MMR? Because it's an issue of the commons, not the individual.

And if it is an issue of civic responsibilty, we should frame it as such, rather than guilt-tripping parents about exposing their children to risk via neglect. Just like driving a massive Ford Excursion, where your children may be safer but everybody else's are worse off (because of the size of your projectile or its impact on the environment), vaccination is better for the commons, if not for your own children.

Thursday, April 30, 2009

Luck that Looks Like Logic? Statins (Rosuvastatin), the Cholesterol Hypothesis, and Causal Pathways

The Cholesterol Hypothesis (CH), namely that the association between elevated cholesterol (LDL) and cardiovascular disease and events is a CAUSAL one, and thus that intervening to lower cholesterol prevents these diseases has seduced mainstream medicine for decades. However, much if not most of the evidence for the causality of cholesterol in atherogenesis and its reversal by lowering cholesterol derives from studies of "Statins" or HMG-CoA-reductase inhibitors; indeed the evidence that lowering LDL cholesterol (or raising HDL) through other pathways has salutary effects on cardiovascular outcomes is scant at best as has been chronicled on this blog (see posts on torcetrapib and ezetimibe/Vytorin). Not myself immune to the beguiling allure of the CH, I admit that I take Niacin, in spite of normal HDL levels and scant to no trustworthy evidence that, in addition to raising HDL and lowering LDL, it will have any primary (or secondary or tertiary) preventative effects for me.

In yesterday's NEJM, Glynn et al report the results of analysis of data on a secondary endpoint from the JUPITER trial of Rosuvastatin. (http://content.nejm.org/cgi/content/abstract/360/18/1851 .) The primary aim of the trial was to determine if Rosuvastatin was effective for primary prevention of cardiovascular events in people with normal cholesterol levels and elevated CRP levels. The secondary endpoint described in the article was the occurrence of venothromboembolism during the study period. Because I see no obvious evidence of foul play, and because this study was simply impeccably designed, conducted, and reported, I'm going to hereafter ignore the fact that it was industry sponsored, and that there is probably some motive of "off-label promotion by proxy" (http://medicalevidence.blogspot.com/2008/06/off-label-promotion-by-proxy-how-nejm.html .) here...

Lo and behold: Rosuvastatin lowered venothromboembolism rates. The difficulties posed by ascertainment of this outcome notwithstanding, this trial has convincing evidence of a statistically significant reduction in DVT and PE event rates (which were very low - ~0.2%/100 persons/year) during the four year period of study. And this does not make a whole lot of sense from the standpoint of the CH. There's something more going on. Like an anti-inflammatory property of Statins. Which is very interesting and noteworthy and worthwhile in its own right. But I'm more interested in what kind of light this sheds on the validity of the CH.

Because of my interest in the fraility of the normalization hypothesis/heuristic (the notion that you just measure something and then raise or lower it to the normal range and make things ALL better) I am obviously a reserved skeptic of the Cholesterol Hypothesis, which was bolstered by if not altogether reared by data from trials of statins. And these new data, combined with emerging evidence that statins may have salutary effects on lung inflammation in ARDS and COPD, among perhaps others, make me wonder - was it just pure LUCK rather than a triumph of LOGIC that the first widely tested and marketed drug for cholesterol happened to both reduce cardiovascular endpoints AND lower cholesterol, even though not necessarily as part of the same causal pathway? Is it just "true, true, and unrelated?" Are they the anti-inflammatory properties or some other piece of the complex biochemical effects of these drugs on the body that leads to their clinical benefits? Other examples come to mind: Is blood pressure lowering just an epiphenomenon of another primary ACE-inhibitor effect on heart failure? Because these effects appear to be superficially and intuitively related does not mean that they are an obvious causal pathway.

What if things had happened another way. What if Statins had eluded discovery for another 20-30 years. What if study of the cholesterol hypothesis meanwhile proceeded through evaluation of Cholestyramine, Cholestipol, Niacin, and other drugs, and what if it had been "disconfirmed" by failure of these agents to reduce cardiovascular outcomes? These hypotheticals will be answerable only after more study of Statins and other drugs as well as their mechanisms. The data presented by the Harvard group as well as their other work with CRP are but one leg of a long journey toward elucidation of the biological mechanisms of atherogenesis, coagulation, and downstream clinical events.

Tuesday, April 21, 2009

Judicial use of DNA "evidence" and Misuse of Statistics: The Prosecutor's Fallacy

A recent article in the NYT described the adoption by the judicial system of a technology that began as a biomedical research tool (I resist to some extent the notion that DNA technology has directly been a boon to clinical patient care.) (See: http://www.nytimes.com/2009/04/19/us/19DNA.html.) This powerful technology, when used appropriately in appropriate circumstances, provides damning evidence of guilt because of its high specificity - the probability of a coincidental match is stated to be as low as 1x10-9. Thus, in a case such as that of the infamous (and nefarious) OJ Sipmson, in which there is strong suspicion of guilt BEFORE the DNA evidence is evaluated, a positive match, in the absence of laboratory error or misconduct (neither of which can be routinely discounted - see: http://www.nytimes.com/2001/09/26/us/police-chemist-accused-of-shoddy-work-is-fired.html) essentially proves, beyond any reasonable doubt, the genetic identity of the person to whom the sample belongs. (Yes, that does indeed mean that OJ Simpson is the perpetrator of the heinous murder of Nicole Brown Simpson, he said unapologetically.)

In the case of old OJ, he was one among perhaps 10, let's say 100 suspects. Let's assume that the LAPD had their act together (this also requires a leap of faith) and that the perpetrator is among the suspects that have been rounded up, but we have no evidence to differentiate their respective probabilities of guilt. Thus, each of the 100 has a 1% probability of being guilty, on the basis of circumstantial evidence alone, or a relation to or relationship with the victim(s) or just being in the wrong place at the wrong time, whatever. Given that 1% probability of guilt, we can make a 2x2 table representing the the probability of guilt given a positive test, which is ultimately what we want to know. I don't know the sensitivity of DNA fingerprinting, but it doesn't really matter because the high specificity of the test drives the likelihood ratio. I will assume it's 50% for simplicity:


In this "population" of 100 suspects (by suspects, I mean persons whose probability of having committed the crime is enhanced over that of a random member of the overall population by virtue of other evidence), even if all 100 suspects have equiprobable guilt, a DNA "match" is damning indeed and all but assures the guilt of the matching suspect (with the caveats mentioned above.)

But consider a different situation, one in which there are no convincing suspects. Suppose that the law enforcement authorities compare a biological sample with a large DNA database to look for a match. Note that we do not use the term "suspect" here - because it implies that there is some suspicion that has limited this population from the overall population. When a database (of unsuspected persons) is canvassed, no such suspicion exists. Rather, a fishing expedition ensues, and the probabilities, when computed, come out quite different. Suppose there are DNA samples from 100 million individuals in the database, and the entire database is canvassed. Now our 2x2 table looks like this:


Whereas in our previous example of a population of "suspects" guilt was all but assured based on a "match", in this example of canvassing a database, guilt is dubious. But what do you suppose will happen in such an investigation? Who will suspend his judgment and conduct a fair investigation of this "matching" individual, who is now a "suspect" based only on "evidence" from this misused test? How tempting will it be for detectives to selectively gather information and see reality through the distorted lens of the "infallible" DNA testing? How can such a person hope to exonerate himself?

This is the Prosecutor's Fallacy. It bolsters arguments by the ACLU and others that the trend of snowballing DNA sample collection should be curtailed, and that limits should be placed on canvassing efforts to solve crimes.

One way to limit the impact of the Prosecutor's Fallacy and false positive "matches" from canvassing efforts would be to force investigators to assign certain profiles to the imaginary "suspect" whom they hope to find in the database and to canvas a subgroup of the database that matches those characteristics. For example, if the crime occurred in Seattle, the canvassing effort could be limited to a subset of the database that lived in or near Seattle, since it is unlikely that a person in Baltimore committed the crime. Other characteristics that are probabilistically associated with certain crimes could be used to limit broad canvassing efforts.

As the use of medical technology expands both inside and outside medicine, we have a responsibility to utilize it wisely and rationally. The strategy of database screening and canvassing is reckless, unwise, and unjust, and should be summarily and duly curtailed.

Wednesday, April 8, 2009

The PSA Screening Quagmire - If Ignorance is Bliss then 'Tis Folly to be Wise?

The March 26th NEJM was a veritable treasure trove of interesting evidence so I can't stop after praising NICE-SUGAR and railing on intensive insulin therapy. If 6000 patients (40,000 screened) seemed like a commendable and daunting study to conduct, consider that the PLCO Project Team randomized over 76,000 US men to screening versus control (http://content.nejm.org/cgi/reprint/360/13/1310.pdf) and the ERSPC Investigators randomized over 162,000 European men in a "real-time meta-analysis" of sorts (wherein multiple simultaneous studies were conducted with similar but different enrollment requirements and combined; see: http://content.nejm.org/cgi/reprint/360/13/1320.pdf.)   This is, as the editorialist points out a "Hurculean effort" and that is fitting and poignant - because ongoing PSA screening efforts in current clinical practice represent a Hurculean effort to reduce morbidity and mortality of this disease and this reinforces the importance of the research question - are we wasting our time? Are we doing more harm than good?

The lay press was quick to start trumpeting the downfall of PSA screening with headlines such as "Prostate Test Found to Save Few Lives" . But for all their might, both of these studies give me, a longtime critic of cancer screening efforts, a good bit of pause. (Pulmonologists may be prone to "sour grapes" as a result of the failures of screening for lung cancer.)

Before I summarize briefly the studies and point out some interesting aspects of each, allow me to indulge in a few asides. First, I direct you to this interesting article in Medical Decision Making "Cure Me Even if it Kills Me". This wonderful study in judgment and decision making shows how difficult it is for patients to live with the knowledge that there is a cancer, however small growing in them. They want it out. And they want it out even if they are demonstrably worse off with it cut out or x-rayed out or whatever. It turns out that patients have a value for "getting rid of it" that probably arises from the emotional costs of living knowing there's a cancer in you. I highly recommend that anyone interested in cancer screening or treatment read this article.

This article invokes in me an unforgettable patient from my residency whom we screened in compliance with VA mandates at the time. Sure enough, this patient with heart disease had a mildly elevated PSA and sure enough he had a cancer on biopsy. And we discussed treatments in concert with our Urology colleagues. While he had many options, this patient agonized and brooded and could not live with the thought of a cancer in him He proceeded with radical prostatectomy, the most drastic of his options. And I will never forget that look of crestfallen resignation every time I saw him after that surgery because he thereafter came to clinic in diapers, having been rendered incontinent and impotent by that surgery. He was more full of self-flagellating regret than any other patient I have seen in my career. This poor man and his experience certainly jaded me at a young age and made me highly attuned to the pitfalls of PSA screening.

Against this backdrop where cancer is the most feared diagnosis in medicine, we feel an urge towards action to screen and prevent, even when there is a marginal net benefit of cancer screening, and even when other greater opportunities for improving health exist. I need not go into the literature about [ir]rational risk appraisal other than to say that our overly-exuberant fear of cancer (relative to other concerns) almost certainly leads to unrealistic hopes for screening and prevention. Hence the great interest in and attention to these two studies.

In summary, the PLCO study showed no reduction in prostate-cancer-related mortality from DRE (digital rectal examination) and PSA screening. Absence of evidence is not evidence, however, and a few points about this study deserve to be made:

~Because of high (and increasing) screening rates in the control group, this was essentially a study of the "dose" of screening. The dose in the control group was ~45 and that in the screening group was ~85%. So the question that the study asked was not really "does screening work" but rather "does doubling the dose of screening work". Had there been a favorable trend in this study, I would have been tempted to double the effect size of the screening to infer the true effect, reasoning that if increasing screening from 40% to 80% reduces prostate cancer mortality by x%, then increasing screening from 0% to 80% would reduce it by 2x%. Alas this was not the case with this study which was underpowered.

~I am very wary of studies that have cause-specific mortality as an endpoint. There's just too much room for adjudication bias, as the editorialist points out. Moreover, if you reduce prostate cancer mortality but overall mortality is unchanged, what do I, as a potential patient care? Great, you saved me from prostate cancer and I died at about the same time I would have but from an MI or a CVA instead? We have to be careful about whether our goals are good ones - the goal should not be to "fight cancer" but rather to "improve overall health". The latter, I admit, is a much less enticing and invigorating banner. We like to feel like we're fighting. (Admittedly, overall mortality appears to not differ in this study, but I'm at a loss as to what's really being reported in Table 4.) The DSMB for the ESRCP trial argue here that cancer specific mortality is most appropriate for screening trials because of dilution by other causes of mortality, and because screening for a specific cancer can only be expected to reduce mortality for that cancer. From an efficacy standpoint, I agree, but from an effectiveness standpoint, this position causes me to squint and tilt my head askance.

~It is so very interesting that this study was stopped not for futility, nor for harm, nor for efficacy, but because it was deemed necessary for the data to be released because of the [potential] impact on public health. And what has been the impact of those data? Utter confusion. That increasing screening from 40% to 80% does not improve prostate specific mortality does not say to me that we should reduce screening to 0%. In fact I don't know what to do, nor what to make of these data. Especially in the context of the next study.

In the ERSPC trial, investigators found a 20% reduction in prostate cancer deaths with screening with PSA alone in Europe. The same caveats regarding adjudication of this outcome notwithstanding, there are some very curious aspects of this trial that merit attention:

~This trial was, as I stated above, a "real-time meta-analysis" with many slightly different studies combined for analysis. I don't know what this does to internal or external validity because this is such an unfamiliar approach to me, but I'll be pondering it for a while I'm sure.

~I am concerned that I don't fully understand the way that interim analyses were performed in this trial, what the early stopping rules were, and whether a one-sided or two-sided alpha was used. Reference 6 states that it was one-sided but the index article says 2. Someone will have to help me out with the O'Brien-Fleming alpha spending function and let me know if 1% spending at each analysis is par for the course.

~As noted by the editorialist, we are not told what the "contamination rate" of screening in the control group is. If it is high, we might use my method described above to infer the actual impact of screening.

~Look at the survival curves that diverge and then appear to converge again at a low hazard rate. Is it any wonder that there is no impact on overall mortality?


So where does this all leave us? We have a population of physicians and patients that yearn for effective screening and believe in it, so much so that it is hard to conduct an uncontaminated study of screening. We have a US study that is stopped prematurely in order to inform public health, but which is inadequate to inform it. We have a European study which shows a benefit near the a priori expected benefit, but which has a bizarre design and is missing important data that we would like to consider before accepting the results. We have no hint of a benefit on overall mortality. We have lukewarm conclusions from both groups, and want desperately to know what the associated morbidities in each group are. We are spending vast amounts of resources and incurring an enormous emotional toll on men who live in fear after a positive PSA test, many of whom pay dearly ("a pound of flesh") to exorcise that fear. And we have a public over-reaction to the results of these studies which merely increase our quandary.

If ignorance is bliss, then truly 'tis folly to be wise. Perhaps this saying applies equally to individual patients, and the investigation of PSA screening in these large-scale trials. For my own part, this is one aspect of my health that I shall leave to fate and destiny, while I focus on more directly remediable aspects of preventive health, ones where the prevention is pleasurable (running and enjoying a Mediterranean diet) rather than painful (prostatectomy).

Sunday, April 5, 2009

Another [the final?] nail in the coffin of intensive insulin therapy (Leuven Protocol) - and redoubled scrutiny of single center studies

In the March 26th edition of the NEJM, the NICE-SUGAR study investigators publish the results of yet another study of intensive insulin therapy in critically ill patients: http://content.nejm.org/cgi/content/abstract/360/13/1283 .

This article is of great interest to critical care practitioners because intensive insulin therapy (Leuven Protocol) or some diluted or half-hearted version of it has become a de facto standard of care in ICUs across the nation and indeed worldwide; and because it is an incredibly well-designed and well-conducted study. My own interest derives also from my own [prescient] letter to the editor of the NEJM after the second Van den Berghe study (http://content.nejm.org/cgi/content/extract/354/19/2069 , the criticisms I levied against this therapy on this blog after another follow-up study recently showed negative results (http://medicalevidence.blogspot.com/2008/01/jumping-gun-with-intensive-insulin.html ), and in a recent paper railing against the "normalization heuristic" (http://www.medical-hypotheses.com/article/S0306-9877(09)00033-4/abstract ). The results of this study also add to the growing evidence that intensive control of hyperglycemia in other settings may not be beneficial (see the ACCORD and ADVANCE studies.)

The current study was designed to largely mirror the enrollment criteria and outcome definitions of the previous studies, had excellent follow-up, had well described and simple statistical analyses with ample power, and is well reported. Key differences between it and the original Van den Berghe study were the lack of high-calorie parenteral glucose infusions, and its multicenter design. This latter characteristic may be pivotal in understanding why the initially promising Leuven Protocol results have not panned out on subsequent study.

The results of this study can be summarized simply by saying that it appears that this therapy is of NO benefit and actually probably kills patients, in addition to markedly increasing the rate of very very severe hypoglycemia (6.3% increase, P<0.001). In contrast to Van den Berghe's second study in medical patients, there were no favorable trends towards reduction in ICU length of stay, time on the ventilator, or reduced organ failures. In short, this therapy appears to be a complete flop.

So why the difference? Why did this therapy, which in 2001 appeared to have such promise that it enjoyed rapid and widespread [and premature] adoption fail to withstand the basic test of science, namely, repeatability? I think that medical history will judge two factors to be responsible. Firstly, the massive dextrose infusions in the first study markedly jeporadized the external validity of the first (positive) Van den Berghe study - it's not that intensive insulin saves you from your illness, it saves you from the harmful caloric infusions used in the surgical patients in the first study.

Secondly, and this is related to the first, single center studies also compromise external validity. In a single center, local practice patterns may be uniform and idiosyncratic, so that the benefit of any therapy tested in such a center may also be idiosyncratic. Moreover, and I dare say, investigators at a single center may have more decisional latitude and control or influence over enrollment, ascentainment of outcomes, and clinical care of enrolled patients. The so-called "trial effect" whereby patients enrolled in a trial receive superior care and have superior outcomes may be more likely in single center studies. Such effects are of increased concern in trials whre total blinding/masking or treatment assignment is not possible. (Recall that in the Van den Berghe study, kan endocrinologist was consulted for insulin adjustments; in the current trial, a computerized algorithm controlled the adjustments.) Moreover still, for single center studies, investigators and the instutution itself may have more "riding on" the outcome of the study, and collective equipoise may not exist. As an "analogy of extremes", just for illustrative purposes, if you wanted to design a trial where you could subversively influence outcomes in a way that would not be apparent from the outside, would you design a single center study (at your own institution where your cronies were) or a large multicenter, multinational study? Which design would allow you to have more influence?

I LOVE the authors' concluding statement that "a clinical trial targeting a perceived risk factor is a test of a complex strategy that may have profound effects beyond its effect on the risk factor." This resonates beautifully with our conceptualization of the "normalization heuristic" and harkens to Ben Franklin's sage old saw that "He is the best physician who knows the worthlessness of the most medicines." I think that we now have more than ample data to assure us that intensive insulin therapy (i.e., targeting a blood sugar of 80-108) is a worthless medicine, and should be largely if not wholly abandoned.

Addendum 4/7/09: Also note the scrutiny of the only other "positive" study (with mortality as the primary endpoint) in critical care in the last decade: Rivers et al; see: http://online.wsj.com/article/SB121867179036438865.html .

Saturday, March 14, 2009

"Statistical Slop": What billiards can teach us about multiple comparisons and the need to assign primary endpoints

Anyone who has played pool knows that you have to call your shots before you make them. This rule is intended to decrease probability of "getting lucky" from just hitting the cue ball as hard as you can, expecting that the more it bounces around the table, the more likely it is that one of your many balls will fall through chance alone. Sinking a ball without first calling it is referred to coloquially as "slop" or a "slop shot".

The underlying logic is that you know best which shot you're MOST likely to successfully make, so not only does that increase the prior probability of a skilled versus a lucky shot (especially if it is a complex shot, such as one "off the rail"), but also it effectively reduces the number of chances the cue ball has to sink one of your balls without you losing your turn. It reduces those multiple chances to one single chance.

Likewise, a clinical trialist must focus on one "primary outcome" for two reasons: 1.) because preliminary data, if available, background knowledge, and logic will allow him to select the variable with the highest "pre-test probability" of causing the null hypothesis to be rejected, meaning that the post-test probability of the alternative hypothesis is enhanced; and 2.) because it reduces the probaility to find "significant" associations among multiple variables through chance alone. Today I came across a cute little experiment that drives this point home quite well. The abstract can be found here on pubmed: http://www.ncbi.nlm.nih.gov/pubmed/16895820?ordinalpos=4&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_DefaultReportPanel.Pubmed_RVDocSum .


In it, the authors describe "dredging" a Canadian database and looking for correlations between astrological signs and various diagnoses. Significant associations were found between the Leo sign and gastrointestinal hemorrhage, and the Saggitarius sign and humerous fracture. With this "analogy of extremes" as I like to call them, you can clearly see how the failure to define a prospective primary endpoint can lead to statistical slop. (Nobody would have been able to predict a priori that it would be THOSE two diagnoses associated with THOSE two signs!) Failure to PROSPECTIVELY identify ONE primary endpoint led to multiple chances for chance associations. Moreover, because there were no preliminary data upon which to base a primary hypothesis, the prior probability of any given alternative hypothesis is markedly reduced, and thus the posterior probability of the alternative hypothesis remains low IN SPITE OF the statistically significant result.

It is for this very reason that "positive" or significant associations among non-primary endpoint variables in clinical trials are considered "hypothesis generating" rather than hypothesis confirming. Requiring additional studies of these associations as primary endpoints is like telling your slop shot partner in the pool hall "that's great, but I need to see you do that double rail shot again to believe that it's skill rather than luck."

Reproducibility of results is indeed the hallmark of good science.

Tuesday, March 10, 2009

PCI versus CABG - Superiority is in the heart of the angina sufferer

In the current issue of the NEJM, Serruys et al describe the results of a multicenter RCT comparing PCI with CABG for severe coronary artery disease: http://content.nejm.org/cgi/content/full/360/10/961. The trial, which was designed by the [profiteering] makers of drug-coated stents, was a non-inferiority trial intended to show the non-inferiority (NOT the equivalence) of PCI (new treatment) to CABG (standard treatment). Alas, the authors appear to misunderstand the design and reporting of non-inferiority trials, and mistakenly declare CABG as superior to PCI as a result of this study. This error will be the subject of a forthcoming letter to the editor of the NEJM.

The findings of the study can be summarized as follows: compared to PCI, CABG led to a 5.6% reduction in the combined endpoint of death from any cause, stroke, myocardial infarction, or repeat vascularization (P=0.002). The caveats regarding non-inferiority trials notwithstanding, there are other reasons to call into question the interpretation that CABG is superior to PCI, and I will enumerate some of these below.

1.) The study used a ONE-SIDED 95% confidence interval - shame, shame, shame. See: http://jama.ama-assn.org/cgi/content/abstract/295/10/1152 .
2.) Table 1 is conspicuous for the absence of cost data. The post-procedural hospital stay was 6 days longer for CABG than PCI, and the procedural time was twice as long - both highly statistically and clinically significant. I recognize that it would be somewhat specious to provide means for cost because it was a multinational study and there would likely be substantial dispersion of cost among countries, but it seems like neglecting the data altogether is a glaring omission of a very important variable if we are to rationally compare these two procedures.
3.) Numbers needed to treat are mentioned in the text for variables such as death and myocardial infarction that were not individually statistically significant. This is misleading. The significance of the composite endpoint does not allow one to infer that the individual components are significant (they were not) and I don't think it's conventional to report NNTs for non-significant outcomes.
4.) Table 2 lists significant deficencies and discrepancies between pharmocological medical management at discharge which are inadequately explained as mentioned by the editorialist.
5.) Table 2 also demonstrates a five-fold increase in amiodarone use and a three-fold increase in warfarin use at discharge among patients in the CABG group. I infer this to represent an increase in the rate of atrial fibrillation in the CABG patients, but because the rates are not reported, I am kept wondering.
6.) Neurocognitive functioning and the incidence of defecits (if measured), known complications of bypass, are not reported.
7.) It is mentioned in the discussion that after consent, more patients randomized to CABG compared to PCI withdrew consent, a tacit admission of the wariness of patients to submit to this more invasive procedure.

In all, what this trial does for me is to remind me to be wary of an overly-simplistic interpretation of complex data and a tendency toward dichotimous thinking - superior versus inferior, good versus bad, etc.

One interpretation of the data is that a 3.4 hour bypass surgery and 9 days in the hospital !MIGHT! save you from an extra 1.7 hour PCI and another 3 days in the hospital on top of your initial committment of 1.7 hours of PCI and 3 days in the hospital if you wind up requiring revascularization, the primary [only] driver of the composite endpoint. And in payment for this dubiously useful exchange, you must submit to a ~2% increase in the risk of stroke, have a cracked chest, risk surgical wound infection (rate of which is also not reported) pay an unknown (but probably large) increased financial cost, risk some probably large increased risk of atrial fibrillation and therefore be discharged on amiodarone and coumadin with their high rates of side effects and drug-drug interactions, while coincidentally risk being discharged on inadequate medical pharmacological management.

Looked at from this perspective, one sees that beauty is truly in the eye of the beholder.

Monday, March 9, 2009

Money talks and Chantix (varenicline) walks - the role of financial incentives in inducing healthful behavior

I usually try to keep the posts current, but I missed a WONDERFUL article a few weeks ago in the NEJM, one that is pivotal in its own right, but especially in the context of good decision making about therapeutic choices and opportunity costs.

The article, by Volpp et all entitled: A Randomized, Controlled Trial of Financial Incentives for Smoking Cessation can be found here: http://content.nejm.org/cgi/content/abstract/360/7/699
In summary, smokers at a large US company, where a smoking cessation program existed before the research began were randomized to receive additional information about the program, versus the same information plus a financial incentive of up to $750 for successfully stopping smoking. At 9-12 months, smoking cessation was 10% higher in the financial incentive group (14.7% vs. 5.0%, P<0.001).

In the 2006 JAMA article on varenicline (Chantix) by Gonzales et al (http://jama.ama-assn.org/cgi/reprint/296/1/47.pdf ), the cessation rates at weeks 9-52 were 8.4% for placebo and 21.9% for varenicline, an absolute gain of 13.5%. (Similar results were reported in the study by Jorenby et al: http://jama.ama-assn.org/cgi/content/abstract/296/1/56?maxtoshow=&HITS=10&hits=10&RESULTFORMAT=&fulltext=varenicline&searchid=1&FIRSTINDEX=0&resourcetype=HWCIT ) Now, given that this branded pharmaceutical sells for ~$120 for a 30 day supply, and that, based on the article by Tonstad (http://jama.ama-assn.org/cgi/reprint/296/1/64.pdf ), many patients are continued on varenicline for 24 weeks or more, the cost of a course of treatment with the drug is approximately $720, just about the same as the financial incentives used in the index article.

And all of this begs the question: Is it better to pay $750 for 6 months of treatment with a drug that has [potentially serious] side effects to achieve ~13% reduction in smoking, or to pay patients to quit smoking to achieve a 10% reduction in smoking without harmful side effects and in fact with POSITIVE side effects (money to spend on pleasurable alternatives to smoking or other necessities)?

The choice is clear to me, and, having failed Chantix, I now consider whether I should offer my brother payment to quit smoking. (I expect to receive a call as soon as he reads this, especially since I haven't mentioned the cotinine tests yet.)

And all of this begs the more important question of why we seek drugs to solve behavioral problems, when good old fashioned greenbacks will do the trick just fine. Why bother with Meridia and Rimonabant and all the other weight loss drugs when we might be able to pay people to lose weight? (See: http://jama.ama-assn.org/cgi/content/abstract/300/22/2631 .) Perhaps one part of Obama's stimulus bill can allocate funds to additional such an experiments, or better yet, to such a social program.

One answer to this question is that the financial incentive to study financial incentives is not as great as the financial incentive to find another profitable pill to treat social ills. (There is after all a "pipeline deficiency" in a number of Big Pharma companies that has led to several mergers and proposed mergers, such as the announcement today of a possible merger of MRK and SGP, two of my personal favorites.) Yet this study sets the stage for more such research. If we are going to pay one way or another, I for one would rather that we be paying people to volitionally change their behavior, rather than paying via third party to reinforce the notion that there is "a pill for everything". As Ben Franklin said, "S/He is the best physician who knows the worthlessness of the most medicines."