Monday, September 21, 2009

The unreliable assymmetric design of the RE-LY trial of Dabigatran: Heads I win, tails you lose

I'm growing weary of this. I hope it stops. We can adapt the diagram of non-inferiority shenanigans from the Gefitinib trial (see ) to last week's trial of dabigatran, which came on the scene of the NEJM with another ridiculously designed non-inferiority trial (see ). Here we go again.

These jokers, lulled by the corporate siren song of Boehringer Ingelheim, had the utter unmitigated gall to declare a delta of 1.46 (relative risk) as the margin of non-inferiority! Unbelievable! To say that a 46% difference in the rate of stroke or arterial clot is clinically non-significant! Seriously!?

They justified this felonious choice on the basis of trials comparing warfarin to PLACEBO as analyzed in a 10-year-old meta-analysis. It is obvious (or should be to the sentient) that an ex-post difference between a therapy and placebo in superiority trials does not apply to non-inferiority trials of two active agents. Any ex-post finding could be simply fortuitously large and may have nothing to do with the MCID (minimal clinically important difference) that is SUPPOSED to guide the choice of delta in a non-inferiority trial (NIT). That warfarin SMOKED placebo in terms of stroke prevention does NOT mean that something that does not SMOKE warfarin is non-inferior to warfarin. This kind of duplicitious justification is surely not what the CONSORT authors had in mind when they recommended a referenced justification for delta.

That aside, on to the study and the figure. First, we're testing two doses, so there are multiple comparisons, but we'll let that slide for our purposes. Look at the point estimate and 95% CI for the 110 mg dose in the figure (let's bracket the fact that they used one-sided 97.5% CIs - it's immaterial to this discussion). There is a non-statistically significant difference between dabigatran and warfarin for this dose, with a P-value of 0.34. But note that in Table 2 of the article, they declare that the P-value for "non-inferiority" is <0.001 [I've never even seen this done before, and I will have to look to see if we can find a precedent for reporting a P-value for "non-inferiority"]. Well, apparently this just means that the RR point estimate for 110 mg versus warfarin is statistically significantly different from a RR of 1.46. It does NOT mean, but it is misleadingly suggested that the comparison between the two drugs on stroke and arterial clot is highly clinically significant, but it is not. This "P-value for non-inferiority" is just an artifical comparison: had we set the margin of non-inferiority at a [even more ridiculously "P-value for non-inferiority" as small as we like by just inflating the margin of non-inferiority! So this is a useless number, unless your goal is to create an artificial and exaggerated impression of the difference between these two agents.

Now let's look at the 150 mg dose. Indeed, it is statistically significantly different than warfarin (I shall resist using the term "superior" here), and thus the authors claim superiority. But here again, the 95% CI is narrower than the margin of non-inferiority, and had the results gone the other direction, as in Scenarios 3 and 4, (in favor of warfarin), we would have still claimed non-inferiority, even though warfarin would have been statistically significantly "better than" dabigatran! So it is unfair to claim superiority on the basis of a statistically significant result favoring dabigatran, but that's what they do. This is the problem that is likely to crop up when you make your margin of non-inferiority excessively wide, which you are wont to do if you wish to stack the deck in favor of your therapy.

But here's the real rub. Imagine if the world were the mirror image of what it is now and dabigatran were the existing agent for prevention of stroke in A-fib, and warfarin were the new kid on the block. If the makers of warfarin had designed this trial AND GOTTEN THE EXACT SAME DATA, they would have said (look at the left of the figure and the dashed red line there) that warfarin is non-inferior to the 110 mg dose of dabigatran, but that it was not non-inferior to the 150 mg dose of dabigatran. They would NOT have claimed that dabigatran was superior to warfarin, nor that warfarin was inferior to dabigatran, because the 95% CI of the difference between warfarin and dabigatran 150 mg crosses the pre-specified margin of non-inferiority. And to claim superiority of dabigatran, the 95% CI of the difference would have to fall all the way to the left of the dashed red line on the left. (See Piaggio, JAMA, 2006.)

The claims that result from a given dataset should not depend on who designs the trial, and which way the asymmetry of interpretation goes. But as long as we allow asymmetry in the interpretation of data, they shall. Heads they win, tails we lose.

Tuesday, September 15, 2009

Plavix (clopidogrel), step aside, and prasugrel (Effient), watch your back: Ticagrelor proves that some "me-too" drugs are truly superior

Another breakthrough is reported in last week's NEJM: . Wallentin et al report the results of the PLATO trial showing that ticagrelor, a new reversible inhibitor of P2Y12 is superior to Plavix in just about every imaginable way. Moreover, when you compare the results of this trial to the trial of prasugrel (Effient, recently approved, about which I blogged here: ), it appears that ticagrelor is going to be preferable to prasugrel in at least 2 ways: 1.) a larger population can benefit (AMI versus just patients undergoing PCI); and 2.) less bleeding, which may be a result of reversible rather than irreversible inhibition of P2Y12.

I will rarely be using either of these drugs or Plavix because I rarely treat AMI or patients undergoing PCI. My interest in this trial and that of prasugrel stems from the fact that in the cases of these two agents, the sponsoring company indeed succeeded in making what is in essence a "me-too" drug that is superior to an earlier-to-market agent(s). They did not monkey around with this non-inferiority trial crap like anidulafungin and gefitinib and just about every antihypertensive that has come to market in the past 10 years, they actually took Plavix to task and beat it, legitimately. For this, and for the sheer size of the trial and its superb design, they deserve to be commended.

One take-home message here, and from other posts on this blog is "beware the non-inferiority trial". There are a number of reasons that a company will choose to do a non-inferiority trial (NIT) rather than a superiority trial. First, as in the last post ( ) running a NIT often allows you to have your cake and eat it too - you can make it easy to claim non-inferiority (wide delta) AND make the criterion for superiority (of your agent) more lenient than the inferiority criterion, a conspicuous asymmetry that just ruffles my feathers again and again. Second, you don't run the risk of people saying after the fact "that stuff doesn't work," even though absence of evidence does not constitute evidence of absence. Third, you have great latitude with delta in a NIT and that's appealing from a sample size standpoint. Fourth, you don't actually have to have a better product which might not even be your goal, which is rather to get market share for an essentially identical product. Fifth, maybe you can't recruit enough patients to do a superiority trial. The ticagrelor trial recruited over 18,000 patients. You can look at this in two ways. One is that the difference they're trying to demonstrate is quite small, so what does it matter to you? (If you take this view, you should be especially dismissive of NITs, since they're not trying to show any difference at all.) The other is that if you can recruit 18,000 patients into a trial, even a multinational trial, the problem that is being treated must be quite prevalent, and thus the opportunity for impact from a superior treatment, even one with a small advantage, is much greater. It is much easier and more likely, in a given period of time, to treat 50 acute MIs and save a life with ticagrelor (compared to Plavix - NNT=50=[1/0.02]) than it is to find 8 patients with invasive candidiasis and treat them with anidulafungin (compared to fluconazole; [1/.12~8]; see Reboli et al: ), and in that latter case, you're not saving one life but rather just preventing a treatment failure. Thus, compared to anidulafungin, with its limited scope of application and limited impact, a drug like ticagrelor has much more public health impact. You should simply pay more attention to larger trials, there's more likely to be something important going on there. By inference, the conditions they are treating are likely to be a "bigger deal".

Of course, perhaps I'm giving the industry too much credit in the cases of prasugrel and ticagrelor. Did they really have much of a choice? Probably not. Generally, when you do a non-inferiority trial, you try to show non-inferiority and also something like preferable dosing schedules, reduced cost or side effects. That way, when the trial is done (if you have shown non-inferiority), you can say, "yeah, they have basically the same effect on xyz, but my drug has better [side effects, dosing, etc.]". Because of the enhanced potency of prasugrel and ticagrelor, they knew there would be more bleeding and that this would cause alarm. So they needed to show improved mortality (or similar) to show that that bleeding cost is worth paying. Regardless, it is refreshing to see that the industry is indeed designing drugs with demonstrable benefits over existing agents. I am highly confident that the FDA will find ticagrelor to be approvable, and I wager that it will quickly supplant prasugrel. I also wager that when clopidogrel goes generic (soon), it will be a boon for patients who can know that they are sacrificing very little (2% efficacy compared to ticagrelor of prasugrel) for a large cost savings. For most people, this trade-off will be well worth it. For those fortunate enough to have insurance or another way of paying for ticagrelor, more power to them.

Sunday, September 6, 2009

There's no such thing as a free lunch - unless you're running a non-inferiority trial. Gefitinib for pulmonary adenocarcinoma

A 20% difference in some outcome is either clinically relevant, or it is not. If A is worse than B by 19% and that's NOT clinically relevant and significant, then A being better than B by 19% must also NOT be clinically relevant and significant. But that is not how the authors of trials such as this one see it: . According to Mok and co-conspirators, if gefitinib is no worse in regard to progression free survival than Carboplatin-Paclitaxel based on a 95% confidence interval that does not include 20% (that is, it may be up to 19.9% worse, but not more worse), then they call the battle a draw and say that the two competitors are equally efficacious. However, if the trend is in the other direction, that is, in favor of gefitinib BY ANY AMOUNT HOWEVER SMALL (as long as it's statistically significant), they declare gefinitib the victor and call it a day. It is only because of widespread lack of familiarity with non-inferiority methods that they can get away with a free lunch like this. A 19% difference is either significant, or it is not. I have commented on this before, and it should come as no surprise that these trials are usually used to test proprietary agents ( ). Note also that in trials of adult critical illness, the most commonly sought mortality benefit is about 10% (more data on this forthcoming in a article soon to be submitted and hopefully published). So it's a difficult argument to subtend to say that something is "non-inferior" if it is less than 20% worse than something else. Designers of critical care trials will tell you that a 10% difference, often much less, is clinically significant.

I have created a figure to demonstrate the important nuances of non-inferiority trials using the gefitinib trial as an example. (I have adapted this from the Piaggio 2006 JAMA article of the CONSORT statement for the reporting of non-inferiority trials - a statement that has been largely ignored: .) The authors specified delta, or the margin of non-inferiority, to be 20%. I have already made it clear that I don't buy this, but we needn't challenge this value to make war with their conclusions, although challenging it is certainly worthwhile, even if it is not my current focus. This 20% delta corresponds to a hazard ratio of 1.2, as seen in the figure demarcated by a dashed red line on the right. If the hazard ratio (for progression or death) demonstrated by the data in the trial were 1.2, that would mean that gefitinib is 20% worse than comparator. The purpose of a non-inferiority trial is to EXCLUDE a difference as large as delta, the pre-specified margin of non-inferiority. So, to demonstrate non-inferiority, the authors must show that the 95% confidence interval for the hazard ratio falls all the way to the left of that dashed red line at HR of 1.2 on the right. They certainly achieved this goal. Their data, represented by the lowermost point estimate and 95% CI, falls entirely to the left of the pre-specified margin of non-inferiority (the right red dashed line). I have no arguments with this. Accepting ANY margin of non-inferiority (delta), gefitinib is non-inferior to the comparator. What I take exception to is the conclusion that gefitinib is SUPERIOR to comparator, a conclusion that is predicated in part on the chosen delta, to which we are beholden as we make such conclusions.

First, let's look at [hypothetical] Scenario 1. Because the chosen delta was 20% wide (and that's pretty wide - coincidentally, that's the exact width of the confidence interval of the observed data), it is entirely possible that the point estimate could have fallen as pictured for Scenario 1 with the entire CI between an HR of 1 and 1.2, the pre-specified margin of non-inferiority. This creates the highly uncomfortable situation in which the criterion for non-inferiority is fulfilled, AND the comparator is statistically significantly better than gefitinib!!! This could have happened! And it's more likely to happen the larger you make delta. The lesson here is that the wider you make delta, the more dubious your conclusions are. Deltas of 20% in a non-inferiority trial are ludicrous.

Now let's look at Scenarios 2 and 3. In these hypothetical scenarios, comparator is again statistically significantly better than gefitinib, but now we cannot claim non-inferiority because the upper CI falls to the right of delta (red dashed line on the right). But because our 95% confidence interval includes values of HR less than 1.2 and our delta of 20% implies (or rather states) that we consider differences of less than 20% to be clinically irrelevant, we cannot technically claim superiority of comparator over gefitinib either. The result is dubious. While there is a statistically significant difference in the point estimate, the 95% CI contains clinically irrelevant values and we are left in limbo, groping for a situation like Scenario 4, in which comparator is clearly superior to gefitinib, and the 95% CI lies all the way to the right of the HR of 1.2.

Pretend you're in Organic Chemistry again, and visualize the mirror image (enantiomer) of scenario 4. That is what is required to show superiority of gefitinib over comparator - a point estimate for the HR whose 95% CI does not include delta or -20%, an HR of 0.8. The actual results come close to Scenario 5, but not quite, and therefore, the authors are NOT justified in claiming superiority. To do so is to try to have a free lunch, to have their cake and eat it too.

You see, the larger you make delta, the easier it is to achieve non-inferiority. But the more likely it is also that you might find a statistically significant difference favoring comparator rather than the preferred drug which creates a serious conundrum and paradox for you. At the very least, if you're going to make delta large, you should be bound by your honor and your allegiance to logic and science to make damned sure that to claim superiority, your 95% confidence interval must not include negative delta. If not, shame on you. Eat your free lunch if you will, but know that the ireful brow of logic and reason is bent unfavorably upon you.

Saturday, September 5, 2009

Troponin I, Troponin T, Troponin is the Woe of Me

As a critical care physician, I have not infrequently been called to the emergency department to admit a patient on the basis of "abnormal laboratory tests" with no synthesis, no assimilation of the various results into any semblance of a unifying diagnosis. It is bad enough that patients' chests are no longer ausculted, respiratory rates and patterns not noted, neck veins not examined, etc. It is worse that the portable chest film (often incorrectly interpreted), the arterial blood gas (also often incorrectly interpteted), and the BNP level have supplanted any sort of logical and systematic approach to diagnosing a patient's problem. If we are going to replace physical examination with BNPs and d-dimers, we should at least insist that practitioners have one iota of familiarity with Bayes' Theorem and pre-test probabilities and the proper interpretation of test results.

Thus I raised at least one brow slightly on August 27th when the NEJM reported two studies of highly sensitive troponin assays for the "early diagnosis of myocardial infarction" (wasn't troponin sensitive enough already? see: and ). Without commenting on the studies' methodological quality specifically, I will emphasize some pitfalls and caveats related to the adoption of this "advance" in clinical practice, especially that outside of the setting of an appropriately aged person with risk factors who presents to an acute care setting with SYMPTOMS SUGGESTIVE OF MYOCARDIAL INFARCTION.

In such a patient, say a 59 year old male with hypertension, diabetes and a family history of coronary artery disease, who presents to the ED with chest pain, we (and our cardiology colleagues) are justified in having a high degree of confidence in the results of this test based on these and a decade or more of other data. But I suspect that only the MINORITY of cardiac troponin tests at my institution are ordered for that kind of indication. Rather, it is used as a screening test for just about any patient presenting to the ED who is ill enough to warrant admission. And that's where the problem has its roots. Our confidence in the diagnostic accuracy of this test in the APPROPRIATE SETTING (read appropriate clinical pre-test probability) should not extend to other scenarios, but all too often it does, and it makes a real conundrum when it is positive in those other scenarios. Here's why.
Suppose that we have a pregnancy test that is evaluated in women who have had a sexual encounter and who have missed two menstrual periods and it is found to be 99.9% sensitive and 99.9% specific. (I will bracket for now the possibility that you could have a 100% sensitive and/or specific test.) Now suppose that you administer this test to 10,000 MEN. Does a positive test mean that a man is pregnant? Heavens No! He probably has testicular cancer or some other malady. This somewhat silly example is actually quite useful to reinforce the principle that no matter how good a test is, if it is not used appropriately or in the appropriate scenario that the results are likely to be misleading. Likewise, consider this test's use in a woman who has not missed a menstrual cycle - does a negative test mean that she is not pregnant? Perhaps not, since the sensitivity was determined in a population that had missed 2 cycles. If a woman were obviously24 weeks pregnant and the test was negative, what would we think? It is important to bear in mind that these tests are NOT direct tests for the conditions we seek to diagnose, but are tests of ASSOCIATED biological phenomena, and insomuch as our understanding of those phenomena is limited or there is variation in them, the tests are liable to be fallible. A negative test in a woman with a fetus in utero may mean that the sample was mishandled, that the testing reagents were expired, that there is an interfering antibody, etc. Tests are not perfect, and indeed are highly prone to be misleading if not used in the appropriate clinical scenario.

And thus we return to cardiac troponins. In the patients I'm called to admit to the ICU who have sepsis, PE, COPD, pneumonia, respiratory failure, renal failure, metabolic acidosis, a mildly positive troponin which is a COMMON occurrence is almost ALWAYS an epiphenomenon of critical illness rather than an acute myocardial infarction. Moreover, the pursuit of diagnosis via cardiac catheterization or the empiric treatment with antiplatelet agents and anticoagulants almost always is a therapeutic misadventure in these patients who are at much greater risk of bleeding and renal failure via these interventions which are expected to have a much reduced positive utility for them. More often than not, I would just rather not know the results of a troponin test outside the setting of isolated acute chest pain. Other practitioners should be acutely aware of the patient populations in which these tests are performed, and the significant limitations of using these highly sensitive tests in other clinical scenarios.