Canadian Medical Association Journal 1995; 152: 169-173

Paper reprints of the full text may be obtained from: Dr. Gordon Guyatt, Rm. 2C12, McMaster University Health Sciences Centre, 1200 Main St. W, Hamilton ON L8N 3Z5.

- Abstract
- Résumé
- Introduction
- Interpreting study results: how should we treat heart failure
- Solving the problem: confidence intervals
- Using confidence intervals to interpret study results
- Interpreting trials that appear to be "negative"
- Interpreting trials that appear to be "positive"
- Was the trial large enough?
- Conclusions
- References

[ Top of document ]

In our first article in this series we explained hypothesis testing, which involves estimating the likelihood that observed results of an experiment would have occurred by chance if a null hypothesis — that there was no difference between the effects of a treatment and a control condition — were true. The limitations of hypothesis testing have been increasingly recognized, and an alternative approach, called estimation, is becoming more popular. Several authors [1-5] have outlined the concepts that we will introduce in this article, and their discussions may be read to supplement our explanation.

An example from our first article illustrates the limitations of the hypothesis-testing approach. In the results of this trial, the decision to reject the null hypothesis rests on the analysis one prefers.

[ Top of document ]

If one considered this study an exercise in hypothesis testing and adopted the usual threshold for
Type I error of *p* = 0.05, one would conclude that chance is an adequate explanation for
the study results. One would classify this as a "negative" study, i.e., a study showing no important
difference between treatment and control groups. However, the investigators also used their data
to conduct a "survival analysis," which is generally more sensitive than a test of the difference in
proportions. The *p* value for mortality obtained from the survival analysis was 0.08, a
result that leads to the same conclusion as the simpler Chi² test. However, the authors also reported
that the *p* value associated with differences in mortality after 2 years ("a point
predetermined to be a major end point of the trial") was 0.016.

The reader could be excused for experiencing a little confusion. Do these results mean that this is a "positive" study supporting the use of an angiotensin-converting-enzyme (ACE) inhibitor (enalapril) rather than the combination of hydralazine and nitrates or a "negative" study leaving open the choice of drug treatments?

[ Top of document ]

Suppose that we have a coin that may or may not be biased. That is, the true probability of heads on any toss of the coin may be 0.5, but it may also be as high as 1.0 in favour of heads (every toss will yield heads) or in favour of tails (every toss will yield tails). We conduct an experiment to determine the true nature of the coin.

We begin by tossing the coin twice, and we observe one head and one tail. At this point, our best estimate of the probability of heads on any given coin toss is the value we have obtained (known as the "point estimate"), which is 0.5 in this case. But what is the plausible range within which the true probability of finding a head on any individual coin toss may lie? This range is very wide, and on the basis of this experiment most people would think that the probability may be as high or higher than 0.9, or as low or lower than 0.1. In other words, if the true probability of heads on any given coin toss is 0.9, it would not be surprising if, in any sample of two coin tosses, one were heads and one tails. So, after two coin tosses we are not much further ahead in determining the true nature of the coin.

We proceed with another eight coin tosses; after a total of 10, we have observed five heads and five tails. Our best estimate of the true probability of heads on any given coin toss remains 0.5, the point estimate. The range within which the true probability of heads may plausibly lie has, however, narrowed. It is no longer plausible that the true probability of heads is as great as 0.9; with such a high probability, it would be very unlikely that one would observe 5 tails in a sample of 10 coin tosses. People's sense of the range of plausible probabilities may differ, but most would agree that a probability greater than 0.8 or less than 0.2 is very unlikely.

On the basis of 10 coin tosses, it is clear that values between 0.2 and 0.8 are not all equally plausible. The most likely value of the true probability is the point estimate, 0.5, but probabilities close to that point estimate (0.4 or 0.6, for instance) are also likely. The further the value from the point estimate, the less likely it represents the truth.

Ten tosses have still left us with considerable uncertainty about our coin, and so we conduct another 40 repetitions. After 50 coin tosses, we have observed 25 heads and 25 tails, and our point estimate remains 0.5. We now believe that the coin is very unlikely to be extremely biased, and our estimate of the range of probabilities that is reasonably consistent with 25 heads in 50 coin tosses is 0.35 to 0.65. This is still a wide range, and we may persist with another 50 repetitions. If after 100 tosses we had observed 50 heads we might guess that the true probability is unlikely to be more extreme than 0.40 or 0.60. If we were willing to endure the tedium of 1000 coin tosses, and we observed 500 heads, we would be very confident (but still not certain) that our coin is minimally, if at all, biased.

In this experiment we have used common sense to generate confidence intervals around an observed proportion (0.5). In each case, the confidence interval represents the range within which the truth plausibly lies. The smaller the sample, the wider the confidence interval. As the sample becomes larger, we are increasingly certain that the truth is not far from the point estimate we have observed from our experiment.

Since people's "common-sense" estimate of the plausible range differs considerably, we can turn to statistical techniques for precise estimation of confidence intervals. To use these techniques we must be more specific about what we mean by "plausible." In our coin toss example we could ask What is the range of probabilities within which, 95% of the time, the true probability would lie? The actual 95% confidence intervals around the observed proportion of 0.5 for our coin toss experiment are given in Table 1. If we do not need to be so certain, we could ask about the range within which the true value would lie 90% of the time. This 90% confidence interval, also presented in Table 1, is somewhat narrower.

Table l: Confidence intervals around a proportion of 0.5 in a coin-toss
experiment
| |||

Number of coin tosses | Observed result | 95% confidence interval | 90% confidence interval |

2 | 1 head, 1 tail | 0.010.99 | 0.030.98 |

10 | 5 heads, 5 tails | 0.190.81 | 0.220.78 |

50 | 25 heads, 25 tails | 0.360.65 | 0.380.62 |

100 | 50 heads, 50 tails | 0.400.60 | 0.410.59 |

1000 | 500 heads, 500 tails | 0.470.53 | 0.470.53 |

The coin toss example also illustrates how the confidence interval tells us whether the sample is large enough to answer the research question. If you wanted to be reasonably sure that any bias in the coin is no greater than 10% (that is, the confidence interval is within 10% of the point estimate) you would need approximately 100 coin tosses. If you needed greater precision — with 3% of the point estimate — 1000 coin tosses would be required. To obtain greater precision all you must do is make more measurements. In clinical research, this involves enrolling more subjects or increasing the number of measurements in each subject enrolled. (But take care: increasing precision by enlarging the sample or increasing the number of measurements does not compensate for poor study design [7-9]).

[ Top of document ]

How can we now interpret the study results? The most likely value for the difference in mortality between the two vasodilator regimens is 5%, but the true difference may be up to 1.2% in favour of hydralazine and nitrates or up to 12% in favour of the ACE inhibitor. Values farther from 5% are less and less probable. We can conclude that patients offered ACE inhibitors most likely (but not certainly) will die later than patients offered hydralazine and nitrates; however, the magnitude of the difference in expected survival may be trivial or large. This way of understanding the results avoids the YesNo dichotomy that results from hypothesis testing, the expenditure of time and energy to evaluate the legitimacy of the authors' end point of mortality after 2 years, and consideration of whether the study is "positive" or "negative" on the basis of the results. One can conclude that, all else being equal, an ACE inhibitor is the appropriate choice for patients with heart failure, but that the strength of this inference is weak. The toxic effects and cost of the drugs, and evidence from other studies, would all bear on the treatment decision. Since several large randomized trials have now shown that a benefit is gained from the use of ACE inhibitors in patients with heart failure [10,11], one can confidently recommend this class of agents as the treatment of choice.

[ Top of document ]

The results of this large trial, involving more than 500 patients, may appear to exclude any possible benefit from ASA. However, the 95% confidence interval around the point estimate of 2% in favour of placebo is from 7% in favour of placebo to 3% in favour of ASA. If, in fact, 3% of patients who had strokes would have been spared if they had taken ASA, one would certainly want to administer the drug. By treating 33 patients, one stroke could be prevented. Thus, one can conclude that the Swedish study did not exclude a clinically important benefit and, in that sense, did not have a large enough sample.

As this example emphasizes, many subjects are needed in order to generate precise estimates of treatment effects; this is why clinicians are turning more and more to rigorous meta-analyses that pool data from the most valid studies [14]. In the case of ASA prophylaxis for recurrent stroke, such a meta-analysis showed that antiplatelet agents given to patients with a previous transient ischemic attack (TIA) or stroke reduced the risk of a subsequent TIA or stroke by approximately 25% (confidence interval approximately 19% to 31%). This benefit is great enough that most clinicians will want to treat such patients with ASA [15].

This example also illustrates that, when one sees results of an apparently "negative" trial (one that, in a hypothesis-testing framework, would fail to exclude the null hypothesis), one should pay particular attention to the upper end of the confidence interval, that is, the end that suggests the largest benefit from treatment. If even the smallest benefit of clinical importance lies above the upper boundary of the confidence interval, the trial is definitively negative. In contrast, if clinically important benefits fall within the confidence interval, the trial has not ruled out the possibility that the treatment is worth while.

[ Top of document ]

[ Top of document ]

The highest point of each distribution represents the result actually observed (the point estimate). In trials A and B the investigators observed that mortality was 5% lower in the experimental group than in the control group. In trials C and D they observed that mortality was 1% higher in the experimental group than in the control group.

The distributions of the likelihood of possible true results of each trial are based on the point estimate and the size of the sample. The point estimate is the single value that is most likely to represent the true effect. As you can see, values farther from the results observed are less likely than values closer to the point estimate to represent the true difference in mortality.

Now, suppose we assume that an absolute reduction in mortality of greater than 1% means that treatment is warranted (that is, such a result is clinically important), and a reduction of less than 1% means that treatment is not warranted (that is, the result is trivial). For example, if the experimental treatment results in a true reduction in mortality from 5% to 4% or less, we would want to use the treatment. If, on the other hand, the true reduction in mortality was from 5% to 4.5%, we would consider the benefit of the experimental treatment not to be worth the associated toxic effects and cost. What are the implications of this decision for the interpretation of the results of the four studies?

In trial A the entire distribution and, hence, the entire 95% confidence interval lies above the threshold risk reduction of 1%. We can therefore be confident that the true treatment effect is above our threshold, and we have a definitive "positive" trial. That is, we can be very confident that the true reduction in risk is greater — probably appreciably greater — than 1%; this leaves little doubt that we should administer the treatment to our patients. The sample size in this trial was adequate to show that the treatment provides a clinically important benefit.

Trial B has the same point estimate of treatment effect as trial A (5%) and is also "positive"
(*p* < 0.05). In a hypothesis test, the null hypothesis would be rejected. However, more
than 2.5% of the distribution is to the left of the 1% threshold. In other words, the 95%
confidence interval includes values less than 1%. This means that the data are consistent with an
absolute risk reduction of less than 1%, so we are left with some doubt that the treatment effect is
really greater than our threshold. This trial is still "positive," but its results are not definitive. The
sample in this trial was inadequate to establish definitively the appropriateness of administering the
experimental treatment.

Trial C is "negative"; its results would not lead to the rejection of the null hypothesis in a hypothesis test. The investigators observed mortality 1% higher in the treatment than in the control group. The entire distribution and, therefore, the 95% confidence interval lie to the left of our 1% threshold. Because the upper limit of the distribution is 1%, we can be very confident that, if there is a positive effect, it is trivial. The trial has excluded any clinically important benefit of treatment, and it can be considered definitive. We can therefore decide against the use of the experimental treatment, at least for this type of patient.

The result of trial D shows the same difference in absolute risk as that of trial C: mortality 1% higher in the experimental than in the control group. However, trial D had a smaller sample and, as a result, a much wider distribution and confidence interval. Since an appreciable portion of the confidence interval lies to the right of our 1% threshold, it is plausible (although unlikely) that the true effect of the experimental treatment is a reduction in mortality of greater than 1%. Although we would refrain from using this treatment (indeed, the most likely conclusion is that it kills people), we cannot completely dismiss it. Trial D was not definitive, and a trial involving more patients is required to exclude a clinically important treatment effect.

[ Top of document ]

In this discussion we have examined absolute differences in proportions of patients who died while receiving two different treatments. In the next article in this series, we will explain how to interpret other ways investigators present treatment effects, including odds ratios and relative risk.

[ Top of document ]

- Simon R: Confidence intervals for reporting results of clinical trials.
*Ann Intern Med*1986; 105: 429435 - Gardner MJ, Altman DG (eds):
*Statistics with Confidence: Confidence Intervals and Statistical Guidelines*, British Medical Journal, London, England, 1989 - Bulpitt CJ: Confidence intervals.
*Lancet*1987; 1: 494497 - Pocock SJ, Hughes MD: Estimation issues in clinical trials and overviews.
*Stat Med*1990; 9: 657671 - Braitman LE: Confidence intervals assess both clinical significance and statistical
significance.
*Ann Intern Med*1991; 114: 515517 - Cohn JN, Johnson G, Ziesche S et al: A comparison of enalapril with
hydralazine-isosorbide dinitrate in the treatment of chronic congestive heart failure.
*N Engl J Med*1991; 325: 303310 - Oxman AD, Sackett DL, Guyatt GH and the Evidence-Based Medicine Working
Group: Users' guides to the medical literature: I. How to get started.
*JAMA*1993; 270: 20932095 - Guyatt GH, Sackett DL, Cook DJ and the Evidence-Based Working Group: Users'
guides to the medical literature: II. How to use an article about therapy or prevention. A. Are the
results of the study valid?
*JAMA*1993; 270: 25982601 - Guyatt GH, Sackett DL, Cook DJ and the Evidence-Based Working Group: Users'
guides to the medical literature: II. How to use an article about therapy or prevention. B. What
were the results and will they help me in caring for my patients?
*JAMA*1994; 271: 5963 - Mulrow CD, Mulrow JP, Linn WD et al: Relative efficacy of vasodilator therapy in
chronic congestive heart failure.
*JAMA*1988; 259: 34223426 - The SOLVD Investigators: Effect of enalapril on survival in patients with reduced left
ventricular ejection fractions and congestive heart failure.
*N Engl J Med*1991; 325: 293302 - Sackett DL, Haynes RB, Guyatt GH et al:
*Clinical Epidemiology, a Basic Science for Clinical Medicine*, Little, Brown and Company, Boston, 1991: 218220 - Britton M, Helmers C, Samuelsson K: High-dose salicylic acid after cerebral infarction:
a Swedish co-operative study.
*Stroke*1987; 18: 325 - Oxman AD, Cook DJ, Guyatt GH and the Evidence-Based Medicine Working Group:
Users' guides to the medical literature: VI. How to use an overview.
*JAMA*1994; 272: 13671371 - Antiplatelet trialists' collaboration: Secondary prevention of vascular disease by
prolonged antiplatelet treatment.
*BMJ*1988; 296: 320331