WikiPsiquiatria - Posts previos

Understanding Effect Size: How It's Measured and What It Means


Stephen V. Faraone, PhDMedscape Psychiatry & Mental Health.  2008; ©2008 Medscape

Posted 02/14/2008




In any branch of biomedical science, it is imperative that research data are translated into results that are easily interpretable. Thus, while the results of inferential statistical analyses provide crucial information on the reliability of a result (eg, the difference between 2 groups, treatments, or conditions), values such as the F-statistic, the correlation coefficient, the chi-squared statistic, or the P value convey little information on the size of an observed effect. As such, it is impossible to determine from such statistics how, for example, a novel treatment evaluated in 1 study compares in relative terms to the efficacy of other established or emerging treatments for the same condition. In contrast to other areas of study, the need for "translatable" evidence from research studies is most acute in the evaluation of clinical trials, because these trials hold the potential to immediately influence the prescribing habits of practitioners and the course of treatment for patients. When evaluating such studies, the examination of "effect sizes" can be a useful adjunct to traditional hypothesis-testing using inferential statistics.


As the name suggests, an effect size estimate can place an easily interpretable value on the direction and magnitude of an effect of a treatment; a difference between 2 treatment groups; or any other numerical comparison or contrast. Advocates of this approach to treatment selection urge physicians to base treatment choices on the best evidence from systematic research about the efficacy and adverse effects of treatment alternatives. Ideally, physicians could compare different medications or procedures by referring to randomized, double-blind, placebo-controlled studies that compared the treatments. Although individual treatments may be well researched, studies comparing medications are rare in many areas of medicine. When direct comparative trials are not available, the best evidence comes from comparing randomized, double-blind, placebo-controlled studies of each treatment. This article reviews methods for comparing treatments across studies and provides examples of how they can be used to make treatment decisions.


Achieving Clinically Meaningful Comparisons Between Disparate Studies


One cannot compare 2 treatments based on their degree of statistical significance when each is compared with placebo. The reason for this is because the level of statistical significance is influenced by both the magnitude of the treatment effect and the size of the study. For example, consider 2 studies testing the different medications for the same disorder. Both studies found that 70% of patients improved with drug compared with 30% with placebo. So they clearly show the same clinical effect size. Study 1 tested 50 patients and 50 controls whereas study 2 tested 25 patients and 25 controls. Because of these differences in sample size, the level of statistical significance of study 1 (.00006) was much larger than that of study 2 (.005). It would be a mistake to conclude that because the results from study 1 reached a greater level of statistical significance than those of study 2 that the first medication studied is more efficacious.


This interpretative problem with statistical significance is solved by the concept of effect size, which was developed to allow clinically meaningful comparisons of efficacy between treatment trials. Without using this concept, comparing 2 treatment trials can be difficult. As the name suggests, an effect-size estimate can place an easily interpretable value on the direction and magnitude of an effect of a treatment, a difference between 2 treatment groups, or any other numerical comparison or contrast.


Consider this example: If 1 study measured the efficacy of back pain treatment using a rating scale for pain intensity and another used the dose of pain medication taken, we cannot compare the results because a 1-point reduction in pain intensity is not the same as a 1-point reduction in medication dose. Even if 2 studies use the same measure, we cannot simply compare mean change scores between drug and placebo because these studies may differ in their precision of measurement. We should have more faith in precise measures than imprecise ones. These problems of differing scales of measurement and differing precision of measurement make it difficult to compare treatment studies. Fortunately, these problems are overcome by the computation of effect size, which provides the difference in improvement between drug and placebo adjusted for the scale and accuracy of the measurements used in each study.


The Effect-Size Equations


Several measures of effect size are in current practice.[1] Two that are commonly used are the standardized mean difference (SMD), which is used for continuous measures such as a pain intensity rating scale, and the number needed to treat, which is used for binary outcomes such as responder vs nonresponder. We compute the SMD as the difference between drug and placebo divided by their standard deviation:


SMD = (Drug Improvement - Placebo Improvement) / Standard Deviation


The numerator of the formula makes clear that the SMD accounts for the observed drug vs placebo difference. Using this numerator, an SMD of 0 means that drug and placebo have equivalent effects, SMDs greater than 0 indicate the degree a given drug is more efficacious than placebo, and SMDs less than 0 indicate the degree a given drug is worse than placebo. The denominator shows that the scale of measurement (as indexed by the standard deviation) is equally important. Although the inclusion of the drug vs placebo difference requires no explanation, the reason for including the standard deviation is less obvious. The standard deviation adjusts the drug vs placebo differences for the scale and precision of measurement. Scale is important because different studies will use different types of measures. For example, consider 2 studies that each used pain intensity rating scales. In 1 study, the scores on the scale range from 0 to 100. For the other study, the range was 0 to 10. Intuitively, a 5-point change in the first scale is less impressive than a 5-point change in the second scale. The standard deviation adjusts for this type of scale problem. Regarding precision, consider the example of 2 studies that use the same scale that ranges from 0 to 100. Now assume that study 1 trains its raters very well, which results in a very precise measurement, and that study 2 does a poor job of training. Under these conditions, ratings from the first study will be less variable (ie, use a smaller range of numbers) than the ratings from the second study. The standard deviation corrects for such problems when computing the SMD measure of effect size.


Interpreting the Standardized Mean Difference


Because the job of the SMD is to compare results across different studies using different measures, it cannot be interpreted as a difference in rating scale points or percent improvement. How then are we to interpret SMDs? One approach is to use the widely accepted guidelines of Cohen.[2] For SMDs, he defined 0.2 as small, 0.5 as medium, and 0.8 as large. He gives the mean height difference between 15- and 16-year-old girls, which is half an inch, as an example of a small effect size. The height difference between 14- and 18-year-old girls, (about 1 inch), is his example of a medium effect size; and the height difference between 13- and 18-year-old girls, (about 1 and a half inches), is a large effect size. Another example can be made from differences in intelligence as measured by the Wechsler IQ scales. If a medication increased IQ by 3 points, we would say it had a small effect. We would call an increase of 7.5 IQ points a medium effect, and an increase of 12 IQ points a large effect. It is easy to compute the IQ point equivalent of any SMD. You simply multiply the SMD by 15.


The SMD is not a perfect measure of effect size (such a measure does not exist!). Although it provides an excellent method of computing a standard measure of effect across different studies, it cannot compare studies that express outcomes as a binary variable such as responder vs nonresponder or improved vs not improved. One intuitive solution to this problem is to use the percent of patients who respond or improve in the treated group as a comparative index of efficacy. In this case intuition is wrong. Although easy to understand, percent improvement statistics cannot be meaningfully interpreted without reference to the percent improvement observed in a placebo group.


One statistic used to deal with this problem is the Number Needed to Treat (NNT), which is the number of patients who need to be treated to prevent 1 bad outcome. It is computed as follows:


Number Needed to Treat = 100 / (Percent Improved on Drug - Percent Improved on Placebo)


When the formula gives a noninteger number, we round it to the next higher number. The number computed from the formula is the number of patients a physician would need to treat to be assured that the treatment has led to 1 positive outcome that would not be seen with a placebo. For example, if the NNT for a treatment is 10, we would need to treat 10 patients before achieving a positive response that we could attribute to the treatment. Ideally, the NNT would be 1, which would mean that every patient treated benefits from the medication. But because of nonresponse in some treated patients and the presence of placebo responses, the NNT usually exceeds 1.


The literature on treatments for chronic low back pain (CLBP) provides a useful example of how effect size can be used to compare treatments. In reviewing 6 experimental studies of treatments that sought to educate patients about lifestyle changes that would prevent CLBP, Di Fabio and colleagues examined 19 randomized controlled trials.[3] They found that training patients in a "back school" had little effect (SMD = 0.07) on the improvement of pain. They found such programs to be more useful when they were embedded in comprehensive rehabilitation programs. These embedded programs produced a larger (but still small) SMD of 0.28. The efficacy of back school was better for inpatients (SMD = 0.32) compared with outpatients, for which the SMD of 0.01 shows that the treatment is useless.


Manheimer[4] examined 33 randomized controlled trials of the effectiveness of acupuncture for CLBP. Acupuncture was significantly more effective than sham treatment (SMD = 0.54) and no treatment control (SMD = 0.69). Another commonly evaluated treatment for CLBP is exercise. In a review of exercise studies, McLain and colleagues[5] found a moderate-sized effect (SMD = 0.48) of exercise on CLBP. In an updated review of that literature, Kool and colleagues[6] found a 50% smaller SMD than was previously observed
(SMD = 0.24). Pharmacologic treatments for CLBP have been evaluated extensively, and at least 2 reviews of such studies have provided estimates of the effect size attributable to such treatments. The first of these evaluated the efficacy of the muscle relaxant, cyclobenzaprine, on 5 domains of pain.[7] Reductions in pain were highly reliable, but the magnitude of these reductions was only moderate, with SMDs ranging from 0.38 to 0.58 across the 5 pain dimensions. A review of antidepressant medication treatment for CLBP[8] also showed a moderate effect (SMD = 0.41). Interestingly, the effect size for a surgical procedure was low (0.22) and not significant.[9]


The example of CLBP is instructive in several ways. First, effect sizes from different types of treatment provide clear guidance: (1) pharmacotherapy is among the most effective but most problematic treatments for CLBP; (2) surgery is among the least effective treatments; and (3) acupuncture and several other alternative, low-cost therapies for CLBP may be equally or more effective than more invasive and/or expensive treatments. This example shows that estimates of effect size are useful when comparing similar treatments for a condition.


Using the Effect Size to Understand the Efficacy of Attention Deficit-Hyperactivity Disorder Medications


Faraone and colleagues[10] analyzed published literature on the pharmacotherapy of attention deficit-hyperactivity disorder (ADHD) medications to describe the variability of drug-placebo effect sizes. We conducted a literature search to identify double-blind, placebo-controlled studies of ADHD youth published after 1979. Meta-analysis regression assessed the influence of medication type and study design features on medication effects. Twenty-nine trials met criteria and were included in this meta-analysis. These trials studied 15 drugs using 17 different outcome measures of hyperactive, inattentive, impulsive, or oppositional behavior. The most commonly identified treatments included both methylphenidate and amphetamine compounds.


Effect sizes for stimulants are significantly greater than those for other medications, but the presence of confounds and their interaction with medication class suggests that, in the absence of confirmatory head-to-head studies, caution is warranted when comparing the effects of different medications across studies. Although head-to-head trials are needed to make definitive statements about efficacy differences, our results comparing stimulants and nonstimulants are compatible with the efficacy differences between atomoxetine and mixed amphetamine salts reported by Wigal and colleagues[11,12] and the conclusions of a previous review limited to a smaller subset of studies that excluded short-acting stimulants.[13] There were no differences between long and short-acting stimulants but this finding is limited by the fact that the studies did not examine duration of action.


Recently, lisdexamfetamine (LDX) has been US Food and Drug Administration-approved for ADHD treatment. LDX is the first prodrug stimulant approved for the treatment of ADHD. In this context, LDX is notable because in the registration trial, LDX yielded effect sizes of 1.21, 1.34, and 1.60 in the LDX 30-, 50-, and 70-mg daily dose groups, respectively.[14] Notably, these are greater than effect sizes reported for other stimulants.[10] If these results can be replicated, it would suggest that LDX is superior to other stimulant medications.


Limitations of Using Effect Sizes to Compare Treatments


Although using effect sizes to compare different treatments is much better than qualitative comparisons of different studies, there are several limitations of this method. Computing effect sizes only makes sense if the studies being compared are similar on any study design features that might increase or decrease the effect size. Comparing effect sizes between studies is questionable for studies that differ substantially on design features that influence drug-placebo differences. For example, if a study of drug A used double-blind methodology and found a smaller effect size than a study of a drug B that was not blinded, we could not be sure whether the difference in effect size was related to differences in drug efficacy or differences in methodology. If the endpoint outcome measures differ dramatically among studies, that could also lead to spurious results. For example, if 1 study used a highly reliable and well-validated outcome measure whereas the other used a measure of questionable reliability and validity, comparing effect sizes would not be meaningful without adjusting for these differences.


Although the efficacy effect size is a useful tool for comparing treatments, physicians must also consider adverse events along with efficacy when choosing treatments for their patients. And our discussion of efficacy does not diminish the importance of the many other questions physicians must consider when prescribing treatments. Are patients taking other treatments that may interact with a proposed treatment? Do subjects have a coexisting disorder that suggests the need for combined treatments? Have they undergone previous trials with any of the potential treatments? If so, what were the effects? These and other questions remind us that, although quantitative methods such as the computation of effect size play a crucial role in the practice of evidence-based medicine, it will never fully replace the evidence collected by informed physicians seeking to optimize the care of their patients.


This activity is supported by an independent educational grant from Shire.



References











  1. Faraone SV, Biederman J, Spencer TJ, Wilens TE. The drug-placebo response curve: a new method for assessing drug effects in clinical trials. J Clin Psychopharmacol. 2000;20:673-679. Abstract

  2. Cohen J. Statistical Power Analysis for the Behavioral Sciences. Second ed. Hillsdale, NJ: Erlbaum; 1988.

  3. Di Fabio RP. Efficacy of comprehensive rehabilitation programs and back school for patients with low back pain: a meta-analysis. Phys Ther. 1995 ;75:865-878. Abstract

  4. Manheimer E, White A, Berman B, Forys K, Ernst E. Meta-analysis: acupuncture for low back pain. Ann Intern Med. 2005;142:651-663. Abstract

  5. McLain K, Powers C, Thayer P, Seymour RJ. Effectiveness of exercise versus normal activity on acute low back pain: an integrative synthesis and meta-analysis. Online J Knowl Synth Nurs. 1999;6:7.

  6. Kool J, de Bie R, Oesch P, Knüsel O, van den Brandt P, Bachmann S. Exercise reduces sick leave in patients with non-acute non-specific low back pain: a meta-analysis. J Rehabil Med. 2004 ;36:49-62. Abstract

  7. Browning R, Jackson JL, O'Malley PG. Cyclobenzaprine and back pain: a meta-analysis. Arch Intern Med. 2001;161:1613-1620. Abstract

  8. Salerno SM, Browning R, Jackson JL. The effect of antidepressant treatment on chronic back pain: a meta-analysis. Arch Intern Med. 2002;162:19-24. Abstract

  9. Ibrahim T, Tleyjeh IM, Gabbar O. Surgical versus non-surgical treatment of chronic low back pain: a meta-analysis of randomised trials. Int Orthop. 2008;32:107-113. Abstract

  10. Faraone SV, Biederman J, Spencer TJ, Aleardi M. Comparing the efficacy of medications for ADHD using meta-analysis. MedGenMed. 2006;8:4.

  11. Faraone SV, Wigal S, Hodgkins P. Forecasting three-month outcomes in a laboratory school comparison of mixed amphetamine salts extended release (Adderall XR[R]) and atomoxetine (Strattera[R]) in school-aged children with attention-deficit/hyperactivity disorder. J Atten Disord. 2007;11:74-82. Abstract

  12. Wigal SB, Wigal TL, McGough JJ, et al. A laboratory school comparison of mixed amphetamine salts extended release (Adderall XR) and atomoxetine (Strattera) in school-aged children with attention deficit/hyperactivity disorder. J Atten Disord. 2005;9:275-289. Abstract

  13. Banaschewski T, Coghill D, Santosh P, et al. Long-acting medications for the hyperkinetic disorders : a systematic review and European treatment guideline. Eur Child Adolesc Psychiatry. 2006;15:476-495. Abstract

  14. Biederman J, Krishnan S, Zhang Y, McGough JJ, Findling RL. Efficacy and safety of lisdexamfetamine (ldx; nrp104) in children with attention-deficit/ hyperactivity disorder: a phase 3, randomized, multicenter, double-blind, parallel-group study. Clin Ther. 2007;29:450-463. Abstract






Stephen Faraone, PhD, Professor of Psychiatry, SUNY Upstate Medical University, Syracuse, New York



Disclosure: Stephen Faraone, PhD, has disclosed that he has received grants for clinical research and that he has also served as an advisor or consultant to Shire. Dr. Faraone has also disclosed that he has received grants for educational activities from McNeil and Shire.


Powered by Qumana


No hay comentarios:

Medscape Psychiatry

Scientific American - Mind & Brain

Nature Reviews Neuroscience - Issue - nature.com science feeds

Nature Reviews Neuroscience - AOP - nature.com science feeds

Nature Neuroscience - Issue - nature.com science feeds

Nature Neuroscience - AOP - nature.com science feeds

Translational Psychiatry

Neuropsychopharmacology - AOP - nature.com science feeds

Neuropsychopharmacology - Issue - nature.com science feeds