1 / 25

Understanding Statistical Power and Effect Sizes

Learn how statistical power determines sample size for finding specific effects. Explore a priori and post hoc power analyses, emphasizing effect size over statistical significance. Discover how reliability, outliers, and analysis choices impact power.

dunn
Download Presentation

Understanding Statistical Power and Effect Sizes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Effect Sizes and Power Review

  2. Statistical Power • Statistical power refers to the probability of finding a particular sized effect • Specifically, it is 1- type II error rate • Probability of rejecting the null hypothesis if it is false • It is a function of type I error rate, sample size, and effect size • Its utility lies in helping us determine the sample size needed to find an effect size of a certain magnitude

  3. Two kinds of power analysis • A priori • Used when planning your study • What sample size is needed to obtain a certain level of power? • Post hoc • Used when evaluating study • What chance did you have of significant results? • Not really useful • If you do the power analysis and conduct your analysis accordingly then you did what you could. To say after, “I would have found a difference but didn’t have enough power” isn’t going to impress anyone.

  4. A priori power • Can use the relationship of n, d, and d (the noncentrality parameter, i.e. what the sampling distribution is centered on if H0 is false) plus our specified  to calculate how many subjects we need to run • Decide on your a level • Decide an acceptable level of power/type II error rate • Figure out the effect size you are looking for • Calculate n

  5. A priori Effect Size? • Figure out an effect size before I run my experiment? • Several ways to do this: • Base it on substantive knowledge • What you know about the situation and scale of measurement • Base it on previous research • Use conventions

  6. An acceptable level of power? • Why not set power at .99? • Practicalities • Howell shows how for a 1 sample t test, and an effect size d of 0.33: • Power = .80, then n = 72 • Power = .95, then n = 119 • Power = .99, then n = 162 • Cost of increasing power (usually done through increasing n) can be high

  7. Howell’s general rule • Look for big effects or • Use big samples • You may now start to understand how little power many of the studies in psych have considering they are often looking for small effects • Many seem to think that if they use the central limit theorem rule of thumb (n=30), which doesn’t even hold that often, that power is solved too • This is clearly not the case

  8. Post hoc power: the power of the actual study • If you fail to reject the null hypothesis might want to know what chance you had of finding a significant result – defending the failure • As many point out this is a little dubious • One thing we can understand regarding the power of a particular study at hand is that it can be affected by a number of issues such as • Reliability of measurement • An increase in reliability can actually result in power increasing or decreasing as we will see later, though here I stress the decrease due to unreliable measures • Outliers • Skewness • Unequal N for group comparisons • The analysis chosen

  9. Something to consider • Doing a sample size calculation is nice in that it gives a sense of what to shoot for, but rarely if ever do the data or circumstances bare out such that it provides a perfect estimate for our needs • Mike’s sample size calculation for all studies: • The sample size needed is the largest N you can obtain based on practical considerations (e.g. time, money) • Also, even the useful form of power analysis (for sample size calculation) involves statistical significance as its focus • While it gives you something to shoot for, our real interest regards the effect size itself and how comfortable we are with its estimation • Emphasizing effect size over statistical significance in a sense de-emphasizes the power problem

  10. Always a relationship • Commonly define the null hypothesis as ‘no difference’ or ‘no relationship’ • There is always a non-zero relationship (to some decimal place) seen in sample data • As such obtaining statistical significance can be seen as just a matter of sample size • Furthermore, the importance and magnitude of an effect are not reflected (because of the role of sample size in probability value attained)

  11. What should we be doing? • Want to make sure we have looked hard enough for the difference – power analysis • Figure out how big the thing we are looking for is – effect size

  12. Effect Size • There are different ways to speak about the relationship between variables, but in general effect size refers to the practical, rather than statistical, significance • This is what we are really interested in • No one cares about the statistical particulars if the effect is real and will change the way we think about things and how we act • However, the effect size, like our other measures, varies from sample to sample • I.e. if we did a study 5 times, we would get 5 different effect sizes • So while we are primarily interested in effect size, we will need to be cautious in our interpretation there too, and use other available evidence also to come to our final conclusions

  13. Calculating effect size • Different statistical tests have different effect sizes developed for them • However, the general principle is the same • Effect size refers to the magnitude of the impact of the independent variable (factor) on the outcome variable

  14. Thinking about effect size again • d family: Focused on standardized mean differences • Allows comparison across samples and variables with differing variance • Equivalent to z scores • Note sometimes no need to standardize (units of the scale have inherent meaning) • r family: Variance-accounted-for • Amount of variance explained versus the total • d family and r family

  15. Example: Cohen’s d – Differences Between Means • Used with independent samples t test • Cohen initially suggested could use either sample standard deviation, since they should both be equal in the population according to our assumptions. In practice people now use the pooled variance. • Variations of this are for control group settings, dependent samples, more than two groups… but the notion of standardized mean difference is the same

  16. Cohen’s d – Differences Between Means • Relationship to t • Relationship to rpb P and q are the proportions of the total each group makes up. If equal groups p=.5, q=.5.

  17. Characterizing effect size • Cohen emphasized that the interpretation of effects requires the researcher to consider things narrowly in terms of the specific area of inquiry • Evaluation of effect sizes inherently requires a personal value judgment regarding the practical or clinical importance of the effects • Even though rules of thumb exist, use only as a last resort and be wary of “mindlessly invoking” these criteria

  18. Association • A measure of association describes the amount of the covariation between the independent and dependent variables • It is expressed in an unsquared metric or a squared metric—the former is usually a correlation, the latter a variance-accounted-for effect size • We can apply the measure to continuous data(r and R2), categorical predictors with continuous DV (eta2), and strictly categorical settings (e.g. phi) • Again the notion is the same, a measure of linear association which, if squared, provides a measure of variance in the DV can be accounted for by the predictor

  19. Case-level effect sizes for group differences • Indexes such as Cohen’s d and eta2 estimate effect size at the group or variable level only • However, it is often of interest to estimate differences at the case level • Case-level indexes of group distinctiveness are proportions of scores from one group versus another that fall above or below a reference point • Examples Cohen’s Us, common language effect size, tail ratios • Reference points can be relative (e.g., a certain number of standard deviations above or below the mean in the combined frequency distribution) or more absolute (e.g., the cutting score on an admissions test) • Note that all three effect size types applicable to the group difference setting are transferable to the other, it is just a matter of preference as to which one we use for communication

  20. Confidence Intervals for Effect Size • Effect size statistics such as Cohen’s dand η2 have complex distributions • General form is the same as any CI

  21. Confidence Intervals for Effect Size • Traditional methods of interval estimation rely on approximate standard errors assuming large sample sizes • We need a computer program to help us find the correct noncentrality parameters to use in calculating exact confidence intervals for effect sizes • Both standalone programs (Steiger) and statistical packages (R) can do this for us, and thus provide a measure of effect while noting the uncertainty with that estimate

  22. Limitations of effect size measures • Variability across samples • No more a limitation than other statistics, but one needs to be fully aware of this • Just because you found a moderate effect doesn’t mean that there is one • Standardized mean differences: • Heterogeneity of within-conditions variances across studies can limit their usefulness—the unstandardized contrast may be better in this case • Measures of association: • Correlations can be affected by sample variances and whether the samples are independent or not, the design is balanced or not, or the factors are fixed or not • Also affected by artifacts such as missing observations, range restriction, categorization of continuous variables, and measurement error (see Hunter & Schmidt, 1994, for various corrections) • Variance-accounted-for indexes can make some effects look smaller than they really are in terms of their substantive significance

  23. Limitations of effect size measures • How to fool yourself with effect size estimation: • 1. Measure effect size only at the group level • 2. Apply generic definitions of effect size magnitude without first looking to the literature in your area • 3. Believe that an effect size judged as “large” according to generic definitions must be an important result and that a “small” effect is unimportant • 4. Ignore the question of how theoretical or practical significance should be gauged in your research area • 5. Estimate effect size only for statistically significant results

  24. Limitations of effect size measures • 6. Believe that finding large effects somehow lessens the need for replication • 7. Forget that effect sizes are subject to sampling error • 8. Forget that effect sizes for fixed factors is specific to the particular levels selected for study • 9. Forget that standardized effect sizes encapsulate other quantities such as the unstandardized effect size, error variance, and experimental design • 10. As a journal editor or reviewer, substitute effect size magnitude for statistical significance as a criterion for whether a work is published

  25. Recommendations • Report effect sizes along with statistical significance • Report confidence intervals • Use graphics • Use common sense combined with theoretical considerations • Do not rely on any one result to support your conclusions

More Related