Content Outline:

I. Definition of Effect Size for the Mean
A. Importance
B. Approaches
C. Critical assumptions

II. Calculating Effect Size
A. Cohen’s d
B. Pearson Product r
C. Odds ratio

III. Advantages and Disadvantages
A. Advantages and Disadvantages of Standardized Effect Sizes
B. Comparative Table of Pros and Cons of Standardized Effect Sizes

IV. Examples

VI. Resources

VII. References

I. Definition of effect size

What is effect size?

Effect size (ES), is a measure of the size or magnitude of a treatment effect. Unlike tests of significance that answer the question "is the difference between these groups due to chance?" measures of effect size tell us how strong the relationship between the groups is. Effect size answers the question "How big is the difference between the group means?" It is important for making decisions because effect size has practical implications. A finding can be statistically significant but whether that significant difference is of practical concern is another question.

Cohen (1988) outlines benchmarks for understanding effect size. He suggests that an r = .10 (r²=.01) and a d = .20 are considered small effects (also see Valentine & Cooper, 2003). A d = .5 and an r = .3 (r²=.09) are considered medium-sized effects and large effects are generally considered equal or larger to d=.8 and r=.5 (r² = .25).

Why is effect size important?

Effect size is important because it provides information about the size or magnitude of the effect. Effect size can help us assess change and understand the practical significance/importance of a treatment. If, for example, students in a school district who attend a math tutoring program are found to do better on a standardized math test than students who did not attend the tutoring program, decision makers in the school district will want to know "how much better?". If the students who received tutoring did only marginally better, administrators may not implement a district-wide tutoring program for math. However, if the effect size of the tutoring program is large, administrators will have to consider how to implement the program for all students. This example demonstrates the practical importance of effect size. Statistical significance is not enough when considering the relationship between two populations; effect size helps us understand the real importance of the relationship between groups.

Journals increasingly require the reporting of the effect size for publications. The American Psychological Association (APA) made the following change to their publication guidelines: "When reporting inferential statistics (e.g. t tests, F tests, and chi-square), include information about the obtained magnitude or value of the test statistic, the degrees of freedom, the probability of obtaining a value as extreme or more extreme than the one obtained, and the direction of the effect." (Wilkinson, L., & APA Task Force on Statistical Inference, 1999).

Critical Assumptions

  • Normal distribution (assumption of normality)
  • Homogeneity of variance (homoscedasticity).Effect size is sensitive to the heterogeneity of variance or differing variances (heteroscedasticity).
  • Reliability of measures

Comparative Table of Three Effect Size Measures


Sources: Becker, L.A. (2000) and Valentine, J. & Cooper, H. (2003).

II. Calculating Effect Size

There are several ways to calculate effect size, and not all of them pertain to the mean. In CEP 932, for instance, we have learned how to calculate the effect size for the Chi-Square Goodness of Fit test. We do not deal with this specific test of effect size (//w)// here (although Cohen's benchmarks for r also apply to w). Rather, we present calculations for : Cohen's d, Pearson's r and the Odds-ratio.

All of the calculations are based on hypothetical data. Here's the scenario:
An educational researcher wants to know if students' online reading comprehension scores improve as a result of a month-long classroom-based intervention designed to help students become more critical readers of the Internet.

One class of 30 students received the intervention treatment. A control class of 30 students did not receive the intervention treatment. At pre-test, the mean scores for the two groups on the standardized test of reading comprehension were not statistically different. However, at post-test, an independent samples t-test revealed a statistically significant difference. Here are the statistics:

Descriptive Statistics:

Mean Post-test Online Reading Comprehension Score
Standard Deviation

Independent Samples T-Test:
t observed = 4.238, p <0.001

Cohen's d

Cohen's d is calculated by dividing the difference between the two means by their pooled standard deviation.
The formula is:

To calculate the pooled standard deviation, follow these steps:


Cohen's d = 12.22 - 9.55/2.395
Cohen's d = 1.11

As noted above, Cohen (1988) stated that effect sizes above 0.8 are considered large. See L. Becker's notes on effect size for a comprehensive table that reviews the range of effect sizes and their meanings. In this case therefore, the calculated effect size of 1.11 is considered large. It would seem that the intervention had a sizable, positive effect on the treatment group of students.

Pearson's r (aka Pearson product-moment correlation coefficient)

As noted above in the comparative table of these three measures of effect size, Pearson's r is a correlation coefficient. It is designed to answer three questions:
  1. How strong is the relationship between these two variables?
  2. Is the linear relationship positive or negative? Another way to express this is question is "Which values of each variable are associated with values of the other variable?" (see CEP 932 Topic 8 .ppt, slide 2)
  3. What is the shape of the relationship? Is it linear or curvilinear?

Values of Pearson's r range from -1 to +1. An r value of -1 suggests that the relationship between the two variables is perfectly linear and that as values of y increase, values of x decrease (negative slope). An r-value of +1 indicates that the relationship between the two variables is perfectly linear and that as values of x increase, values of y increase (positive slope). An r-value of 0 indicates no linear relationship. The scatterplot (see below) for an r-value of 0 might appear circular or curvilinear (eg. hyperbolic). Since r is a measure of linear relationship, it does not detect curvilinear relationships.


Bowman, R. (2009, April 28).

Calculating Pearson's r
The formula for calculating Pearson's r is:

If we return to the scenario of two groups of students, one of which received an online reading comprehension intervention, we can test the strength, shape and direction of the relationship between two variables. In this case the two variables are (1) treatment group post-test score on online reading comprehension test and (2) control group score on the same online reading comprehension test. If there is a linear correlation between the two variables, we will be able to see the effect of the treatment - which, for this hypothetical experiment, was a month-long intervention designed to help the treatment students become more critical readers of the Internet.

Looking at the purple cell highlighted in the SPSS Output table, we see that the Pearson r (correlation) between these two variables is 0.092. Following Cohen's rule of thumb for effect sizes for the correlation coefficient, we see that when r² = 0.01 the effect is considered small. If we square 0.092, the = 0.008 - a small effect. The
scatterplot for this data demonstrates this fact visually.:

Why does Cohen's d suggest a large effect and Pearson's r suggest a negliglbe effect?
The difference between the result of Cohen's d and the Pearson's r for these data highlight the purpose of each test. Cohen's d is designed to test the separation between group means whereas the Pearson product moment correlation coefficient, r, demonstrates the degree to which x varies as a factor of y. If we think about the data, we know that the pair-wise comparisons for each data point are actually comparing the scores of individual students from the treatment and control groups. The Pearson statistic compares Student 1, Treatment with Student 1, Control. Even if the effect of the intervention is considerable between the groups (Cohen's d) it makes sense that the performance of Student 1, Treatment wouldn't necessarily PREDICT the performance of Student 1, Control. For this reason, it is easy to see that you can't really predict y from x in this case. Moreover, this example demonstrates the point that Pearson's r is most appropriate for correlated data. Had we compared Student 1, Treatment's pre-test scores with her post-test scores, we might have seen a sizeable effect.

Odds Ratio

As noted in the comparative table above, the odds ratio is a way to compare the odds of one outcome vs. another. We can use the odds ratio for 2 variables where both variables have two possible outcomes. The odds ratio is useful because it's easily understood. There is no "Cohen's rule of thumb" to memorize. Rather, the odds ratio is calculated in percent.

Using the data for the post-test scores of treatment and control students on the test of online reading comprehension, we can set up an odds ratio. The researchers for this study are interested in the odds of treatment and control students getting a score of 12 or higher on the reading comprehension test, the maximum score for which is 15. Here, we can return to the raw data and just tally up the number of students in both groups who got a score of 12, 13, 14 or 15 on the reading comprehension test. Here are the results:

Treatment (n=30)
Control (n=30)

To calculate the odds ratio, we have to first establish the odds for a score of 12 or more for the treatment and the control groups. For the treatment, students were twice as likely (2/1) to get 12 or more than 12 or less. For the control group, however, the odds were almost reversed. The odds of a control group student getting 12 or more happened .3 times as often as scores of 12 or less.

Now, we can calculate the odds between the two groups with a simple equation:

From this calculation, we see that treatment students were 6.67 times more likely to get a score of 12 or higher at post-test than the control students. This example also demonstrates the practical significance of the odds ratio test of effect size. By doing a simple calculation of the odds of each outcome occuring within group and then comparing those odds, we get a very real sense of how effective the intervention was in teaching students to read the Internet more critically. Though Cohen's d and the odds ratio use different scales, and are not directly comparable as a result, this example does, however, demonstrate how their findings can be compared. Practically speaking, we can see that the result of the odds ratio is consistent with the finding of Cohen's d - that the treatment intervention increased student scores, and to a sizable degree.

III. Advantages and Disadvantages

Standardized Effect Size Measures or Practically Significant Measures?

There seems to be a debate in the statistics community about whether to use standardized measures of effect size (such as Cohen's d) or practical measures such as the odds ratio.

Advantage of standardized effect size:

According to Valentine & Cooper (2003) standardized effect sizes allow for comparisons by combining the estimates from different effect size measurement metrics that have different measurement scales. For example, as part of a research study, outcomes may be reported using different measures such as odds ratio and correlation coefficient. When standardized effect sizes are used, these estimates can be compared to reach a conclusion.

Disadvantages of standardized effect size:

Standardized effect sizes may be “less easily understood " (Fern & Monroe, 1996). This implies that due to a lack of thorough understanding of what the standardized measures mean, different interpretations may lead to incoherent results on similar research issues. This may simply be due to the differences in assumptions on which the research is based (Fern & Monroe, 1996).
On a similar note, words such as "small", "medium", and "large" used in Cohen's test interpretations need to be interpreted carefully as differences arising in context may change the meaning of these words. What may be "small" in one context may mean something else in another context.

The following table summarizes the pros and cons of effect size measures.



Standardized effect sizes are generally easy to calculate and for their benefit, these measures are widely used.
Effect size interpretation can be problematic, if not implemented appropriately. For example, Cohen’s description of effect size 0.5 is medium, and if this is used out of context, the results would vary. What may be medium in once case may be different in another. (Coe, 2002)
Effect size is readily understood and applicable in social sciences and education. It also contributes to a body of literature that can be further built upon by researchers.
The practical importance of an effect depends on its overall benefits and costs. For example, in the field of education, an improvement with an effect size of 0.1 may seem insignificant, but over time, the cumulative effect may become very large and that could mean a significant improvement (Coe, 2002). This aspect may usually be ignored if things are looked at just in light of the effect size.
The standards of “large”, “small” and “medium” are quite beneficial in an analysis. In addition, these standards are derived after extensive research and do reflect “realistic conventions”. (Lenth, 2004)

There can be problems in the interpretation of standardized effect sizes when a sample does not come from a Normal distribution, or when it has unreliable measurements or data used. (Coe, 2002)
Effect size analysis helps in creating tables that display information in a coherent manner thus allowing easy comparisons against benchmarks.

Scale of comparisons between different measures of effect sizes are different and can sometimes be confusing. An odd ratio of 3 would mean something different on Cohen’s d of 3.
Effect size is useful in “quantifying effects measured on unfamiliar or arbitrary scales and for comparing the relative sizes of effects from different studies”. (Coe, 2002)
Standardized effect sizes are recommended ONLY when the metrics used for comparison really have no practical meaning - i.e. a GRE score of 540. In this case, the APA recommends using practical, unstandardized measures of effect size such as the raw difference between group means.
Practitioners can feel comfortable with effect sizes and relate to it easily as compared to significance testing (Neil J. 2008)

Effect sizes can be very helpful in practical situations and they aid decision making.

When we are only interested in sample results, significance testing is not needed. In such cases, effect sizes are “sufficient and suitable” (Neill, 2008)

In certain situations, effect sizes are more informative than significance testing. For example, when the sample size is small, significance testing can mislead due to being subject to Type-II errors. (Neill, 2008)

IV. Resources

Effect size calculator for multiple regression: http:www.danielsoper.com/statcalc/calc05.aspx

Effect size calculator for Cohen's d and r: http://web.uccs.edu/lbecker/Psy590/escalc3.htm

Effect size generator software: http:www.clintools.com/victims/resources/software/effectsize/effect_size_generator.html

Effect Size Calculator 1.0 freeware - Macintosh Universal application that calculates effect sizes (Cohens d, r, Glasss ɢ, Common Language Effect Size) given appropriate means and standard deviations: http://mac.wareseeker.com/Education/effect-size-calculator-1.0.zip/329400

V. References

Barnette, J. (2006). Effect size and measures of association. 2006 Summer Evaluation Institute. Retrieved from http://www.eval.org/SummerInstitute/06SIHandouts/SI06.Barnette.TR2.Online.pdf

Becker, L. A. (2000). Effect size (ES). Psychology 590 course lecture PowerPoint slides. University of Colorado. Retrieved from http:web.uccs.edu/lbecker/Psy590/es.htm

Bowman, R. (2009, April 28). Statistics Notes-Correlation. Retrieved August15, 2009 from http://rchsbowman.wordpress.com/2009/04/28/statistics-notes-correlation/

Coe, R. (2002). It's the effect size, stupid: what effect size is and why it is important. Paper presented at the British Educational Research Association Conference, Exeter, September. Retrieved from http://www.cemcentre.org/Documents/CEM%20Extra/EBE/ESguide.pdf

Cohen, J. (1988).
Statistical power analysis for the behavioral sciences. NY: Academic Press.

Fern, E. F. & Monroe, K. B. (1996). Effect-Size Estimates: Issues and Problems in Interpretation.
Journal of Consumer Research, 23(2), 89-105. Retrieved from http://www.jstor.org/stable/2489707

Glass , G.V., McGaw, B. & Smith, M.L. (1981).
Meta-analysis in social research. London: Sage.

Lenth, R. (2004). Two Sample size practices that I do not recommend. Paper presented at the Joint Statistical Meetings of the American Statistical Association, Indianapolis. Retrieved from http://www.cs.uiowa.edu/~rlenth/Power/2badHabits.pdf

Neill, J. (2008). Why use effect sizes instead of significance testing in program evaluation. Retrieved from http://wilderdom.com/research/effectsizes.html

Valentine, J. & Cooper, H. (2003).
Effect size substantive interpretation guidelines: Issues in the interpretation of effect sizes//. Washington, DC: What Works Clearinghouse. Retrieved from http://ies.ed.gov/ncee/wwc/pdf/essig.pdf

Wilkinson, L., & The APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604. Retrieved from http://www.loyola.edu/library/ref/articles/Wilkinson.pdf