Power Analysis, Statistical Significance, & Effect Size

If you plan to use inferential statistics (e.g., t-tests, ANOVA, etc.) to analyze your evaluation results, you should first conduct a power analysis to determine what size sample you will need. This page describes what power is as well as what you will need to calculate it.

What is power?

To understand power, it is helpful to review what inferential statistics test. When you conduct an inferential statistical test, you are often comparing two hypotheses:

The null hypothesis – This hypothesis predicts that your program will not have an effect on your variable of interest. For example, if you are measuring students’ level of concern for the environment before and after a field trip, the null hypothesis is that their level of concern will remain the same.
The alternative hypothesis – This hypothesis predicts that you will find a difference between groups. Using the example above, the alternative hypothesis is that students’ post-trip level of concern for the environment will differ from their pre-trip level of concern.

Statistical tests look for evidence that you can reject the null hypothesis and conclude that your program had an effect. With any statistical test, however, there is always the possibility that you will find a difference between groups when one does not actually exist. This is called a Type I error. Likewise, it is possible that when a difference does exist, the test will not be able to identify it. This type of mistake is called a Type II error.

Power refers to the probability that your test will find a statistically significant difference when such a difference actually exists. In other words, power is the probability that you will reject the null hypothesis when you should (and thus avoid a Type II error). It is generally accepted that power should be .8 or greater; that is, you should have an 80% or greater chance of finding a statistically significant difference when there is one.

Increase your sample size to be on the safe side!

How do I use power calculations to determine my sample size?

Generally speaking, as your sample size increases, so does the power of your test. This should intuitively make sense as a larger sample means that you have collected more information -- which makes it easier to correctly reject the null hypothesis when you should.

To ensure that your sample size is big enough, you will need to conduct a power analysis calculation. Unfortunately, these calculations are not easy to do by hand, so unless you are a statistics whiz, you will want the help of a software program. Several software programs are available for free on the Internet and are described below.

For any power calculation, you will need to know:

What type of test you plan to use (e.g., independent t-test, paired t-test, ANOVA, regression, etc. See Step 6 if you are not familiar with these tests.),
The alpha value or significance level you are using (usually 0.01 or 0.05. See the next section of this page for more information.),
The expected effect size (See the last section of this page for more information.),
The sample size you are planning to use

When these values are entered, a power value between 0 and 1 will be generated. If the power is less than 0.8, you will need to increase your sample size.

What is statistical significance?

There is always some likelihood that the changes you observe in your participants’ knowledge, attitudes, and behaviors are due to chance rather than to the program. Testing for statistical significance helps you learn how likely it is that these changes occurred randomly and do not represent differences due to the program.

To learn whether the difference is statistically significant, you will have to compare the probability number you get from your test (the p-value) to the critical probability value you determined ahead of time (the alpha level). If the p-value is less than the alpha value, you can conclude that the difference you observed is statistically significant.

P-Value: the probability that the results were due to chance and not based on your program. P-values range from 0 to 1. The lower the p-value, the more likely it is that a difference occurred as a result of your program.

Alpha (α) level: the error rate that you are willing to accept. Alpha is often set at .05 or .01. The alpha level is also known as the Type I error rate. An alpha of .05 means that you are willing to accept that there is a 5% chance that your results are due to chance rather than to your program.

What alpha value should I use to calculate power?

An alpha level of less than .05 is accepted in most social science fields as statistically significant, and this is the most common alpha level used in EE evaluations.

The following resources provide more information on statistical significance:

Statistical Significance
Creative Research Systems, (2000).
Beginner
This page provides an introduction to what statistical significance means in easy-to-understand language, including descriptions and examples of p-values and alpha values, and several common errors in statistical significance testing. Part 2 provides a more advanced discussion of the meaning of statistical significance numbers.

Statistical Significance
Statpac, (2005).
Beginner
This page introduces statistical significance and explains the difference between one-tailed and two-tailed significance tests. The site also describes the procedure used to test for significance (including the p value)

What is effect size?

When a difference is statistically significant, it does not necessarily mean that it is big, important, or helpful in decision-making. It simply means you can be confident that there is a difference. Let’s say, for example, that you evaluate the effect of an EE activity on student knowledge using pre and posttests. The mean score on the pretest was 83 out of 100 while the mean score on the posttest was 84. Although you find that the difference in scores is statistically significant (because of a large sample size), the difference is very slight, suggesting that the program did not lead to a meaningful increase in student knowledge.

To know if an observed difference is not only statistically significant but also important or meaningful, you will need to calculate its effect size. Rather than reporting the difference in terms of, for example, the number of points earned on a test or the number of pounds of recycling collected, effect size is standardized. In other words, all effect sizes are calculated on a common scale -- which allows you to compare the effectiveness of different programs on the same outcome.

How do I calculate effect size?

There are different ways to calculate effect size depending on the evaluation design you use. Generally, effect size is calculated by taking the difference between the two groups (e.g., the mean of treatment group minus the mean of the control group) and dividing it by the standard deviation of one of the groups. For example, in an evaluation with a treatment group and control group, effect size is the difference in means between the two groups divided by the standard deviation of the control group.

mean of treatment group – mean of control group

standard deviation of control group

To interpret the resulting number, most social scientists use this general guide developed by Cohen:

< 0.1 = trivial effect
0.1 - 0.3 = small effect
0.3 - 0.5 = moderate effect
> 0.5 = large difference effect

How do I estimate effect size for calculating power?

Because effect size can only be calculated after you collect data from program participants, you will have to use an estimate for the power analysis. Common practice is to use a value of 0.5 as it indicates a moderate to large difference.

For more information on effect size, see:

Effect Size Resources
Coe, R. (2000). Curriculum, Evaluation, and Management Center
Intermediate Advanced
This page offers three useful resources on effect size: 1) a brief introduction to the concept, 2) a more thorough guide to effect size, which explains how to interpret effect sizes, discusses the relationship between significance and effect size, and discusses the factors that influence effect size, and 3) an effect size calculator with an accompanying user's guide.

Effect Size (ES)
Becker, L. (2000).
Intermediate Advanced
This website provides an overview of what effect size is (including Cohen’s definition of effect size). It also discusses how to measure effect size for two independent groups, for two dependent groups, and when conducting Analysis of Variance. Several effect size calculators are also provided.

References

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New Jersey: Lawrence Erlbaum.

Smith, M. (2004). Is it the sample size of the sample as a fraction of the population that matters? Journal of Statistics Education. 12:2. Retrieved September 14, 2006 from http://www.amstat.org/publications/jse/v12n2/smith.html

Patton, M. Q. (1990). Qualitative research and evaluation methods. London: Sage Publications.

Search form