The purpose of this article is to present the concepts involved in selecting a minimum sample size of subjects to represent a larger population. In determining sample size, it is important that the sample studied adequately represents the population to which the researcher is generalizing. The size of the study should be considered early in the planning phase of a research study. Often no formal sample size is ever calculated. Instead, the number of subjects available to the investigators during some period of time determines the size of the study. Many clinical trials that do not carefully consider sample size requirements lack the power and ability to detect intervention effects of fairly substantial magnitude and clinical importance.
Since the cost of a study is partially dependent on the number of subjects sampled, it is important to determine the fewest number of subjects required to yield valid results. Therefore, two key elements in a research design are the methodology used to select a sample and the minimum number of subjects chosen.
Three methods of determining the minimum sample size are presented. The first method is based on the requirement of a specific level of significance. The second method involves power and effect-size concepts. The third method is based on a required percent change due to the treatments.
The inherent limitations of these techniques are reviewed, and additional reading is suggested.
The first two questions most researchers ask once a research project has been defined are: "How many subjects will I need to complete my study?" and "How will I select them?"
This article, "Part II," will attempt to present the factors relevant to determining the minimum sample size. "Part I," which was published in the Summer fF0 (7:3:105-12), addressed the issues related to selecting subjects for a research project (1).
In clinical research it would be ideal to include an entire population of interest when conducting a study; this enables a generalization to be made about that population as a whole. Because of cost, inaccessibility and time constraints, it usually is impossible to test all members of a population. Therefore, samples are drawn from the population for testing purposes, and statistics are computed so the results can be generalized to the larger population.
An estimate of the number of subjects or observations needed in a study is important to researchers to avoid discarding an effective intervention. If a characteristic varies, the validity of any estimate of the parameter will be reduced as the size of the sample is decreased.
Researchers have developed a number of techniques in which only a small portion of the total population is sampled, and attempts to generalize the results and conclusions for the entire population are made.
Many clinical studies do not achieve their intended purposes because the researcher is unable to enroll enough subjects. Therefore, at some point in planning a study, consideration should be given to sample size. Without some idea of how large a difference is to be detected, how much variation is present and what risks are to be tolerated, the best alternative is to take as large a sample as possible.
In practice, sample size often is arbitrary. Since a researcher is interested in learning about a population, the larger the sample studied, the more likely the measured findings will be representative of the population parameters. The researcher is less likely to obtain negative results or make incorrect inferences about the collected data when samples are large rather than small. However, the larger the sample, the more costly the study in terms of time and money. Therefore, effort must be made to assess the probability of adequately sampling the population before the data are collected.
Currier simplifies the task by stating that when studying relationships (correlation), at least 30 subjects should be gathered, whereas in experimental studies involving the comparison of groups, a minimum of 15 subjects is desirable (2).
This method of determining sample size is arbitrary and not well-founded. Each study should be considered on its own merits. The decision should be based on the judgment of the researcher or an advisory committee.
Many clinical trials that do not carefully consider sample-size requirements lack the power or ability to detect intervention effects of fairly substantial magnitude and clinical importance.
The statistical tradition of testing for differences assumes the researcher wishes to guard against two types of errors: type I (reject the hypothesis [Ho] when it is true-in other words, claim a difference among groups exists when in fact it does not) and type II (accept the Ho when it is false-i.e., claim no difference among groups exists when in fact it does). The probability of a type I error is denoted alpha, and the probability of a type II error is designated beta. The quantity alpha is referred to as the level of significance of the test. The quantity (1 - beta) is called the power of the test. Since neither type of error is desirable, both alpha and beta should be small.
Power is also defined as the probability of correctly rejecting the null hypothesis (3). The null hypothesis is a statement of no difference or no relationship between variables, interventions or devices (3).
Freiman et al. reviewed the power of 71 published randomized controlled clinical trials and failed to find significant differences between groups (4). "Sixty-seven of the trials had a greater than 10-percent risk of missing a true 25percent therapeutic improvement, and with the same risk, 50 of the trials could have missed a 50-percent improvement" (2). The danger in studies with low statistical power (sample sizes too small) is interventions that could be beneficial are discarded without adequate testing and may never be considered again.
Three approaches to determining sample size are presented here. In all methods, some knowledge of the behavior of the data, such as the standard deviation or variance, is necessary. For data normally distributed, the variance is simply the square root of the standard deviation (3). If a published study used a similar population, then it may be possible to use the variance from that study. However, if such a study cannot be found, a pilot study with a small sample size must be conducted to obtain an estimate of the variance for calculating sample size. In either case, an estimate of the variance must be calculated.
A large study is being planned involving incomplete spinal injured individuals wearing biomechanically equivalent metal or plastic ankle-foot orthoses (AFOs). The energy cost of the patients will be measured to determine which type of AFO requires significantly less energy. For scheduling and budgetary purposes, an estimate of the minimum sample size must be determined before beginning the study.
The three sample-size methods will be compared with a data set from the example. The example used involves a pilot study of a small group (n=5) of incomplete spinal injured patients who were tested for their energy cost of walking (milliliters of oxygen consumed per kilogram body weight per minute) while wearing metal or plastic AFOs. The results were as follows:
|
This method was adapted from Borg and Gall (2,5) and requires knowledge of a previously determined standard deviation (variance) and a t-test value at an estimated sample size. The number of subjects needed (in each group) to detect a significant difference between these two clinical situations can be calculated as follows:
|
where
n = minimum number of subjects needed to achieve significance at 0.05
s = average standard deviation for the two groups (In the above example, the average standard deviation for metal and plastic AFOs [3.4 + 4.4]/2 3.9.)
t = t-test value (For the AFO example above with a two-tailed distribution (2,3), .05 level of significance and 30 subjects, the t-value is 1.96. This value may be obtained from a t-test table in any statistics text (2,3). The selection of 30 subjects is arbitrary, but was chosen in this case since the two sample means and variances are so similar it was appropriate to estimate at a higher level.)
D = half of the mean standard deviation of the two groups (In this example, D 3.9/2 1.95.)
Using Equation (1), it is possible to calculate the minimum sample size as follows:
n = [2(3.9)2 X 1.962] / 1.952
n = (28.08 x 3.84)/3.80
n = 107.83/3.80
n = 28.4
Using method 1, a minimum of 28 subjects in each group is required to detect a significant difference in the energy cost of the subjects walking with plastic vs. metal AFOs.
Researchers often are concerned with power analysis. In a power analysis there are five statistical elements: level of 2 significance, sample size (n), sample variance (s2), effect size (epsilon) and power (1 - beta). These elements are related in such a way that given any four, the fifth can be readily determined. Power analysis can be used to determine the level of power achieved in a particular study (given a known effect size and sample size) or to estimate sample size (given an estimated effect size and the desired level of power). The terms are defined as follows:
For the example, it is assumed the level of significance = .05, and the hypothesis to be tested is nondirectional (two-sided). First calculate epsilon the effect-size index. The calculation of epsilon (for the unpaired t-test with equal variances) involves using the means for the energy cost for each type of AFO (X-bar) and the mean of the two standard deviations (s-bar):
|
This result gives a relatively large effect-size index ~. An effect size equal to .74 suggests the difference between the means is 74 percent of the standard deviation.
To calculate sample size it is necessary to refer to a power table that matches the two criteria of a and direction, i.e., .05, two-tailed (see Table I ).
The two coordinates of Table I are power (left column) and effect size (e) (top row) with the cells containing the minimum number of subjects needed to meet the intersecting requirements. As can be seen from the power table, the selection of power has a major impact on the minimum sample size. The sample size estimate gained from method 1 indicated that 28 subjects would be required for each group, for any one of the three sets of coordinates shown in Table 2 .
When power is low a large sample is needed to establish significance. The same is true when the means are very close and the ratio of the means/standard deviation is small. However, when the converse is true, i.e., if the effect size is large and the power is high, then valid statistics can be obtained using a smaller sample size.
Likewise, given the effect size (.74 in our example), which is desired to increase the power, it is necessary to increase the sample size. For example, using the effect size column of .70, and the calculated sample size from method 1, it would be necessary to increase the number of subjects to 33 (power .80), 44 (power .90) or 76 (power .99). These increases represent an 18- to 70-percent increase over the number of subjects calculated earlier (n - 28) when ignoring the effect of power.
The third approach to determining sample size involves calculating the minimum number of subjects required to detect a certain amount of change in the group means, whereas the two prior methods required that significance be established based on variability alone.
To calculate the minimum sample size required to detect a difference among group means, four quantities must be specified: 1) the size of the difference desired to be detected delta; 2) the level of significance alpha; 3) the chance of not detecting a difference of delta units beta; and 4) the standard deviation sigma.
Schlesselman (5) provides the following equation for determining sample size based on degree of impact of treatment:
|
The quantities Zalpha and Zbeta are unit normal deviates corresponding to the level of significance a and the type II error beta. Table 3 gives values for Z0 and 4 for a range of values of a and 13. The deviates Zalpha and Zbeta correspond to the probability in the upper tail of the unit normal distribution. The quantity sigma is the standard deviation
To illustrate Table 3 , Equation (3) and Table 2 for determining the minimum number of subjects required, consider again the plastic vs. metal AFO comparison. The following are stipulated as given:
The following is required: What is the minimum number of subjects required for the sample? The solution is:
n = 2(52)(1.96 + .52)2/3.62
n = 50(6.15/9.18)
n = 50 (.67)
n = 33.5 or 34 subjects
n = 2(52)(1.96 + 1.28)2/3.62
n = 50 (10.5/9.18)
n = 50 (1.14)
n = 57 subjects
Three methods of estimating minimum sample size for a research study have been presented. The first method is based on the requirement for a specific level of significance. The second method involves the power and effect-size concepts. The third method is based on a required percent change as a result of treatments.
The influence of sample size on the power of a test is critical. The larger the sample, the greater the statistical power given a good research design and correct sampling techniques. Smaller samples are less likely to be good representations of population characteristics, therefore, true differences between groups are less likely to be recognized. When very small samples are used, as is often the case in clinical research, power is substantially reduced, and there is serious risk that an effective intervention will be lost.
By specifying a level of significance and a desired power in the planning stages of a study, a researcher can estimate how many subjects are needed to detect a significant difference for an expected effect size. The larger the effect size, the smaller the required sample of subjects. When the sample-size estimate is beyond realistic limits, a researcher may try to redesign the study by controlling variability in the sample or increasing effect size, or the researcher may decide not to conduct the study given the unlikelihood of obtaining significant results.
The lack of planning for the minimum sample size and overall power often results in a high probability of errors and needlessly wasted efforts. It is interesting to note the power of nonsignificant test results reported in the literature. In some cases, the clinical significance of a study will be greater than suggested by the statistical outcome as a result of the lack of power in the analysis.
A few of the inherent limitations to these techniques must be emphasized. For example, computing preliminary ideas about the magnitudes of the quantities to be estimated, or the differences to be detected and their standard deviations, is no easy task. Few studies are undertaken in total ignorance, and accumulated experience can be a guide in these matters. More serious is that no account has been taken of interrelationships among a constellation of variables. Such interrelationships usually are unknown or poorly understood. Although determining these interrelationships can be a Herculean task, it's a step in the right direction.
Another matter is the outlined techniques assume the data have approximately normal distribution (bell curve). The interested reader is referred to Pasternack (7), Cochran (8) or Bates (9) for additional material on how to handle these special situations.
THOMAS R. LUNSFORD, MSE, CO, is director of the orthotic department at The Institute for Rehabilitation and Research in Houston, Texas, and assistant professor of physical medicine and rehabilitation at Baylor College of Medicine.
BRENDA RAE LUNSFORD, MS, MAPT is visiting assistant professor at Texas Women's University in Houston, Texas, and physical therapist II at The Institute for Rehabilitation and Research.
References: