RESEARCH FORUM--The Research Sample,
Part II: Sample Size
Brenda Rae Lunsford, MS, MAPT
Thomas R. Lunsford, MSE, CO
ABSTRACT
The purpose of this article is to present the concepts involved
in selecting a minimum sample size of subjects to represent a
larger population. In determining sample size, it is important
that the sample studied adequately represents the population
to which the researcher is generalizing. The size of the study
should be considered early in the planning phase of a research study. Often no formal sample size is ever calculated.
Instead, the number of subjects available to the investigators
during some period of time determines the size of the study.
Many clinical trials that do not carefully consider sample
size requirements lack the power and ability to detect intervention effects of fairly substantial magnitude and clinical
importance.
Since the cost of a study is partially dependent on the
number of subjects sampled, it is important to determine the
fewest number of subjects required to yield valid results.
Therefore, two key elements in a research design are the
methodology used to select a sample and the minimum number of subjects chosen.
Three methods of determining the minimum sample size
are presented. The first method is based on the requirement
of a specific level of significance. The second method involves power and effect-size concepts. The third method is
based on a required percent change due to the treatments.
The inherent limitations of these techniques are reviewed,
and additional reading is suggested.
Introduction
The first two questions most researchers ask once a research project has been defined are: "How many subjects
will I need to complete my study?" and "How will I select
them?"
This article, "Part II," will attempt to present the factors
relevant to determining the minimum sample size. "Part I,"
which was published in the Summer fF0 (7:3:105-12), addressed the issues related to selecting subjects for a research project (1).
In clinical research it would be ideal to include an entire
population of interest when conducting a study; this enables a generalization to be made about that population as
a whole. Because of cost, inaccessibility and time constraints, it usually is impossible to test all members of a population. Therefore, samples are drawn from the population
for testing purposes, and statistics are computed so the results can be generalized to the larger population.
An estimate of the number of subjects or observations
needed in a study is important to researchers to avoid discarding an effective intervention. If a characteristic varies,
the validity of any estimate of the parameter will be reduced as the size of the sample is decreased.
Researchers have developed a number of techniques in
which only a small portion of the total population is sampled, and attempts to generalize the results and conclusions
for the entire population are made.
Many clinical studies do not achieve their intended purposes because the researcher is unable to enroll enough
subjects. Therefore, at some point in planning a study, consideration should be given to sample size. Without some
idea of how large a difference is to be detected, how much
variation is present and what risks are to be tolerated, the
best alternative is to take as large a sample as possible.
In practice, sample size often is arbitrary. Since a researcher is interested in learning about a population, the
larger the sample studied, the more likely the measured
findings will be representative of the population parameters. The researcher is less likely to obtain negative results
or make incorrect inferences about the collected data when
samples are large rather than small. However, the larger
the sample, the more costly the study in terms of time and
money. Therefore, effort must be made to assess the probability of adequately sampling the population before the data are collected.
Currier simplifies the task by stating that when studying
relationships (correlation), at least 30 subjects should be
gathered, whereas in experimental studies involving the
comparison of groups, a minimum of 15 subjects is desirable (2).
This method of determining sample size is arbitrary and
not well-founded. Each study should be considered on its
own merits. The decision should be based on the judgment
of the researcher or an advisory committee.
Many clinical trials that do not carefully consider sample-size requirements lack the power or ability to detect intervention effects of fairly substantial magnitude and clinical importance.
The statistical tradition of testing for differences assumes
the researcher wishes to guard against two types of errors:
type I (reject the hypothesis [Ho] when it is true-in other
words, claim a difference among groups exists when in fact
it does not) and type II (accept the Ho when it is false-i.e.,
claim no difference among groups exists when in fact it
does). The probability of a type I error is denoted alpha, and the
probability of a type II error is designated beta. The quantity
alpha is referred to as the level of significance of the test. The
quantity (1 - beta) is called the power of the test. Since neither
type of error is desirable, both alpha and beta should be small.
Power is also defined as the probability of correctly rejecting the null hypothesis (3). The null hypothesis is a
statement of no difference or no relationship between variables, interventions or devices (3).
Freiman et al. reviewed the power of 71 published randomized controlled clinical trials and failed to find significant differences between groups (4). "Sixty-seven of the trials had a greater than 10-percent risk of missing a true 25percent therapeutic improvement, and with the same risk,
50 of the trials could have missed a 50-percent improvement" (2). The danger in studies with low statistical power
(sample sizes too small) is interventions that could be beneficial are discarded without adequate testing and may never be considered again.
Determining Sample Size
Three approaches to determining sample size are presented here. In all methods, some knowledge of the behavior of
the data, such as the standard deviation or variance, is necessary. For data normally distributed, the variance is simply
the square root of the standard deviation (3). If a published
study used a similar population, then it may be possible to
use the variance from that study. However, if such a study
cannot be found, a pilot study with a small sample size must
be conducted to obtain an estimate of the variance for calculating sample size. In either case, an estimate of the variance must be calculated.
- Method 1: In the first method, the minimum number of
subjects needed in each group of a study can be estimated
based on a strategy of detecting a significant difference
(2,5).
- Method 2: The second method depends on effect size
and power (4).
- Method 3: The third method answers the question,
"How many subjects are needed to detect a certain percent
change or effect due to treatment?" (6).
Example Calculations
A large study is being planned involving incomplete spinal
injured individuals wearing biomechanically equivalent
metal or plastic ankle-foot orthoses (AFOs). The energy
cost of the patients will be measured to determine which
type of AFO requires significantly less energy. For scheduling and budgetary purposes, an estimate of the minimum
sample size must be determined before beginning the
study.
The three sample-size methods will be compared with a
data set from the example. The example used involves a pilot study of a small group (n=5) of incomplete spinal injured patients who were tested for their energy cost of
walking (milliliters of oxygen consumed per kilogram body
weight per minute) while wearing metal or plastic AFOs.
The results were as follows:
Method 1
This method was adapted from Borg and Gall (2,5) and requires knowledge of a previously determined standard deviation (variance) and a t-test value at an estimated sample
size. The number of subjects needed (in each group) to detect a significant difference between these two clinical situations can be calculated as follows:
where
n = minimum number of subjects needed to achieve
significance at 0.05
s = average standard deviation for the two groups (In
the above example, the average standard deviation
for metal and plastic AFOs [3.4 + 4.4]/2 3.9.)
t = t-test value (For the AFO example above with a
two-tailed distribution (2,3), .05 level of significance and 30 subjects, the t-value is 1.96. This value
may be obtained from a t-test table in any statistics
text (2,3). The selection of 30 subjects is arbitrary,
but was chosen in this case since the two sample
means and variances are so similar it was appropriate to estimate at a higher level.)
D = half of the mean standard deviation of the two
groups (In this example, D 3.9/2 1.95.)
Using Equation (1), it is possible to calculate the minimum
sample size as follows:
n = [2(3.9)2 X 1.962] / 1.952
n = (28.08 x 3.84)/3.80
n = 107.83/3.80
n = 28.4
Using method 1, a minimum of 28 subjects in each group is
required to detect a significant difference in the energy cost
of the subjects walking with plastic vs. metal AFOs.
Method 2
Researchers often are concerned with power analysis. In a
power analysis there are five statistical elements: level of
2
significance, sample size (n), sample variance (s2), effect
size (epsilon) and power (1 - beta). These elements are related in
such a way that given any four, the fifth can be readily determined. Power analysis can be used to determine the level of power achieved in a particular study (given a known
effect size and sample size) or to estimate sample size (given an estimated effect size and the desired level of power).
The terms are defined as follows:
- Level of significance is the level of risk a researcher is
willing to take in making an incorrect assumption about the
null hypothesis.
- Sample variance is the measure of spread of the data.
- Effect size (epsilon) is the measure of the magnitude of differences between the sample means (3). Effect size is difficult to apply directly to statistical formulas since it is a unit
of measurement; therefore, an index of effect size has been
developed that enables a more universal application. This
index is unit-free, such as the t, f or X2 distribution (3). The
effect-size index therefore is a ratio of the difference between two sample means and a common standard deviation. To supplement this definition, the effect-size index has
been tabled as follows:
- Large effect size (epsilon = .80) implies there is a large degree of separation and therefore very little potential
overlap between groups. These differences should be obvious by observation, and statistics should be applied only to legitimately document them.
- Medium effect size (epsilon= .50) implies an observable
difference may be noted by the trained observer. Testing
is necessary to verify the differences.
- Small effect size (epsilon = .20) implies the differences, if
any, are so small as to be invisible to the observer, and
testing is necessary to identify differences in performance and/or behaviors.
- Power (1-13) is the probability that a researcher will
correctly reject the null hypothesis. Three constants must
be determined before this method can be applied to the
same data set above:
- The level of significance (a) for testing, i.e., .05, .01, etc.
In the medical field it is common to set a .05.
- Whether the hypothesis infers direction or is nondirectional. A nondirectional hypothesis could be, "The energy
cost of walking is significantly different for spinal cord injured patients when wearing plastic AFOs vs. metal AFOs."
In this case the researcher does not care which is better;
he/she is simply interested in determining "Is there a difference?" A directional hypothesis could be, "The energy
cost of walking is significantly less for spinal cord injured
patients when wearing plastic vs. metal AFOs." In this case
the researcher does care and is interested in determining
"Which type of AFO is significantly better in terms of energy cost?"
- The power desired (e.g., .7,.8, .9 or .99).
For the example, it is assumed the level of significance =
.05, and the hypothesis to be tested is nondirectional (two-sided). First calculate epsilon the effect-size index. The calculation of epsilon (for the unpaired t-test with equal variances) involves using the means for the energy cost for each type of
AFO (X-bar) and the mean of the two standard deviations
(s-bar):
This result gives a relatively large effect-size index ~. An effect size equal to .74 suggests the difference between the
means is 74 percent of the standard deviation.
To calculate sample size it is necessary to refer to a power table that matches the two criteria of a and direction, i.e.,
.05, two-tailed (see Table I
).
The two coordinates of Table I
are power (left column)
and effect size (e) (top row) with the cells containing the
minimum number of subjects needed to meet the intersecting requirements. As can be seen from the power table, the
selection of power has a major impact on the minimum
sample size. The sample size estimate gained from method
1 indicated that 28 subjects would be required for each
group, for any one of the three sets of coordinates shown in
Table 2
.
When power is low a large sample is needed to establish
significance. The same is true when the means are very
close and the ratio of the means/standard deviation is small.
However, when the converse is true, i.e., if the effect size is
large and the power is high, then valid statistics can be obtained using a smaller sample size.
Likewise, given the effect size (.74 in our example),
which is desired to increase the power, it is necessary to increase the sample size. For example, using the effect size
column of .70, and the calculated sample size from method
1, it would be necessary to increase the number of subjects
to 33 (power .80), 44 (power .90) or 76 (power .99).
These increases represent an 18- to 70-percent increase
over the number of subjects calculated earlier (n - 28)
when ignoring the effect of power.
Method 3
The third approach to determining sample size involves
calculating the minimum number of subjects required to
detect a certain amount of change in the group means,
whereas the two prior methods required that significance
be established based on variability alone.
To calculate the minimum sample size required to detect
a difference among group means, four quantities must be
specified: 1) the size of the difference desired to be detected delta; 2) the level of significance alpha; 3) the chance of not detecting a difference of delta units beta; and 4) the standard deviation sigma.
Schlesselman (5) provides the following equation for determining sample size based on degree of impact of treatment:
The quantities Zalpha and Zbeta are unit normal deviates corresponding to the level of significance a and the type II error
beta. Table 3
gives values for Z0 and 4 for a range of values
of a and 13. The deviates Zalpha and Zbeta correspond to the probability in the upper tail of the unit normal distribution. The
quantity sigma is the standard deviation
To illustrate Table 3
, Equation (3) and Table 2 for determining the minimum number of subjects required, consider again the plastic vs. metal AFO comparison. The following are stipulated as given:
- A 20-percent improvement in walking velocity is required when the subjects are wearing plastic vs. metal
AFOs.
- The level of significance is specified to be a .05.
- The power is specified to be (1 - beta) = .70 (i.e., beta = .30).
- The average velocity of subjects wearing metal AFOs
15 X-barmetal 18 in/mm.
- The standard deviation of this velocity is sigma 5 m/min.
- The hypothesis is there is no difference in the subjects'
velocity when wearing metal or plastic AFOs, i.e., Ho: X-barplastic
- X-barmetal=0.
- The null hypothesis is the subjects' velocities when
wearing plastic AFOs are greater than when wearing metal AFOs, i.e., H1: Xplastic - X-barmetal = 0.
The following is required: What is the minimum number
of subjects required for the sample? The solution is:
- From Table 2
, Zalpha = 1.96 for alpha = .05.
- From Table 2
, Zbeta for beta = .30.
- The change in walking velocity A 20 percent of 18
m/min 3.6 m/min.
- Using Equation (3), the minimum number of subjects
required by the sample is given by
n = 2(52)(1.96 + .52)2/3.62
n = 50(6.15/9.18)
n = 50 (.67)
n = 33.5 or 34 subjects
betan = 2(52)(1.96 + 1.28)2/3.62
n = 50 (10.5/9.18)
n = 50 (1.14)
n = 57 subjects
Summary and Conclusions
Three methods of estimating minimum sample size for a research study have been presented. The first method is
based on the requirement for a specific level of significance. The second method involves the power and effect-size concepts. The third method is based on a required percent change as a result of treatments.
The influence of sample size on the power of a test is critical. The larger the sample, the greater the statistical power
given a good research design and correct sampling techniques. Smaller samples are less likely to be good representations of population characteristics, therefore, true differences between groups are less likely to be recognized.
When very small samples are used, as is often the case in
clinical research, power is substantially reduced, and there
is serious risk that an effective intervention will be lost.
By specifying a level of significance and a desired power
in the planning stages of a study, a researcher can estimate
how many subjects are needed to detect a significant difference for an expected effect size. The larger the effect
size, the smaller the required sample of subjects. When the
sample-size estimate is beyond realistic limits, a researcher
may try to redesign the study by controlling variability in
the sample or increasing effect size, or the researcher may
decide not to conduct the study given the unlikelihood of
obtaining significant results.
The lack of planning for the minimum sample size and
overall power often results in a high probability of errors
and needlessly wasted efforts. It is interesting to note the
power of nonsignificant test results reported in the literature. In some cases, the clinical significance of a study will
be greater than suggested by the statistical outcome as a result of the lack of power in the analysis.
A few of the inherent limitations to these techniques
must be emphasized. For example, computing preliminary
ideas about the magnitudes of the quantities to be estimated, or the differences to be detected and their standard deviations, is no easy task. Few studies are undertaken in total
ignorance, and accumulated experience can be a guide in
these matters. More serious is that no account has been taken of interrelationships among a constellation of variables.
Such interrelationships usually are unknown or poorly understood. Although determining these interrelationships
can be a Herculean task, it's a step in the right direction.
Another matter is the outlined techniques assume the
data have approximately normal distribution (bell curve).
The interested reader is referred to Pasternack (7),
Cochran (8) or Bates (9) for additional material on how to
handle these special situations.
THOMAS R. LUNSFORD, MSE, CO, is director of the orthotic
department at The Institute for Rehabilitation and Research in
Houston, Texas, and assistant professor of physical medicine and
rehabilitation at Baylor College of Medicine.
BRENDA RAE LUNSFORD, MS, MAPT is visiting assistant
professor at Texas Women's University in Houston, Texas, and
physical therapist II at The Institute for Rehabilitation and Research.
References:
- Lunsford TR, Lunsford BR. The research sample, part I:
sampling. JPO 1995; 7:3:105-12.
- Currier DP. Elements of research in physical therapy. 2nd ed.
Baltimore: Williams and Wilkins, 1984.
- Portney LG, Watkins MP Foundations of clinical research.
Applications to practice. Norwalk, Conn.: Appleton and Lange,
1993.
- Freiman JA, Chalmers TC, Smith H Jr. et al. The importance
of beta, the type II error and sample size in the design and interpretation of the randomized control trial: survey of 71 "negative"
trials. N EngI J Med 1978; 299:690-4.
- Borg WR, Gall MD. Educational research: an introduction.
3rd ed. New York: Longman, 1979.
- Schiesselman JJ. Planning a longitudinal study: I. sample size
determination. J Chron Dis 1973; 26:553-60.
- Pasternack BS, Gilbert HS. Planning the duration of longterm survival time studies designed for accrual by cohorts. J
Chron Dis 1971; 24:681-700.
- Cochran WG. The planning of observational studies of human populations. JR Stat Soc 1965; 128A:234-65.
- Bates PB. Longitudinal and cross-sectional sequences in the
study of age and generation effects. Hum Dev 1968; 11:145-71.
|