View Options - Click to expand
Print Options - Click to expand
E-Mail Options - Click to expand

RESEARCH FORUM--The Research Sample, Part II: Sample Size

Brenda Rae Lunsford, MS, MAPT
Thomas R. Lunsford, MSE, CO

ABSTRACT

The purpose of this article is to present the concepts involved in selecting a minimum sample size of subjects to represent a larger population. In determining sample size, it is important that the sample studied adequately represents the population to which the researcher is generalizing. The size of the study should be considered early in the planning phase of a research study. Often no formal sample size is ever calculated. Instead, the number of subjects available to the investigators during some period of time determines the size of the study. Many clinical trials that do not carefully consider sample size requirements lack the power and ability to detect intervention effects of fairly substantial magnitude and clinical importance.

Since the cost of a study is partially dependent on the number of subjects sampled, it is important to determine the fewest number of subjects required to yield valid results. Therefore, two key elements in a research design are the methodology used to select a sample and the minimum number of subjects chosen.

Three methods of determining the minimum sample size are presented. The first method is based on the requirement of a specific level of significance. The second method involves power and effect-size concepts. The third method is based on a required percent change due to the treatments.

The inherent limitations of these techniques are reviewed, and additional reading is suggested.

Introduction

The first two questions most researchers ask once a research project has been defined are: "How many subjects will I need to complete my study?" and "How will I select them?"

This article, "Part II," will attempt to present the factors relevant to determining the minimum sample size. "Part I," which was published in the Summer fF0 (7:3:105-12), addressed the issues related to selecting subjects for a research project (1).

In clinical research it would be ideal to include an entire population of interest when conducting a study; this enables a generalization to be made about that population as a whole. Because of cost, inaccessibility and time constraints, it usually is impossible to test all members of a population. Therefore, samples are drawn from the population for testing purposes, and statistics are computed so the results can be generalized to the larger population.

An estimate of the number of subjects or observations needed in a study is important to researchers to avoid discarding an effective intervention. If a characteristic varies, the validity of any estimate of the parameter will be reduced as the size of the sample is decreased.

Researchers have developed a number of techniques in which only a small portion of the total population is sampled, and attempts to generalize the results and conclusions for the entire population are made.

Many clinical studies do not achieve their intended purposes because the researcher is unable to enroll enough subjects. Therefore, at some point in planning a study, consideration should be given to sample size. Without some idea of how large a difference is to be detected, how much variation is present and what risks are to be tolerated, the best alternative is to take as large a sample as possible.

In practice, sample size often is arbitrary. Since a researcher is interested in learning about a population, the larger the sample studied, the more likely the measured findings will be representative of the population parameters. The researcher is less likely to obtain negative results or make incorrect inferences about the collected data when samples are large rather than small. However, the larger the sample, the more costly the study in terms of time and money. Therefore, effort must be made to assess the probability of adequately sampling the population before the data are collected.

Currier simplifies the task by stating that when studying relationships (correlation), at least 30 subjects should be gathered, whereas in experimental studies involving the comparison of groups, a minimum of 15 subjects is desirable (2).

This method of determining sample size is arbitrary and not well-founded. Each study should be considered on its own merits. The decision should be based on the judgment of the researcher or an advisory committee.

Many clinical trials that do not carefully consider sample-size requirements lack the power or ability to detect intervention effects of fairly substantial magnitude and clinical importance.

The statistical tradition of testing for differences assumes the researcher wishes to guard against two types of errors: type I (reject the hypothesis [Ho] when it is true-in other words, claim a difference among groups exists when in fact it does not) and type II (accept the Ho when it is false-i.e., claim no difference among groups exists when in fact it does). The probability of a type I error is denoted alpha, and the probability of a type II error is designated beta. The quantity alpha is referred to as the level of significance of the test. The quantity (1 - beta) is called the power of the test. Since neither type of error is desirable, both alpha and beta should be small.

Power is also defined as the probability of correctly rejecting the null hypothesis (3). The null hypothesis is a statement of no difference or no relationship between variables, interventions or devices (3).

Freiman et al. reviewed the power of 71 published randomized controlled clinical trials and failed to find significant differences between groups (4). "Sixty-seven of the trials had a greater than 10-percent risk of missing a true 25percent therapeutic improvement, and with the same risk, 50 of the trials could have missed a 50-percent improvement" (2). The danger in studies with low statistical power (sample sizes too small) is interventions that could be beneficial are discarded without adequate testing and may never be considered again.

Determining Sample Size

Three approaches to determining sample size are presented here. In all methods, some knowledge of the behavior of the data, such as the standard deviation or variance, is necessary. For data normally distributed, the variance is simply the square root of the standard deviation (3). If a published study used a similar population, then it may be possible to use the variance from that study. However, if such a study cannot be found, a pilot study with a small sample size must be conducted to obtain an estimate of the variance for calculating sample size. In either case, an estimate of the variance must be calculated.

  • Method 1: In the first method, the minimum number of subjects needed in each group of a study can be estimated based on a strategy of detecting a significant difference (2,5).
  • Method 2: The second method depends on effect size and power (4).
  • Method 3: The third method answers the question, "How many subjects are needed to detect a certain percent change or effect due to treatment?" (6).

Example Calculations

A large study is being planned involving incomplete spinal injured individuals wearing biomechanically equivalent metal or plastic ankle-foot orthoses (AFOs). The energy cost of the patients will be measured to determine which type of AFO requires significantly less energy. For scheduling and budgetary purposes, an estimate of the minimum sample size must be determined before beginning the study.

The three sample-size methods will be compared with a data set from the example. The example used involves a pilot study of a small group (n=5) of incomplete spinal injured patients who were tested for their energy cost of walking (milliliters of oxygen consumed per kilogram body weight per minute) while wearing metal or plastic AFOs. The results were as follows:



Method 1

This method was adapted from Borg and Gall (2,5) and requires knowledge of a previously determined standard deviation (variance) and a t-test value at an estimated sample size. The number of subjects needed (in each group) to detect a significant difference between these two clinical situations can be calculated as follows:



where

n = minimum number of subjects needed to achieve significance at 0.05

s = average standard deviation for the two groups (In the above example, the average standard deviation for metal and plastic AFOs [3.4 + 4.4]/2 3.9.)

t = t-test value (For the AFO example above with a two-tailed distribution (2,3), .05 level of significance and 30 subjects, the t-value is 1.96. This value may be obtained from a t-test table in any statistics text (2,3). The selection of 30 subjects is arbitrary, but was chosen in this case since the two sample means and variances are so similar it was appropriate to estimate at a higher level.)

D = half of the mean standard deviation of the two groups (In this example, D 3.9/2 1.95.)

Using Equation (1), it is possible to calculate the minimum sample size as follows:

n = [2(3.9)2 X 1.962] / 1.952

n = (28.08 x 3.84)/3.80

n = 107.83/3.80

n = 28.4

Using method 1, a minimum of 28 subjects in each group is required to detect a significant difference in the energy cost of the subjects walking with plastic vs. metal AFOs.

Method 2

Researchers often are concerned with power analysis. In a power analysis there are five statistical elements: level of 2 significance, sample size (n), sample variance (s2), effect size (epsilon) and power (1 - beta). These elements are related in such a way that given any four, the fifth can be readily determined. Power analysis can be used to determine the level of power achieved in a particular study (given a known effect size and sample size) or to estimate sample size (given an estimated effect size and the desired level of power). The terms are defined as follows:

  • Level of significance is the level of risk a researcher is willing to take in making an incorrect assumption about the null hypothesis.
  • Sample variance is the measure of spread of the data.
  • Effect size (epsilon) is the measure of the magnitude of differences between the sample means (3). Effect size is difficult to apply directly to statistical formulas since it is a unit of measurement; therefore, an index of effect size has been developed that enables a more universal application. This index is unit-free, such as the t, f or X2 distribution (3). The effect-size index therefore is a ratio of the difference between two sample means and a common standard deviation. To supplement this definition, the effect-size index has been tabled as follows:
    • Large effect size (epsilon = .80) implies there is a large degree of separation and therefore very little potential overlap between groups. These differences should be obvious by observation, and statistics should be applied only to legitimately document them.
    • Medium effect size (epsilon= .50) implies an observable difference may be noted by the trained observer. Testing is necessary to verify the differences.
    • Small effect size (epsilon = .20) implies the differences, if any, are so small as to be invisible to the observer, and testing is necessary to identify differences in performance and/or behaviors.
  • Power (1-13) is the probability that a researcher will correctly reject the null hypothesis. Three constants must be determined before this method can be applied to the same data set above:
    1. The level of significance (a) for testing, i.e., .05, .01, etc. In the medical field it is common to set a .05.
    2. Whether the hypothesis infers direction or is nondirectional. A nondirectional hypothesis could be, "The energy cost of walking is significantly different for spinal cord injured patients when wearing plastic AFOs vs. metal AFOs." In this case the researcher does not care which is better; he/she is simply interested in determining "Is there a difference?" A directional hypothesis could be, "The energy cost of walking is significantly less for spinal cord injured patients when wearing plastic vs. metal AFOs." In this case the researcher does care and is interested in determining "Which type of AFO is significantly better in terms of energy cost?"
    3. The power desired (e.g., .7,.8, .9 or .99).

For the example, it is assumed the level of significance = .05, and the hypothesis to be tested is nondirectional (two-sided). First calculate epsilon the effect-size index. The calculation of epsilon (for the unpaired t-test with equal variances) involves using the means for the energy cost for each type of AFO (X-bar) and the mean of the two standard deviations (s-bar):



This result gives a relatively large effect-size index ~. An effect size equal to .74 suggests the difference between the means is 74 percent of the standard deviation.

To calculate sample size it is necessary to refer to a power table that matches the two criteria of a and direction, i.e., .05, two-tailed (see Table I ).

The two coordinates of Table I are power (left column) and effect size (e) (top row) with the cells containing the minimum number of subjects needed to meet the intersecting requirements. As can be seen from the power table, the selection of power has a major impact on the minimum sample size. The sample size estimate gained from method 1 indicated that 28 subjects would be required for each group, for any one of the three sets of coordinates shown in Table 2 .

When power is low a large sample is needed to establish significance. The same is true when the means are very close and the ratio of the means/standard deviation is small. However, when the converse is true, i.e., if the effect size is large and the power is high, then valid statistics can be obtained using a smaller sample size.

Likewise, given the effect size (.74 in our example), which is desired to increase the power, it is necessary to increase the sample size. For example, using the effect size column of .70, and the calculated sample size from method 1, it would be necessary to increase the number of subjects to 33 (power .80), 44 (power .90) or 76 (power .99). These increases represent an 18- to 70-percent increase over the number of subjects calculated earlier (n - 28) when ignoring the effect of power.

Method 3

The third approach to determining sample size involves calculating the minimum number of subjects required to detect a certain amount of change in the group means, whereas the two prior methods required that significance be established based on variability alone.

To calculate the minimum sample size required to detect a difference among group means, four quantities must be specified: 1) the size of the difference desired to be detected delta; 2) the level of significance alpha; 3) the chance of not detecting a difference of delta units beta; and 4) the standard deviation sigma.

Schlesselman (5) provides the following equation for determining sample size based on degree of impact of treatment:



The quantities Zalpha and Zbeta are unit normal deviates corresponding to the level of significance a and the type II error beta. Table 3 gives values for Z0 and 4 for a range of values of a and 13. The deviates Zalpha and Zbeta correspond to the probability in the upper tail of the unit normal distribution. The quantity sigma is the standard deviation

To illustrate Table 3 , Equation (3) and Table 2 for determining the minimum number of subjects required, consider again the plastic vs. metal AFO comparison. The following are stipulated as given:

  1. A 20-percent improvement in walking velocity is required when the subjects are wearing plastic vs. metal AFOs.
  2. The level of significance is specified to be a .05.
  3. The power is specified to be (1 - beta) = .70 (i.e., beta = .30).
  4. The average velocity of subjects wearing metal AFOs 15 X-barmetal 18 in/mm.
  5. The standard deviation of this velocity is sigma 5 m/min.
  6. The hypothesis is there is no difference in the subjects' velocity when wearing metal or plastic AFOs, i.e., Ho: X-barplastic - X-barmetal=0.
  7. The null hypothesis is the subjects' velocities when wearing plastic AFOs are greater than when wearing metal AFOs, i.e., H1: Xplastic - X-barmetal = 0.

The following is required: What is the minimum number of subjects required for the sample? The solution is:

  1. From Table 2 , Zalpha = 1.96 for alpha = .05.
  2. From Table 2 , Zbeta for beta = .30.
  3. The change in walking velocity A 20 percent of 18 m/min 3.6 m/min.
  4. Using Equation (3), the minimum number of subjects required by the sample is given by
    n = 2(52)(1.96 + .52)2/3.62
    n = 50(6.15/9.18)
    n = 50 (.67)
    n = 33.5 or 34 subjects

beta

n = 2(52)(1.96 + 1.28)2/3.62
n = 50 (10.5/9.18)
n = 50 (1.14)
n = 57 subjects

Summary and Conclusions

Three methods of estimating minimum sample size for a research study have been presented. The first method is based on the requirement for a specific level of significance. The second method involves the power and effect-size concepts. The third method is based on a required percent change as a result of treatments.

The influence of sample size on the power of a test is critical. The larger the sample, the greater the statistical power given a good research design and correct sampling techniques. Smaller samples are less likely to be good representations of population characteristics, therefore, true differences between groups are less likely to be recognized. When very small samples are used, as is often the case in clinical research, power is substantially reduced, and there is serious risk that an effective intervention will be lost.

By specifying a level of significance and a desired power in the planning stages of a study, a researcher can estimate how many subjects are needed to detect a significant difference for an expected effect size. The larger the effect size, the smaller the required sample of subjects. When the sample-size estimate is beyond realistic limits, a researcher may try to redesign the study by controlling variability in the sample or increasing effect size, or the researcher may decide not to conduct the study given the unlikelihood of obtaining significant results.

The lack of planning for the minimum sample size and overall power often results in a high probability of errors and needlessly wasted efforts. It is interesting to note the power of nonsignificant test results reported in the literature. In some cases, the clinical significance of a study will be greater than suggested by the statistical outcome as a result of the lack of power in the analysis.

A few of the inherent limitations to these techniques must be emphasized. For example, computing preliminary ideas about the magnitudes of the quantities to be estimated, or the differences to be detected and their standard deviations, is no easy task. Few studies are undertaken in total ignorance, and accumulated experience can be a guide in these matters. More serious is that no account has been taken of interrelationships among a constellation of variables. Such interrelationships usually are unknown or poorly understood. Although determining these interrelationships can be a Herculean task, it's a step in the right direction.

Another matter is the outlined techniques assume the data have approximately normal distribution (bell curve). The interested reader is referred to Pasternack (7), Cochran (8) or Bates (9) for additional material on how to handle these special situations.


THOMAS R. LUNSFORD, MSE, CO, is director of the orthotic department at The Institute for Rehabilitation and Research in Houston, Texas, and assistant professor of physical medicine and rehabilitation at Baylor College of Medicine.

BRENDA RAE LUNSFORD, MS, MAPT is visiting assistant professor at Texas Women's University in Houston, Texas, and physical therapist II at The Institute for Rehabilitation and Research.

References:

  1. Lunsford TR, Lunsford BR. The research sample, part I: sampling. JPO 1995; 7:3:105-12.
  2. Currier DP. Elements of research in physical therapy. 2nd ed. Baltimore: Williams and Wilkins, 1984.
  3. Portney LG, Watkins MP Foundations of clinical research. Applications to practice. Norwalk, Conn.: Appleton and Lange, 1993.
  4. Freiman JA, Chalmers TC, Smith H Jr. et al. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial: survey of 71 "negative" trials. N EngI J Med 1978; 299:690-4.
  5. Borg WR, Gall MD. Educational research: an introduction. 3rd ed. New York: Longman, 1979.
  6. Schiesselman JJ. Planning a longitudinal study: I. sample size determination. J Chron Dis 1973; 26:553-60.
  7. Pasternack BS, Gilbert HS. Planning the duration of longterm survival time studies designed for accrual by cohorts. J Chron Dis 1971; 24:681-700.
  8. Cochran WG. The planning of observational studies of human populations. JR Stat Soc 1965; 128A:234-65.
  9. Bates PB. Longitudinal and cross-sectional sequences in the study of age and generation effects. Hum Dev 1968; 11:145-71.