RESEARCH FORUM--
Methodology: Parametric
Data Analysis
Thomas R. Lunsford, MSE, CO
Brenda Rae Lunsford, MS, MAPT
ABSTRACT
The purpose of this article is to present the concepts involved
in analyzing parametric data. The word parametric, or parameter, relates to the nature of data, i.e., the assumptions
about particular data. The primary assumptions are that the
data are randomly drawn, that the population is normally
distributed and that there is homogeneity among variances.
Parametric tests are more stringent than nonparametric tests,
and the results tend to be more powerful.
Theory concerning hypothesis testing is reviewed, and the
distinction is made between the null and alternative hypotheses. The null hypothesis assumes no difference exists
between two devices being tested while the alternative assumes a difference. The goal of the statistical test is to accept
or reject the null hypothesis. However, it can be difficult to
choose the correct statistical test to apply to the data.
The most frequently applied statistical tests are the t-test
and the analysis of variance (ANO VA). Two types oft-tests
(independent and paired) and the one-way ANOVA are discussed with examples. Since a proliferation of statistical
software packages are now available to perform calculations, the reader is encouraged to focus on learning which
test to apply rather than on unwieldy mathematical equations.
Reading about or conducting statistical tests can be frustrating. Nevertheless, to aid in the growth of O&P research,
the authors encourage readers planning to conduct or read
research to consider the views presented in this article on
parametric testing and those that will be presented in a future
article on nonparametric testing.
Introduction
Most clinical research involves the collection of some form
of quantitative data. The purpose of collecting data is to obtain information that will allow one to infer or draw conclusions about the specific characteristics of a certain large
group of subjects or events based on the observation of a
few (1-4). The concept of screening raw data for its distribution was presented in a previous article (5). To select the
proper statistical test it is important to know how the data
are distributed.
In the development of modern statistical theory some of
the first techniques of inference were formed around a given set of standard parameters pertaining to a population
that was to be analyzed (1,2). The primary assumption of
any parametric study is the data are randomly drawn from
a normally distributed population. A second assumption is
the sampling population variances in samples being compared are equal or homogeneous (1-4). In using small samples it often is difficult to achieve this standard; however,
tests for homogeneity of variance can be used to substantiate this assumption (2,6).
Another feature or characteristic of parametric statistics
is the data are measured on either ratio or interval scales;
i.e., the data values can be added, divided, multiplied and/or
subtracted, and follow the rules of mathematics (2,6,7).
When "numbers" are used to substitute for scores, as is
often done in rating human performance, arithmetic distortions can occur (2,7). For example, when substituting the
following numerals to describe or document the alpha
character ratings of mild, moderate or severe (1 mild, 2
moderate and 3 severe), adding or dividing such numbers
would cause arithmetic flaws (2,7).
When statistical conditions fail to meet the basic requirements stated above, a researcher may default to distribution-free or nonparametric statistical techniques (1,2).
These nonparametric techniques make fewer assumptions
about the population data and can be legitimately used
with nominal and ordinal data. For example, in comparing
two groups, a parametric test would evaluate the differences between the means of two sets of scores while an
equivalent nonparametric test would evaluate the difference in two median scores (1).
The advantage of using parametric tests is their assumptions are more stringent, thereby making the study results
more powerful and requiring less qualification on the part
of the researcher in drawing conclusions about the data.
The advantage of the nonparametric techniques is the data
do not have to meet requirements as stringent as those for
parametric techniques. Nonparametric testing techniques,
which will be the subject of a future article, enable data to
be tested that otherwise would be unsuitable for analysis.
Since the validity of parametric statistics is based on specific assumptions, it is important that all data be screened
for their distribution and variance prior to analysis (5). Also, certain requirements must be met in the sampling of the
data for each statistical test. For example, the power depends on several conditions (1,2,5,6):
- All observations must be independent; i.e., all subjects or
samples must have an equal chance of being selected
(randomly selected).
- The sample population must be normally distributed.
- The variances of the groups being compared must be homogeneous or uniform in their distribution.
- The data type must be continuous (interval or ratio); i.e.,
the data must enable legitimate arithmetic operations.
The optimum opportunity to test a research hypothesis
results when all of these conditions are satisfied.
Hypothesis Testing
Understanding the null and alternative hypotheses is necessary for conducting or reading experimental research.
The basis for experimental research is the stating and testing of a hypothesis (2,8). Research studies often are initiated by clinicians who believe, based on knowledge of basic
sciences and clinical observations, that a certain orthosis or
prosthesis is more effective than another. This conjecture is
called the null and the alternative hypotheses (2,4).
Statement of the Null and Alternative Hypotheses
The null hypothesis assumes no effect or difference will result from the experiment. This hypothesis states there is no
difference between devices tested based on a comparison
of a measurable trait. In most cases, the O&P researcher
tries to disprove the null hypothesis to demonstrate that
one device is better than another.
For example, the question may involve the effectiveness
of a new pressure garment. The clinicians using this new
garment speculate it is more effective in reducing edema
than the garment currently being used. The data collected
may include circumferential measurements of the leg,
which is a ratio level of measurement. Since the null hypothesis is tested statistically, the conjecture is restated in
the form of the null hypothesis as follows:
Null Hypothesis: There is no difference in the mean
circumference of the leg (CIRC) when using the new
versus the old pressure garment.
The common notation for this concept is given below:
This statement says the mean circumference of the leg
when wearing the old pressure garment is equal to the
mean circumference with the new; i.e., there is no expected
difference. If it turned out that an analysis of the data indicated there were no significant differences in leg circumferences, then the null hypothesis would not be rejected but
accepted. If the analysis indicated a significant difference,
then the null hypothesis would be rejected.
The alternative hypothesis (H1) is stated in one of three
ways depending on the expectation of the researchers.
This notation implies there is an expected difference in
circumference when using the two different pressure garments, but there is no indication of which is better.
This notation states the mean circumference of the leg
while wearing the old pressure garment is less than that of
the circumference measured when wearing the new pressure garment. This implies the expected difference is that
the old pressure garment will control edema better because the leg circumference will be less when using the old
garment when compared to using the new pressure garment.
This notation states the mean circumference of the leg
when wearing the new pressure garment is less than the
measured circumference with the old pressure garment.
This implies the expected difference is the new pressure
garment will control edema better because the circumference will be less when using that garment when compared
to using the old pressure garment. The latter two alternative hypotheses also are known as directional hypotheses
since a direction is implied.
Testing of the null hypothesis is performed by evaluating
the mean circumferential measurements for the two types
of garments in a sample of subjects representative of the
population. The null hypothesis could be either not rejected (accepted) or rejected. Accepting or rejecting the null
hypothesis does not prove the hypothesis is true or false; it
merely states the probability of arriving at the same results
if the experiment is performed again under the same conditions.
Errors in Hypothesis Testing
The decision to reject or not reject the null hypothesis is
based on the results of objective statistical procedures;
however, this objectivity does not guarantee a correct decision will be made. Because such decisions are based on
sample data only, it always is possible the true relationship
between experimental populations is not accurately reflected in the statistical outcome (2-4).
Hypothesis testing will always result in one of two decisions: rejecting or not rejecting the null hypothesis. Any one
decision can be correct or incorrect. Therefore, it is possible
to classify four possible decision outcomes, as shown in
Table A
(2). If we accept H0 when it is in fact true (observed
differences are really due to chance), we have made a correct decision (see Table A
).
If H0 is rejected when it is false (differences are real), a
correct decision is made. If, however, is rejected when it
is true, a Type I error is made (2-4). In this case it has been
concluded that a true difference exists when, in fact, the differences are due to chance not to the orthoses or prostheses. Having committed this type of statistical error, the researcher might decide to use a device that is not more effective or better than the conventional device.
Conversely, if H1, is accepted when it is false, a Type II error is committed (2-4). In this case, the researcher would
have concluded the differences are due to chance when, in
fact, one device was better than the other. In this situation,
an effective or improved device might be ignored or a potentially fruitful line of research might be abandoned.
In any statistical analysis one of these two types of errors
might be committed. The importance of one type of error
over the other is relative. Historically, statisticians and researchers have focused attention on Type I error as the primary basis of hypothesis testing; however, the consequences
of failing to recognize an effective treatment may be equally important. Although researchers never know for sure if
they are committing one or the other type of error, they can
take steps to decrease the probability of committing either.
Determining Significance Levels
The investigator determines the minimal value for rejection of H1 by establishing the level of significance, designated by alpha (2-4,9). This level of significance is the same as
the Type I error previously described. When determining alpha,
it is useful to review a probability distribution graph that
can be divided into an acceptance region and a rejection region (see Figure 1
). Two horizontal axes are shown: one for
the hypothetical variable measured (ankle-joint height)
and the other for units of standard deviations, Sd, away
from the mean (+1 Sd, +2 Sd, +3 Sd, -1 Sd, etc.) (2-4). The
curve is bell-shaped and symmetrical with zero at the
mean, and each standard deviation unit is equal to 1 (2-4).
The values that fall into the rejection region are those
values less likely to occur if the null hypothesis is true. The
level of significance is equal to the probability of a value
falling in this portion of the distribution. The significance
level also can be thought of as the proportion of the total
area under the curve that constitutes the rejection region
(2-4,9). Because the total area under the curve in Figure 1
is 1, an alpha-value of .05 is equal to 5 percent of the total area.
The desired level of significance depends on the consequences of either accepting or rejecting the null hypothesis. An alpha level of .05 is most commonly used. If rejecting
the null hypothesis involves using a more time consuming
and expensive device, then a lower significance level, such
as alpha = .01, may be desired. In this situation, the investigator needs very convincing evidence to justify major
changes. A higher significance level, such as x .1, may be
desired if the consequences of error involve minimal
changes that will be time- and cost-efficient.
One-Sided Versus Two-Sided Tests
The way the alternative hypothesis is stated determines
whether a one-sided or two-sided test (2-5,9) should be
used. Using the previous example, when the alternative hypothesis is stated as an inequality, either device could be expected to be more effective. The rejection area is equally divided into the two tails, one on either end of the probability distribution (see Figure 2A
). If a significance level of .05
is chosen, areas of .025 in each tail are designated as rejection areas. Tests of this nature are referred to as "two-sided" or "two-tailed" tests (2).
If the alternative hypothesis states the results of one device will be greater than those of the other device, then the
rejection area is contained in only one tail of the probability distribution. This type of test is referred to as a "one-sided" test (see Figure 2B
). Whether a one-sided or two-sided test is performed will determine the critical value the
test statistic must exceed to be termed significant. For a
two-sided test, the critical values are further from the midpoint and, therefore, will be larger in absolute value than
the critical value for a one-sided test.
After data have been collected for the variable(s) of interest within a sample of subjects representative of the population of interest, a test statistic is calculated from the raw
data. This test statistic produces a t or F value. A t-statistic
is used to test or compare the means of two independent or
paired groups of subjects, and the F statistic is used to compare more than two groups (2-4,8,9).
Preparing Data for Analysis (Descriptive Statistics)
As data are collected they become a compilation of numbers representing empirical observations and exist in what
is called raw form (2-4). For these data to be useful as an indication of group performance, they must be organized,
summarized and analyzed so their meaning can be communicated.
The first step in analyzing data is screening the raw data
for errors and distribution (5). The second step is summarizing the data so they can be communicated in a meaningful manner. The shape, central tendency and variability
within a set of data should be presented as descriptive statistics that should include the number of subjects and the
mean and standard deviation of the variables of interest
(2,5). For example, Table B
provides a brief, hypothetical
example summarizing the raw data derived from a group of
50 subjects comprised of 20 females and 30 males, each of
whom had his/her age and three geometric variables of
his/her feet and ankles measured and recorded.
Seeing data in this form is much more meaningful than
trying to make sense of four columns and 50 rows of numbers. Once the data have been screened and summarized,
they can be evaluated or tested. The method of evaluation
depends on the design of the research project.
In the following sections the appropriate statistical test
used in association with the more common research designs, as well as the assumptions (parameters) that are associated with each test, will be identified. The mathematical
equations will be presented to aid understanding of how
the comparisons or relationships are being evaluated. However, the purpose of this article is not to demonstrate the
mathematical calculations involved in the statistical test but
to provide an understanding of which test should be used
for any given type of research design or question. Most statistical software packages will automatically perform the
complex calculations; the important job for the researcher
is to select the correct test.
Statistical Tests
To enable better understanding of the concepts described
above, three examples will be provided that will encompass
the subject of comparison testing. The design, assumptions
and test statistics for the paired t-test, two-sample or independent t-test, and an analysis of variance (ANOVA) will
be presented. Future articles will present the concepts and
procedures related to correlation and regression.
t-Tests
Orthotists and prosthetists commonly perform research designed to determine a difference between two different
models of the same devices (variables), such as types of
prosthetic feet, knee-joint locking mechanisms, ankle joints,
socket liner materials, pressure garments, cervical orthoses,
etc. The variables of interest (which are measures of interval or ratio data) correspond to such characteristics as velocity, force, angle, circumference or pressure.
When the study design is such that two independent
groups of subjects are to be compared, such as the difference in the velocity between a group of females and a
group of males using a certain prosthetic foot, the proper
test is the independent or two-sample t-test (2-4,8,9,11).
When the study design is such that the same group of
subjects is tested before and after specific intervention,
such as a change in prosthetic feet, then the correct statistical test is the paired t-test. The latter type of test will be reviewed first.
Paired Comparisons (Paired t- Test)
Pairing may occur in one of three possible ways. First, each
subject may be used as his/her own control: He or she is
tested, treated, then retested. A second method of pairing
is using identical twins; a third method involves what is
known as matching.
Matching is the selection of one group of subjects to receive one device and another group that is as closely
matched as possible to the first group (taking into consideration age, gender, race, ethnic group, income group, diagnosis, experience with the device being tested, etc.) to receive another device (2).
For example, a device could be tested on a group of
matched 6-year-old subjects. The first group could wear
white shoes and be asked to jump as high as possible, and
the second, matched group could wear black shoes and be
asked to perform the same task as the first group. The research question could be, "Does the color of shoes have
anything to do with how high the paired groups of matched
6-year-olds can jump?"
A paired comparison experiment is an effective way to
reduce the natural variability that exists among subjects
when comparing treatments. For example, in the study illustrated below, pairing eliminated the difference in the
natural self-selected walking velocities between the subjects.
Study Design: To illustrate the use of a paired t-test,
a hypothetical research question is posed: "What will the
effect on walking efficiency on a group of geriatric patients be if their prosthetic feet are changed from SACH
to one of the newer dynamic response feet (DRF)?" The
velocity of walking was selected as the variable to measure since it is a simple but valid indicator of efficiency.
To answer this hypothetical question, subject selection was made by choosing every other patient that came
to the clinic until 10 patients were selected. To solve the
problem of greater familiarity with the SACH foot, the
subjects who were experienced SACH-foot wearers were
measured for walking velocity with their SACH foot at
the beginning of the study. Once they had been fit with
the new foot, they were allowed to use it for a minimum
of two months before velocity measurements were taken.
The hypotheses for this study are stated as follows:
The null hypothesis states there is no difference in
the mean walking velocity when the geriatric subjects
walk with either the SACH foot or the DRF The null hypothesis (which is tested statistically) is correctly stated
as follows:
However, clinical researchers believe the DRF will
make walking easier. Therefore, the velocity with the
DRF is expected to be greater. Correctly stated, the alternative hypothesis implies the mean walking velocity
of geriatric subjects with the SACH foot is less than the
mean walking velocity of the subjects with the DRF, or:
The hypothetical results are given in Table C
The first column in Table C
is the subject number;
the second and third columns are the velocities obtained
for each geriatric subject when walking with the SACH
and DRF, respectively. The fourth column contains the
difference in the value in column two from the value in
column three. The mean and standard deviations (Sd)
are calculated for each of the data columns at the bottom
of the table.
Note the standard deviations for the velocities
(columns two and three) are quite large. This is due to the
natural variability among individuals and is to be expected in clinical research. However, there is a relatively small
value for standard deviation in the difference column
(column four). This occurs when the velocity change within a single subject is compared and the greater variability
between subjects is eliminated.
This design consists of measuring walking velocity in
a group of geriatric subjects using SACH feet, then replacing each SACH foot with a DRF and re-measuring
the subjects' walking velocity. This is the purest use of before and after (or pre- and post-) testing; i.e., a specific
variable such as walking velocity is measured, a change
in device or treatment is made (such as a new foot), and
the velocity is re-measured.
In this example, the paired t-test is used to measure
the difference or change in velocity between the two devices or treatments.
Assumptions: The parameters assumed when comparing the paired data are that the subjects were randomly selected from a larger population and the testing
was done in a manner that assured all subjects had equal
opportunity for familiarity in both testing situations (2,8).
Test Statistic: In this case, the test statistic is the t-test
and is based on the ratio of the mean of the difference
scores and the variability of those scores (4,10). The
equation for the t-value is given by
where
d = the mean of differences
Sd/[sqrt.(n)] = the standard error of the difference
of these scores (4, 5). Sd is the standard deviation of n subjects.
n - 1 =degrees of freedom, which always are
(n - 1) for the paired t-test where n is
the pairs of scores (4)
Calculation of this equation yields a "t-value," which
can be compared to a table of critical values of tin a statistics text. Statistical software can be used to perform
the calculation in Equation (]).For example, the value of
t can be determined from Equation (1) by plugging in
values from Table C
for the variables. The mean of the
differences, d, is 2.72, and the sample standard deviation
of the differences Sd is 2.67. The number of paired subjects, n, is 10.
A standard table of critical t-values in a text (2) will
appear as shown in Table D
.
This table of critical values of t contains predetermined values of for both one-sided and two-sided testing (2). These values are related to the normal distribution. To use this table, first find the row that matches the
degrees of freedom for the test [in this case df = (n - 1) =
9]. Next locate the column for both the x-value and the
row for either the one- or two-tailed test. Finally, compare the tabled t-value with the calculated t-value.
In this case, it is appropriate to use the one-tail x
since the alternative hypothesis implied direction; i.e.,
H1: VELSACH <VELDRF. Since this is a small sample using human subjects, alpha = .05 is selected as the level of significance for testing. The results are summarized below:
Calculated t (df= 9) - 3.2
Tabled t (df= 9, 1-tailed) 1.83
The calculated t-value is larger than the tabled t-value, placing the test-statistic value in the region of rejection (see Figure 4). Therefore, the null hypothesis is
rejected. The result of this test is stated as follows:
"When geriatric subjects were tested walking with
both a SACH foot and DRF, their walking velocity was
significantly faster when walking with the DRF than
when walking with a SACH foot."
This result would be reported in a journal article as
not only being significant but being significant at p < .05.
This means the probability (p) that this result occurred
by chance and not due to the difference in prosthetic feet
is less than 5 percent.
This hypothetical study shows there is a statistical
difference in the average walking velocity when using the
DRF over the SACH foot. The question that still must be
answered is whether the result of the study is clinically
significant: Is a difference in walking velocity of one meter per minute fast enough to justify the time and expense of changing geriatric subjects from the SACH foot
to the DRF?
Comparing Two Sample Means
(Independent or two-sample t-test)
Clinical researchers often are interested in finding the differences between two separate groups of subjects for a specific characteristic or variable. One method of establishing
independence in groups is comparing a group of healthy
subjects for a specific characteristic and a group with a
known pathology.
Another way to establish independence is to use two
groups of subjects that are not matched or paired to each
other on any variable of interest as described in the previous section. In both instances, the independent or two-sample t-test is required (2-4,8-1 1).
For two independent groups, the degrees of freedom are
given by df= (n1 + n2 - 2) (2,11). As the degrees of freedom become larger, the size of the critical value of t becomes smaller (7). This implies H~ will be rejected with a
smaller critical value of t (4,7). The disadvantage is that
there is larger between-subjects variance. The following
provides an example of the two-sample t-test.
Study Design: As an illustration of the application of
the independent t-test, another hypothetical research
question is presented. In this case the clinician wished to
evaluate the effect of age on the efficiency of walking
with the DRF The hypothetical research design in this
case consists of two different groups of subjects grouped
by age: Group I (age <45) and Group II (age 45). Independence is established by having two distinctly different age groups. All subjects were selected randomly
from a city-wide population of amputees.
The null hypothesis states there is no expected difference between the mean walking velocity in subjects
less than 45 years of age and that of subjects 45 or older.
In this case, the researcher did not have a preconceived notion of how age would affect the walking efficiency of the subjects and was interested only in determining if a difference did exist in fact. Therefore, the alternative hypothesis is stated as follows:
This alternative hypothesis implies the two groups
are not equal though it does not suggest in which direction.
Table E
yields the hypothetical raw data for these
two groups; their means and standard deviations are located at the bottom of the table.
The first column in Table E
is the subject number for
Group I; the group's raw data values for velocity are listed in the second column. Column three contains the subject numbers for Group II; its raw data values are presented in column four. Because the two groups are independent, their respective means and standard deviations
are presented separately at the bottom of the table.
Assumptions: The parameters required of this design
include the assumptions of randomization, normal distribution and homogeneity of variances. For the test results to
be valid, these assumptions must be adhered to and factors
that may affect the internal validity (9) must be controlled
so the outcome is not biased (9). If the effect of age is of
interest, it is important to control other factors besides age
that may affect the result. For example, care must be taken to include subjects who are similarly familiar with their
prostheses. A group should not have seven patients who
have been wearers for 10 years and three who have been
wearing their prostheses for only two weeks.
Test Statistic: The test statistic used with this research
design is called the two-sample or independent t-test and
is based on the ratio of the difference between the two
means and their variances (2-4,8). This test is different
from the paired t-test in that it calls for the evaluation of
the mean of the differences. The equation for the independent t-statistic is as follows:
where the pooled variance is found from the following
equation (calculation not shown):
where
N-2 = degrees of freedom where N the total
size of the sample, - (n1± n2)
xI = mean of Group I
xII = mean of Group II
sp2 = pooled variance
1/nI = reciprocal of number of subjects in Group I
1/nII = reciprocal of number of subjects in Group II
Selecting the correct test statistic involves recognition of the parameter or assumption of the equality of
variances. Most computer programs will provide a test
for homogeneity of variables and will give a choice of using either a pooled or separate variance. (Since the equation for the separate variance is complicated and beyond
the scope of this article, it will not be shown but can be
found in most statistical texts (2).)
The data shown in Table E
would be tested using the
pooled variance rule. As before, a completed calculation
is provided.
Substituting values for the variables in Equations
(2) and (3) gives
The table of critical values of t as published in statistical texts gives a t-value of 2.101 for a two-sided test at
alpha = .05, df = 18, as shown in Table F
.
Calculated t (df= 18) .75
Tabled t (df= 18, two-tailed, alpha = .05) 2.1
In this case, the calculated t-value is smaller than the
tabled t-value, which places the test statistic well within
the acceptance region (see Figure 5
). Therefore, there is
insufficient evidence to reject the null hypothesis, and it
is accepted. This result is summarized as, "There were no
differences found when subjects who were habitual users
of the DRF were tested by age groups (<45, 45 years)
for velocity of walking; i.e., the efficiency of walking does
not seem to be affected by age."
Comparing More Than Two Sample Means:
Analysis of Variance (ANOVA)
As knowledge and clinical theory have advanced, more
complex research designs have emerged. The ANOVA was
created to enable the comparison of three or more groups
(2-4,11). ANOVA is used to determine whether the observed differences among a group of means are greater than
expected (2,11) and based on the F-statistic, which is similar
to the t-test in that it is a ratio of the variability between the
groups to the variability of the subjects within each group.
One would "expect" little variability within groups aside
from the variability being tested if all assumptions are met;
i.e., subjects are randomly selected with controls in place
for factors affecting internal validity. However, the unknown variability is that which occurs between the groups,
which is the effect being tested or observed by the investigator (2,11).
Since the mathematics involved in calculating an ANOVA
is a jump in complexity over the formulas illustrated above,
researchers often employ a two-sample t-test to compare the
means of the performance of three devices as follows:
However, this method is unacceptable because the probability associated with the t-test is based on the assumption
that only one test is performed (12). When more than one
test is performed, the probability that at least one of the
means will be significant increases with the number of possible pairings of means. This leads to an increased probability of making a Type I error, yielding the false conclusion a
significant difference exists when it does not (2). For example, if a .05 alpha-level is selected for testing, the critical region will not be
1 - .95 = .05
but rather
1 - (.95)3 = .143
Therefore, multiple t-tests performed using alpha = .05 will
lead to erroneous conclusions since the actual alpha value will
be alpha = .143 (12).
Following is a review of a few terms commonly associated with the ANOVA statistic to aid in further understanding of ANOVA.
- Variance refers to the differences observed when one
measures almost any dependent variable such as height,
weight, force, length, circumference, etc.
- Independent variable is the variable that is manipulated
or controlled.
- Dependent variable is the variable being measured or affected by the manipulation of the independent variable.
- Factor is an independent variable that results in grouping.
- Level is defined by the number of different members of
each factor.
For example, in a previous example regarding how high
6-year-olds can jump, the dependent variable is jumping
height, the independent variable is the color of shoes, the
factor is shoe color and the number of levels is determined
by further grouping (not just shoes but perhaps gender as
well). If gender were included, there would be four levels as
shown in Table C
.
There are two factors: factor A shoe color (which
has two levels, black and white), and factor B gender
of subjects (which has four levels, two gender levels
for each shoe-color factor above). More complex designs of
ANOVA may have several factors, each with several levels.
The complexity of the mathematics involved when designs
involve multiple factors and multiple levels is obvious.
Since there are many forms of ANOVA and the purpose
of this article is to acquaint the reader with categories of
testing, the following example will be a one-way ANOVA
for independent samples.
Study Design: The concept of comparison testing
will be expanded with another hypothetical example.
In a central fabrication laboratory some of the
technicians have noticed an increased number of returns
due to problems of fracture with some of the plastic ankle-foot orthoses (AFOs) produced by a particular facility. The laboratory director decided it would be appropriate to engage in a formal analysis to determine why
the plastic AFOs were failing. One of the differences in
types of AFOs produced was the manufacturer of the
plastic sheet stock. Another difference involved the trim
lines. Since one trim line was used most often, it was decided to first test the effect of the manufacturer.
Five AFOs were made from plastic sheet stock supplied from each of the manufacturers. The AFOs made
from sheet stock supplied from manufacturers were
placed into three groups: Group I (Manufacturer 1),
Group II (Manufacturer II) and Group III (Manufacturer III). These AFOs were then tested using a surrogate walking machine, and the time-to-failure was measured.
The null hypothesis of this study indicates the mean
time-to-failure (MTF) for each of three groups will be
the same:
The alternative hypothesis can be stated in multiple combinations of inequality-for example, that none
of the means are equal:
or that some combinations are unequal, for example:
Assumptions: Four major assumptions must be fulfilled to apply ANOVA: The data are normally distributed, the variances are homogeneous (normally distributed with equivalent variances), the measurements are
independent (i.e., sampling is random) and the null hypothesis is true (2,11).
Test Statistic: The data are shown in Table H
with
summary information at the bottom of the table (2). The
term k = 3 refers to the number of plastic manufacturers
(groups) being compared, and N = 15 is the total number
of AFOs. Note the mean time-to-failure for each manufacturer's plastic sheet stock is presented without the
standard deviation. In addition to the mean time-to-failure of each group, the sum of each column of raw data
and the sum of the squares of the raw data are shown.
For Manufacturer I (Group I), the sum of individual
MTFs equals 65, the sum of the squares of the MTF
equals 875, and the mean MTF equals 13.
A summarized version of the calculation of this
table follows to help the reader understand the origin of
the ANOVA tables that are so often published in research literature.
The first step is to add the MTFs and MTF2s of all
three groups:
Sum MTF 65 + 55 + 30 = 150
Sum MTF2 875 + 655 + 190 = 1,720
Next, these values are substituted into the following
equations, resulting in what is known as partitioning of
the variance (2). The first partition is defined as the total
sums of squares, which includes the total number of subjects (N), symbolized as SSt:
The next partition is calculated on the sums of
squares for the between-groups effect, symbolized as
SSb:
The last partition is known as the within-groups error effect, symbolized as SSw.
The F-statistic
First, the degrees of freedom (df) need to be defined
since they are the most complicated thus far. There are
three levels of df when calculating an ANOVA.
- Total degrees of freedom is always one less than
the total number of subjects and is symbolized as follows:
- The degrees of freedom associated with the between-groups variability is always one less than the number of groups:
- The degrees of freedom associated with the within-groups variability is a multiple of the df for each
group, which is (n - 1). The notation for this source of
variance is:
Next, the mean sums of squares between subjects
and within subjects are calculated, providing a ratio of
the variance by dividing the sums of squares by their respective df:
The last ratio calculated is the F-ratio, which is derived from the mean square values above:
The calculations performed above are summarized
in Table I
. This is commonly referred to as an ANOVA
table and should now be easier to read with some idea of
where the numbers came from and what they mean.
Calculated F (df= 2,12) 8.67
Tabled F (df= 2,12, alpha .05) 3.89
The value for the tabled F was obtained in the same
manner as the values for the tabled t in previous examples (statistical texts) (2-4).
In this case, the calculated F-value is larger than the
tabled value, which places the test statistic in the region
of rejection (see Figure 6
). It is therefore legitimate to reject the null hypothesis. This result is stated as follows:
"The mean time-to-failure of the AFOs made from
the plastic sheet stock supplied by three manufacturers is
significantly different."
Critical values Of F: As with the t-value, the calculated F-value is compared to a critical value to determine
significance. Unlike the bell-shaped curve of the normal
distribution, the curve derived for the F-distribution is
skewed to the left (see Figure 6
). This skew is because the
F-ratio is based on squared values, which are always positive. As shown in Figure 6
, the central point of the tabled
F is at 1, with the limits at 0 at the left and infinity (infinity symbol) at the right. These limits are explained by the
manner in which the ratios are developed. Recall that F-statistic is a ratio of observed (between) to expected
(within) values and is represented as:
MSb
MSw
If the observed and expected values are identical,
then this ratio 1. However, if the observed value is very
small with respect to the expected value, then the ratio
becomes a very small number approaching 0. In contrast,
the ratio is large if the observed value is large with respect to the expected value. For example:
If the null hypothesis is accepted when using ANOVA (i.e., there are no significant differences), then the
analysis is complete. However, when a significant difference is revealed, as above, then it is appropriate to conduct post-hoc or multiple comparisons testing. Post-hoc
testing in the preceding example would reveal exactly
which group means were different; i.e., Group I may be
equal to Group II but different from Group III, and
Group II may be different from Group III. Post-hoc
analysis will be the subject of a future article.
Summary
The t-test and analysis of variance are based on several assumptions about the nature of data. These assumptions
have been reviewed and include random selection of subjects, normally distributed data and, when there is more
than one mean, homogeneity of variance. When clinical experiments are performed with very small samples, the data
may sufficiently violate these assumptions to warrant
transforming the data to a different scale of measurement
that better reflects the appropriate characteristics for statistical analysis, or it may be appropriate to use nonparametric statistics that do not make the same demands on
the data.
The simplest experimental comparison involves the use
of two independent groups created by random assignment.
This design allows the researcher to assume all individual
differences are evenly distributed among the groups so the
groups are equivalent at the start of the experiment. Statistically, the groups are considered random samples of the
same population; any observed differences among them,
therefore, should be the result of sampling error or chance.
After the application of some independent variable, the researcher wants to determine if the groups are still from the
same population, or if their means can be considered significantly different. This determination is made through a
test of statistical significance.
The best comparison of two means is the t-test. The
mathematical basis for the tests varies depending on which
type of design is used. When it is desired to compare three
or more means, it is necessary to use ANOVA. These procedures are based on parametric operations and subject to
all assumptions underlying parametric statistics.
THOMAS R. LUNSFORD, MSE, CO, is president of Lone Star
Orthotics Inc. at The Institute for Rehabilitation and Research in
Houston, Texas, and assistant professor of physical medicine and
rehabilitation at Baylor College of Medicine.
BRENDA RAE LUNSFORD, MS, MAPT is visiting assistant
professor at Texas Woman's University in Houston, Texas, and
physical therapist 11 at The Institute for Rehabilitation and Research.
References:
- Siegel S. Nonparametric statistics for the behavioral sciences.
McGraw-Hill, Series in Psychology, 1956.
- Portney LG, Watkins MR Foundations of clinical research-
applications and practice. Norwalk, Calif.: Appleton & Lange,
1993.
- Mattson DE. Statistics: difficult concepts, understandable explanations. Mosby Co., 1981.
- Colton T. Statistics in medicine. Little Brown & Co., 1974.
- Lunsford BR. Statistics: screening and data summary. JPO
1993; 5:4:125-30.
- Lunsford BR. Methodology: variables and levels of measurement. JPO 1993; 5:4:121-4.
- Lehmkuhl D. Mixing one part common sense with each part
statistics in planning the design and reporting the results of clinical research in physical therapy. Phys Ther 1987;67:12:1851-3.
- Huck SW, Cormier WH, Bounds WG. Reading statistics and
research. Harper Collins, 1974.
- Ferguson GA. Statistical analysis in psychology and education, 5th ed. New York: McGraw-Hill, 1981.
- Dunn OJ. Basic statistics: a primer for the biomedical sciences, 2d ed. New York: John Wiley & Sons, 1977.
- APTA. Reading tips for reports on research: an anthology.
Amer Phys Ther Assn, 1986.
- Pairwise mean comparisons in 7D. BMDP communications
newsletter, October 1983; 16:3.
|