American Academy of Orthotists & Prosthetists - Providing Better Care Through Knowledge
Online Learning Center

Search

 oandp.org  JPO
 Glossary


O&P Links

ABC
O&P Care
AOPA
NAAOP
NCOPE
ACA
OPAF
ACPOC

Home > JPO > 1996 Vol. 8, Num. 2 > pp. 65-76

View Options
Print Options
E-Mail Options

RESEARCH FORUM-- Methodology: Parametric Data Analysis

Thomas R. Lunsford, MSE, CO
Brenda Rae Lunsford, MS, MAPT

ABSTRACT

The purpose of this article is to present the concepts involved in analyzing parametric data. The word parametric, or parameter, relates to the nature of data, i.e., the assumptions about particular data. The primary assumptions are that the data are randomly drawn, that the population is normally distributed and that there is homogeneity among variances. Parametric tests are more stringent than nonparametric tests, and the results tend to be more powerful.

Theory concerning hypothesis testing is reviewed, and the distinction is made between the null and alternative hypotheses. The null hypothesis assumes no difference exists between two devices being tested while the alternative assumes a difference. The goal of the statistical test is to accept or reject the null hypothesis. However, it can be difficult to choose the correct statistical test to apply to the data.

The most frequently applied statistical tests are the t-test and the analysis of variance (ANO VA). Two types oft-tests (independent and paired) and the one-way ANOVA are discussed with examples. Since a proliferation of statistical software packages are now available to perform calculations, the reader is encouraged to focus on learning which test to apply rather than on unwieldy mathematical equations.

Reading about or conducting statistical tests can be frustrating. Nevertheless, to aid in the growth of O&P research, the authors encourage readers planning to conduct or read research to consider the views presented in this article on parametric testing and those that will be presented in a future article on nonparametric testing.

Introduction

Most clinical research involves the collection of some form of quantitative data. The purpose of collecting data is to obtain information that will allow one to infer or draw conclusions about the specific characteristics of a certain large group of subjects or events based on the observation of a few (1-4). The concept of screening raw data for its distribution was presented in a previous article (5). To select the proper statistical test it is important to know how the data are distributed.

In the development of modern statistical theory some of the first techniques of inference were formed around a given set of standard parameters pertaining to a population that was to be analyzed (1,2). The primary assumption of any parametric study is the data are randomly drawn from a normally distributed population. A second assumption is the sampling population variances in samples being compared are equal or homogeneous (1-4). In using small samples it often is difficult to achieve this standard; however, tests for homogeneity of variance can be used to substantiate this assumption (2,6).

Another feature or characteristic of parametric statistics is the data are measured on either ratio or interval scales; i.e., the data values can be added, divided, multiplied and/or subtracted, and follow the rules of mathematics (2,6,7).

When "numbers" are used to substitute for scores, as is often done in rating human performance, arithmetic distortions can occur (2,7). For example, when substituting the following numerals to describe or document the alpha character ratings of mild, moderate or severe (1 mild, 2 moderate and 3 severe), adding or dividing such numbers would cause arithmetic flaws (2,7).

When statistical conditions fail to meet the basic requirements stated above, a researcher may default to distribution-free or nonparametric statistical techniques (1,2). These nonparametric techniques make fewer assumptions about the population data and can be legitimately used with nominal and ordinal data. For example, in comparing two groups, a parametric test would evaluate the differences between the means of two sets of scores while an equivalent nonparametric test would evaluate the difference in two median scores (1).

The advantage of using parametric tests is their assumptions are more stringent, thereby making the study results more powerful and requiring less qualification on the part of the researcher in drawing conclusions about the data. The advantage of the nonparametric techniques is the data do not have to meet requirements as stringent as those for parametric techniques. Nonparametric testing techniques, which will be the subject of a future article, enable data to be tested that otherwise would be unsuitable for analysis.

Since the validity of parametric statistics is based on specific assumptions, it is important that all data be screened for their distribution and variance prior to analysis (5). Also, certain requirements must be met in the sampling of the data for each statistical test. For example, the power depends on several conditions (1,2,5,6):

  • All observations must be independent; i.e., all subjects or samples must have an equal chance of being selected (randomly selected).
  • The sample population must be normally distributed.
  • The variances of the groups being compared must be homogeneous or uniform in their distribution.
  • The data type must be continuous (interval or ratio); i.e., the data must enable legitimate arithmetic operations.

The optimum opportunity to test a research hypothesis results when all of these conditions are satisfied.

Hypothesis Testing

Understanding the null and alternative hypotheses is necessary for conducting or reading experimental research. The basis for experimental research is the stating and testing of a hypothesis (2,8). Research studies often are initiated by clinicians who believe, based on knowledge of basic sciences and clinical observations, that a certain orthosis or prosthesis is more effective than another. This conjecture is called the null and the alternative hypotheses (2,4).

Statement of the Null and Alternative Hypotheses

The null hypothesis assumes no effect or difference will result from the experiment. This hypothesis states there is no difference between devices tested based on a comparison of a measurable trait. In most cases, the O&P researcher tries to disprove the null hypothesis to demonstrate that one device is better than another.

For example, the question may involve the effectiveness of a new pressure garment. The clinicians using this new garment speculate it is more effective in reducing edema than the garment currently being used. The data collected may include circumferential measurements of the leg, which is a ratio level of measurement. Since the null hypothesis is tested statistically, the conjecture is restated in the form of the null hypothesis as follows:

Null Hypothesis: There is no difference in the mean circumference of the leg (CIRC) when using the new versus the old pressure garment.

The common notation for this concept is given below:



This statement says the mean circumference of the leg when wearing the old pressure garment is equal to the mean circumference with the new; i.e., there is no expected difference. If it turned out that an analysis of the data indicated there were no significant differences in leg circumferences, then the null hypothesis would not be rejected but accepted. If the analysis indicated a significant difference, then the null hypothesis would be rejected.

The alternative hypothesis (H1) is stated in one of three ways depending on the expectation of the researchers.



This notation implies there is an expected difference in circumference when using the two different pressure garments, but there is no indication of which is better.



This notation states the mean circumference of the leg while wearing the old pressure garment is less than that of the circumference measured when wearing the new pressure garment. This implies the expected difference is that the old pressure garment will control edema better because the leg circumference will be less when using the old garment when compared to using the new pressure garment.



This notation states the mean circumference of the leg when wearing the new pressure garment is less than the measured circumference with the old pressure garment. This implies the expected difference is the new pressure garment will control edema better because the circumference will be less when using that garment when compared to using the old pressure garment. The latter two alternative hypotheses also are known as directional hypotheses since a direction is implied.

Testing of the null hypothesis is performed by evaluating the mean circumferential measurements for the two types of garments in a sample of subjects representative of the population. The null hypothesis could be either not rejected (accepted) or rejected. Accepting or rejecting the null hypothesis does not prove the hypothesis is true or false; it merely states the probability of arriving at the same results if the experiment is performed again under the same conditions.

Errors in Hypothesis Testing

The decision to reject or not reject the null hypothesis is based on the results of objective statistical procedures; however, this objectivity does not guarantee a correct decision will be made. Because such decisions are based on sample data only, it always is possible the true relationship between experimental populations is not accurately reflected in the statistical outcome (2-4).

Hypothesis testing will always result in one of two decisions: rejecting or not rejecting the null hypothesis. Any one decision can be correct or incorrect. Therefore, it is possible to classify four possible decision outcomes, as shown in Table A (2). If we accept H0 when it is in fact true (observed differences are really due to chance), we have made a correct decision (see Table A ).

If H0 is rejected when it is false (differences are real), a correct decision is made. If, however, is rejected when it is true, a Type I error is made (2-4). In this case it has been concluded that a true difference exists when, in fact, the differences are due to chance not to the orthoses or prostheses. Having committed this type of statistical error, the researcher might decide to use a device that is not more effective or better than the conventional device.

Conversely, if H1, is accepted when it is false, a Type II error is committed (2-4). In this case, the researcher would have concluded the differences are due to chance when, in fact, one device was better than the other. In this situation, an effective or improved device might be ignored or a potentially fruitful line of research might be abandoned.

In any statistical analysis one of these two types of errors might be committed. The importance of one type of error over the other is relative. Historically, statisticians and researchers have focused attention on Type I error as the primary basis of hypothesis testing; however, the consequences of failing to recognize an effective treatment may be equally important. Although researchers never know for sure if they are committing one or the other type of error, they can take steps to decrease the probability of committing either.

Determining Significance Levels

The investigator determines the minimal value for rejection of H1 by establishing the level of significance, designated by alpha (2-4,9). This level of significance is the same as the Type I error previously described. When determining alpha, it is useful to review a probability distribution graph that can be divided into an acceptance region and a rejection region (see Figure 1 ). Two horizontal axes are shown: one for the hypothetical variable measured (ankle-joint height) and the other for units of standard deviations, Sd, away from the mean (+1 Sd, +2 Sd, +3 Sd, -1 Sd, etc.) (2-4). The curve is bell-shaped and symmetrical with zero at the mean, and each standard deviation unit is equal to 1 (2-4).

The values that fall into the rejection region are those values less likely to occur if the null hypothesis is true. The level of significance is equal to the probability of a value falling in this portion of the distribution. The significance level also can be thought of as the proportion of the total area under the curve that constitutes the rejection region (2-4,9). Because the total area under the curve in Figure 1 is 1, an alpha-value of .05 is equal to 5 percent of the total area.

The desired level of significance depends on the consequences of either accepting or rejecting the null hypothesis. An alpha level of .05 is most commonly used. If rejecting the null hypothesis involves using a more time consuming and expensive device, then a lower significance level, such as alpha = .01, may be desired. In this situation, the investigator needs very convincing evidence to justify major changes. A higher significance level, such as x .1, may be desired if the consequences of error involve minimal changes that will be time- and cost-efficient.

One-Sided Versus Two-Sided Tests

The way the alternative hypothesis is stated determines whether a one-sided or two-sided test (2-5,9) should be used. Using the previous example, when the alternative hypothesis is stated as an inequality, either device could be expected to be more effective. The rejection area is equally divided into the two tails, one on either end of the probability distribution (see Figure 2A ). If a significance level of .05 is chosen, areas of .025 in each tail are designated as rejection areas. Tests of this nature are referred to as "two-sided" or "two-tailed" tests (2).

If the alternative hypothesis states the results of one device will be greater than those of the other device, then the rejection area is contained in only one tail of the probability distribution. This type of test is referred to as a "one-sided" test (see Figure 2B ). Whether a one-sided or two-sided test is performed will determine the critical value the test statistic must exceed to be termed significant. For a two-sided test, the critical values are further from the midpoint and, therefore, will be larger in absolute value than the critical value for a one-sided test.

After data have been collected for the variable(s) of interest within a sample of subjects representative of the population of interest, a test statistic is calculated from the raw data. This test statistic produces a t or F value. A t-statistic is used to test or compare the means of two independent or paired groups of subjects, and the F statistic is used to compare more than two groups (2-4,8,9).

Preparing Data for Analysis (Descriptive Statistics)

As data are collected they become a compilation of numbers representing empirical observations and exist in what is called raw form (2-4). For these data to be useful as an indication of group performance, they must be organized, summarized and analyzed so their meaning can be communicated.

The first step in analyzing data is screening the raw data for errors and distribution (5). The second step is summarizing the data so they can be communicated in a meaningful manner. The shape, central tendency and variability within a set of data should be presented as descriptive statistics that should include the number of subjects and the mean and standard deviation of the variables of interest (2,5). For example, Table B provides a brief, hypothetical example summarizing the raw data derived from a group of 50 subjects comprised of 20 females and 30 males, each of whom had his/her age and three geometric variables of his/her feet and ankles measured and recorded.

Seeing data in this form is much more meaningful than trying to make sense of four columns and 50 rows of numbers. Once the data have been screened and summarized, they can be evaluated or tested. The method of evaluation depends on the design of the research project.

In the following sections the appropriate statistical test used in association with the more common research designs, as well as the assumptions (parameters) that are associated with each test, will be identified. The mathematical equations will be presented to aid understanding of how the comparisons or relationships are being evaluated. However, the purpose of this article is not to demonstrate the mathematical calculations involved in the statistical test but to provide an understanding of which test should be used for any given type of research design or question. Most statistical software packages will automatically perform the complex calculations; the important job for the researcher is to select the correct test.

Statistical Tests

To enable better understanding of the concepts described above, three examples will be provided that will encompass the subject of comparison testing. The design, assumptions and test statistics for the paired t-test, two-sample or independent t-test, and an analysis of variance (ANOVA) will be presented. Future articles will present the concepts and procedures related to correlation and regression.

t-Tests

Orthotists and prosthetists commonly perform research designed to determine a difference between two different models of the same devices (variables), such as types of prosthetic feet, knee-joint locking mechanisms, ankle joints, socket liner materials, pressure garments, cervical orthoses, etc. The variables of interest (which are measures of interval or ratio data) correspond to such characteristics as velocity, force, angle, circumference or pressure.

When the study design is such that two independent groups of subjects are to be compared, such as the difference in the velocity between a group of females and a group of males using a certain prosthetic foot, the proper test is the independent or two-sample t-test (2-4,8,9,11).

When the study design is such that the same group of subjects is tested before and after specific intervention, such as a change in prosthetic feet, then the correct statistical test is the paired t-test. The latter type of test will be reviewed first.

Paired Comparisons (Paired t- Test)

Pairing may occur in one of three possible ways. First, each subject may be used as his/her own control: He or she is tested, treated, then retested. A second method of pairing is using identical twins; a third method involves what is known as matching.

Matching is the selection of one group of subjects to receive one device and another group that is as closely matched as possible to the first group (taking into consideration age, gender, race, ethnic group, income group, diagnosis, experience with the device being tested, etc.) to receive another device (2).

For example, a device could be tested on a group of matched 6-year-old subjects. The first group could wear white shoes and be asked to jump as high as possible, and the second, matched group could wear black shoes and be asked to perform the same task as the first group. The research question could be, "Does the color of shoes have anything to do with how high the paired groups of matched 6-year-olds can jump?"

A paired comparison experiment is an effective way to reduce the natural variability that exists among subjects when comparing treatments. For example, in the study illustrated below, pairing eliminated the difference in the natural self-selected walking velocities between the subjects.

Study Design: To illustrate the use of a paired t-test, a hypothetical research question is posed: "What will the effect on walking efficiency on a group of geriatric patients be if their prosthetic feet are changed from SACH to one of the newer dynamic response feet (DRF)?" The velocity of walking was selected as the variable to measure since it is a simple but valid indicator of efficiency.

To answer this hypothetical question, subject selection was made by choosing every other patient that came to the clinic until 10 patients were selected. To solve the problem of greater familiarity with the SACH foot, the subjects who were experienced SACH-foot wearers were measured for walking velocity with their SACH foot at the beginning of the study. Once they had been fit with the new foot, they were allowed to use it for a minimum of two months before velocity measurements were taken.

The hypotheses for this study are stated as follows:

The null hypothesis states there is no difference in the mean walking velocity when the geriatric subjects walk with either the SACH foot or the DRF The null hypothesis (which is tested statistically) is correctly stated as follows:



However, clinical researchers believe the DRF will make walking easier. Therefore, the velocity with the DRF is expected to be greater. Correctly stated, the alternative hypothesis implies the mean walking velocity of geriatric subjects with the SACH foot is less than the mean walking velocity of the subjects with the DRF, or:



The hypothetical results are given in Table C

The first column in Table C is the subject number; the second and third columns are the velocities obtained for each geriatric subject when walking with the SACH and DRF, respectively. The fourth column contains the difference in the value in column two from the value in column three. The mean and standard deviations (Sd) are calculated for each of the data columns at the bottom of the table.

Note the standard deviations for the velocities (columns two and three) are quite large. This is due to the natural variability among individuals and is to be expected in clinical research. However, there is a relatively small value for standard deviation in the difference column (column four). This occurs when the velocity change within a single subject is compared and the greater variability between subjects is eliminated.

This design consists of measuring walking velocity in a group of geriatric subjects using SACH feet, then replacing each SACH foot with a DRF and re-measuring the subjects' walking velocity. This is the purest use of before and after (or pre- and post-) testing; i.e., a specific variable such as walking velocity is measured, a change in device or treatment is made (such as a new foot), and the velocity is re-measured.

In this example, the paired t-test is used to measure the difference or change in velocity between the two devices or treatments.

Assumptions: The parameters assumed when comparing the paired data are that the subjects were randomly selected from a larger population and the testing was done in a manner that assured all subjects had equal opportunity for familiarity in both testing situations (2,8).

Test Statistic: In this case, the test statistic is the t-test and is based on the ratio of the mean of the difference scores and the variability of those scores (4,10). The equation for the t-value is given by



where

d = the mean of differences Sd/[sqrt.(n)] = the standard error of the difference of these scores (4, 5). Sd is the standard deviation of n subjects. n - 1 =degrees of freedom, which always are (n - 1) for the paired t-test where n is the pairs of scores (4)

Calculation of this equation yields a "t-value," which can be compared to a table of critical values of tin a statistics text. Statistical software can be used to perform the calculation in Equation (]).For example, the value of t can be determined from Equation (1) by plugging in values from Table C for the variables. The mean of the differences, d, is 2.72, and the sample standard deviation of the differences Sd is 2.67. The number of paired subjects, n, is 10.



A standard table of critical t-values in a text (2) will appear as shown in Table D .

This table of critical values of t contains predetermined values of for both one-sided and two-sided testing (2). These values are related to the normal distribution. To use this table, first find the row that matches the degrees of freedom for the test [in this case df = (n - 1) = 9]. Next locate the column for both the x-value and the row for either the one- or two-tailed test. Finally, compare the tabled t-value with the calculated t-value.

In this case, it is appropriate to use the one-tail x since the alternative hypothesis implied direction; i.e., H1: VELSACH <VELDRF. Since this is a small sample using human subjects, alpha = .05 is selected as the level of significance for testing. The results are summarized below:

Calculated t (df= 9) - 3.2
Tabled t (df= 9, 1-tailed) 1.83

The calculated t-value is larger than the tabled t-value, placing the test-statistic value in the region of rejection (see Figure 4). Therefore, the null hypothesis is rejected. The result of this test is stated as follows:

"When geriatric subjects were tested walking with both a SACH foot and DRF, their walking velocity was significantly faster when walking with the DRF than when walking with a SACH foot."

This result would be reported in a journal article as not only being significant but being significant at p < .05. This means the probability (p) that this result occurred by chance and not due to the difference in prosthetic feet is less than 5 percent.

This hypothetical study shows there is a statistical difference in the average walking velocity when using the DRF over the SACH foot. The question that still must be answered is whether the result of the study is clinically significant: Is a difference in walking velocity of one meter per minute fast enough to justify the time and expense of changing geriatric subjects from the SACH foot to the DRF?

Comparing Two Sample Means (Independent or two-sample t-test)

Clinical researchers often are interested in finding the differences between two separate groups of subjects for a specific characteristic or variable. One method of establishing independence in groups is comparing a group of healthy subjects for a specific characteristic and a group with a known pathology.

Another way to establish independence is to use two groups of subjects that are not matched or paired to each other on any variable of interest as described in the previous section. In both instances, the independent or two-sample t-test is required (2-4,8-1 1).

For two independent groups, the degrees of freedom are given by df= (n1 + n2 - 2) (2,11). As the degrees of freedom become larger, the size of the critical value of t becomes smaller (7). This implies H~ will be rejected with a smaller critical value of t (4,7). The disadvantage is that there is larger between-subjects variance. The following provides an example of the two-sample t-test.

Study Design: As an illustration of the application of the independent t-test, another hypothetical research question is presented. In this case the clinician wished to evaluate the effect of age on the efficiency of walking with the DRF The hypothetical research design in this case consists of two different groups of subjects grouped by age: Group I (age <45) and Group II (age 45). Independence is established by having two distinctly different age groups. All subjects were selected randomly from a city-wide population of amputees.

The null hypothesis states there is no expected difference between the mean walking velocity in subjects less than 45 years of age and that of subjects 45 or older.



In this case, the researcher did not have a preconceived notion of how age would affect the walking efficiency of the subjects and was interested only in determining if a difference did exist in fact. Therefore, the alternative hypothesis is stated as follows:



This alternative hypothesis implies the two groups are not equal though it does not suggest in which direction.

Table E yields the hypothetical raw data for these two groups; their means and standard deviations are located at the bottom of the table.

The first column in Table E is the subject number for Group I; the group's raw data values for velocity are listed in the second column. Column three contains the subject numbers for Group II; its raw data values are presented in column four. Because the two groups are independent, their respective means and standard deviations are presented separately at the bottom of the table.

Assumptions: The parameters required of this design include the assumptions of randomization, normal distribution and homogeneity of variances. For the test results to be valid, these assumptions must be adhered to and factors that may affect the internal validity (9) must be controlled so the outcome is not biased (9). If the effect of age is of interest, it is important to control other factors besides age that may affect the result. For example, care must be taken to include subjects who are similarly familiar with their prostheses. A group should not have seven patients who have been wearers for 10 years and three who have been wearing their prostheses for only two weeks.

Test Statistic: The test statistic used with this research design is called the two-sample or independent t-test and is based on the ratio of the difference between the two means and their variances (2-4,8). This test is different from the paired t-test in that it calls for the evaluation of the mean of the differences. The equation for the independent t-statistic is as follows:



where the pooled variance is found from the following equation (calculation not shown):



where

N-2 = degrees of freedom where N the total
size of the sample, - (n1± n2)
xI = mean of Group I
xII = mean of Group II
sp2 = pooled variance
1/nI = reciprocal of number of subjects in Group I
1/nII = reciprocal of number of subjects in Group II

Selecting the correct test statistic involves recognition of the parameter or assumption of the equality of variances. Most computer programs will provide a test for homogeneity of variables and will give a choice of using either a pooled or separate variance. (Since the equation for the separate variance is complicated and beyond the scope of this article, it will not be shown but can be found in most statistical texts (2).)

The data shown in Table E would be tested using the pooled variance rule. As before, a completed calculation is provided.

Substituting values for the variables in Equations (2) and (3) gives



The table of critical values of t as published in statistical texts gives a t-value of 2.101 for a two-sided test at alpha = .05, df = 18, as shown in Table F .

Calculated t (df= 18) .75
Tabled t (df= 18, two-tailed, alpha = .05) 2.1

In this case, the calculated t-value is smaller than the tabled t-value, which places the test statistic well within the acceptance region (see Figure 5 ). Therefore, there is insufficient evidence to reject the null hypothesis, and it is accepted. This result is summarized as, "There were no differences found when subjects who were habitual users of the DRF were tested by age groups (<45, 45 years) for velocity of walking; i.e., the efficiency of walking does not seem to be affected by age."

Comparing More Than Two Sample Means:

Analysis of Variance (ANOVA)

As knowledge and clinical theory have advanced, more complex research designs have emerged. The ANOVA was created to enable the comparison of three or more groups (2-4,11). ANOVA is used to determine whether the observed differences among a group of means are greater than expected (2,11) and based on the F-statistic, which is similar to the t-test in that it is a ratio of the variability between the groups to the variability of the subjects within each group. One would "expect" little variability within groups aside from the variability being tested if all assumptions are met; i.e., subjects are randomly selected with controls in place for factors affecting internal validity. However, the unknown variability is that which occurs between the groups, which is the effect being tested or observed by the investigator (2,11).

Since the mathematics involved in calculating an ANOVA is a jump in complexity over the formulas illustrated above, researchers often employ a two-sample t-test to compare the means of the performance of three devices as follows:



However, this method is unacceptable because the probability associated with the t-test is based on the assumption that only one test is performed (12). When more than one test is performed, the probability that at least one of the means will be significant increases with the number of possible pairings of means. This leads to an increased probability of making a Type I error, yielding the false conclusion a significant difference exists when it does not (2). For example, if a .05 alpha-level is selected for testing, the critical region will not be

1 - .95 = .05

but rather

1 - (.95)3 = .143

Therefore, multiple t-tests performed using alpha = .05 will lead to erroneous conclusions since the actual alpha value will be alpha = .143 (12).

Following is a review of a few terms commonly associated with the ANOVA statistic to aid in further understanding of ANOVA.

  • Variance refers to the differences observed when one measures almost any dependent variable such as height, weight, force, length, circumference, etc.
  • Independent variable is the variable that is manipulated or controlled.
  • Dependent variable is the variable being measured or affected by the manipulation of the independent variable.
  • Factor is an independent variable that results in grouping.
  • Level is defined by the number of different members of each factor.

For example, in a previous example regarding how high 6-year-olds can jump, the dependent variable is jumping height, the independent variable is the color of shoes, the factor is shoe color and the number of levels is determined by further grouping (not just shoes but perhaps gender as well). If gender were included, there would be four levels as shown in Table C .

There are two factors: factor A shoe color (which has two levels, black and white), and factor B gender of subjects (which has four levels, two gender levels for each shoe-color factor above). More complex designs of ANOVA may have several factors, each with several levels. The complexity of the mathematics involved when designs involve multiple factors and multiple levels is obvious.

Since there are many forms of ANOVA and the purpose of this article is to acquaint the reader with categories of testing, the following example will be a one-way ANOVA for independent samples.

Study Design: The concept of comparison testing will be expanded with another hypothetical example.

In a central fabrication laboratory some of the technicians have noticed an increased number of returns due to problems of fracture with some of the plastic ankle-foot orthoses (AFOs) produced by a particular facility. The laboratory director decided it would be appropriate to engage in a formal analysis to determine why the plastic AFOs were failing. One of the differences in types of AFOs produced was the manufacturer of the plastic sheet stock. Another difference involved the trim lines. Since one trim line was used most often, it was decided to first test the effect of the manufacturer.

Five AFOs were made from plastic sheet stock supplied from each of the manufacturers. The AFOs made from sheet stock supplied from manufacturers were placed into three groups: Group I (Manufacturer 1), Group II (Manufacturer II) and Group III (Manufacturer III). These AFOs were then tested using a surrogate walking machine, and the time-to-failure was measured.

The null hypothesis of this study indicates the mean time-to-failure (MTF) for each of three groups will be the same:



The alternative hypothesis can be stated in multiple combinations of inequality-for example, that none of the means are equal:



or that some combinations are unequal, for example:



Assumptions: Four major assumptions must be fulfilled to apply ANOVA: The data are normally distributed, the variances are homogeneous (normally distributed with equivalent variances), the measurements are independent (i.e., sampling is random) and the null hypothesis is true (2,11).

Test Statistic: The data are shown in Table H with summary information at the bottom of the table (2). The term k = 3 refers to the number of plastic manufacturers (groups) being compared, and N = 15 is the total number of AFOs. Note the mean time-to-failure for each manufacturer's plastic sheet stock is presented without the standard deviation. In addition to the mean time-to-failure of each group, the sum of each column of raw data and the sum of the squares of the raw data are shown. For Manufacturer I (Group I), the sum of individual MTFs equals 65, the sum of the squares of the MTF equals 875, and the mean MTF equals 13.

A summarized version of the calculation of this table follows to help the reader understand the origin of the ANOVA tables that are so often published in research literature.

The first step is to add the MTFs and MTF2s of all three groups:

Sum MTF 65 + 55 + 30 = 150

Sum MTF2 875 + 655 + 190 = 1,720

Next, these values are substituted into the following equations, resulting in what is known as partitioning of the variance (2). The first partition is defined as the total sums of squares, which includes the total number of subjects (N), symbolized as SSt:



The next partition is calculated on the sums of squares for the between-groups effect, symbolized as SSb:



The last partition is known as the within-groups error effect, symbolized as SSw.



The F-statistic

First, the degrees of freedom (df) need to be defined since they are the most complicated thus far. There are three levels of df when calculating an ANOVA.

  • Total degrees of freedom is always one less than the total number of subjects and is symbolized as follows:


  • The degrees of freedom associated with the between-groups variability is always one less than the number of groups:


  • The degrees of freedom associated with the within-groups variability is a multiple of the df for each group, which is (n - 1). The notation for this source of variance is:


Next, the mean sums of squares between subjects and within subjects are calculated, providing a ratio of the variance by dividing the sums of squares by their respective df:



The last ratio calculated is the F-ratio, which is derived from the mean square values above:



The calculations performed above are summarized in Table I . This is commonly referred to as an ANOVA table and should now be easier to read with some idea of where the numbers came from and what they mean.

Calculated F (df= 2,12) 8.67

Tabled F (df= 2,12, alpha .05) 3.89

The value for the tabled F was obtained in the same manner as the values for the tabled t in previous examples (statistical texts) (2-4).

In this case, the calculated F-value is larger than the tabled value, which places the test statistic in the region of rejection (see Figure 6 ). It is therefore legitimate to reject the null hypothesis. This result is stated as follows:

"The mean time-to-failure of the AFOs made from the plastic sheet stock supplied by three manufacturers is significantly different."

Critical values Of F: As with the t-value, the calculated F-value is compared to a critical value to determine significance. Unlike the bell-shaped curve of the normal distribution, the curve derived for the F-distribution is skewed to the left (see Figure 6 ). This skew is because the F-ratio is based on squared values, which are always positive. As shown in Figure 6 , the central point of the tabled F is at 1, with the limits at 0 at the left and infinity (infinity symbol) at the right. These limits are explained by the manner in which the ratios are developed. Recall that F-statistic is a ratio of observed (between) to expected (within) values and is represented as:

MSb
MSw

If the observed and expected values are identical, then this ratio 1. However, if the observed value is very small with respect to the expected value, then the ratio becomes a very small number approaching 0. In contrast, the ratio is large if the observed value is large with respect to the expected value. For example:



If the null hypothesis is accepted when using ANOVA (i.e., there are no significant differences), then the analysis is complete. However, when a significant difference is revealed, as above, then it is appropriate to conduct post-hoc or multiple comparisons testing. Post-hoc testing in the preceding example would reveal exactly which group means were different; i.e., Group I may be equal to Group II but different from Group III, and Group II may be different from Group III. Post-hoc analysis will be the subject of a future article.

Summary

The t-test and analysis of variance are based on several assumptions about the nature of data. These assumptions have been reviewed and include random selection of subjects, normally distributed data and, when there is more than one mean, homogeneity of variance. When clinical experiments are performed with very small samples, the data may sufficiently violate these assumptions to warrant transforming the data to a different scale of measurement that better reflects the appropriate characteristics for statistical analysis, or it may be appropriate to use nonparametric statistics that do not make the same demands on the data.

The simplest experimental comparison involves the use of two independent groups created by random assignment. This design allows the researcher to assume all individual differences are evenly distributed among the groups so the groups are equivalent at the start of the experiment. Statistically, the groups are considered random samples of the same population; any observed differences among them, therefore, should be the result of sampling error or chance. After the application of some independent variable, the researcher wants to determine if the groups are still from the same population, or if their means can be considered significantly different. This determination is made through a test of statistical significance.

The best comparison of two means is the t-test. The mathematical basis for the tests varies depending on which type of design is used. When it is desired to compare three or more means, it is necessary to use ANOVA. These procedures are based on parametric operations and subject to all assumptions underlying parametric statistics.


THOMAS R. LUNSFORD, MSE, CO, is president of Lone Star Orthotics Inc. at The Institute for Rehabilitation and Research in Houston, Texas, and assistant professor of physical medicine and rehabilitation at Baylor College of Medicine.

BRENDA RAE LUNSFORD, MS, MAPT is visiting assistant professor at Texas Woman's University in Houston, Texas, and physical therapist 11 at The Institute for Rehabilitation and Research.

References:

  1. Siegel S. Nonparametric statistics for the behavioral sciences. McGraw-Hill, Series in Psychology, 1956.
  2. Portney LG, Watkins MR Foundations of clinical research- applications and practice. Norwalk, Calif.: Appleton & Lange, 1993.
  3. Mattson DE. Statistics: difficult concepts, understandable explanations. Mosby Co., 1981.
  4. Colton T. Statistics in medicine. Little Brown & Co., 1974.
  5. Lunsford BR. Statistics: screening and data summary. JPO 1993; 5:4:125-30.
  6. Lunsford BR. Methodology: variables and levels of measurement. JPO 1993; 5:4:121-4.
  7. Lehmkuhl D. Mixing one part common sense with each part statistics in planning the design and reporting the results of clinical research in physical therapy. Phys Ther 1987;67:12:1851-3.
  8. Huck SW, Cormier WH, Bounds WG. Reading statistics and research. Harper Collins, 1974.
  9. Ferguson GA. Statistical analysis in psychology and education, 5th ed. New York: McGraw-Hill, 1981.
  10. Dunn OJ. Basic statistics: a primer for the biomedical sciences, 2d ed. New York: John Wiley & Sons, 1977.
  11. APTA. Reading tips for reports on research: an anthology. Amer Phys Ther Assn, 1986.
  12. Pairwise mean comparisons in 7D. BMDP communications newsletter, October 1983; 16:3.


 

Home > JPO > 1996 Vol. 8, Num. 2 > pp. 65-76

 

Copyright © American Academy of Orthotists & Prosthetists (AAOP)
All rights reserved. See disclaimer

oandp.com - Orthotics & Prosthetics Industry Information

Website built by oandp.com

oandp.com - Orthotics & Prosthetics Industry Information