This article discusses the basic characteristics of nonparametric statistical tests, contrasting them with the characteristics of parametric statistical tests. Examples for performing nonparametric statistical tests on practitioners' own data also are included.
Nonparametric tests can be used with data that are of the nominative (e.g., characteristics such as right and left, male and female) and ordinal (e.g., mild, moderate and severe) levels of measurement, which may not follow the normal distribution curve or comply with other assumptions required of data analyzed by parametric statistical methods. However, the results of analyzing data with these nonparametric statistical methods can yield important information about the degree to which qualities of one group of data differ from those of another group.
Statistical tests have been developed to permit comparisons regarding the degree to which qualities of one group of data differ from those of another group. Each statistical test is based on certain assumptions about the population(s) from which the data are drawn. If a particular statistical test is used to analyze data collected from a sample that does not meet the expected assumptions, then the conclusions drawn from the results of the test will be flawed.
The two classes of statistical tests are called parametric and nonparametric. The word parametric comes from "metric," meaning to measure, and "para," meaning beside or closely related; the combined term refers to the assumptions about the population from which the measurements were obtained. Nonparametric data do not meet such rigid assumptions. Nonparametric tests sometimes are referred to as "distribution-free." That is, the data can be drawn from a sample that may not follow the normal distribution (1).
Before a parametric test can be undertaken, it must be ascertained that: 1) the samples are random (i.e., each member of the population has an equal chance of being selected for measurement); 2) the scores are independent of each other; 3) the experiments are repeatable with constancy of measurements from experiment to experiment; 4) the data are normally distributed; and 5) the samples have similar variances (1,2). Parametric statistics use mean values, standard deviation and variance to estimate differences between measurements that characterize particular populations.
The two major types of parametric tests are Student's t-tests (Student is the name of the statistician who developed the test) and analyses of variance (ANOVA). Nonparametnc tests use rank or frequency information to draw conclusions about differences between populations. Parametric tests usually are assumed to be more powerful than nonparametric tests. However, parametric tests cannot always be used to analyze the significance of differences because the assumptions on which they are based are not always met.
A statistical test can never establish the truth of an hypothesis with 100-percent certainty. Typically, the hypothesis is specified in the form of a "null hypothesis," i.e., the score characterizing one group of measurements does not differ (within an acceptable margin of measurement error) from the score characterizing another group. Note the hypothesis does not state the two scores are the same; rather, it states no significant difference can be detected. Performing the statistical procedure yields a test result that helps one reach a decision that 1) the scores are not different (the hypothesis is confirmed) or 2) the difference in the scores is too great to be explained by chance (the hypothesis is rejected).
Rejecting the hypothesis when it actually is true is called a Type-I error. Failure to reject the hypothesis when it is false is termed a Type-II error. For convenience and simplicity, a 5-percent risk of making a Type-I error has become conventional; one should be correct 95 out of 100 times when using the listed value in the probability tables to accept or reject the hypothesis.
In statistics, robustness is the degree to which a test can stray from the assumptions before changing the confidence you have in the result of the statistical test you have used. Choosing a nonparametric test trades the power, or large sample size (3), of the parametric test for robustness (4). Further, a method requiring few and weak assumptions about the population(s) being sampled is less dependable than the corresponding parametric method and increases the chances of committing a Type-Il error (5,6).
Independence or dependence of samples concerns whether the different sets of numbers being compared are independent or dependent of each other (7). Sets are independent when values in one set tell nothing about values in another set. When two or more groups consist of different, unrelated individuals, the observations made about the samples are independent. When the sets of numbers consist of repeated measures on the same individuals, they are said to be dependent. Similarly, if male and female characteristics are compared using brother-sister pairs, the samples are dependent. Matching two or more groups of individuals on factors such as income, education, age, height and weight also yields dependent samples.
This type of selection sometimes is difficult to confirm. However, so long as the data sets used in the analysis are relatively normally distributed, the robustness of most parametric tests still provides an appropriate level of rejection of the null hypothesis.
Homogeneity of variance of the data from each group being compared must be equal (homogeneous) and can be tested statistically (8). If found to differ significantly, then nonparametric tests must be used.
Parametric tests require data from which means and variances can be calculated, i.e., interval and ratio data. Some statisticians also support the use of parametric tests with ordinal-scaled values because the distribution of ordinal data often is approximately normal. As long as the actual data meet the parametric assumptions, regardless of the origin of the numbers, then parametric tests can be conducted. As is the case with all statistical tests of differences, the researcher must interpret parametric statistical conclusions based on ordinal data in light of their clinical or practical implications.
Nonparametric tests are used in the behavioral sciences when there is no basis for assuming certain types of distributions. Siegel has advocated that nonparametric tests be used for nominal and ordinal levels of measurements while parametric tests be used for analyzing interval and ratio data (4).
On the other hand, Williamsen has argued that statistical tests are selected to meet certain goals or to answer specific questions rather than to match certain levels of measurement with parametric or nonparametric procedures (9). This view currently is prevalent among many statisticians.
In practice, levels of measurement sometimes are "downgraded" from ratio and interval scales to ordinal or nominal scales for the convenience of a measuring instrument or interpretation. For example, muscular strength (measured with a force gauge and considering the length of the lever arm through which the force is acting) is a variable that yields ratio data because a true zero point exists in the level of measurement. Muscular strength is absent with paralysis (true zero point). The manual muscle test converts the ratio characteristic of force into an ordinal scale by assigning grades of relative position (normal, good, fair, poor, trace, zero; or 5,4, 3,2, 1,0).
Nonparametric tests should not be substituted for parametric tests when parametric tests are more appropriate. Nonparametric tests should be used when the assumptions of parametric tests cannot be met, when very small numbers of data are used, and when no basis exists for assuming certain types or shapes of distributions (9).
Nonparametric tests are used if data can only be classified, counted or ordered-for example, rating staff on performance or comparing results from manual muscle tests. These tests should not be used in determining precision or accuracy of instruments because the tests are lacking in both areas.
Nonparametric tests usually can be performed quickly and easily without automated instruments (calculators and computers). They are designed for small numbers of data, including counts, classifications and ratings. They are easier to understand and explain.
Calculations of nonparametric tests generally are easy to perform and apply, and they have certain intuitive appeal as shortcut techniques. Nonparametric tests are relatively robust and can be used effectively for determining relationships and significance of differences using behavioral research methods.
Parametric tests are more powerful than nonparametric tests and deal with continuous variables whereas nonparametric tests often deal with discrete variables (10). Using results from analyses of nonparametric tests for making inferences should be done with caution because small numbers of data are used, and no assumptions about parent populations are made. The ease of calculation and reduced concern for assumptions have been referred to as "quick and dirty statistical procedures" (11).
Descriptive statistics involve tabulating, depicting and describing collections of data. These data may be either quantitative, such as measures of leg length (variables that are characterized by an underlying continuum) or representative of qualitative variables, such as gender, vocational status or personality type.
Collections of data generally must be summarized in some fashion to be more easily understood. Descriptive statistics serve as the means to describe, summarize and reduce to manageable form the properties of an otherwise unwieldy mass of data. Descriptive statistics used to characterize data analyzed by parametric tests include the mean, standard deviation and variance.
Those descriptive statistics used to characterize data analyzed by nonparametric tests include the mode, median and percentile rank:
where Ri is the rank of the observation Xi (ranked from highest to lowest), and n is the number of observations in the distribution. The median is the 50th percentile.
In statistics, the mean or median commonly is used when dealing with measurement data. The mode most often is useful when dealing with data more appropriately handled with classification procedures (e.g., mild, moderate, severe).
Correlation coefficients are used to reveal the nature and extent of association between two variables. Each method used to determine a correlation coefficient has conditions that must be met for its use to be appropriate. The first step in analyzing a relationship always is selection of the proper measure of association based on the conditions of the study and the hypothesis to be tested.
Measures of association are useful for a variety of studies. Correlation coefficients are used in exploratory studies to determine relationships among variables in new study areas. The results of such studies allow investigators to formulate further research questions or hypotheses to delve more deeply into the study area. In some studies, the hypotheses focus on associations between selected variables, and the correlation coefficients serve to test these hypotheses.
Similarly, hypotheses based on expected associations among variables make important contributions to theory building.
Finally, correlation coefficients are used to manage threats to validity in experimental and quasi-experimental studies. They can be used to test the credibility of findings when groups have been compared by checking on the association of independent and extraneous variables with the dependent variable.
Spearman's rank order correlation coefficient rho is a nonparametric method of computing correlation from ranks. The method is similar to that used to compute Pearson's correlation coefficient (a parametric test), with the computed value rho providing an index of relation between two groups of ranks.
If the original scores are ranks, the computed index will be similar in value to that computed by the Pearson (product moment) method. The difference between the two methods is the product moment method assigns weight to the magnitude of each score whereas the rank method focuses on the ordinal position of each score (9). The coefficient of rank correlation (rho) ranges from +1, when paired ranks are changing in the same order, to -1, when ranks are changing in reverse order. A score of zero indicates the paired ranks are occurring at random. The equation for rank correlation is:
where d is the difference between each subject's rank for the two variables being studied, Ed2 is the sum of squared differences between ranks, 6 is a constant and n is the number of paired scores.
Suppose 10 students are drawn at random from a large class; each student has been rated on a 10-point scale for a recent clinical experience, and each student has a grade-point average (GPA) on file. The coefficient of ranks can be computed to determine the extent of agreement between the two sets of scores (clinical experience ratings and GPA). In the rank correlation method, the raw scores are replaced by assigning an appropriate rank to each score of each set. Ranks for each set correspond to the total number of scores in that set.
Step 1. Make a table of the subjects' scores and ranks for the two variables of interest and subtract the ranks to determine the difference (diff) for each pair of ranks. Square each of these differences and sum the squared values (see Table A ).
This example illustrates what happens when scores are similar (tied ranks). When tied ranks occur (e.g., column Y of Table A), each score is assigned the average rank the tied scores occupy (a higher rank is better). The GPA of 3.2, for example, had two scores occupying ranks 5 and 6. The average rank for the score 3.2 is obtained by adding ranks (5 +6) and dividing by the number of ranks occupied (e.g., 5 +6 + 2= 5.5 ranking).
Step 2. Substitute the calculated value of Ed2 in Equation (2) and solve for p:
Consulting a textbook of statistics that provides a table of values for p, one finds a minimum p value of 0.746 is needed to be considered significant at the .05 level of significance. Thus, the correlation coefficient p of 0.867 confirms a statistically significant correlation between the two sets of rankings (a conclusion that will be incorrect less than five times out of 100).
Kendall's rank correlation tau ('r) is another nonparametric measure of association. When relatively large numbers of ties exist in a set of ranking, Kendall's tau is preferred over Spearman's rho. The formula and procedures for calculating it have been adapted from Siegel (4).
where N = the number of objects or individuals ranked on both X and Y characteristics.
The value of S can be determined by arranging the first set of measurements (see Table B and Table C ) into their natural order (e.g., 1, 2, 3, 4, 5) and aligning the second set of measurements under them (e.g., 2, 1,4, 5, 3). Starting with the first number on the left in the bottom row, the number of ranks on the right which are larger are counted. The derivations of the actual score and the maximum possible score are illustrated in the example that follows: Two orthotists rank the fit of an "off-the-shelf" ankle-foot orthosis (AFO) on five different patients. The two sets of rankings appear in Table B .
Step 1. Rearrange the data so the first orthotist's rankings fall in a natural (increasing) order and the second orthotist's rankings are tabulated in the same order (see Table C ).
Step 2. Compare the first ranking of orthotist 2 with every ranking to its right, assigning a +1 to every pair in which the order is natural and a -1 to every pair in which the order is unnatural:
Repeat for each subsequent ranking of orthotist 2:
Step 3. Add these measures of "disarray" (sum = 4) and enter this sum in the above formula as a substitute for S.
Step 4. The value of N = 5. Thus, Equation (3) becomes:
Step 5. The statistical significance can be determined by two procedures, depending on sample size.
If N is equal to or less than 10, use a probability table such as that found in the appendix of a textbook on statistics to find the statistical significance of 'r. In this example, the table of probability indicates a probability score (p-value) of 0.242 for a value of 0.400. Thus, this test supports the conclusion that the ratings of the two orthotists are not significantly correlated.
For situations in which N is greater than 10, a z score can be computed for the 'r obtained and the statistical significance of the correlation read from a corresponding table of z scores:
The Chi-square test of independence is a nonparametric test designed to determine whether two variables are independent or related. This test is designed to be used with data that are expressed as frequencies; it should not be used to analyze data expressed as proportions (percentages) unless they are first converted to frequencies.
The application of Chi-square to contingency tables can best be illustrated by working through an example. Suppose a sample of new graduates of an orthotic educational program and orthotists with more than five years of clinical experience were asked whether research should be a part of every orthotist's practice. The replies were recorded as "Agree" or "Disagree."
Step 1. Organize the data into the form of a 2 x 2 contingency table (see Table D ). Note the table includes row totals, column totals and the grand total of subjects included in the sample.
The actual numbers of "Agree" responses were 82 from recent graduates and 30 from experienced orthotists. The numbers disagreeing with the statement were 12 and 66, respectively.
The rationale that underlies Chi-square is based on the differences between the observed and the expected frequencies. The observed frequencies are the data produced by the survey. The expected frequencies are computed on the assumption that no difference existed between the groups except that resulting from chance occurrences.
Step 2. The expected frequencies are computed as follows:
Cross-tabulation and the computation of Chi-square can be made when the variables are nominal as well as ordinal, interval or ratio, and the Chi-square statistic is useful for discrete or continuous variables. However, it is assumed that data occur in every category; thus, no cell may have an observed frequency of zero. The formula for the degrees of freedom for calculating Chi-square and the contingency coefficient is:
where k = number of columns in the contingency table and r = number of rows in the contingency table.
Step 3. The Chi-square (X2) is calculated using Equation (6):
where 0 is the observed number of cases found in the ith row of the ith column, and E is the expected frequency obtained by multiplying the two marginal totals for each cell and dividing the product by N
Step 4. The Chi-square is computed by finding the difference between the observed and expected frequencies in each cell, squaring that difference and dividing by the expected frequency of that cell. The result for each cell is then added (see Table E ), and the total is the value of the Chi2 square. (Chi-square = X2 = 61.84.)
Step 5. Consulting a table of Chi-square values in a textbook of statistics, using 1 degree of freedom and the 0.05-level of significance, we find that a minimum value of 3.84 is needed for the observed frequency to be considered significantly different from the expected frequency. In this example, the value of X greatly exceeds that minimum value; thus, the observed values are significantly different from the expected values.
The use of the Chi-square statistic has important limitations. Although no association is indicated by a zero, a perfect association is not indicated by a 1.00. Moreover, the size of Chi-square is influenced by both the size of the contingency table and the size of the sample.
The addition of rows and columns as a table grows is accompanied by larger and larger values of Chi-square- even when the association remains essentially constant. If the sample size is tripled, the value of Chi-square is tripled, and everything else remains the same. Degrees of freedom depend on the number of rows and columns, not the sample size; thus, inflated values of Chi-square occur for large samples, leading the investigator to conclude the differences between observed and expected frequencies are more significant than warranted. The Chi-square is designed for use with relatively small samples and a limited number of rows and columns.
The correlation coefficient phi corrects for the size of the sample when the table size is 2 x 2. The equation is:
Phi is 0 when no relationship exists and 1 when variables are related perfectly. When tables are greater than 2 x 2, Phi has no upper limit and is not a suitable statistic to use. The statistical significance of Phi may be tested by calculating a corresponding Chi-square value and assigning 1 degree of freedom to it (12):
Cramer's V is an adjusted CF, modified to be suitable for tables larger than 2 x 2. The value of V is zero when no relationship exists and 1 when a perfect relationship exists. The equation for Cramer's V (13) is:
Thus, when 2 x 2 tables are involved, Phi may provide a more useful measure of the relationship between the two variables than that provided by Chi-square. For tables larger than 2 x 2, Cramer's V is the statistic of choice (k = number of columns in the table).
Two-Group Design: Chi-square (2 x 2)
The Chi-square comparison of differences between two groups is one of the better known and commonly used statistical procedures. The same procedure for Chi-square (x2) as described above can be used to test for the significance of differences in two groups of data that are expressed as frequencies.
Suppose researchers wanted to determine if the proportion of trauma patients being referred for orthotic services in a particular hospital was significantly different than the number being referred for orthotic services in another hospital with a similar mix of patients. During a specific 12-month period, the orthotic department in Hospital A filled 238 requests for orthoses from a pool of 2,222 patients, and the orthotic department in Hospital B filled 221 requests for orthoses from a pool of 1,238 patients. First, the data are organized into a 2 x 2 contingency table (see Table F ).
As before, the general equation for Chi-square is:
To compute X for a contingency table, simply square the difference between the observed and expected frequencies in each cell and divide by the expected frequency of that cell. Finally, total the cells to obtain the X2value (see Table G ). X2= 23.72, which is evidence that the experiences of the two hospitals are significantly different.
The Chi-square median test can be used to determine if the medians of two groups are different. For example, all of the male patients fitted with an AFO to correct foot-drop following the onset of hemiplegia were asked to rate the comfort of their footwear when walking with the AFO. Forty-four patients were evaluated; 32 wore normal leather shoes and 12 wore tennis shoes.
Comfort was rated on a nine-point scale (the larger the score, the greater the comfort in walking), and the evaluation was made six months after fitting the AFO and with the subject walking 50 yards. The median comfort rating of the 44 patients was 7.3. The number of subjects rating their comfort above or below the grand median is shown in Table H .
The Chi-square computation viewing the leather shoe and tennis shoe wearers as random samples is shown below. The ratings are discrete units; each patient's rating appears only once, and the ratings are independent.
where n1 the total number of observations, nrc2 is the number of observations in the rcth cell of the contingency table, nr is the number of observations in the rth row of the table, and nc is the number of observations in the cth column of the table.
A table of Chi-square values is then consulted to determine if this calculated value of Chi-square is sufficiently large to represent a statistically significant difference of the mean scores. The degree of freedom is (R - 1) (C - 1) = (1)(1) = 1. In this example, a Chi-square of 3.84 or larger would be needed (at a .05 level of significance) to justify the conclusion that the comfort levels of the two different types of footwear were significantly different.
Tukey's quick test is used to determine if the results of two different interventions produced the same or different effects. Suppose a sample of 20 patients with limitation in elbow extension on one side that exceeded 50 degrees was treated with one of two methods for reducing contractures (7). Subgroup A, consisting of 10 patients, was treated with serial casting over a period of one month; Subgroup B, also consisting of 10 patients, was treated with an adjustable splint worn 18 hours a day for one month. The increased range of motion for subjects in the two groups is shown in Table I .
Tukey's quick test is applied by identifying the group containing the largest value and the group containing the smallest value in the two groups. In this example, Group B contains the largest value of either group (41), and Group A contains the smallest value (14). The number of values in Group B that are larger than the largest value in Group A (36) are counted and recorded (in this example, there are 2). Next, the number of values in Group A that are smaller than the smallest value in Group B (18) are counted (there are 2). The two counts are added, and, if the sum is equal to or greater than 7, we conclude the effects of the two treatments are different. If the sum is less than 7, we conclude the effects of the two treatments are not different. In the present example, the sums of the two counts equal 4; therefore, we conclude the effects of the two interventions are not different. In the event the largest and the smallest values occurred in the same group, we conclude automatically that the two treatments did not have different effects. In Tukey's test the number 7 is a constant and is the criterion value to be used with any set of data.
The Mann-Whitney U-test is a rank test for two independent samples, each with a small number of subjects. This test is a good alternative to the parametric t-test. Suppose measurements of the height of the ankle joint axis (in millimeters) in a group of patients receiving services in Orthotic Clinic A are compared with measurements taken from a group of patients in Orthotic Clinic B to determine if they are comparable. Because of the small number of cases, a nonparametric test is selected. The measurements are assigned a rank in ascending order of height, with a rank of 1 being the smallest value:
The ranks then are ordered according to their identity.
The value of the Mann-Whitney U-test is found by determining the number of A scores preceding each B score. The U is: 1 + 2 +3 + 4 + 5 = 18 (rank 2A precedes 3B = +1; ranks 2A and 4A precede 5B = +2; ranks 2A, 4A and 6A precede 7B = +3, and so on). Consulting a Mann-Whitney U-test table for nB = 7 (larger sample size), locate the U value (18) on the left-hand margin and nA = 5 at the top of the table. The probability that these two samples are equivalent is 0.562, which is not statistically significant (i.e., the distribution of ankle heights is not different). This procedure is appropriate only when the larger sample size is 8 or smaller. Different procedures and tables are used for samples ranging between 9 and 20 and larger than 20, respectively. Procedures and tables for the Mann-Whitney U-test can be found in Siegel and Castellan (14).
The Wilcoxon matched pairs/rank test is an alternate form of the Mann-Whitney test that is used when the samples are dependent. For purposes of illustration, presume the time to ambulate 25 meters is measured with a stopwatch when the patient is wearing a new type of lightweight KAFO and again when wearing a conventional metal KAFO. The ambulation times for each patient are tabulated, and the absolute difference between each pair of numbers is calculated. The nonzero differences then are ranked according to their absolute values and separated into ranks associated with positive and negative differences. Table J shows the sums of the positive and negative ranks for this example.
As in the case with the Mann-Whitney procedures for analyzing differences between independent samples, the resulting score, in this case called a T value, is used to look up the statistical significance of the differences in a table (for example, Table G in the appendix of Reference 4). In this example, a T value of 8 or more indicates that the two situations are significantly different, and the subjects walked more quickly when wearing the lightweight KAFO.
This method functions like the conventional one-way analysis of variance. The null hypothesis is tested to determine if the differences among samples show true population differences or whether they represent chance variations to be expected among several samples from the same population. The test is based on the assumptions that ranks within each sample constitute a random sample for the set of numbers 1 through N (15) and that the variable being tested has a continuous distribution (4). Scores in all samples are combined and arranged in order of magnitude so that ranks can be given to each score. The lowest score is assigned the rank of 1. The scores then are replaced in their respective samples with appropriate ranks. The ranks for each sample are summed. The assumption is that mean rank sums (R) are equal for all samples and equal to the mean of the N ranks, (N + 1)12, if the samples (K) are from the same population (16). Both equal- and unequalsized samples can be used in this test because the sums of sample ranks (>R) are pooled in the equation. The statistic H used in this test can be defined by the equation:
H = 12/[N(N+1)]
where N is the number of scores in all samples combined. The random sample distribution of H is approximated by a Chi-square distribution of K-1 degrees of freedom, where K is the number of samples. The Chi-square probability can be found in appendix tables published in Reference 11. The Kruskal-Wallis One-Way Analysis of Variance by Ranks is used when assumptions for the parametric Analysis of Variance are not suitable for the data, or when the level of data is less than interval measures.
This article provides students and clinicians in the field of prosthetics/orthotics with basic information about the distinctions between parametric and nonparametric statistical methods. Knowledge of these distinctions is essential in reaching a decision about which statistical method would be appropriate for testing the strength of a correlation between two sets of data or determining if the differences between two sets of observations are great enough to be considered significant from a statistical point of view. One still must make the judgment if the difference is of clinical significance.
Some of the most commonly used nonparametric statistical methods have been described in sufficient detail that readers should be able to use the method to answer questions pertinent to their own practices-using data accumulated in their own setting.
L. DON LEHMKUHL, PHD, FAPTA, is associate professor in the department of physical medicine and rehabilitation at Baylor College of Medicine, Houston, TX 77030.