Statistics: Screening and Data Summary
Brenda Rae Lunsford, MS, PT
ABSTRACT
To gain meaningful results from a research effort, data must be collected and
then analyzed correctly. While rigorous
research methods must be adhered to
during the data collection process, similar
efforts must be made to assure data are
handled correctly during analysis.
The importance of thorough screening and proofing cannot be overestimated. In computerized data management
there are numerous opportunities to err.
if one does not examine the data carefully, valid results may not be obtained
(1).
This article will present the first two
steps in the statistical management of
data: proofing and screening. Proofing
refers to the clerical steps required, such
as checking for recording errors and
making sure data recorded manually
are attached to the correct subjects, observations and variables when entered
into a computer. Screening refers to
technical data management such as
checking that the data conform to the
mathematical assumptions necessary
for subsequent analysis.
Introduction
Each statistical test relies on mathematical assumptions that if not adhered
to will render data analysis invalid
(2,3). Specifically, it is important to
know if the data are normally distributed. Since most analyses performed on
clinical-medical studies rely on the normal distribution, the bell curve shape
will be discussed in detail. There are
other distributions of statistical importance that are beyond the scope of this
article (2).
Extreme values (outliers) need to be
identified, and the researcher should
decide if such occurrences are part of
the population about which generalizations will be made. Frequently, outliers
are the result of poor control during the
early stages of a project and should be
discarded. The presence of undefined
strata or subgroups (e.g., percentage
of burn, diagnostic classification, etc.)
needs to be defined.
Following a thorough proofing/
screening process is the first step in
data analysis, which is the data summary or description (1). The logical sequence in the statistical process is to:
- summarize data using single-variable (univariate) summaries such as a
measure of the central tendency
(mean, median, mode), spread (range,
minimum, maximum), measures of
variability (standard deviation or variance), and the shape of the distribution
(e.g., normal, skewed).
- proceed with two-sample (bivariate) comparisons, such as the students t-test, or two sample correlations, followed with
- multivariate comparisons (ANOVA).
- correlations, regression analysis,
etc.
Proofing
Assuming that manually recorded data
are entered into a computer, it is important to print the data set and proof
for accuracy. This step is often omitted. A number entered as 1,000 instead
of 10 can change an important result.
The data set also should be edited for
misaligned observations of subjects
with their associated data. Inconsistencies such as the number of subjects with
BK amputations being greater than the
total sample of all amputees can later
prove embarrassing.
For example, consider the highlighted areas in the hypothetical data set of
Table A
. The age of subject "CD" as
210 years is obviously a recording error
that must be corrected as is the "G" in
the first column, gender, for subject
AB. This is an important correction to
make since later during analysis you
may wish to sort by sex, or "M" and
"F," and this subject's data will not be
included.
With the errant age of 210 years, the
average age for the subjects is 67.75
years with a standard deviation of
95.57 years. By correcting the 210 to
21, the average age is 20.5 years and
the standard deviation is 11.8 years. If
one did not proof the data set and simply ran the computer analysis, this error would have invalidated the outcome.
The other two areas of concern are
the single burn diagnosis mixed within
a sample of spinal cord injury, and the
single 6-year-old subject amid 20- to
35-year-olds.
If the entries are in error, they must
be corrected. However, if the recording is correct, then decisions need to be
made about the sample. Data in the
case of the 6-year-old subject were removed because he was not part of the
population this researcher wished to
generalize. The case of the subject with
the burn diagnosis was kept because
the variables of interest were related to
range-of-motion limitations similar to
the remaining subjects and not related
to the diagnosis.
Correcting the erroneous age and
deleting the 6-year-old subject results
in the data set shown in Table B
. The
standard deviations of age and velocity
become smaller, 8.4 vs. 95.6 and 9.0 vs.
11.4, respectively, as the data become
more homogeneous. The importance
of this will become apparent as the
analyses of the data are discussed. For
more complex data sets there are computerized statistical techniques that allow detection of outliers (4). This also
illustrates the value of the researcher
being intimately involved with the
proofing/screening process.
Screening
To enable a realistic example, data
from a study of 80 spinal-injured subjects will be used. This study evaluates
the relationship between heart rate and
velocity of locomotion by wheelchair
and walking.
Descriptive Statistics
The first analysis of the data included
the calculation of the descriptive statistics (i.e., mean, median, standard deviation, minimum, maximum and skewness). There are two reasons for doing
this, first to screen the data and second
to provide a quick summary result. For
example, if you had just finished collecting data on the walking velocity of a
group of patients with a new style of
prosthesis, how would you describe
how they performed? Reciting each
patient's velocity, heart rate, etc.,
would not allow meaningful inferences
to be made. Therefore, the first task in
data analysis is to organize the data
into some meaningful arrangement.
One of the easiest and most useful
steps is to produce a summary table as
shown in Table C
.
To examine this table we will first
define the terms then examine the
data.
Variable: " . . . anything that takes
on different values from time to
time . . . " (5). Specifically:
WCHR: Heart rate achieved while
propelling a wheelchair.
WALKHR: Heart rate achieved
while walking.
WCVEL: Velocity attained while
propelling a wheelchair.
WALKVEL: Velocity attained
while walking.
n: Number of subjects
Central Tendency: There are three
measures of central tendency, the
mean, median and mode. When a distribution is normal, they are equal.
- Mean: The mean is the most common measure of central tendency for
sample distributions (Equation 1). The
mean is precisely defined and the most
stable of the measures of central tendency. When extreme values are present, the mean is not the best representative of the data (3). For example,
in a group of spinal cord injured patients the mean age was 28.4 years
while the median age was 24.2 years.
Given that there were two older subjects in the sample whose age affected
the mean, the median in this case was a
better measure of the average age. The
mean is the average value of a given
variable in the sample (3,5,6,7).
- Median: The median is defined as
the middle data value of a set of sample
data. Since the median is not affected
by extreme values, it is a better measure when a distribution is not balanced
(3). If data are perfectly distributed,
the mean and median are the same
(3,6,7). In the example, 3 is the median
value (see Figure 1
).
- Mode: The mode is the most common value in a set of sample data. This
is the least useful measure of central
tendency in biomedical research since
it is really sensitive only to counts (3).
The mode may not exist in continuous
data where the measurement instrument is sensitive, and there are no duplicate data values (3). In the example,
4 is the mode (see Figure 2
).
Standard Deviation: To understand
standard deviation a brief discussion of
normal distribution is warranted.
When data are distributed "normally,"
it means the data are evenly distributed, with the mean at zero or at the
center, and the data are distributed
such that a bell curve is formed. The
partitions of a bell curve are such that
68 percent of the data falls within the
first partition, known as one standard
deviation, 95 percent of the data falls
within the second partition, and 99 percent falls within the third partition, or
the 2nd and 3rd standard deviations
(see Figure 3
) (6,7,8).
The sample standard deviation is the
positive square root of the variance
(see Equation 2). This is the most commonly used measure of variability
(2,6,7,8). Its values are closer to the
data values of interest than the variance and is an easier number to relate
to.
Variance: The variance is the mean of
the squared differences from the mean
of the distribution or the square of the
standard deviation (see Equation 3).
Mathematically, the variance is the
square of the standard deviation. This
number gives information about the
distribution (spread) of the sample
data. The variance is the term commonly used in the mathematical calculations performed in statistical testing
(6,7,8).
Minimum: The smallest value of a
variable.
Maximum: The largest value of a
variable.
For the purposes of screening it is
preferable to use minimum and maximum instead of range since it is possible to visualize extreme values that
might be erroneous, such as the age of
210 in the earlier example. Many prefer to summarize using range; however, range is not useful in screening for
extreme values.
Skewness: Skewness occurs when
data are distributed unevenly about the
mean with a higher concentration at
one end (5,6,7). If the tail of the data is
to the left, the data are said to be
skewed to the left or vice versa if the
tail is to the right (see Figure 4
) (1,5,9).
The skew value for data normally distributed is zero (9). Values greater
than zero indicate that there is a skew
to the right, less than zero indicate a
skew to the left (9).
Now return to the data set shown in
Table C
. Those familiar with the diagnosis of incomplete paraplegia or
quadriplegia are aware it is possible to
experience the extremes of having very
little preservation of neuromuscular
function or to remain mostly intact
with little loss. For a patient to attempt
to walk with weak muscular control
and sensory deprivation can be such a
struggle that a velocity of 4 meters per
minute, MIN column, is not surprising.
However, examination of the MAX
column reveals a heart rate of 188 during wheelchair propulsion. That heart
rate value is of concern since it would
be greater than 90 percent of predicted
maximum for a 20-year-old subject
(10)!
Next, the 1.659 skewness score for
WCHR is high since a value of 0 corresponds to a normal distribution (i.e., a
symmetrical bell curve) (9). Another
hint of trouble is the differences between the mean and median for both
WCHR and WALKVEL. The fact that
the WCHR distribution curve has a
skew to the right and that the mean and
median have a difference of 4.2 units
gives one cause to examine the data
further.
The case is not so clear for the variable
WALKVEL, whose mean and median
are 11.2 units apart while the data are
not substantially skewed (skew
-.065). Also, the standard deviation of
WALKVEL is quite large, being 59 percent of the mean, which suggests a very
wide curve. The next step is to evaluate
the cause of the extreme skew for the
variable WCHR and the marked difference between mean and median values
for WALKVEL
Distribution
The first step in screening a variable is
to analyze the distribution of the data
by frequency listing and/or histogram.
Perusal of the "value" column indicates there is a reasonable continuum
of data from the upper 70s through the
low 130s, and a gap between 132 and
188. It is obvious there is an extreme
value at the upper end causing the skew
(see Figure 5
). By plotting the frequency histogram of five-year groups of
these data, you gain a more visual impact of the shape of the distribution
(see Figure 6
).
After this discovery it was learned
that the subject with the heart rate of
188 was having a medical problem that
caused erratic heart rates. Since the
cause of this high rate was unrelated to
the purpose of the research, this subject's data were removed. Table D
shows the improvement of the data
once this outlier was removed.
As a result, there was considerable
improvement in skewness, the standard
deviation became smaller and the median is now closer to the mean. These
data parameters are now more acceptable with respect to the requirement of
a normal distribution (2,6).
Now the data parameters for
WALKVEL will be analyzed using the
same techniques. In perusing the value
column for the WALKVEL data set
(see Figure 7
), the relatively small values of 4 and 5 might be cause for concern; however, there is no significant
skew to these data (i.e., -0.65). The
only remarkable observation of the
raw data is the discontinuity between
the walking velocities of 37 and 50. The
11.2 discrepancy between the mean
(38.8) and median (50) walking velocity values is also cause for concern (see
Figure 7
). This is an example of where
a graphic representation is helpful.
The frequency histogram of WALKVEL clearly shows a bimodal distribution (see Figure 8
). In other words, an
underlying factor is causing these data
to separate into subgroups.
Another way to identify a subgrouping is to plot the data of WALKHR vs
WALKVEL (see Figure 9
). The
WALKHR vs WALKVEL plot reveals
a bimodal distribution of the two
groups that should be evaluated to see
if there is a significant difference between them. This finding sheds a completely different light on these data and
is only discovered through careful
screening. If one had obtained the
means and then moved directly to analysis, a significant error would have
been made and an important result
overlooked.
This project revealed the subgrouping was the result of the impact of the
combination of the variables' muscle
strength and proprioception. This
trend continued even when a larger
sample was achieved. It is important to
remember, however, that when working with small samples apparent differences may disappear or be reduced as
the number of subjects in the sample is
increased (3,11).
Summary
When data are transcribed, error is
possible. To comply with the mandate
of valid research, it is imperative that
all data entry be checked for accuracy.
The value of clerical proofing for this
type of error cannot be stressed
enough.
For correct inferences to be made, it
is critical that the data follow the assumptions of the statistic to be applied.
Very often one is left with data that
either have a very large spread, do not
look normally distributed, are skewed
or have apparent outliers. While these
conditions may be the result of poor
control, large variability and/or a very
small sample, not all is lost. There are
two primary categories of statistical
testing, parametric and non-parametric.
The parametric category of tests requires full and rigorous adherence to
the assumption of normality, i.e.,
(n,0), which means the distribution is
normal, and the mean is at the center.
The non-parametric category also is
frequently called distribution-free statistics (3,12).
The implication here is that data do
not have to be normally distributed for
the test to provide valid results.
Screening, therefore, does not determine if you can test but rather how you
will test and if legitimate measures
must be taken in managing the data
prior to testing. Since research establishes professional standards and is
used to guide others in patient care,
ethical conduct requires that published
information be of the highest integrity
with the best efforts made toward establishing correct information.
Brenda Rae Lunsford, MS, PT, is a visiting assistant professor in the school of physical therapy at Texas Women's University in Houston.
References:
- Hill MA. Annotated computer output
for data screening. BMDP Technical Report 77, UCLA 1981.
- Sokal RR, Rohlf FJ. Biometry. 2nd ed.
San Francisco: WH Freeman & Co.,
1981:400-14.
- Currier DP. Elements of research in
physical therapy. 2nd ed. Baltimore: Williams & Wilkins, 1984:152, 278-94.
- Afifi AA, Azen SP. Statistical analysis:
A computer-oriented approach. New
York: Academic Press Inc., 1972:281-3.
- Dominowski RL. Research methods.
Englewood Cliffs, N.J.: Prentice-Hall Inc.,
1980:4:124-63.
- Dunn JO. Basic statistics: A primer for
the biomedical sciences. 2nd ed. New
York: John Wiley and Sons, 1977:5:38-49.
- Brown FL, Amos JR, Mink 0G. Statistical concepts: A basic program. 2nd ed.
New York: Harper & Row, 1975:28-32.
- Goldstein A. Biostatistics: An introductory text. New York: Macmillan, 1964:3444.
- Bostrom A, Kahn T. Crunch Statistical
Package. Vol. I & II, 1991:590.
- Lunsford BR. Clinical indicators of endurance. Phys Ther 1978;58:6:704-9.
- Schlesselman JJ. Planning a longitudinal study: I sample size determination. J
Chron Dis. 1973;26:553-60.
- Siegle S. Nonparametric statistics for
the behavioral sciences. New York:
McGraw-Hill. 1956:1-34.
|
|