RESEARCH FORUM--Measurement
Characteristics and Sources
of Measurement Error
Jan Gwyer, PhD, PT
ABSTRACT
Orthotic and prosthetic clinicians make clinical decisions
daily that affect their patients' lives and the resources of the
health-care system. These decisions are largely based on information gained taking clinical measurements of patient
characteristics. Each measurement can be evaluated in terms
of its reliability, validity, responsiveness and usefulness in the
clinic. Clinicians and researchers studying clinical phenomena must have an understanding of these qualities of clinical
measures and the sources of error that can make their data
less useful.
Measurement errors occur in all systematic investigations
of natural phenomena. Neither the advanced state of our instrumented technology nor precision in our procedures can
prevent random errors. Thus, it is important that orthotists
and prosthetists familiarize themselves with the characteristics and occurrences of measurement errors in practice and
research. This understanding will provide the basis to evaluate the impact of error and bias on measures and improve
the clinical decisions made when reading or conducting clinical research.
Introduction
A major goal of clinical research is to search out the truth
of the matter at hand. For example, when researching the
range-of-motion-limiting effects of four types of cervical orthoses, it is important to know as precisely as possible how
much axial rotation is permitted by each orthosis. When
such a study was conducted by Lunsford et al. (1), 10 subjects wearing four different cervical orthoses and wearing
no orthoses were studied to obtain measures of allowable
axial rotation. Their results showed the most effective orthosis limited axial rotation of the cervical spine to approximately 25 degrees of motion. Those who read this study
may conclude that 25 degrees is the true amount of axial
rotation allowed by the orthosis and may be persuaded to
select this orthosis over others when limitation of cervical
rotation is the primary patient concern.
Clinicians who plan to conduct clinical research, or who
read and critique published research [such as the Lunsford
et al. study (1)], should pause to ask themselves several
questions about the clinical measures involved. What are
the measurement properties of these measures, and what
are the sources of error that might affect our confidence in
them? The intent of this article is to build on previous papers that provide an introduction to measurement and
variables (2) and illustrate common clinical measures (3).
The study by Lunsford et al. referenced throughout this
article will provide a framework for illustrating concepts of
measurement theory and sources of measurement error in
experimental research.
Measurement Science
Measurement is a process so familiar to clinicians that to
stop and analyze the science underlying the skill of measurement may seem unnecessary. But if measurement is, as
Rothstein says, the very way in which clinicians communicate, document treatments and claim credibility for clinical
decisions (4), then an understanding of how to best practice
the science of measurement is crucial.
Measurement is the complex process by which rules are
followed to assign numbers to events or phenomena (5).
Measurements vary in their intuitive appeal from measuring limb girth in centimeters to measuring the amount of
assistance a patient requires to don an orthosis on a scale of
zero to five. The former methodology seems much more
obviously "true" or "real" and is more easily communicated consistently and with confidence between practitioners
than the latter. Researchers strive to increase the validity
and reliability of all measurements.
Validity
Many familiar definitions of measurement validity illustrate
the desired properties of clinical measurements: accuracy,
truthfulness, veracity, legitimacy, correctness and soundness. Less clear is the definition of measurement validity as
the extent to which a measure or score actually measures
what it is intended to measure. This definition reminds the
clinician that the accuracy of a measurement tool is not the
only concern. Clinicians must have a clear picture of the
context in which the measurement is taken to establish
measurement validity. This context is most easily identified
by asking the following question: What is the purpose of
this measure, or what do I intend to do with this information once obtained? Some examples of purposes for measurements include to evaluate physical performance, to diagnose, or to predict future limitations or disabilities. An
understanding of the purpose of a measure provides the
guidance needed to select the types of measurement validity most appropriate for the clinical measure.
A brief overview of types of measurement validity will
be presented here and illustrated with our study example.
An extensive review of measurement principles illustrated
with clinical measures familiar to all orthotic and prosthetic practitioners is presented in Sim and Arnell's review of
measurement validity in clinical research (6) for a detailed
analysis of the types of measurement validity.
Face validity, which exists for the majority of clinical measures in use in prosthetics and orthotics, is perhaps the easiest form of measurement validity to establish. A clinical
measure has face validity when the clinician and many of
his/her peers can agree that the measurement process seems
to do what it is intended to do and has a useful purpose.
When a piece of deformable material is dropped into the
bottom of a patellar tendon-bearing (PTB) socket, and the
patient is asked to stand and then remove the material
from the socket, a measurement process to evaluate
whether or not there is total contact in the prosthetic socket has been developed. The measurement is of the nominal
level with two possible responses, either yes or no, to the
question of whether there exists total contact of the residual limb in the socket. This measure probably enjoys substantial face validity among prosthetists. Conversely, if a
clinician plans to remove the deformable material, measure
the height of it, use a formula including the patient's height
and weight and translate that into a percentage of weightbearing accomplished by the patient, the clinician may lose
the reader's confidence in the face validity of this measure.
In the Lunsford et al. study presented here, the authors
have provided the reader with both pictures and text documenting the clinical measurement of cervical spine axial rotation in this study. From this information, a determination
about the face validity of this method of measuring cervical
spine axial rotation can be made.
The next two types of measurement validity are classified
as theoretical validity by Sim and Arnell (6): construct and
content validity. Construct validity is established when there
is agreement on a clear theoretical logical argument underlying the measure (7). Content validity is established when
the measurement contains all, or a sample of all, appropriate elements of the construct of interest. Payton (8), in his
commentary on Sim and Arnell's work, agrees with this classification of construct and content validity as theoretical validity and suggests that these types of measurement validity
are best confronted by theoreticians whereas the criterion-related validities are more suited for clinical studies. Although not referenced in their article, Lunsford et al. are
probably relying on the published discourse documenting
the construct and content validity of spinal range-of-motion
measures (4,9).
The last type of measurement validity is termed criterion-related validity and has two sub-classifications: predictive
and concurrent validity. Both types of criterion-related validity are concerned with the clinician's confidence in the
judgments he/she makes based on the measure. The purpose of measurement must be clearly stated here. Criterion-related validity is established by selecting a criterion
against which the clinician will evaluate the performance of
the clinical measure.
For many aspects of physical performance of interest to
orthotists and prosthetists more than one type of clinical
measure may be used. Some clinical measures have enjoyed more reported types of measurement validity as just
described (face, content, construct) and serve well as a criterion against which to hold a newly developed measure. If
the purpose of a clinical measurement is to classify or evaluate some aspect of physical performance, then the concurrent validity of the measure must be established. If the purpose of a clinical measure is to predict future disability or
functional performance, then predictive validity for the
measure must be pursued. This is done by collecting data
with the clinical measure and then, at some point in the future, collecting data on the same subjects with the criterion
measure. For example, a clinician might collect data on the
type and thickness of the material used in several types of
ankle-foot orthoses (AFOs) and predict the number of
months to failure of the orthoses in a group of walkers.
Although no aspects of measurement validity are discussed by the authors of the sample study, several methods
of establishing the criterion-related validity for the measures of cervical axial rotation could be investigated. Cervical axial rotation measured in degrees is a measurement familiar to the readers, but the manner in which these measures were taken is not a common clinical practice, and so
the validity of the measures is a reasonable concern for the
reader. We can assume that the purpose of this measure is
to evaluate the available cervical spine axial rotation.
The authors could have increased the readers' confidence
in this specific measurement of cervical spine axial rotation
by comparing measurements obtained in this fashion (protractor measurements taken on a video monitor screen)
with measurements taken from radiographs or from gravity-referenced goniometers while the subjects were under the
same orthotic conditions. Although not a practical clinical
measure, serial radiographs serve as an excellent criterion
measure against which a clinical measure can be judged. The
question "Does this clinical measurement capture all cervical axial rotation and only cervical rotation, ruling out thoracic spine rotation and temporomandibular joint motion?"
can be answered by comparing the clinical measure to serial
radiographs.
Reliability
Reliability also has many common definitions familiar to
most O&P clinicians: repeatability, stability, consistency, reproducibility and dependability are just a few synonyms.
Since O&P professionals often check their measurements
more than once during fabrication processes, they are familiar with the concept of hoping that sequential measurements
of the same limb segment agree with each other, and even
more familiar with the consequences in time and material
when their measurements fail them. When this scenario happens and an assistive device must be refabricated, it is often
assumed that a change took place in the patient between
measurement and fitting, and this is an important concept to
clarify during a discussion of measurement reliability.
Classical definitions of all of the following types of reliability are based on one or more examiners making judgments about the same performance on one or more occasions. In other words, true measurement reliability can only be assessed if the performance of interest can be perfectly captured in some fashion (e.g., videotape) such that it
cannot be changed while it is presented for rating to one or
more judges over one or more time periods.
However, difficulty occurs when the assessment of reliability is brought into the clinical arena where one performance often cannot be captured in a useful manner. For example, some amounts of muscle force production cannot be
assessed from a videotape. Therefore, the logical comparison of repeated measurements taken on patients (e.g., reliability estimates) must be based on the premise that no
change has taken place in the phenomenon under study between repeated measurements or trials. Since this is essentially impossible with human beings, a dilemma is created
right from the start.
Despite this truth about the variability in human performance, inquiries into the reliability of measures must be designed with the maximum confidence that no change could
reasonably be expected in the performance to be measured
in the time interval between measurements. While acknowledging that normal fluctuations in human performance are inherent and undefinable in these analyses, this
assumption allows the difference in scores recorded at two
points in time to be logically attributed to errors in measurement. If researchers can identify a reason why the performance under study changed between measurements,
then that subject's performance must be eliminated from
the assessment of the measure's reliability.
As with validity, several types of reliability can be established for a clinical measure. Instrument reliability, often
called test-retest reliability, should be pursued when very little interaction occurs between the measurement tool and the
investigator. Mechanical measurement tools that are either
self-applied by the subject or require no judgments in application or reading on the part of the clinician (e.g., digital
readouts) are suitable for this type of investigation. This
process, often referred to as instrument calibration, should
be carried out before any data collection is undertaken. (Calibration may also include an assessment of measurement
validity if comparisons are made to a standard measure.)
More typically, important interaction and judgment are
required of the tester, and in these instances both intrarater
and interrater reliability can be established. Intrarater reliability assesses the agreement of two or more ratings performed by the same examiner over some specified period
of time. The determination of the time separation of the
two or more measures should take into consideration the
preceding assumptions about changes that can be expected
in the performance of interest and the typical pattern of
measurements found in clinical practice. The time separation must be documented for each measure and could be
characterized generally as closely separated (e.g., minutes
or days) or widely separated (days or weeks) assessments
of intrarater reliability as appropriate for each measure. Interrater reliability addresses one of the main purposes of
measurement: to communicate consistently with each other. This type of reliability requires assessments by two or
more raters of the same patient performance.
Lunsford et al. give the reader no assessment of the reliability of their measures of cervical axial rotation. Measurement reliability in an experimental design is often investigated in a pilot study prior to beginning formal data collection.
From the description of their procedures, it is not known if
one or more testers performed the cervical spine range-of-motion measurements from the video screen using a protractor. If only one investigator took all measurements, then
the reader assumes there is no error introduced that would
need to be addressed by establishing interrater reliability
prior to the study. However, in the case of one rater, the reader's confidence in the measurements could be enhanced by
knowing the intrarater reliability of the measurements.
In the procedures, it is noted that the subject performed
10 repetitions of axial rotation. It can be assumed that this
procedure was used to enhance the reliability of the measurements, but since the data collection section does not include a description of how an individual subject's score was
derived, the reader does not know whether the rater measured all 10 trials, trials 3 to 5 or just one trial. If the rater
measured more than one trial, the data exist to investigate
the intrarater reliability of the measure.
Quantification of Measurement Reliability and Validity
Several statistical techniques are available to clinical researchers that allow them to quantify the reliability of their
clinical measures. Reliability estimates for measurements
taken in a study can be assessed with a variety of correlation coefficients, which analyze the relative reliability or absolute reliability or agreement of a set of scores. The specific type of correlation coefficient chosen to analyze reliability data depends on the level of measurement of the data, the type of reliability estimate desired and the number
of raters involved in the study. The reader is referred to
chapters 9 and 10 of the text by Domholdt (9) for an excellent discussion of correlational analyses and experimental
design. Correlation coefficients range in value from -1.0 to
+1.0, indicating strength and direction of the relationship.
Coefficients close to zero indicate no relationship whereas
those close to -1 or +/-1 indicate strong indirect or direct
relationships.
In a similar fashion, correlation coefficients can be used
to quantify the association between two sets of scores derived from two different measures, and this forms the basis
of the analysis of measurement validity. In the sample
study, if the scores obtained on each subject for cervical
spine axial rotation were compared to scores obtained using cervical spine radiographs, the resulting correlation coefficient would be interpreted as a measure of criterion-related validity for the measurements in the study.
Measurement Error
Knowledge gained from the study of measurement science
will make clinicians more or less certain of their interpretations of the research summarized above and their confidence in the values reported as the true amount of axial rotation permitted by the orthoses. For example, measurement theory shows one can never absolutely measure the
true quantity of a concept. Every measure taken by clinicians or scientists has a shadow component, termed error.
The error associated with a measurement is defined as
the difference between the unknowable true score and the
observed score recorded while taking measurements. Since
in theory the true score always remains unknown, it is crucial to estimate the errors associated with observed scores,
or measures, as a means of establishing confidence in the
measuring devices and procedures. This can be done by taking repeated measures of the same phenomenon and then
describing the various observed scores. If repeated observed scores are consistent, it is assumed that the measurement error is small and that the observed scores closely approximate the true score (9). The measurement device
and procedures are declared reliable, and one of the major
pitfalls of clinical research, measurement error, has been
overcome.
Measurement error is often categorized as occurring either randomly or systematically in an experiment. Random
errors are inconsistent discrepancies that occur by chance
in a study. They are not found to follow any pattern that
could introduce bias into the results; they are simply naturally occurring events that detract from the precision of
clinical measures (10). If the researcher is inexperienced
with the measurements to be taken and is uncertain about
his/her judgments, the possibility of random error is introduced. Research procedures often include repeated trials
for measurements to decrease these types of random errors
so an average of several trials may be entered for the subject's score. This method will provide a score that will more
closely approximate the subject's true score than does any
one trial score (10). Consistent errors that persist from one
subject to another are considered systematic errors.
Both random error and systematic error will undermine
the validity of the clinical measure (6). Systematic error is
of particular concern since its effect on reliability can go
undetected; thus, a clinician may assume the clinical measure is reliable and proceed with its use. In the study example, range-of-motion measurements were taken by placing
a precision protractor on a video monitor's screen and
measuring both beginning and ending angular measurements. If the numbers marked on the protractor were in error by three degrees, then all measurements made with that
protractor would be off by three degrees in the same direction. One can see that this systematic error would not affect
the reliability of the measurements taken by the investigators, but statements about the average amount of axial rotation permitted by an orthosis would carry with it the error of three degrees. This illustrates the intimate relationship between reliability and validity and the influence of
measurement error on both of these important characteristics of measures.
Quantification of Measurement Error
Several statistical techniques are available to clinical researchers to allow them to quantify the errors associated
with their measures. In instances where the research project's purpose is to predict measures of central tendency
(e.g., mean) of a variable for a set of subjects so that the researcher and reader may generalize this value to a population of similar patients who were not studied, both confidence intervals and the standard error of the mean are very
useful tools (9). Applying these tools in the sample study
would allow the investigator to report the associated error
(e.g., +/- 5 degrees) along with the estimate of average cervical axial rotation available in each orthotic condition. These
analyses improve the reader's confidence in the reported
means, especially those derived from studies with small
sample sizes.
Other Sources of Error in Research
Measurement error is considered the primary source of error in most research designs. Errors in sampling, instrumentation, procedures and data analysis also occur so that
a thorough review of all aspects of the research design
should be undertaken before beginning an experiment. The
search for errors of procedure and technique can be facilitated by conducting a pilot study.
Particular attention should be paid to all procedures that
might affect the measurements, including lack of consistent
stabilization of subjects in equipment, inconsistent instructions given to subjects, inaccurate procedures for reading
measurement dials or gauges (parallax), inadequate procedures for initiating timing sequences, equipment failures
and plans for backup equipment. If the interaction between
the subject and any research equipment is novel, procedures should allow the subject to become thoroughly familiar with the equipment prior to the recording of any
measurements to eliminate learning effects in the study.
As quickly becomes obvious, not all errors can be completely eliminated from clinical investigations. Attempts to
control sources of errors in measurement and procedures
of research are not unlike the tension between internal and
external experimental validity discussed by Lunsford (11).
The application of too great an effort to control error may
enhance reliability but somewhat decrease validity. Reasonable efforts to assess measurement properties should be
expected of those conducting clinical investigations.
Implications for Practice
This discussion of measurement and error has been set in
the context of the performance of clinical research. Clinicians' expectations of the reliability of clinical measures as
used in daily practice should be no less than those of research colleagues. Practitioners, clinicians and researchers
share high expectations of what clinical measures should
do. Another important clinical consideration relates to the
nature of the theoretical validity of clinical measures: Do
we currently have the most useful clinical measures available to us in orthotics and prosthetics?
The clinician community has an important insight into
the content of useful and appropriate clinical measures given the current climate of health-care delivery systems. To
paraphrase Campbell (12), how good does the gait of a patient with a transfemoral amputation have to be at the completion of rehabilitation to result in long-term use of his/her
prosthesis as a community ambulator rather than in a
wheelchair?
More clinical measures that can answer this type of
question (by providing this type of measurement predictive validity) are needed. What types or collection of
measures should be taken to predict successful ambulation
with bilateral KAFOs in patients who have experienced
spinal cord injuries? In developing such an assessment, a
major concern is content validity: Has a sampling of
all predictive measures been included? When clinicians
ask themselves these questions, they are attempting to
make their measurements accurate, efficient and useful to
their patients.
Clinical decisions made by prosthetists and orthotists, in
conjunction with other rehabilitation team members, have
significant consequences for patients and for the healthcare system. These consequences include an impact on the
functional capabilities of the patients and the allotment of
considerable time and financial resources of the health-care
system. In the future, O&P professionals will be held even
more accountable for their clinical decisions, and the quality of clinical measures has a direct effect on the adequacy
of these clinical decisions.
Each clinician must participate in a constant evaluation
of the characteristics of a clinical measure to ensure it is
performing as needed. Campbell states, "As we develop
new clinical tests, we must assess their validity in predicting
disability in the community and link observed functional
limitations to the need for assistive technology or other
compensations for permanently limited function" (12). If
the outcomes of clinical decisions are no better than the
toss of a coin, or the judgments of an uninformed insurance
carrier, then O&P professionals may lose the ability to
make these judgments in the future.
References:
- Lunsford TR, Davidson M, Lunsford BR. The effectiveness of four contemporary cervical orthoses in restricting cervical
motion. JPO October 1994; 6:4:93-9.
- Lunsford BR. Methodology: Variables and levels of measurement. JPO October 1993; 5:4:121-4.
- Shurr DG. Practical clinical measures. JPO October 1993;
5:4:131 -3.
- Rothstein JM. Measurement in physical therapy. New
York: Churchill Livingstone, 1985.
- Payton OD. Research: The validation of clinical practice,
3rd ed. Philadelphia: F.A. Davis Co., 1993.
- Sim J, Arnell P. Measurement validity in physical therapy
research. 1993: 73:2:102-15.
- Rothstein JM. Reliability and validity: Implications for research. In: Bork CE (ed.), Research in Physical Therapy. Philadelphia: J.B. Lippincott Co., 1992.
- Payton OD. Commentary on Sim J and Arnell P: Measurement validity in physical therapy research. 1993: 73:2:102-15.
- Domholdt E. Physical therapy research: Principles and applications. Philadelphia: W.B. Saunders Co., 1993.
- Currier DP. Elements of research in physical therapy. 3rd
ed. Baltimore: Williams and Wilkins, 1990.
- Lunsford TR. Clinical research. JPO October 1993;
5:4:101-11.
- Campbell SK. Commentary on Sim J and Arnell P: Measurement validity in physical therapy research. 1993: 73:2:102-15.
|