View Options
Print Options
E-Mail Options

RESEARCH FORUM--Measurement Characteristics and Sources of Measurement Error

Jan Gwyer, PhD, PT

ABSTRACT

Orthotic and prosthetic clinicians make clinical decisions daily that affect their patients' lives and the resources of the health-care system. These decisions are largely based on information gained taking clinical measurements of patient characteristics. Each measurement can be evaluated in terms of its reliability, validity, responsiveness and usefulness in the clinic. Clinicians and researchers studying clinical phenomena must have an understanding of these qualities of clinical measures and the sources of error that can make their data less useful.

Measurement errors occur in all systematic investigations of natural phenomena. Neither the advanced state of our instrumented technology nor precision in our procedures can prevent random errors. Thus, it is important that orthotists and prosthetists familiarize themselves with the characteristics and occurrences of measurement errors in practice and research. This understanding will provide the basis to evaluate the impact of error and bias on measures and improve the clinical decisions made when reading or conducting clinical research.

Introduction

A major goal of clinical research is to search out the truth of the matter at hand. For example, when researching the range-of-motion-limiting effects of four types of cervical orthoses, it is important to know as precisely as possible how much axial rotation is permitted by each orthosis. When such a study was conducted by Lunsford et al. (1), 10 subjects wearing four different cervical orthoses and wearing no orthoses were studied to obtain measures of allowable axial rotation. Their results showed the most effective orthosis limited axial rotation of the cervical spine to approximately 25 degrees of motion. Those who read this study may conclude that 25 degrees is the true amount of axial rotation allowed by the orthosis and may be persuaded to select this orthosis over others when limitation of cervical rotation is the primary patient concern.

Clinicians who plan to conduct clinical research, or who read and critique published research [such as the Lunsford et al. study (1)], should pause to ask themselves several questions about the clinical measures involved. What are the measurement properties of these measures, and what are the sources of error that might affect our confidence in them? The intent of this article is to build on previous papers that provide an introduction to measurement and variables (2) and illustrate common clinical measures (3). The study by Lunsford et al. referenced throughout this article will provide a framework for illustrating concepts of measurement theory and sources of measurement error in experimental research.

Measurement Science

Measurement is a process so familiar to clinicians that to stop and analyze the science underlying the skill of measurement may seem unnecessary. But if measurement is, as Rothstein says, the very way in which clinicians communicate, document treatments and claim credibility for clinical decisions (4), then an understanding of how to best practice the science of measurement is crucial.

Measurement is the complex process by which rules are followed to assign numbers to events or phenomena (5). Measurements vary in their intuitive appeal from measuring limb girth in centimeters to measuring the amount of assistance a patient requires to don an orthosis on a scale of zero to five. The former methodology seems much more obviously "true" or "real" and is more easily communicated consistently and with confidence between practitioners than the latter. Researchers strive to increase the validity and reliability of all measurements.

Validity

Many familiar definitions of measurement validity illustrate the desired properties of clinical measurements: accuracy, truthfulness, veracity, legitimacy, correctness and soundness. Less clear is the definition of measurement validity as the extent to which a measure or score actually measures what it is intended to measure. This definition reminds the clinician that the accuracy of a measurement tool is not the only concern. Clinicians must have a clear picture of the context in which the measurement is taken to establish measurement validity. This context is most easily identified by asking the following question: What is the purpose of this measure, or what do I intend to do with this information once obtained? Some examples of purposes for measurements include to evaluate physical performance, to diagnose, or to predict future limitations or disabilities. An understanding of the purpose of a measure provides the guidance needed to select the types of measurement validity most appropriate for the clinical measure.

A brief overview of types of measurement validity will be presented here and illustrated with our study example. An extensive review of measurement principles illustrated with clinical measures familiar to all orthotic and prosthetic practitioners is presented in Sim and Arnell's review of measurement validity in clinical research (6) for a detailed analysis of the types of measurement validity.

Face validity, which exists for the majority of clinical measures in use in prosthetics and orthotics, is perhaps the easiest form of measurement validity to establish. A clinical measure has face validity when the clinician and many of his/her peers can agree that the measurement process seems to do what it is intended to do and has a useful purpose.

When a piece of deformable material is dropped into the bottom of a patellar tendon-bearing (PTB) socket, and the patient is asked to stand and then remove the material from the socket, a measurement process to evaluate whether or not there is total contact in the prosthetic socket has been developed. The measurement is of the nominal level with two possible responses, either yes or no, to the question of whether there exists total contact of the residual limb in the socket. This measure probably enjoys substantial face validity among prosthetists. Conversely, if a clinician plans to remove the deformable material, measure the height of it, use a formula including the patient's height and weight and translate that into a percentage of weightbearing accomplished by the patient, the clinician may lose the reader's confidence in the face validity of this measure.

In the Lunsford et al. study presented here, the authors have provided the reader with both pictures and text documenting the clinical measurement of cervical spine axial rotation in this study. From this information, a determination about the face validity of this method of measuring cervical spine axial rotation can be made. The next two types of measurement validity are classified as theoretical validity by Sim and Arnell (6): construct and content validity. Construct validity is established when there is agreement on a clear theoretical logical argument underlying the measure (7). Content validity is established when the measurement contains all, or a sample of all, appropriate elements of the construct of interest. Payton (8), in his commentary on Sim and Arnell's work, agrees with this classification of construct and content validity as theoretical validity and suggests that these types of measurement validity are best confronted by theoreticians whereas the criterion-related validities are more suited for clinical studies. Although not referenced in their article, Lunsford et al. are probably relying on the published discourse documenting the construct and content validity of spinal range-of-motion measures (4,9).

The last type of measurement validity is termed criterion-related validity and has two sub-classifications: predictive and concurrent validity. Both types of criterion-related validity are concerned with the clinician's confidence in the judgments he/she makes based on the measure. The purpose of measurement must be clearly stated here. Criterion-related validity is established by selecting a criterion against which the clinician will evaluate the performance of the clinical measure.

For many aspects of physical performance of interest to orthotists and prosthetists more than one type of clinical measure may be used. Some clinical measures have enjoyed more reported types of measurement validity as just described (face, content, construct) and serve well as a criterion against which to hold a newly developed measure. If the purpose of a clinical measurement is to classify or evaluate some aspect of physical performance, then the concurrent validity of the measure must be established. If the purpose of a clinical measure is to predict future disability or functional performance, then predictive validity for the measure must be pursued. This is done by collecting data with the clinical measure and then, at some point in the future, collecting data on the same subjects with the criterion measure. For example, a clinician might collect data on the type and thickness of the material used in several types of ankle-foot orthoses (AFOs) and predict the number of months to failure of the orthoses in a group of walkers.

Although no aspects of measurement validity are discussed by the authors of the sample study, several methods of establishing the criterion-related validity for the measures of cervical axial rotation could be investigated. Cervical axial rotation measured in degrees is a measurement familiar to the readers, but the manner in which these measures were taken is not a common clinical practice, and so the validity of the measures is a reasonable concern for the reader. We can assume that the purpose of this measure is to evaluate the available cervical spine axial rotation.

The authors could have increased the readers' confidence in this specific measurement of cervical spine axial rotation by comparing measurements obtained in this fashion (protractor measurements taken on a video monitor screen) with measurements taken from radiographs or from gravity-referenced goniometers while the subjects were under the same orthotic conditions. Although not a practical clinical measure, serial radiographs serve as an excellent criterion measure against which a clinical measure can be judged. The question "Does this clinical measurement capture all cervical axial rotation and only cervical rotation, ruling out thoracic spine rotation and temporomandibular joint motion?" can be answered by comparing the clinical measure to serial radiographs.

Reliability

Reliability also has many common definitions familiar to most O&P clinicians: repeatability, stability, consistency, reproducibility and dependability are just a few synonyms. Since O&P professionals often check their measurements more than once during fabrication processes, they are familiar with the concept of hoping that sequential measurements of the same limb segment agree with each other, and even more familiar with the consequences in time and material when their measurements fail them. When this scenario happens and an assistive device must be refabricated, it is often assumed that a change took place in the patient between measurement and fitting, and this is an important concept to clarify during a discussion of measurement reliability.

Classical definitions of all of the following types of reliability are based on one or more examiners making judgments about the same performance on one or more occasions. In other words, true measurement reliability can only be assessed if the performance of interest can be perfectly captured in some fashion (e.g., videotape) such that it cannot be changed while it is presented for rating to one or more judges over one or more time periods.

However, difficulty occurs when the assessment of reliability is brought into the clinical arena where one performance often cannot be captured in a useful manner. For example, some amounts of muscle force production cannot be assessed from a videotape. Therefore, the logical comparison of repeated measurements taken on patients (e.g., reliability estimates) must be based on the premise that no change has taken place in the phenomenon under study between repeated measurements or trials. Since this is essentially impossible with human beings, a dilemma is created right from the start.

Despite this truth about the variability in human performance, inquiries into the reliability of measures must be designed with the maximum confidence that no change could reasonably be expected in the performance to be measured in the time interval between measurements. While acknowledging that normal fluctuations in human performance are inherent and undefinable in these analyses, this assumption allows the difference in scores recorded at two points in time to be logically attributed to errors in measurement. If researchers can identify a reason why the performance under study changed between measurements, then that subject's performance must be eliminated from the assessment of the measure's reliability.

As with validity, several types of reliability can be established for a clinical measure. Instrument reliability, often called test-retest reliability, should be pursued when very little interaction occurs between the measurement tool and the investigator. Mechanical measurement tools that are either self-applied by the subject or require no judgments in application or reading on the part of the clinician (e.g., digital readouts) are suitable for this type of investigation. This process, often referred to as instrument calibration, should be carried out before any data collection is undertaken. (Calibration may also include an assessment of measurement validity if comparisons are made to a standard measure.)

More typically, important interaction and judgment are required of the tester, and in these instances both intrarater and interrater reliability can be established. Intrarater reliability assesses the agreement of two or more ratings performed by the same examiner over some specified period of time. The determination of the time separation of the two or more measures should take into consideration the preceding assumptions about changes that can be expected in the performance of interest and the typical pattern of measurements found in clinical practice. The time separation must be documented for each measure and could be characterized generally as closely separated (e.g., minutes or days) or widely separated (days or weeks) assessments of intrarater reliability as appropriate for each measure. Interrater reliability addresses one of the main purposes of measurement: to communicate consistently with each other. This type of reliability requires assessments by two or more raters of the same patient performance.

Lunsford et al. give the reader no assessment of the reliability of their measures of cervical axial rotation. Measurement reliability in an experimental design is often investigated in a pilot study prior to beginning formal data collection. From the description of their procedures, it is not known if one or more testers performed the cervical spine range-of-motion measurements from the video screen using a protractor. If only one investigator took all measurements, then the reader assumes there is no error introduced that would need to be addressed by establishing interrater reliability prior to the study. However, in the case of one rater, the reader's confidence in the measurements could be enhanced by knowing the intrarater reliability of the measurements.

In the procedures, it is noted that the subject performed 10 repetitions of axial rotation. It can be assumed that this procedure was used to enhance the reliability of the measurements, but since the data collection section does not include a description of how an individual subject's score was derived, the reader does not know whether the rater measured all 10 trials, trials 3 to 5 or just one trial. If the rater measured more than one trial, the data exist to investigate the intrarater reliability of the measure.

Quantification of Measurement Reliability and Validity

Several statistical techniques are available to clinical researchers that allow them to quantify the reliability of their clinical measures. Reliability estimates for measurements taken in a study can be assessed with a variety of correlation coefficients, which analyze the relative reliability or absolute reliability or agreement of a set of scores. The specific type of correlation coefficient chosen to analyze reliability data depends on the level of measurement of the data, the type of reliability estimate desired and the number of raters involved in the study. The reader is referred to chapters 9 and 10 of the text by Domholdt (9) for an excellent discussion of correlational analyses and experimental design. Correlation coefficients range in value from -1.0 to +1.0, indicating strength and direction of the relationship. Coefficients close to zero indicate no relationship whereas those close to -1 or +/-1 indicate strong indirect or direct relationships.

In a similar fashion, correlation coefficients can be used to quantify the association between two sets of scores derived from two different measures, and this forms the basis of the analysis of measurement validity. In the sample study, if the scores obtained on each subject for cervical spine axial rotation were compared to scores obtained using cervical spine radiographs, the resulting correlation coefficient would be interpreted as a measure of criterion-related validity for the measurements in the study.

Measurement Error

Knowledge gained from the study of measurement science will make clinicians more or less certain of their interpretations of the research summarized above and their confidence in the values reported as the true amount of axial rotation permitted by the orthoses. For example, measurement theory shows one can never absolutely measure the true quantity of a concept. Every measure taken by clinicians or scientists has a shadow component, termed error.

The error associated with a measurement is defined as the difference between the unknowable true score and the observed score recorded while taking measurements. Since in theory the true score always remains unknown, it is crucial to estimate the errors associated with observed scores, or measures, as a means of establishing confidence in the measuring devices and procedures. This can be done by taking repeated measures of the same phenomenon and then describing the various observed scores. If repeated observed scores are consistent, it is assumed that the measurement error is small and that the observed scores closely approximate the true score (9). The measurement device and procedures are declared reliable, and one of the major pitfalls of clinical research, measurement error, has been overcome.

Measurement error is often categorized as occurring either randomly or systematically in an experiment. Random errors are inconsistent discrepancies that occur by chance in a study. They are not found to follow any pattern that could introduce bias into the results; they are simply naturally occurring events that detract from the precision of clinical measures (10). If the researcher is inexperienced with the measurements to be taken and is uncertain about his/her judgments, the possibility of random error is introduced. Research procedures often include repeated trials for measurements to decrease these types of random errors so an average of several trials may be entered for the subject's score. This method will provide a score that will more closely approximate the subject's true score than does any one trial score (10). Consistent errors that persist from one subject to another are considered systematic errors.

Both random error and systematic error will undermine the validity of the clinical measure (6). Systematic error is of particular concern since its effect on reliability can go undetected; thus, a clinician may assume the clinical measure is reliable and proceed with its use. In the study example, range-of-motion measurements were taken by placing a precision protractor on a video monitor's screen and measuring both beginning and ending angular measurements. If the numbers marked on the protractor were in error by three degrees, then all measurements made with that protractor would be off by three degrees in the same direction. One can see that this systematic error would not affect the reliability of the measurements taken by the investigators, but statements about the average amount of axial rotation permitted by an orthosis would carry with it the error of three degrees. This illustrates the intimate relationship between reliability and validity and the influence of measurement error on both of these important characteristics of measures.

Quantification of Measurement Error

Several statistical techniques are available to clinical researchers to allow them to quantify the errors associated with their measures. In instances where the research project's purpose is to predict measures of central tendency (e.g., mean) of a variable for a set of subjects so that the researcher and reader may generalize this value to a population of similar patients who were not studied, both confidence intervals and the standard error of the mean are very useful tools (9). Applying these tools in the sample study would allow the investigator to report the associated error (e.g., +/- 5 degrees) along with the estimate of average cervical axial rotation available in each orthotic condition. These analyses improve the reader's confidence in the reported means, especially those derived from studies with small sample sizes.

Other Sources of Error in Research

Measurement error is considered the primary source of error in most research designs. Errors in sampling, instrumentation, procedures and data analysis also occur so that a thorough review of all aspects of the research design should be undertaken before beginning an experiment. The search for errors of procedure and technique can be facilitated by conducting a pilot study.

Particular attention should be paid to all procedures that might affect the measurements, including lack of consistent stabilization of subjects in equipment, inconsistent instructions given to subjects, inaccurate procedures for reading measurement dials or gauges (parallax), inadequate procedures for initiating timing sequences, equipment failures and plans for backup equipment. If the interaction between the subject and any research equipment is novel, procedures should allow the subject to become thoroughly familiar with the equipment prior to the recording of any measurements to eliminate learning effects in the study.

As quickly becomes obvious, not all errors can be completely eliminated from clinical investigations. Attempts to control sources of errors in measurement and procedures of research are not unlike the tension between internal and external experimental validity discussed by Lunsford (11). The application of too great an effort to control error may enhance reliability but somewhat decrease validity. Reasonable efforts to assess measurement properties should be expected of those conducting clinical investigations.

Implications for Practice

This discussion of measurement and error has been set in the context of the performance of clinical research. Clinicians' expectations of the reliability of clinical measures as used in daily practice should be no less than those of research colleagues. Practitioners, clinicians and researchers share high expectations of what clinical measures should do. Another important clinical consideration relates to the nature of the theoretical validity of clinical measures: Do we currently have the most useful clinical measures available to us in orthotics and prosthetics?

The clinician community has an important insight into the content of useful and appropriate clinical measures given the current climate of health-care delivery systems. To paraphrase Campbell (12), how good does the gait of a patient with a transfemoral amputation have to be at the completion of rehabilitation to result in long-term use of his/her prosthesis as a community ambulator rather than in a wheelchair?

More clinical measures that can answer this type of question (by providing this type of measurement predictive validity) are needed. What types or collection of measures should be taken to predict successful ambulation with bilateral KAFOs in patients who have experienced spinal cord injuries? In developing such an assessment, a major concern is content validity: Has a sampling of all predictive measures been included? When clinicians ask themselves these questions, they are attempting to make their measurements accurate, efficient and useful to their patients.

Clinical decisions made by prosthetists and orthotists, in conjunction with other rehabilitation team members, have significant consequences for patients and for the healthcare system. These consequences include an impact on the functional capabilities of the patients and the allotment of considerable time and financial resources of the health-care system. In the future, O&P professionals will be held even more accountable for their clinical decisions, and the quality of clinical measures has a direct effect on the adequacy of these clinical decisions.

Each clinician must participate in a constant evaluation of the characteristics of a clinical measure to ensure it is performing as needed. Campbell states, "As we develop new clinical tests, we must assess their validity in predicting disability in the community and link observed functional limitations to the need for assistive technology or other compensations for permanently limited function" (12). If the outcomes of clinical decisions are no better than the toss of a coin, or the judgments of an uninformed insurance carrier, then O&P professionals may lose the ability to make these judgments in the future.


References:

  1. Lunsford TR, Davidson M, Lunsford BR. The effectiveness of four contemporary cervical orthoses in restricting cervical motion. JPO October 1994; 6:4:93-9.
  2. Lunsford BR. Methodology: Variables and levels of measurement. JPO October 1993; 5:4:121-4.
  3. Shurr DG. Practical clinical measures. JPO October 1993; 5:4:131 -3.
  4. Rothstein JM. Measurement in physical therapy. New York: Churchill Livingstone, 1985.
  5. Payton OD. Research: The validation of clinical practice, 3rd ed. Philadelphia: F.A. Davis Co., 1993.
  6. Sim J, Arnell P. Measurement validity in physical therapy research. 1993: 73:2:102-15.
  7. Rothstein JM. Reliability and validity: Implications for research. In: Bork CE (ed.), Research in Physical Therapy. Philadelphia: J.B. Lippincott Co., 1992.
  8. Payton OD. Commentary on Sim J and Arnell P: Measurement validity in physical therapy research. 1993: 73:2:102-15.
  9. Domholdt E. Physical therapy research: Principles and applications. Philadelphia: W.B. Saunders Co., 1993.
  10. Currier DP. Elements of research in physical therapy. 3rd ed. Baltimore: Williams and Wilkins, 1990.
  11. Lunsford TR. Clinical research. JPO October 1993; 5:4:101-11.
  12. Campbell SK. Commentary on Sim J and Arnell P: Measurement validity in physical therapy research. 1993: 73:2:102-15.


 

Home > JPO > 1995 Vol. 7, Num. 3 > pp. 100-104

 

Copyright © American Academy of Orthotists & Prosthetists (AAOP)
All rights reserved. See disclaimer

oandp.com - Orthotics & Prosthetics Industry Information

Website built by oandp.com

oandp.com - Orthotics & Prosthetics Industry Information

Home About Education Legislation / Advocacy Project Quantum Leap Annual Meeting Membership Journal of Orthotics & Prosthetics Online Publications Bookstore Contact Us