Continued article from the The Behavioral Measurement Letter, Vol. 7, No.2 Winter 2002
Fred B. Bryant
Selecting appropriate measurement instruments is among the tasks researchers most frequently face. Yet, surprisingly little has been written about how best to go about the process of instrument selection. Given the prevalence of self-report methods of measurement in the social sciences, the task of selecting an instrument most often involves choosing from among a set of seemingly relevant questionnaires, surveys, inventories, checklists, and scales.
For example, a researcher who wishes to measure depression in college students might locate dozens of potentially useful self-report instruments designed to assess this construct. Indeed, the September 2001 release of the Health and Psychosocial Instruments (HaPI) database contains 105 primary records of self-report instruments with the term “depression” in the title, excluding measures developed for use with children or the elderly, those written in foreign languages, and those assessing attitudes toward, knowledge of, or reactions to depression rather than depression per se. The seemingly appropriate instruments thus identified include the Beck Depression Inventory (BDI; Beck, 1987), Hamilton Rating Scale for Depression. (HAM-D; Hamilton, 1960), Center for Epidemiologic Studies Depression Scale (CES-D; National Institute of Mental Health, 1977), and Self-Rating Depression Scale (SDS; Zung, 1965), to name just a few. How should the researcher decide which to use?
One strategy for selecting instruments is to employ only those most commonly used in published studies. Not only is this strategy simple and straight-forward, but some researchers follow it in the hope of increasing the likelihood that their research will be published. However, it limits conceptual definitions to those created within the theoretical frameworks of commonly used instruments. Over time, this effectively constricts the generalizability of research on these constructs. Further, all measurement instruments tap irrelevancies that have nothing to do with the constructs they are intended to assess. Using only a single measure of a particular research construct makes it impossible to know the degree to which the irrelevancies in the measure affect the obtained results. Moreover, diversity in operationalization helps us better understand not only what we’re measuring, but also what we’re not measuring. Thus, in the long run, employing only the most commonly used instruments limits and weakens the body of scientific knowledge.
Although it is difficult to devise a universally applicable set of rules for selecting measurement instruments, it is possible to suggest some general guidelines that researchers can use in choosing appropriate self-report measures. What follows, then, is a set of precepts and principles for selecting instruments for research purposes, along with concrete examples illustrating each. Note that the order of presentation is not intended to reflect the guidelines’ relative importance in the process of instrument selection. In fact, each principle is essential in selecting the right measurement tool for the job.
- Before choosing an instrument, define the construct you want to measure as precisely as feasible. Ideally, researchers should not rely merely on a label or descriptive term to represent the construct they wish to assess, but should instead define the construct of interest clearly and precisely in theory-relevant terms at the outset. Being unable to specify beforehand what it is you want to measure makes it hard to know whether or not a particular instrument is an appropriate measure of the target construct (Bryant, 2000). Potentially useful strategies for defining research constructs are to draw on published literature reviews that synthesize available theories concerning a particular construct, or to review the published literature on one’s own in search of alternative theoretical definitions. If you can find alternative conceptual definitions of the target construct, then you can choose from among them a particular conception that resonates with your own thinking. The process of explicitly conceptualizing the construct that you wish to measure is known as “preoperational explication” (Cook & Campbell, 1979).
Imagine, for example, that a clinical researcher wants to use a self-report measure of shyness. A wide variety of potentially relevant measures can be found in the Health and Psychosocial Instruments database, depending on how the researcher conceptually defines shyness. Is shyness (for which there are 30 primary records in the database) synonymous with introversion (30 primary records), timidity (17 primary records), emotional insecurity (2 primary records), social anxiety (63 primary records), social fear (1 primary record), social phobia (29 primary records), social avoidance (4 primary records), or social isolation (76 primary records)? The clearer and more precise the initial conceptual definition, the easier it will be to find appropriate measurement tools. An added benefit of precisely specifying target constructs at the outset is that it helps to focus the research.
Although precise preoperational explication is the ideal when selecting measures, in actual practice it is often difficult beforehand to specify a clear conceptual definition of the target construct. Many times the published literature does not provide explicit alternatives and this, then, forces researchers to explicate constructs on their own– the equivalent of trying to define an unknown word without having a dictionary. In actual practice, researchers often begin by selecting a particular instrument that appears useful, thus adopting by default the conceptual definition of the target construct that underlies the chosen instrument. Truly, therefore, an available instrument often dictates one’s conceptual definition post hoc.
- Choose the instrument that is designed to tap an underlying construct whose conceptual definition most closely matches your own. Carefully consider the theoretical framework on which the originators based their instrument. Select an instrument that stems from a theory that defines the construct the same way you do, or at least in a way that does not contradict your conceptual orientation, for the theoretical underpinnings of the instrument should be compatible with your own conceptual framework.
Consider a sociologist and a psychologist, each of whom wants to measure guilt. The most appropriate self-report instrument in each case will be the one whose underlying conceptual definition most closely corresponds to that of the researcher. The sociologist, on the one hand, might be studying people’s reactions to homeless adults from a macro-level, sociocultural perspective. If so, then she might begin by defining guilt to be a prosocial emotion experienced when one perceives oneself as being better off than another person who is disadvantaged. Consistent with this conceptualization is Montada and Schneider’s (1989) three-item measure of “existential guilt,” conceived as prosocial feelings in reaction to the socially disadvantaged. The psychologist, on the other hand, might be studying personality from a micro-level, individual perspective. If so, then she might begin by defining guilt to be a dispositional feeling of regret or culpability in reaction to perceived personal or moral transgressions. Consistent with this conceptual definition is Kugler and Jones’ (1992) 98-item Guilt Inventory, which includes a separate subscale specifically tapping personal guilt as an unpleasant emotional trait. Clearly, instruments must have underlying conceptual definitions that match your own conceptual framework (Brockway & Bryant, 1998).
- Never decide to use a particular instrument based solely on its title. Just because the name of an instrument includes the target construct does not guarantee that it either defines this construct the same way you do or even measures the construct at all. Don’t let the title lead you to select an inappropriate instrument.
As a case in point, consider a developmental psychologist who wants to measure parents’ psychological involvement in their families. Based on its promising title, the researcher decides to use the Parent/Family Involvement Index (PFII; Cone, DeLawyer, & Wolfe, 1985). After obtaining the instrument and inspecting its constituent items, the researcher realizes to his chagrin that the PFII requires a knowledgeable informant (e.g., a teacher or teacher’s aide) to indicate whether or not the particular parent of a school-aged child has engaged in each of 63 specific educational activities. Based on an underlying conceptualization of family involvement in psychological terms, a more appropriate instrument would be a measure developed by Yogev and Brett (1985) that assesses parental involvement in terms of the degree to which parents identify psychologically with and are committed to family roles. This example shows clearly that you can’t judge an instrument by its title any more than you can judge a book by its cover (Brockway & Bryant, 1997).
- Choose an instrument for which there is evidence of reliability and validity. Reliability is measurement that is accurate and consistent. Good reliability in measurement strengthens observed statistical relationships — the more reliable the instrument, the smaller will be the error in measurement, and the closer observed results will be to actual results. For example, imagine a medical researcher who wants to determine whether an experimental antipyretic agent reduces fever more rapidly than available antipyretics, but who is using an unreliable thermometer that gives different readings over time, even when body temperature is stable. These inconsistencies in measurement make it nearly impossible to assess temperature accurately, and greatly decrease the likelihood of finding any experimental effects.
Data supporting the validity of an instrument increase one’s confidence that it really measures what it is designed to measure. For example, the medical researcher referred to above can be more confident that the thermometer actually measures temperature if its readings correlate highly with those of infrared-telemetry or body-scanning devices. Although instrument developers sometimes report reliability and validity data, such empirical evidence is often available only in published studies that have used the given measure. As a rule, avoid judging the validity of an instrument by the content of its constituent items. What an instrument appears to measure “on its face” (i.e., face validity) is not necessarily what it actually measures. As in the case of an instrument’s title, what you see is not necessarily what you get.
Judging the quality of research evidence concerning measurement reliability and validity can be difficult and confusing. There are various types of reliability (e.g., internal consistency, split-half, parallel-forms, interrater, test-retest) and of reliability coefficients (e.g., Cronbach’s alpha, coefficient kappa, intraclass correlation, KR-20). Similarly, there are numerous types of validity (e.g., construct, concurrent, criterion-referenced, convergent, discriminant). Thus a host of specialized statistical tools has been developed to quantify both reliability (Strube, 2000) and validity (Bryant, 2000). Given the numerous types of reliability and reliability coefficients, validity, and tools used to assess reliability and validity, instrument selection requires at least a basic understanding of psychometrics and of principles of scale validation in order to make informed judgments of instrument quality. When there is no evidence concerning an instrument’s reliability or validity, measurement becomes a “shot in the dark.”
- Given a choice among alternatives, select an instrument whose breadth of coverage matches your own conceptual scope. If you define your target construct as having a wide range of requisite constituent components, then choose an instrument whose items tap a broad spectrum of content relating to those components. On the other hand, if you define your target construct in a way that specifies a narrower set of conceptual components, then choose an instrument that has a more restrictive and specific content.
Breadth of coverage varies widely across available instruments. For example, to measure coronary-prone Type A behavior, alternatives include: the Student Jenkins Activity Survey (Glass, 1977) that taps the Type A components of hard-driving competitiveness and speed-impatience; the Type A Self-Rating Inventory (Blumenthal, Herman, O’Toole, Haney, Williams, & Barefoot, 1985) that taps hard-drivingness and extraversion; the Type A/Type B Questionnaire (Eysenck & Fulker, 1983) that taps tension, ambition, and activity; the Time Urgency and Perpetual Activation Scale (Wright, 1988) that taps activity and time urgency; and the Self-Assessment Questionnaire (Thorensen, Friedman, Powell, Gill, & Ulmer, 1985) that taps hostility, time urgency, and competitiveness. Clearly, your choice of instrument depends on the specific components of Type A behavior that you want to investigate.
The choice between general versus specific measures of a given construct may also depend on your particular research question. Consider the process of coping, for example. Numerous instruments exist for measuring people’s general style of coping in response to stress. However, if you want to study coping in relation to a specific problem or stressor, a host of other measures have been developed to assess individual differences in coping with such specific concerns as arthritis, asthma, cancer, chronic pain, diabetes, heart disease, hypertension, stroke, multiple sclerosis, spinal cord injury. HIV/AIDS, rape, sexual abuse, abuse, sexual harassment, pregnancy, miscarriage, childbirth, economic pressure, job stress, unemployment, depression, bereavement, natural disasters, prison confinement, test anxiety, and war trauma, to name just a few. Compared to a broad-band measure of general coping, a narrow-band measure of coping that is specific to a particular stressor would be expected to show stronger relationships with reactions to that specific stressor.
- Select an instrument that provides a level of measurement appropriate to your research goals. Some instruments are based on a theoretical model in which the underlying construct is assumed to be “unitary.” Such instruments provide only a general, global “total score” that summarizes the overall level of responses. Other instruments are based on a theoretical framework in which the latent construct is considered to be “multidimensional.” Such instruments provide multiple “subscale scores,” each of which taps a separate dimension of the underlying construct. Thus, if you want to gather global summary information about a target construct, then use a unitary instrument. If you want to gather information about multiple facets of a target construct, then use a multidimensional instrument.
Imagine two nursing researchers, each of whom wishes to measure patients’ life satisfaction. One seeks a global summary of patients’ overall life satisfaction, whereas the other wants to compare levels of satisfaction across important aspects of patients’ lives. The first researcher could use the five-item Satisfaction with Life Scale (Diener, Emmons, Larsen, & Griffin, 1985) to obtain a global total score. The second researcher could use the 66-item Quality of Life Index (Ferrans & Powell, 1985) to obtain individual scores for four separate satisfaction subscales: Health/ Functioning, Socioeconomic, Psychological/ Spiritual, and Family.
- Choose an instrument with a time frame and response format that meet your needs. Don’t use a “trait” measure (i.e., an instrument that defines the underlying construct as a stable disposition) to assess a transient, situational “state” that you expect to change over time. And don’t modify the labels on a response scale (e.g., from “rarely” to “never”) or the time frame of measurement (e.g., from “in general” or “on average” to “during the past day” or “in the past hour”) unless you have no other recourse. Substantial changes in an instrument’s response format or time frame can compromise its construct validity and therefore require revalidation.
In measuring hostility, for example, the choice of an appropriate instrument would depend on whether you conceive of hostility as a transitory. variable state or a stable, dispositional trait. An appropriate tool for measuring “state” hostility might be the 35-item Current Mood Scale (Anderson, Deuser, & DeNeve, 1995), which is designed to assess situational hostility, whereas an appropriate tool for measuring “trait” hostility might be the 50-item Cook-Medley Hostility Scale (Cook & Medley, 1954), which is based on the Minnesota Multiphasic Personality Inventory (MMPI; Hathaway & McKinley, 1989) and conceptualizes hostility as a personality trait. Clearly, when selecting instruments you need to distinguish carefully between states and traits.
- Match the reading level required to understand the items in the instruments you select to the age of the intended respondents. In studying children or adolescents, for example, avoid using an instrument designed for use with adults. When in doubt, use word-processing or linguistic software to determine the reading ability level required to understand an instrument’s constituent items.
Imagine a researcher who wants to measure depression in children. Depending on the average age of the subjects, the researcher could choose from a variety of different self-report instruments specifically designed to assess depression in children of various ages, including those with a first-grade reading level (Children’s Depression Inventory; Kovacs, 1985), children age 7-13 (Negative Affect Self-Statement Questionnaire; Ronan, Kendall, & Rowe, 1994), children age 8-12 (Childhood Depression Assessment Tool; Brady, Nelms, Albright, & Murphy, 1984), 1984), or children age 8-13 (Depression Self-Rating Scale; Asarnow, Carlson, & Guthrie, 1987). In any case, the researcher studying childhood depression should avoid adopting an instrument designed to tap depression in adults.
- Never use an instrument unless you know how you’ll score it and how you’ll analyze it. This rule may seem self-evident, but it is sometimes violated unintentionally. No matter how interesting or important an instrument seems, it is useless unless you can convert responses to it into meaningful data. Sometimes the scoring rules are difficult to obtain or hard to follow, particularly when an instrument consists of multiple composite subscales, reverse-scored items, or item-specific scoring weights. This suggests that researchers should first make sure they know how to score an instrument before they administer it.
Consider the SF-12 (Ware, Kosinski, & Keller, 1996), a 12-item self-report instrument designed to measure functional health status. At first glance, it might appear simple enough to score this instrument by simply summing or averaging responses to its constituent items. But the test manual for the SF-12 (Ware, 1993) specifies a complex set of mathematical computations designed to weight and combine the 12 item scores to produce composite scores reflecting mental, physical, and total functioning. Users cannot score the SF-12 correctly unless they have access to the detailed scoring instructions contained in the test manual. Clearly, then, administering an instrument is one thing, but scoring it can be an entirely different matter.
- Rather than choosing only one measure, when feasible use multiple measures of the construct you wish to assess. A central tenet of classical measurement theory is that any single way of measuring a construct has unavoidable idiosyncrasies that are unique to the measure and have nothing to do with the underlying conceptual variable. By studying what multiple measures of the same construct have in common, researchers can converge or triangulate on the referent construct. Using multiple measures also allows an assessment of the generalizability of results across alternative operational or conceptual definitions, to probe the generality versus specificity of effects. And in the long run, using multiple instruments will advance our understanding of the targeted construct much further than simply using a single instrument.
Even when following the ten guidelines for instrument selection discussed above, you will still sometimes face difficult, highly subjective decisions in choosing appropriate measures. For example, which mood measure is more appropriate: one that uses a seven-point response scale, or one that uses a four-point response scale; one that includes a specific label for each individual point on its response scale, or one that has labels only at its endpoints; one that assesses the frequency with which respondents experience certain feelings, or one that taps the percentage of time respondents experience certain feelings? Given the subjectivity of such decisions, it makes all the more sense to use multiple measures whenever possible so as to evaluate the generalizability of results across alternative operational definitions of the same underlying construct.
To recap, I have suggested “Ten Commandments” for selecting self-report instruments:
- Before choosing an instrument, define the construct you want to measure as precisely as feasible.
- Choose the instrument that is designed to tap an underlying construct whose conceptual definition most closely matches your own.
- Never decide to use a particular instrument based solely on its title.
- Choose an instrument for which there is evidence of reliability and validity.
- Given a choice among alternatives, select an instrument whose breadth of coverage matches your own conceptual scope.
- Select an instrument that provides a level of measurement appropriate to your research goals.
- Choose an instrument with a time frame and response format that meet your needs.
- Match the reading level required to understand the items in the instruments you select to the age of the intended respondents.
- Never use an instrument unless you know how you’ll score it and how you’ll analyze it.
- Rather than choosing only one measure, when feasible use multiple measures of the construct you wish to assess.
Following these guidelines will help you to select instruments wisely.
Obtaining Instruments Once Identified
Adhering to these ten guiding principles can help you identify appropriate measurement instruments, but then you need to obtain them and permission to use them. Indeed, physical availability may ultimately dictate instrument choice. An instrument may be unavailable for a variety of reasons, including copyright restrictions, being out-of-print, or death of the primary author. Given such obstacles, rather than trying to contact the original developer to obtain copies of an instrument, it may be best to contact BMDS, Behavioral Measurement Database Services, creator of the HaPI database, to secure permission to use an instrument, and to obtain a hardcopy of it along with any scoring instructions. For a reasonable fee, BMDS will perform these services.
Anderson, C.A., Deuser, W.E., & DeNeve, K.M. (1995). Hot temperatures, hostile affect, hostile cognition, and arousal: Tests of a general model of affective aggression. Personality and Social Psychology Bulletin, 21, 434-448.
Asamow, J.R., Carlson, G.A., & Guthrie, D. (1987). Coping strategies, self-perceptions, hopelessness, and perceived family environments in depressed and suicidal children. Journal of Consulting and Clinical Psychology. 55, 361-366.
Beck, A.T. (1987). Beck Depression Inventory (BDI). San Antonio, TX: Psychological Corporation.
Blumenthal, J.A., Herman, S., O’Toole, L.C., Haney, T.L., Williams, R.B., Jr., & Barefoot, J.C. (1985). Development of a brief self-report measure of the Type A (coronary prone) behavior pattern. Journal of Psychosomatic Research, 29, 265-274.
Brady, M.A., Nelms, B.C., Albright, A.V., & Murphy, C.M. (1984). Childhood depression: Development of a screening tool. Pediatric Nursing, 10, 222-225, 227.
Brockway, J.H., & Bryant, F.B. (1997). Teaching the process of instrument selection in family research. Family Science Review, 10, 182-194.
Brockway, J.H., & Bryant, F.B. (1998). You can’t judge a measure by its label: Teaching students how to locate, evaluate, and select appropriate instruments. Teaching of Psychology, 25, 121-123.
Bryant. F.B. (2000). Assessing the validity of measurement. In L.G. Grimm & P.R. Yarnold (Eds.), Reading and understanding more multivariate statistics (pp. 99-146). Washington, DC: American Psychological Association.
Cook, T.D., & Campbell, D.T. (1979). Quasi- experimentation: Design & analysis issues for field settings. Chicago: Rand McNally.
Cook, W. W., & Medley, D.M. (1954). Proposed hostility and pharisaic-virtue scales for the MMPI. Journal of Applied Psychology, 38, 414-419.
Cone, J.D., DeLawyer, D.D., & Wolfe, V.V. (1985). Assessing parent participation: The Parent/Family Involvement Index. Exceptional Children, 51, 417-424.
Diener, E., Emmons, R.A., Larsen, R.J., & Griffin, S. (1985). The Satisfaction with Life Scale. Journal of Personality Assessment, 49, 71-75.
Eysenck, H., & Fulker, D. (1983). The components of Type A behavior and its genetic determinants. Personality and Individual Differences, 4, 499-505.
Ferrans, C.E., & Powell, M.J. (1985). Quality of Life Index: Development and psychometric properties. Advances in Nursing Sciences, 8, 15-24.
Glass, D.C. (1977). Behavior patterns, stress, and coronary disease. Hillsdale, NJ: Lawrence Erlbaum.
Hamilton, M. (1960). A rating scale for depression. Journal of Neurology, Neurosurgery, and Psychiatry, 23, 56-62.
Hathaway, S.R., & McKinley, J.C. (1989). Minnesota Multiphasic Personality Inventory (MMPI). Minneapolis, MN: NCS Assessments.
Kovacs, M. (1985). The Children’s Depression Inventory (CDI). Psychopharmacology Bulletin, 21, 995-998.
Kugler, K., & Jones, W.H. (1992). On conceptualizing and assessing guilt. Journal of Personality and Social Psychology, 62, 318-327.
Montada, L., & Schneider, A. (1989). Justice and emotional reactions to the disadvantaged. Social Justice Research, 3, 313-344.
National Institute of Mental Health (1977). Center for Epidemiologic Studies Depression Scale (CES-D). Rockville, MD: Author.
Ronan, K.R., Kendall, P.C., & Rowe, M. (1994). Negative affectivity in children: Development and validation of a self-statement questionnaire. Cognitive Therapy and Research, 18, 509-528.
Strube, M.J. (2001). Reliability and generalizability theory. In L.G. Grimm & P.R. Yarnold (Eds.), Reading and understanding more multivariate statistics (pp. 23-66). Washington, DC: American Psychological Association.
Thoresen, C.E., Friedman, M., Powell, L.H., Gill, J.J., & Ulmer, D. (1985). Altering the Type A behavior pattern in postinfarction patients. Journal of Cardiopulmonary Rehabilitation, 5, 258-266.
Ware, J.E., Jr. (1993). SF-12 Health Survey (SF-12). Boston, MA: Medical Outcomes Trust.
Ware, J., Jr., Kosinski, M., & Keller, S.D. (1996). A 12- Item Short-Form Health Survey: Construction of scales and preliminary tests of reliability and validity. Medical Care, 34, 220-233.
Wright, L. (1988). The Type A behavior pattern and coronary artery disease: Quest for the active ingredients and the elusive mechanism. American Psychologist, 43, 2-14.
Yogev, S., & Brett, J. (1985). Patterns of work and family involvement among single-and dual-earner couples. Journal of Applied Psychology, 70, 754-768.
Zung, W.W.K. (1965). A Self-Rating Depression Scale, Archives of General Psychiatry, 12, 371-379.
Fred Bryant is Professor of Psychology at Loyola University, Chicago. He has roughly 90 professional publications in the areas of social psychology, personality psychology, measurement, and behavioral medicine. In addition, he has coedited 6 books, including Methodological Issues in Applied Social Psychology (1993; New York: Plenum). Dr. Bryant has extensive consulting experience in a wide variety of applied settings, including work as a research consultant for numerous marketing firms, medical schools, and public school systems; a methodological expert for the U.S. Government Accounting Office; and an expert witness in several federal court cases involving social science research evidence. He is currently on the Editorial Board of the journal Basic and Applied Social Psychology. His current research interests include happiness, the measurement of cognition and emotion, and structural equation modeling.
Read additional articles from this newsletter:
Refinements to the Lubben Social Network Scale: The LSNS-R