• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 



• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 

Bootstrapping the Way to Valid Diagnostic Interviews


Continued article from the The Behavioral Measurement Letter, Vol. 5, No.2 Spring 1998


Lee Robins


From World War II to the present, there has been a major effort to develop instruments that assess psychiatric disorders. The original goal was to improve the fairness with which young men were drafted into military service during the war. Different draft boards exempted men for psychoneurosis at such startlingly different rates that it became obvious that the clinicians doing the induction examination must have used greatly differing criteria (Starr, 1950). Since World War II, many diagnostic instruments have been written to serve a variety of purposes, (a) to help primary care physicians recognize which patients need mental health services, (b) to estimate the proportion of the population who have psychiatric disorders but are not receiving treatment, (c) to demonstrate that provision of mental health care can reduce the rate of care-seeking and hospitalization, (d) to study the causes and consequences of psychiatric disorders, (e) to select cases for drug trials with a particular diagnosis of interest and rule out those with comorbid conditions, (f) to estimate the cost in beds and services that would be required to treat all those who have a mental disorder, and (g) as a means for describing the target population who might be expected to benefit if such an increase in services became available.

Early instruments developed for assessing mental illness asked local general practitioners and mental health professionals to name persons they knew of who had mental disorders (Helgason, 1964). In early government surveys, one family member was asked to report on the mental health of the whole household (U.S. Department of Health, Education, and Welfare, 1969). Over time, instruments to be used with a proband have become highly standardized, closely tied to the official psychiatric diagnostic manuals, and designed to be given by lay interviewers and scored by computers. Various factors contributed to these changes, including technological advances, concern about costs of data collection and analysis, and, especially, publication of the third edition of the Diagnostic and Statistical Manual (American Psychiatric Association, 1980). The 1952 and 1964 versions of the DSM had defined diagnostic criteria so imprecisely that it was difficult to get agreement as to whether an informant’s disorder did or did not fulfill them. Moreover, published studies showed that general practitioners frequently failed to ask about psychiatric symptoms (Goldberg, Kay, & Thompson, 1976) and that psychiatrists often neglected comorbid conditions while concentrating on the presenting complaint (Helzer, Clayton, Pambakian, & Woodruff, 1978). These findings underscored the need for direct interviews with a random sample of the population, treated and untreated, if the true prevalence of psychiatric disorders was to be ascertained.

The design of diagnostic assessment instruments changed greatly as social scientists from a tradition of public opinion research entered the field of mental health survey research. These researchers demonstrated that accurate estimation of the prevalence of rare disorders would require very large samples, making it infeasible to supplement proband interviews with information from a family member and physician. They also feared that variation in the wording of questions would change results, and so they precisely specified the questions to be asked. They introduced psychometrics, such as indicators of validity and reliability, to evaluation of interview instruments. These developments led to a shift from using interviewers with clinical experience, who were skilled in tailoring questions to the particular respondent, to using lay interviewers who could be trained to ask questions in a standard fashion. Using lay interviewers also reduced costs, an important consideration given the very large samples needed.

Early studies using standardized interview questions sometimes did not attempt diagnoses, but rather assigned respondents to a degree of mental illness (including none) (Kirkpatrick & Michael, 1962). Clinicians at first were asked to review and make judgments based on the data collected by non-clinicians (Leighton, Leighton, & Danley, 1966). Eventually this review of the configuration of symptoms was assigned to a computer, eliminating the need for a clinician. The first computer programs were devised to approximate the decisions of the clinicians who had previously done the assessments, or to reflect the investigator’s judgment as to the critical symptoms of specific disorders and severity of impairment (Wing & Sturt, 1978). Then, when the DSM-III (American Psychiatric Association, 1980) provided clear diagnostic algorithms, computer programs were written to combine answers to standardized questions according to the Manual’s rules to arrive at a diagnosis (Robins, Helzer, Croughan, Williams, & Spitzer, 1980).

When assessments were being made directly by a clinician, validity was generally not investigated because there was no higher authority to appeal to than the clinician who had made the initial assessment. However, even in those early days, some studies included procedures intended to improve validity. Tomas Helgason, for example, perhaps because he was still a young resident at the time he did his landmark study, Epidemiology of Mental Disorders in Iceland (1964), asked five Danish clinicians with whom he had trained to validate his diagnostic decisions. In early diagnostic studies at Washington University, psychiatrists did all of the interviewing. As a clinician talked with a respondent, he/she checked off, on a list organized by diagnoses, those symptoms that met criteria for both clinical significance and absence of a demonstrable physical cause. The clinician then reviewed the symptoms and made a diagnosis using a cutoff point for the number of symptoms required that had been set by consensus among the psychiatrist researchers. Next, the clinician interviewer submitted to a second clinician the list of positive symptoms along with an approximately verbatim record of the respondent’s current problems, treatment history and past episodes of disorder, which had been elicited before questions were asked about specific symptoms. This second clinician, from whom the initial diagnosis was withheld, made a diagnosis independently. Then the two clinicians met to reconcile any differences in their diagnoses. Such consensus diagnoses were assumed to be valid because they had been agreed to by both psychiatrists, although the symptoms had been ascertained by only one. With the advent of video recording, the second psychiatrist could see and hear the respondent but, still, did not collect data independently.

The current standard method for clinical validation is to have a clinician interview the subject independently and make a diagnostic assessment. If the initial interviewer is also a clinician, this permits determination of reliability. If the initial interviewer is not a clinician, then the clinician’s interview becomes the “gold standard” for determining validity.

Another strategy for assessing validity of a new diagnostic interview requires that the respondent be assessed by multiple instruments. In this assessment of concurrent validity, the new instrument is judged to be valid if its diagnoses agree with those obtained using older instruments. A third strategy compares the diagnoses found by the interview with diagnoses found in a patient’s record. This technique has limited utility in general population studies because many persons with psychiatric disorders have never been treated and so have no records. Here, the interview is judged to be valid if all diagnoses found in the record are also obtained through the interview. Note that finding additional diagnoses by interview should not be considered evidence for lack of validity as a structured interview covering the respondent’s lifetime often identifies disorders that do not appear in the record because the attending physician failed to pursue them, thinking them irrelevant to the presenting complaint.

The methods described so far are external validators. A simpler method for assessing the validity of an interview is to assess its face validity. Realistically, however, this could not be done until the introduction in 1980 of clearly defined criteria in the Diagnostic and Statistical Manual. Only then could one be sure that questions were faithful to the diagnostic criteria they were intended to assess. In the Diagnostic Interview Schedule, the DIS (Robins et al., 1995), and The Composite International Diagnostic Interview, the CIDI (World Health Organization, 1997), each question is labeled so as to show which diagnostic criterion it relates to in the DSM-IV (American Psychiatric Association, 1994) or the International Classification of Diseases-10th Edition, the ICD-10 (World Health Organization, 1993), respectively, enabling each user to evaluate its face validity.

However, the most common use of the term “validity” in psychiatry even today remains how well a standardized instrument administered by a lay interviewer and scored by computer corresponds with a clinician’s diagnosis following an interview with the same respondent. Studies using this method vary in the restrictions placed on the clinician conducting the interview, ranging from none (the clinician uses his/her best clinical judgment) through prescribed use of a standard list of topics and the DSM’s algorithms as in, for example, the Structured Clinical Interview for DSM-III-R, the SCID (Spitzer et al., 1989), or the Schedules for Clinical Assessment in Neuropsychiatry, the SCAN (World Health Organization, 1991). The tighter the control over what topics the clinician must cover and how the symptoms must be combined, the more closely the interview done by the clinician resembles the standardized interview being validated, and the less it resembles what the clinician does in his daily practice, where he chooses which symptoms to ask about, gives them differential weights, and decides whether they constitute a familiar syndrome.

Treating the psychiatrist’s judgment as the “gold standard” seems a bit ironic, since standardized instruments were developed in the first place because of evidence that psychiatrists often disagree among themselves. Clinicians’ reliability is greatly improved when they are forced to follow the rules laid down in the Manual, even though the rules may conflict with their own clinical judgment. But controlling the way the clinicians make their diagnosis forfeits, at least in part, the claim of clinical judgment to being the standard. But the problem is more fundamental than that, for when clinicians are free to exercise their clinical judgment so that the Manual’s rules are not followed exactly, it becomes impossible to tell whether disagreement between the interview’s diagnoses and the clinicians’ diagnoses means that the interview instrument does not properly operationalize the Manual, and so is invalid, or whether the Manual’s diagnostic criteria are being found to be invalid.

Assessing validity of a new instrument by comparing the diagnoses obtained to those obtained using pre-existing instruments has equal difficulties, for the new interview would not have been written had existing instruments been considered valid. Sometimes older

instruments become invalid when the DSM is revised and the diagnostic criteria changed. In other cases, the older instruments are judged to be invalid because they are not “state-of-the-art.” Perhaps they fail to distinguish clinically significant psychiatric symptoms from those caused by the stresses and strains of daily living, physical illness or injury, or substance abuse.

I want to suggest a different approach for assessing validity that is patterned on what is called “back translation,” a method for showing that a translation of text from one language to another is satisfactory. Here, after the initial translation is done, a second translator unfamiliar with the original document translates the initial translation back into the original language. If the second translation is identical to or yields the same meaning as the original, the first translation is considered to be correct. While there are reasons to argue that this is an insufficient test of the validity of a translation, a somewhat parallel approach to validating instruments has appeal. It involves translating interview questions written in technical language in the DSM or ICD into language that most lay persons can understand, and achieving consensus among experts that the questions mean what the Manual says.

The procedure for assessing validity in this way is as follows: First, the protocol’s authors attempt to achieve face validity by reaching consensus that the questions they wrote to operationalize a criterion match the criterion as written and are sufficient to encompass its breadth. Next, they submit those questions to an outside expert panel, with whom they must reach consensus on the questions’ face validity. The third step is a field test where questions are posed to respondents who belong to the population of interest. For questions they answer positively, respondents are asked to describe the experiences that led them to acknowledge the symptom. For questions answered negatively, respondents are asked whether they have had any experiences somewhat like the symptom inquired about, and if so, they are asked to describe them and to explain how they decided that those experiences did not warrant a positive response. The fourth step is to present these positive and below threshold examples to the authors and group of experts who previously agreed that the questions matched the criteria in the Manual. They then decide together, question by question, whether the respondents’ decisions about what did and did not constitute a symptom fall within or outside the meaning of the criterion. For questions where the correspondence is judged not to be satisfactory, they rewrite the questions and then repeat the third and fourth steps.

One advantage of this approach is that it does not dismiss an interview instrument as being invalid if it fails the field test, but rather allows for revision until the questions mean to respondents what the authors intended them to mean. In addition, it can be used in developing training materials because it forces explicitness in specifying what each question is intended to express and provides examples of symptoms that do and do not meet diagnostic criteria.

There is a last essential step in assessing validity. It is not enough that each question designed to assess criterion symptoms means to respondents what experts think is specified in the written criterion, for the criteria must also be correctly combined to yield a valid diagnosis. This work is now done by computer programs that map onto algorithms in the Manual. Error in these computer programs can occur through simple mistyping of a variable name in one program step. Another common source of error, one that is harder to recognize, is a computer program’s failure to distinguish properly between negative responses and responses that are truly “indeterminate” (such as “don’t know” responses, and non-responses due to refusals to answer or the interviewer’s inadvertent skipping over a question). The result is underestimation of prevalence because the denominator is inflated with cases where there are indeterminate responses. Such cases should have been dropped from consideration in assessing validity because it cannot be determined whether they do or do not fulfill diagnostic criteria.

Computer programs used to analyze interview and questionnaire data are so complex that it is virtually impossible to write a program that is error-free. Some errors are found in the course of using the programs, but others may remain undetected. The best way to check the computer algorithms used to score questions and combine them into a diagnosis is to use independently written computer programs based on the same edition of the Manual to analyze the same set of interviews. Computers can be made to randomly generate logically consistent responses, creating very large sets of artificial interviews that contain examples of all, or nearly all, possible response patterns (Marcus and Robins, in press). If the results of applying the two independently written diagnostic scoring programs to these “interviews” are diagnoses that disagree, then one or both of the programs are invalid. When all the steps in constructing a diagnosis are saved by the program, it is possible to locate the step that caused the two programs to diverge. Remarkably, there is no literature suggesting that such systematic testing of diagnostic scoring programs has ever been tried.

Given the increasing number and sophistication of computer programs to generate and analyze interview data, there is much work to be done in this area. Until interview scoring algorithms as well as the interview questions are validated, claims for the validity of a diagnostic interview remain dubious.




American Psychiatric Association. (1980). Diagnostic and statistical manual: Version III. Washington, DC: Author.

American Psychiatric Association. (1987). Diagnostic and statistical manual: Version III-R. Washington, DC: Author.

American Psychiatric Association. (1994). Diagnostic and statistical manual: Version IV. Washington, DC: Author.

Goldberg, D., Kay, C., & Thompson, L. (1976) Psychiatric morbidity in general practice and the community. Psychological Medicine, 6, 565-569.

Helgason, T. (1964) Epidemiology of mental disorders. in Iceland. Acta Psychiatrica Scandinavica, 40, (Suppl.), 173.

Helzer, J. E., Clayton, P. J., Pambakian R., & Woodruff R. A., Jr. (1978) Concurrent diagnostic validity of a structured psychiatric interview. Archives of General Psychiatry, 35, 849-853.

Kirkpatrick, P. & Michael, S. T. (1962). Study methods. In I. Srole, T. S. Langher, S. T. Michael, M. K. Opler, T. A. C. Rennie. (1962). Mental health in the Metropolis. New York: McGraw Hill.

Leighton, A. H., Leighton, D. C., & Danley, R. A. (1966). Validity in mental health. Canadian Psychiatric Association Journal, 11, 167-178.

Marcus, S., & Robins, L. N. (in press) Detecting errors in a scoring program by double diagnosis of a computer- generated sample. Social Psychiatry and Psychiatric Epidemiology.

Robins, L. N., Helzer, J., Croughan, J., Williams, J., & Spitzer, R. L. (1980). The NIMH Diagnostic Interview Schedule: Version II Computer Programs. St. Louis, MO: Washington University, School of Medicine.

Robins, L. N., Cottler, L., Bucholz,K., & Compson, W. (1995). The Diagnostic Interview Schedule: Version IV. St. Louis, MO: Washington University, School of Medicine.

Spitzer, R. L., Williams, J. B. W., Gibbon, M., & First, M. B. (1989). Structured Clinical Interview for DSM-III-R. New York: New York State Psychiatric Institute.

Starr, S. A. (1950). The screening of psychoneurotics: Comparison of psychiatric diagnoses and test scores at all induction stations. In S. A. Stouffer, L. Guttman, E. A. Suchman, P. F. Lazarsfeld, S. A. Starr, & J. A. Clausen. Measurement and prediction. New York. John Wiley.

U.S. Department of Health, Education, and Welfare. (1969). Vital and Health Statistics. Data from the National Health Chronic Conditions Causing Activity Limitation. Washington, DC: Author.

Wing, J. K., & Sturt, E. (1978) The PSE-ID-CATEGO System supplementary manual. London. UK: MRC Social Psychiatry Unit.

World Health Organization. (1997). The Composite International Diagnostic Interview (CIDI 2.1) Geneva: Author.

World Health Organization. (1993). The ICD-10 Classification of Mental and Behavioral Disorders. Geneva: Author.

World Health Organization. (1995). Schedules for Clinical Assessment in Neuropsychiatry (SCAN). Geneva: Author.


Lee Robins is University Professor of Social Science and Professor of Social Science in Psychiatry at Washington University School of Medicine. She is the principal author of the Diagnostic Interview Schedule (DIS), and is on the Editorial Boards for the World Health Organizations’ Composite International Diagnostic Interview (CIDI), which was based on the DIS, and the Diagnostic Interview for Children (DISC). These instruments have been translated into many languages and are used in large epidemiologic studies both in the United States and abroad. As President of the American Psychopathological Association in 1987-88, she chose validity as the topic of the annual meeting, and (with James Barrett) published papers presented there in The Validity of Psychiatric Diagnoses (New York, Raven Press, 1989). In addition to her work in the area of standardized psychiatric diagnostic instruments, she has done research on conduct, antisocial personality, and substance abuse.


Read additional articles from this newsletter: