1 / 48

Personality/Psychopathology Measurement and IRT : promising opportunities

Explore the potential of item response theory (IRT) in improving understanding of personality questionnaires, with a focus on applications in clinical psychology and personnel selection. Discuss the advantages of IRT over classical test theory (CTT) and the challenges in implementing IRT in personality assessment.

lewisk
Download Presentation

Personality/Psychopathology Measurement and IRT : promising opportunities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Personality/Psychopathology MeasurementandIRT: promising opportunities Rob Meijer

  2. Personality Assessment • Diagnosis of personality and personality disorders requires an evaluation of the individual • Self-reports and peer-reports questionnaires are often used to determine personality traits • Contexts: health care, clinical psychology, personnel selection and development

  3. Topic • How can item response theory (IRT) improve understanding of personality questionnaires ? • Discuss several applications, what I have learned from my cooperation with clinical, personality, and I/O psychologists • Not enough research projects that communicates convincingly the relative superiority of the IRT approach in the personality domain

  4. Topic • IRT applied in educational and cognitive measurement • The purpose of cognitive assessment precise and valid scaling of individual differences. • In (applied) personality assessment test score interpretation and prediction of wide ranging behavior • In cognitive assessment there is a large domain (e.g., spelling) where the items are sampled from, in personality many domains are restricted. There are only a number of indicators of e.g., social introversion, friendliness, or narcissism.

  5. Topic • When IRT is transported from cognitive abilities into typical performance assessment, special issues and problems arise • E.g., limited indicators (items), underlying distribution not normal

  6. PersonalityAssessment • In 2002 and 2003, 20 of 39 research articles in JEM and 32 of 52 in APM involved IRT • 2 out of 122 in Journal of Personality Assessment and 6 of 106 articles in Psychological Assessment included IRT • Partly due to different psychometrics prevalent in the two fields

  7. CTT CTT scale construction: item difficulty, item discrimination, and reliability drawback: reliability and SEM is constant for all respondents

  8. IRT • IRT assumes that a person has a true location on a continuous latent dimension (theta).Theta is assumed to probabilistically cause how a person responds to an item • The equation that relates to the probability of endorsing an item is the IRF (dichotomous item scores)

  9. IRT • Item difficulty (b) is the point on the latent variable scale that where the probability equals .50 • Item discrimination (a) is proportional to the slope of the IRF • Important feature: IRT estimates the joint relation between person properties and item properties • a usually between[.5, 2.5] • b usually between -2.5 (easy) and +2.5 (difficult)

  10. Item Response Functions (IRF)

  11. IRT Assumptions: • Unidimensionality • Monotonic relation between trait level and probability of endorsing an item Statistical evaluation of model-to-data goodness of fit

  12. Item and scale analysis • CTT: item discrimination, item difficulty, reliability • IRT: item analysis is done in a similar way but item discrimination, difficulty, and reliability are examined in a more powerful way • Instead of test reliability, item information plays an important role

  13. Item and Test information • Information indicates how well an item discriminates among respondents who are at different levels of the latent variable • Items provide different amounts of information at different ranges of the latent variable • (1) Item information is additive across items: test information function • (2) information is inversely related to the SEM

  14. Item and scale analysis

  15. Item and Test information • The amount of information an item provides is determined by the item discrimination • The location on the latent trait where information is maximized is determined by the item difficulty

  16. Item Information

  17. Item Information

  18. Polytomous scores • Graded response model (GRM), likert data • a-parameter: magnitude reflects the degree to which the item is related to the trait • Two or more location parameters, b1, b2, .. (equal to number response categories minus one) Reflects the spacing of the response categories along the trait scale • Thus for m = 5 answer categories there are b1, b2, b3, and b4 location parameters

  19. Depression items Item 2: I have recently considered killing myself Item 3: I am sometimes down in the dumbs

  20. Option response curves

  21. Example 1: Construct validity clinical scales • Can we use scales as a diagnostic instrument to classify persons in different categories? • In clinical psychology/psychiatry many rating scales are constructed so that they cover DSM-IV(TR) categories. On the basis of a scale a person is classified into different categories such as no, versus mild, versus severe mental illness states • Because diagnostic criteria influence how psychiatric disorders are recognized, researched and treated it is very important to ensure their empirical validity

  22. Practical Features • Clinical change, degree of change within the individual, to measure this, there should be a scale discriminating in the area of interest • Scale should be discriminating around cut-off scores • Diagnostic Interview-Expanded Substance Scale • IRT analysis to investigate the quality of the scale • Can the scale be used as a diagnostic instrument ?

  23. Alcohol use disorder (Langenbucher et al, 2004)

  24. Cocaine use disorder

  25. Conclusion • Dense clustering of symptom item response functions imply that a number of criteria (items) of substance abuse carry the same information • Measurement precision in only a narrow trait range • Trichotomous diagnostic scheme of the DSM-IV (undiagnosed, dependence, abuse) is not supported, only impaired/less impaired can be distinguished

  26. Conclusion • Additive severe criteria (items) are needed to reliably and broadly identify serious degrees of addictive pathology • Additional mild criteria for screening and prevention and establishing base rates (epidemiology) • But areconstructsfully continuous ? And can we find measures (items) across an entire range?

  27. Quasi-traits • Researchers often assume that all construct are fully continuous, defined at both ends of the construct • IRT modeling shows that many personality constructs used in clinical scales (psychopathology) are highly skewed or quasi-traits • For example, self-esteem

  28. Quasi traits • One explanation is that this is not due to poor items or options but due to the nature of the self-esteem construct; items only differentiate between people with low self-esteem because this is the only end of the construct that is meaningful • Future research should clarify whether we can write items that also discriminate at the medium levels of the latent trait

  29. Example 2 Type D personality • What is the effect of narrow band constructscombined with limiteditem pools on the construct validity of our scales? • When only a few items have high slopes and the remainder have low slopes care should be taken in interpreting the latent trait.

  30. Context • Influence psychological factors on health, illness, and death • Psychosomatic research on cardiac disease needs to include personality • Distress as a risk factor • High levels of distress are linked to anxiety, stress, and anger  vital exhaustion • DS-14 : 7 items Negative Affect + 7 items Social Inhibition: • Type D Score above median on both scales: Increased risk

  31. Example 2 Type D • Negative Affect (NA): tendency to experience aversive emotional status with feelings of dysphoria, tension and worry. (α = .88; fact. loadings .6-.8) • Social inhibition (SI): inhibit self-expression in social interactions in order to avoid disapproval by others (α = .86; fact. Loadings .6-.8) (Emons, Meijer, Denollet, 2006)

  32. Example 2: Type D • Variable pattern of slopes may be problematic • (1) The dysphoria items NA7, NA4, and NA2 dominate the construct, remaining items are less important • (2) A practitioner should be very careful in interpretation of the underlying construct: NA = dysphoria and in particular: “I am often down in the dumbs” • (3) the latent trait does not reflect variance on a common latent variable shared by other items, but reflects individual differences on the items with the highest slopes

  33. Example 3 Validity of test scores • Test score validity: validity scales e.g., F-scale in MMPI, items scored infrequently in the normal population, high scores invalidate the interpretation of the MMPI • Can we identify and interpret invalid test scores through studying the configuration of individual item scores by means of fit statistics that are proposed in the context of item response theory IRT ? (Meijer, Egberink, Emons, Sijtsma, 2008)

  34. Context • On the basis of an IRT model observed and expected item scores can be compared and many unexpected item scores alert the researcher that the total score may not adequately reflect the trait being measured. • Gap between psychometric characteristics of several statistical tests and measures on the one hand and the articles that describe the practical usefulness of these measures on the other hand.

  35. Context • Try to integrate psychometric analysis with information from qualitative sources to make judgments about the validity of an individual’s test score. And replication !! • Explore the usefulness of person-fit statistics to identify invalid test scores using real data, and • Validate information obtained from IRT using personality theory and qualitative data obtained from observation and interviews

  36. Rationale of the method • When measuring e.g., depressed suicidal ideation every person that endorses the statement • “I have recently considered killing myself” is expected to also endorse the statement “I don’t seem to care what happens to me”(relative to the previous item this item is less extreme or, more popular) • However, in practice, when analyzing personality data, “errors” are found against this perfect pattern • Many errors may point at invalid person scaling

  37. Fit statistics • 0100100000000001001011001010010100001100 X+ = 12 • 1001000010110010111111000000000000000000 X+ = 12 • 0101110111001010001011110001011111000000 X+ = 20 • 1111110111111111101101000100000000000000 X+ = 20 • Many statistics, we used several statistical tests, normed Guttman errors (ZGE)

  38. Data • Harter’s Self-Perception Profile for Children (SPPC), polytomous item scores (4 point scale) • Intended to determine how children between 8 and 12 years of age judge their own functioning in several specific domains and how they judge their global self-worth • 6 subscales each consisting of 6 items: Scholastic Competence (SC), Social Acceptance (SA), Athletic Competence (AC), Physical Appearance (PA), Behavioral Conduct (BC), Global Self-worth (GS)

  39. Procedure • 611 children between 6 and 12 years of age • Inspection of model fit • Calculation of person-fit statistics • Interviewing teachers, and children, and observation of children • Re-administration of the SPPC

  40. Results • In general, young children (8/9 years of age) scored less consistent than older children • Asking children to select personality statements that better describe them may be relatively complex especially for young children. • They should understand the meaning of these statements and they should also have a frame of reference which is similar to that of old children. We observed that the meaning of some items was problematic, and that inconsistent answering behavior was often due to learning disability

  41. Results • Older children more often than younger children choose the categories 2 and 3. • Older girls more often than older boys choose the 2 and 3 options. We speculate that these shifts point at a more differentiated self-concept for older children as compared to young children and at a more differentiated self-concept for girls than boys

  42. Profiles • Similar Profiles with different item score patterns

  43. Profiles • Child 275: very inconsistent item score pattern (SC:422124, SA:444414, AC:411444, PA:313414, BC:124443, GS:344143) Child 94: consistent SC:112422, SA:443423, AC:444322, PA:222242, BC:333333, GS:424233; Child 242: consistent SC:222232, SA:443333, AC:333343, PA:322233, BC:433223, GS:343333).

  44. Re-administration • As expected, the ZGE scores collected at the second administration were lower than the ZGE scores collected at the first administration. • 8 out of the 27 children again produced irregular item score patterns • For 4 children this was due to cognitive problems: learning disability, problems with reading comprehension skills and/or lexical processing speed. • For 2 other children this may be due to the home situation. come from troubled homes, they have difficult relations with their parents and, perhaps as a result of this, they are very insecure.

  45. Conclusions • In clinical practice and applied research, the fundamental question often is not whether unexpected item score patterns exist but whether the patterns have any theoretical or applied validity • Because nothing in a (statistical) fit procedure guarantees that identified patterns have associations with external criteria or diagnostic categories it is important to use information from other sources. Thus, one may combine information from fit statistics with information obtained from other subtest scores (score profiles), interviews, and/or observation.

More Related