1 / 27

Applications of IRT Models

Applications of IRT Models. DIF and CAT. Which of these is the situation of a biased test?. The average score for males and females is different on an item is not the same. The correlation between males’ scores on an item is stronger than that for the females’ scores.

abeni
Download Presentation

Applications of IRT Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applications of IRT Models DIF and CAT

  2. Which of these is the situation of a biased test? • The average score for males and females is different on an item is not the same. • The correlation between males’ scores on an item is stronger than that for the females’ scores. • A group of males and females with exactly the same ability achieve different scores on an item.

  3. Disentangling the Terminology • Item impact • Item impact is evident when examinees from different groups have differing probabilities of responding correctly to (or endorsing) an item because there are true differences between the groups in the underlying ability being measured by the item. • DIF • The differential probability of a correct response for examinees at the same trait level but from different groups. • DIF occurs when examinees from different groups show differing probabilities of success on (or endorsing) the item after matching on the underlying ability that the item is intended to measure. • Item bias • Item bias occurs when examinees of one group are less likely to answer an item correctly (or endorse an item) than examinees of another group because of some characteristic of the test item or testing situation that is not relevant to the test purpose. • Adverse Impact • Adverse impact is a legal term describing the situation in which group differences in test performance result in disproportionate examinee selection or related decisions (e.g., promotion). This is not evidence for test bias.

  4. No DIF

  5. There are two types of DIF • Uniform DIF • The referent group always has a higher probability of a correct response than that for the focal group. • Non-uniform DIF • The direction of the advantage of one group’s likelihood of a correct response changes in different regions of the ability scale.

  6. Uniform DIF

  7. Non uniform DIF

  8. Differential Test Functioning

  9. Relationship between IRT and CTST models • It has been shown that there is a relationship between 2 PL normal ogive IRT models and the single factor FA model (Lord & Novick, 1968) • The b-parameter is related to the threshold parameter divided by the item factor loading • The discrimination parameter is e2qual to the factor loading divided by the communality of the item • Highly discriminating items will have high factor loadings

  10. Examining Measurement Invariance in CTST • Examining factorial invariance • Configural invariance • Zero and non-zero loading patterns are the same across groups • Pattern (metric) invariance • The factor loadings are equal across groups • Scalar (strong) invariance • The factor loadings and intercepts are equal across groups • Any group differences in means can be attributed to the common factors, which allows for meaningful group mean comparisons • Strict invariance • Factor loadings, intercepts, and unique variances are equal across groups • Any systematic differences in group means, variances, or covariances are due to the common factors

  11. Examining DIF in IRT • IRT tests of DIF examine if the IRC (Item response curve) the same for the reference group as it is for the focal group. • The focal group is the smaller group in questions (the minority group). • The reference group is the larger group that generally has the established parameters. • If they are different, then this means that the probability of an individual in one group with ability x responding correctly is different than the probability of an individual with the same ability x in group two if getting the item correct. • DTF refers to a difference in the test characteristic curves, obtained by summing the item response functions for each group. • DTF is perhaps more important for selection because decisions are made based on test scores, not individual item responses.

  12. Procedures for Detecting DIF/DTF • Parametric Procedures • Compare item parameters from two groups of examinees • Lord’s Chi-Square • Likelihood Ratio Test • Compare IRFs from two groups of examinees by measuring areas between them • Raju’s Area Measures

  13. Likelihood Ratio Test • Distributed as a chi-square with degrees of freedom equal to the difference in the number of parameters estimated in the compact and the augmented model • The compact model assumes item parameters are the same for both groups • The augmented model constrains anchor items to be equal, but allows items of interest to have parameters that vary across groups

  14. Raju’s Area Measures • Signed and unsigned areas • Indicates the area between two IRCs • Requires separate calibrations of the item parameters in each group, then use a linear transformation to put them on the same scale

  15. Procedures for Detecting DIF/DTF • Non Parametric Procedures • Bivariate frequencies between item responses and group memberships conditional on levels of ability or trait estimation Logistic Regression • Simultaneous Item Bias Test (SIBTEST) • Mantel-Haenszel (MH) • Logistic Regression

  16. Procedures for Detecting DIF/DTF • Simultaneous Item Bias Test (SIBTEST) • Examinees are matched on a true score ability estimate of ability • Creates a weighted mean difference between the reference and focal groups, which is then tested statistically • The means are adjusted to correct for differences in the ability distributions with a regression correction procedure • Some examination of this procedure has been conducted to examine changes in Type I error rates when the percent of DIF items is large

  17. SIBTEST

  18. Mantel-Haenszel (MH) • Compares the item performance of two groups who were previously matched on the ability scale • Total test score can be used • K 2x2 contingency tables are made for each item for K number of ability levels • DIF is shown if the odds of correctly answering the item at a given score level is difference for the two groups

  19. Mantel-Haenszel (MH)

  20. Mantel-Haenszel (MH) • The statistic for detecting DIF in an item is • Type A items – negligible DIF with ΔαMH < |1| • Type B items – moderate DIF with |1|<= ΔαMH <= |1.5, and MH test is statistically significant| • Type C items – large DIF with ΔαMH > |1.5|

  21. Logistic Regression • If the group effect is significant and the interaction is not, then there is uniform DIF • If the interaction is significant, then there is non-uniform DIF • Conduct model comparisons by adding each successive model term

  22. Computerized Adaptive Testing (CAT) • To obtain equal precision of measurement to that of a linear test, but with greater efficiency. • Give people only the items that are informative about them. • Reduce testing time and opportunity for error.

  23. CAT System

  24. Issues of Research in a CAT system. • Early Issues • Precision of measurement • Estimation procedure, Prior estimates • Equivalence • Reliability of Estimate, Test Form Equivalence (Test Information), Testing Mode • Efficiency • Item selection methods, Test length • Newer Issues • Security • Item exposure • Tetstlet models

  25. Item Exposure and Item Selection Methods • Sympson-Hetter • Directly controls item exposure probabilistically • Places a filter between item selection and item administration • Items are administered below a prespecified maximum exposure rate • P(S) probability that an item is selected as the best item • P(A) probability that an item is administered • P(A|S) conditional probability that an item is administered given that it is selected • Item exposure parameter • P(A)=P(A|S)*P(S)<=rmax • P(A|S) is easy to determine if P(S) is known, but P(S) must be determined through an iterative process

  26. Item Exposure and Item Selection Methods • Conditional Sympson-Hetter or SLC (Sotcking and Lewis, 1998) • SH controls that item exposure for a population, but at various ability levels the exposure rates can be quite high • P(A|S) is determined at specific trait levels rather than across a population

  27. Item Exposure and Item Selection Methods • a-stratified design (STR CAT; Chang & Ying, 1996, 1999) • Partition the item pool into multilevels and multistages according to the discrimination parameters • Start with the less discriminating items • This approach seems to improve item pool utilization and balanced item exposure rates • Then use a b-matching item selection procedure • It is less computationally complex • No other restrictions on item exposure is imposed

More Related