1 / 54

BAD

THE. THE. AND. THE. BAD. GOOD. UGLY. F. Kaftandjieva. CEFR. Bad Practice. Good Practice. Terminology. Alignment. Anchoring. Calibration. Projection. Scaling. Comparability. Concordance. Linking. Benchmarking. Prediction. Equating. Moderation. 1904.

iola
Download Presentation

BAD

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. THE THE AND THE BAD GOOD UGLY F. Kaftandjieva

  2. F. Kaftandjieva F. Kaftandjieva

  3. CEFR F. Kaftandjieva

  4. Bad Practice Good Practice F. Kaftandjieva

  5. Terminology Alignment Anchoring Calibration Projection Scaling Comparability Concordance Linking Benchmarking Prediction Equating Moderation F. Kaftandjieva

  6. 1904 Milestones in Comparability “The proof and measurement of associationbetween two things“ association Spearman F. Kaftandjieva

  7. 1951 1904 Milestones in Comparability “Scores on twoor more tests may be said to be comparable for a certain population if they show identicaldistributions for that population.” comparable population Flanagan Spearman F. Kaftandjieva

  8. 1971 1951 1904 Milestones in Comparability • ‘Scales, norms, and equivalent scores’: • Equating • Calibration • Comparability Angoff Flanagan Spearman F. Kaftandjieva

  9. 1992 1993 1971 1951 1904 Milestones in Comparability Linking Mislevy, Linn Angoff Flanagan Spearman F. Kaftandjieva

  10. 1997 2001 1992 1993 1971 1951 1904 Milestones in Comparability Alignment Webb, Porter Mislevy, Linn Angoff Flanagan Spearman F. Kaftandjieva

  11. Alignment • Alignment refers to the degree of match between test content and the standards • Dimensions of alignment • Content • Depth • Emphasis • Performance • Accessibility F. Kaftandjieva

  12. Alignment • Alignment is related to content validity • Specification (Manual – Ch. 4) • “Specification … can be seen as a qualitative method. … There are also quantitative methods for content validation but this manual does not require their use.” (p. 2) • 24 pages of forms • Outcome: “A chart profiling coverage graphically in terms of levels and categories of CEF.” (p. 7) • Crocker, L. et al. (1989). Quantitative Methods for Assessing the Fit Between Test and Curriculum. In: Applied Measurement in Education, 2 (2), 179-194. Why? How? F. Kaftandjieva

  13. 0.235 Alignment (Porter, 2004) www.ncrel.org F. Kaftandjieva

  14. 1997 2001 1992 1993 1971 1951 1904 Milestones in Comparability Linking Webb, Porter Mislevy, Linn Angoff Flanagan Spearman F. Kaftandjieva

  15. Mislevy & Linn: Linking Assessments Equating  Linking F. Kaftandjieva

  16. The Good & The Bad in Calibration F. Kaftandjieva

  17. Model – Data Fit F. Kaftandjieva

  18. Model – Data Fit F. Kaftandjieva

  19. Model – Data Fit Reality Models F. Kaftandjieva

  20. Sample-Free Estimation F. Kaftandjieva

  21. The ruler (θ scale) F. Kaftandjieva

  22. The ruler (θ scale) F. Kaftandjieva

  23. The ruler (θ scale) F. Kaftandjieva

  24. The ruler (θ scale) absolute zero boiling water F. Kaftandjieva

  25. The ruler (θ scale) F° = 1.8 * C° + 32 C° = (F°– 32) / 1.8 F. Kaftandjieva

  26. Mislevy & Linn: Linking Assessments F. Kaftandjieva

  27. Standard Setting F. Kaftandjieva

  28. The Ugly F. Kaftandjieva

  29. Fact 1: • Human judgment is the epicenter of every standard-setting method Berk, 1995 F. Kaftandjieva

  30. When Ugliness turns to Beauty F. Kaftandjieva

  31. When Ugliness turns to Beauty F. Kaftandjieva

  32. Fact 2: • The cut-off points on the latent continuum do not possess any objective reality outside and independently of our minds. They are mental constructs, which can differ within different persons. F. Kaftandjieva

  33. Consequently: • Whether the levels themselves are set at the proper points is a most contentious issue and depends on the defensibility of the procedures used for determining them Messick, 1994 F. Kaftandjieva

  34. Evidence Claims Defensibility Evidence Claims F. Kaftandjieva

  35. National Standards Understands manuals for devices used in their everyday life CEF – A2 Can understand simple instructions on equipment encountered in everyday life – such as apublic telephone (p. 70) Defensibility: Claims vs. Evidence (A2) F. Kaftandjieva

  36. Defensibility: Claims vs. Evidence • Cambridge ESOL • DIALANG • Finnish Matriculation • CIEP (TCF) • CELI Universitа per Stranieri di Perugia • Goethe-Institut • TestDaF Institut • WBT (Zertifikat Deutsch) 75% of the institutions provide only claims about item's CEF level F. Kaftandjieva

  37. Defensibility: Claims vs. Evidence • Common Practice (Buckendahl et al., 2000) • External Evaluation of the alignment of • 12 tests by 2 publishers • Publisher reports: • No description of the exact procedure followed • Reports include only the match between items and standards • Evaluation study • At least 10 judges per test • Comparison results • % of agreement: 26% - 55% • Overestimation of the match by test-publishers F. Kaftandjieva

  38. Standards for educational and psychologicaltesting,1999 Standard 1.7: • When a validation rests in part of theopinion or decisions of expert judges, observers or raters,procedures for selecting such experts and for elicitingjudgments or ratings should be fully described. The descriptionof procedures should include any training andinstruction provided, should indicate whether participantsreached their decisions independently, and should reportthe level of agreement reached. If participants interactedwith one another or exchanged information, the proceduresthrough which they may have influenced oneanother should be set forth. F. Kaftandjieva

  39. Evaluation Criteria Hambleton, R. (2001). Setting Performance Standards on Educational Assessmentsand Criteria for Evaluating the Process. In: Setting Performance Standards: Concepts, Methods and Perspectives., Ed. by Cizek, G., Lawrence Erlbaum Ass., 89-116. • A list of 20 questions as evaluation criteria • Planning & Documentation 4 (20%) • Judgments 11 (55%) • Standard Setting Method 5 (25%) Planning F. Kaftandjieva

  40. Judges • Because standard-setting inevitably involves human judgment, a central issue is who is to make these judgments, that is, whose values are to be embodied in the standards. Messick, 1994 F. Kaftandjieva

  41. Selection of Judges The judges should have • the right qualifications, but • some other criteria such as • occupation, • working experience, • age, • sex may be taken into account, because ‘… although ensuring expertise is critical, sampling from relevant different constituencies may be an important consideration if the testing procedures and passing scores are to be politically acceptable’ (Maurer & Alexander, 1992). F. Kaftandjieva

  42. Number of Judges • Livingston & Zieky (1982) suggest the number of judges to be not less than 5. • Based on the court cases in the USA, Biddle (1993) recommends 7to10Subject Matter Expertsto be used in the Judgement Session. • As a general rule Hurtz & Hertz (1999) recommend10 to 15 raters to be sampled. • 10 judges is a minimum number, according to the Manual (p. 94). F. Kaftandjieva

  43. Training Session • The weakest point • How much? • Until it hurts (Berk, 1995) • Main focus • Intra-judge consistency • Evaluation forms • Hambleton, 2001 • Feedback ? ? F. Kaftandjieva

  44. Training Session: Feedback Form F. Kaftandjieva

  45. Training Session: Feedback Form F. Kaftandjieva

  46. Standard Setting Method • Good Practice • The most appropriate • Due diligence • Field tested • Reality check • Validity evidence • More than one F. Kaftandjieva

  47. Standard Setting Method • Probably the only point of agreementamong standard-setting gurus is that there is hardly anyagreement between results of any two standard-setting methods,even when applied to the same test under seemingly identicalconditions. Berk, 1995 F. Kaftandjieva

  48. He that increaseth knowledge increaseth sorrow. (Ecclesiastes1:18) Examinee-centered methods B1/B2 Test-centered methods F. Kaftandjieva

  49. He that increaseth knowledge increaseth sorrow. (Ecclesiastes1:18) Test-centered methods B1/B2 Examinee-centered methods F. Kaftandjieva

  50. Instead of Conclusion • In sum, it may seem that providing valid grounds for valid inferences in standards-based educational assessment is a costly and complicated enterprise. But when the consequences of the assessment affect accountability decisions and educational policy, this needs to be weighed against the costs of uninformed or invalid inferences. Messick, 1994 Butterfly Effect Change one thing, change everything! F. Kaftandjieva

More Related