610 likes | 825 Views
Non-Linear Unidimensional Scale Trajectories Through Multidimensional Content Spaces. A Critical Examination of the Common Psychometric Claims of Unidimensionality, Linearity, and Interval-Level Measurement. Joseph Martineau Qi Diao Yang Lu Dipendra Subedi Samuel Drake Feng-Hsien Pang
E N D
Non-Linear Unidimensional Scale Trajectories Through Multidimensional Content Spaces A Critical Examination of the Common Psychometric Claims of Unidimensionality, Linearity, and Interval-Level Measurement Joseph Martineau Qi Diao Yang Lu Dipendra Subedi Samuel Drake Feng-Hsien Pang Kyle Ward Shu-Chuan Kao Tian Song Tianli Li Xin Li Yan Zheng Authorship is alphabetical after the first four
Author Affiliations • Michigan Department of Education • Joseph Martineau • Kyle Ward • Michigan State University • Qi Diao • Samuel Drake • Shu-Chuan Kao • Tianli Li • Xin Li • Yang Lu • Feng-Hsien Pang • Tian Song • Dipendra Subedi • Yan Zheng
First, a Little Academic Genealogy • Why the concern with these concepts? • Michigan State University students and grads • Mentored by Reckase • General theme is the real world is more complicated than we would like • Stimulated by discussions of… • Multidimensionality • Implications of violations of unidimensionality assumptions of traditional psychometric models • This talk could be entitled The real world is so much more complicated than we psychometricians anticipated, it’s not even funny. • Conflicts with at least the title of the next presentation
First, a Little Academic Genealogy • This work is not the responsibility of Reckase, but is clearly an outgrowth of that work • The focus of this work is not to knock down traditional psychometric models, but to challenge claims based on those models, to encourage more valid use of scales resulting from those models, and to move beyond those models if we want to continue making those claims • Martineau (2004, 2006a) showed that violations of the assumptions of linearity and unidimensionality can create very misleading results when using vertically scaled data • This presentation identifies trajectories within and across grades and investigates the difficulties posed by non-linearity of scales, unmodeled dimensionality, and less than interval-level measurement
Content Specifications in a Grade 3-8 Mathematics Assessment Blueprint • Grade 3-8 Fall 2005 Michigan Educational Assessment Program (MEAP) Mathematics • Based on a hierarchical content standard structure as follows: • Content Area (e.g. Mathematics) • Strand (e.g. Geometry) • Domain (e.g. Transformation & Symmetry) • Benchmark (e.g. Recognize that transformed shapes are still the same shape)
Content Specifications in a Grade 3-8 Mathematics Assessment Blueprint
The Illusion of Unidimensionality • Current dimensionality assessment procedures under-detect dimensionality because they… • Rely on detection of dominant dimensionality • Rely on the assumption that shared variance is a sign of shared construct representation (and therefore dominant dimensionality) • However, shared variance can result both from shared construct representation and exogenous causal variables • The uniqueness of one construct (the unshared variance) is identified as a nuisance factor that may be ignored • The shared variance of multiple constructs is interpreted as a single construct when the shared variance will decrease if the constructs respond differently to an exogenous variable
The Illusion of Unidimensionality • Martineau (2006b) and Tan & Gierl (2006) show that current dimensionality assessment procedures under-detect dimensionality when dimensions are correlated and/or factor structure is complex • Martineau & Reckase (2006b) show that strongly multidimensional data (e.g. 10 or more factors) are identified as unidimensional using traditional methods when factors are moderately to strongly correlated
Empirical Analyses of Unidimensionality and Multidimensionality • Winter 2004 Grade 4 Michigan Educational Assessment Program (MEAP) Mathematics Test • See Martineau, Mapuranga, & Ward, 2006—AERA paper • Meets unidimensionality criteria • Contains four interpretable dimensions • Reasonably similar to intended content structure (a diffuse test blueprint indicating X% of items per each of five strands without further specificity)
Empirical Analyses of Unidimensionality and Multidimensionality • Martineau, Mapuranga, & Ward (2006) • Four dimensions (correlated 0.58 to 0.77) • Pattern/rule generation/recognition • Specific mathematics vocabulary • Combinations • Match data to source • Six content clusters • Rule and pattern recognition (dimension 1) • Match data to source (dimension 4) • Identify/carry out rule based operations (dimensions 1 & 4) • Combinations (dimensions 1 and 3) • Counting units (low loadings on all dimensions) • Mathematics vocabulary (dimension 2)
Empirical Analyses of Unidimensionality and Multidimensionality • Fall 2005 Grade 3-8 MEAP Mathematics Tests • Meets unidimensionality criteria (Rasch model fits without any concerns, EFA identifies a single dimension) • Unable to interpret with small number of dimensions • Contains between 9 and 12 dimensions per grade (with only 4-5 dimensions having sufficient contributing items to produce a reliable scale) • Reasonably similar to intended content structure (a very tight test blueprint indicating 3 items per content standard with 20 content standards) • Dimensionality is reasonably representative of the level of specificity in the test blueprint
Empirical Analyses of Unidimensionality and Multidimensionality • Results & Implications • Multiple dimensions are identifiable even when traditional methods identify a single dimension • Dimensions differ from each other in non-trivial ways (e.g. along the lines of theoretically defined content standards as represented by test blueprints) • If there are reasonable theoretical reasons to expect that different types of content within a subject matter may respond differently to an exogenous variable… • The different types of content should be scaled separately • If not, the results may be misleading and harmful to the educational community
The Illusion of Linearity and Interval-Level Measurement • From Reckase’s invited address at the APA 1989 annual meeting, used with permission • Trajectory of a 3-PL unidimensional ACT mathematics scale through a two-dimensional mathematics achievement space • It is possible that the unidimensional scale is linear, or that one or both of the multidimensional scales are linear, but not all three! • The meaning of the unidimensional scale changes depending upon location • When the scale is non-linear, it is also not equal-interval by definition
Non-linear Unidimensional Trajectories through Multidimensional Content Space • Plotting the trajectories • Obtain overall “unidimensional” scores • Calibrate unidimensional scales for each strand within each grade where there were 10 or more items • Divide the overall scale into quantiles (between 50 and 100) with equal numbers of students in each quantile • Calculate the mean strand scores for each quantile • Plot the mean strand scores against each other
Unidimensional Trajectories through Multidimensional Content Space
Unidimensional Trajectories through Multidimensional Content Space
Unidimensional Trajectories through Multidimensional Content Space
Unidimensional Trajectories through Multidimensional Content Space
Unidimensional Trajectories through Multidimensional Content Space
Unidimensional Trajectories through Multidimensional Content Space
Another View of Non-Linear Trajectories • Vertically Scale all Mathematics 3-8 Items • Use Unidimensional, Vertically Scaled Item Parameters to Calculate Student Strand Scores • Calculate Centiles Across and Within Grades • Create Plots
Yet Another View of Non-Linear Measurement • Reference composites • Easiest third of items on grade 6 • Hardest third of items on grade 6 • Easiest third of items on grade 7 • Hardest third of items on grade 7 • Identifies the dimensions best measured by the top and bottom thirds of the grade 6 and 7 tests
A Caveat About Psychometric Claims • Corporations with a monetary interest tend to make the claims described below often and with impunity • Academics are less likely to make these claims, but some do still make them • When academics make these claims, they tend to make them under more stringent (but still insufficiently stringent) conditions
Psychometric Claim 1: Advanced Psychometric Procedures Produce (Essentially) Unidimensional Scales • Rasch/IRT Point of View • “…[IRT scales are] meaningful only if each and every question contributes to the measure of a single attribute.” • Bond & Fox, 2001, p. 25 • If there is a single dominant dimension in the data, the data can be successfully modeled as unidimensional. • Nandakumar, 1991 • Items that are clearly multidimensional can be selected to model a composite dimension that satisfies essential unidimensionality needs. • Reckase, Ackerman, & Carlson, 1988 • The implication is that if the data fit the selected model (e.g. meets unidimensionality and fit tests), it is reasonable to treat the resulting scale as unidimensional from top to bottom
Psychometric Claim 1: Advanced Psychometric Procedures Produce (Essentially) Unidimensional Scales • Examples • “We place all of our test items on the RIT scale according to their difficulty.” • NWEA website, 9/18/2006 • http://www.nwea.org/assessments/researchbased.asp) • “the developmental standard score is also a number that describes a student's location on an achievement continuum.” • ITBS website, 9/18/2006 • http://www.education.uiowa.edu/itp/itbs/itbs_interp_score.htm) • The Quantile Framework measures student mathematical achievement and concept/application solvability on the same scale ….” • Quantiles website, 9/18/2006 • http://www.quantiles.com/DesktopDefault.aspx?view=fa&tabindex=4&tabid=22)
Psychometric Claim 1: Advantages of Claiming Unidimensional Scales • Unidimensionality is a great simplifier of… • Measurement models • Interpretation of the resulting scale • Explanation of the scale to non-technical audiences • Statistical analyses of the resulting scale • Collaboration with substantive researchers • Instrument development of the instrument • Meeting profit margins • Unidimensionality is a convenient selling point for… • Clients • Researchers • Policymakers • Managers • Shareholders • Ourselves
The Illusion of Unidimensionality For example, if the following two models both approximately capture the shared variance in items 1 through N, we select the less complex model to be parsimonious. So, what’s wrong with that?
The Illusion of Unidimensionality Let’s try this one on for size: This model on the left is, of course, a ridiculous econometric model. The model on the right is simplistic, but more realistic. But how does the left model fare as a psychometric model?
The Illusion of Unidimensionality Let’s try this one on for size: The left model fares very well in the psychometric community. But so does the model on the right. So how does the model on the left cause problems for education?
How the Claim of Unidimensionality Can Harm the Educational Community • When the available data is “unidimensional,” studies are likely to be performed using general unidimensional outcomes • If there is any reasonable theory that different dimensions will respond differently to a given educational intervention, using general unidimensional outcomes is misleading • To visualize the effect of incorrect claims of unidimensionality • Assume that the “unidimensional” scale traverses a multidimensional space in a linear fashion • Understand that if the traversal is non-linear, the problems become greater (see Martineau, 2006a)
Effects of Unmodeled Multidimensionality • Best Case Scenario • An essentially unidimensional reading scale traverses the decoding/comprehension space linearly • Scale measures mostly decoding, but is also moderately sensitive to difference in comprehension • Randomized experiment, students assigned to phonics versus whole language instruction (the two groups start out at the same location in the decoding/comprehension space) • Pre-post assessment on the same unidimensional reading instrument (problems become worse if a different level of the assessment is used—see Martineau, 2004, 2006a) • Phonics has a strong positive effect on decoding (effect size 0.4), no effect on comprehension; whole language has opposite effects
Effects of Unmodeled Multidimensionality • Misleading results • Phonics instruction increases reading ability over whole language instruction • Accurate results with multidimensional instruments • Phonics instruction increases decoding ability over whole language instruction • Whole language instruction increases comprehension ability over phonics instruction • What are the policy ramifications of these misleading results? Can anyone say “Educational Pendulum?” • If the assessment were an essentially unidimensional measure of mostly comprehension but some decoding, the misleading results would go in exactly the opposite (but equally misleading) direction
Alternate Effects of Unmodeled Multidimensionality • Misleading results • It does not matter whether one uses phonics or whole language instruction • Accurate results with multidimensional instruments • Phonics instruction increases decoding ability over whole language instruction • Whole language instruction increases comprehension ability over phonics instruction • What are the policy ramifications of these misleading results? • If the assessment were an essentially unidimensional measure of mostly comprehension but some decoding, the misleading results would go in exactly the opposite (but equally misleading) direction
Psychometric Claim 2: Advanced Psychometric Procedures Produce Linear, Interval Level Scales • If a scale is unidimensional and measures at the interval level, the scale is then linear by definition • Therefore • If we claim that a scale is unidimensional AND • If we claim that a scale is interval level THEN • We also claim that the scale is linear
Psychometric Claim 2: Advanced Psychometric Procedures Produce Linear, Interval Level Scales • Rasch Point of View • “…estimates derived from Rasch procedures are located on an interval scale….” • Bond & Fox, 2001, p. 119 • This claim is well accepted in the Rasch literature • Examples • “You can liken the [NWEA Rasch scale] to a meterstick which is comprised of equal units of measurement, centimeters.” • NWEA website, 9/18/2006 • http://www.nwea.org/assessments/researchbased.asp • “Measurements…are now reportable in a common unit, a Lexile, which is similar to the degree calibrations on a thermometer….” • Lexile website, 9/18/2006 • p.14 of http://www.lexile.com/lexilearticles/objective-measurement-reading-response.pdf). • Pearson Educational Measurement’s PASeries reading assessment is reported on the Lexile scale….” • Pearson website, 9/18/2006 • http://www.pearsonpaseries.com/downloads/overview/PASeriesOverview.html)
Psychometric Claim 2: Advanced Psychometric Procedures Produce Linear, Interval Level Scales • This claim is more controversial in the more general psychometric world, but the claim is made nonetheless • Claims for: • “[IRT models] are calibrated to an equal-interval scale….” • METRIC website, 9/18/2006 • http://www.metric.research.med.va.gov/learn/theories/theories_irt.asp) • “[An IRT] scale…[is not] interval…although it is popular and reasonable to assume [it is].” • Hambleton, Swaminathan, and Rogers, 1991, p. 87 • Claims against: • “[The equal interval assumption] is…either demonstrably wrong…or arbitrary….” • Cliff, 1991, p. 37 • “When the characteristic…cannot be directly observed, claims of equal-interval properties…are not testable and are therefore meaningless.” • Zwick, 1992, p. 209
Psychometric Claim 2: Advanced Psychometric Procedures Produce Linear, Interval Level Scales • General Psychometric Examples • “…[TerraNova K-12] Scale Scores [derived from 3-PL IRT models] are equal-interval scores….” • CTB McGraw-Hill website, accessed 9/18/2006 • http://www.ctb.com/static/about_assessment/popup_f6.jsp • “…the [ITBS] developmental standard score scale…mirrors reality better….[than Grade Equivalents in that] growth is usually not as great at the upper grades as it is at the lower grades.” • Iowa Testing Program website, accessed 9/18/2006 • http://www.education.uiowa.edu/itp/itbs/itbs_interp_score.htm • 1998 NAEP Technical manual describes several analyses that assume the scale scores are equal interval • NCES website, accessed 9/18/2006 • http://nces.ed.gov/nationsreportcard/pdf/main1998/2001509.pdf
Psychometric Claim 2: Advantages of Claiming Linear, Interval-Level Scales • Scale linearity and level of measurement determine the types of analyses that are appropriate • Non-linear, nominal/ordinal measurement • Requires categorical analyses (discarding some data to create categories) or non-linear transformations • Interpretation is more difficult • Categorical/non-linear analyses are less well known and developed • Linear, interval/ratio level measurement • Allows for linear variable analyses without transformation • Interpretation is more clear • Analyses of linear interval/ration variables are better known and better developed (e.g. HLM, ANOVA, Factor Analysis, Regression) • If we have linear, interval/ratio level measurement, then studies of education based on those measures are very much like studies of physical, directly measurable phenomena
How the Claim of Linearity/Interval-Level Measurement Can Harm the Educational Community • Non-linearity may dramatically increase the problems of un-modeled dimensionality • Scenario demonstrates how this can happen
How the Claim of Linearity/Interval-Level Measurement Can Harm the Educational Community • Scenario • The MEAP grade 8 mathematics scale traverses the algebra/geometry space non-linearly (as taken from the empirical trajectories shown earlier) • Scale measures mostly geometry near the bottom, and changes to mostly algebra near the top • Quasi-experiment, matched samples (based on the unidimensional math pre-test) from convenience populations are assigned to use or not use 3-D manipulatives • Pre-post assessment on the same unidimensional reading instrument (problems become worse if a different level of the assessment is used—see Martineau, 2004, 2006a) • Treatment has a strong positive effect on geometry gains (effect size 0.5), no effect on algebra gains