310 likes | 472 Views
The Design of Statistical Specifications for a Test . Mark D. Reckase Michigan State University. Procedures for Test Design. Test design has been considered to be a subjective, artistic endeavor. But, with the development of item response theory, test design has become more scientific.
E N D
The Designof Statistical Specificationsfor a Test Mark D. Reckase Michigan State University
Procedures for Test Design • Test design has been considered to be a subjective, artistic endeavor. • But, with the development of item response theory, test design has become more scientific. • Lord suggested that tests be constructed to match a target information function. • Very sophisticated methods have been developed to select items to match target information functions. • Little work has been done on the design of test information functions.
Purposes for this Paper • Present methodology for designing target information functions or item difficulty distributions for a test. • Demonstrate that methodology for several common testing situations. • Measure all examinees from a normal distribution of the trait to a desired level of precision. • Measure a range of a trait to a desired level of precision.
Basic Concepts • If examinee q is known, optimal test should contain a set of items that provide the required information at that q. • Information from an item covers a range so items that are optimal for one person supply some information for other persons. • General approach is to randomly select persons from target population then select optimal items for that person. • For each additional person, select only the additional items that are needed to reach information target.
Example • Suppose target examinee population is N(0,1) • Randomly select examinee. • Information equivalent to reliability .90 is 10. • Select items until information 10 is reached assuming Rasch model (b = q). • Randomly select additional examinees. • Select items for those examinees until a test length of 50 is reached.
Results -- Comments • Results are from one sample of 6 examinees randomly selected. • 14 items needed for first examinee. • Other examinees need fewer additional items because of overlap of information functions. • Need to consider the effects of sampling variation.
The Complete Process • Create ideal set of items for a sample. • Replicate the process many times (500 seems to work well) • Average information functions from the samples. • Average number of items in .2-unit bins to determine difficult spread. • Check specifications against target.
Conditionsfor Rasch-based Design • N(0,1) trait distribution • 50 item test • Rasch model • 500 replications • Minimum information 10
Comments • Minimum information requirement met from -2.3 to 2.3. • Information accumulates to higher values in the middle of the distribution. • Difficulty distribution is essentially rectangular. • Test information exceeds the target because item numbers are rounded upward in many cases.
Process Can HelpSelect Test Length • Run process for different test lengths. • Also can consider forcing selection of first examinee at 0.0. • What test length allows criteria to be met?
Results – Test Length • With increase test length, information function widens and increases in height. • Test length of 15 is too short to meet requirements unless it is focused at 0.0. • Forcing first examinee at 0.0 makes information function narrower and more peaked. • 75 items is maximum number of items that makes sense for the criteria specified here.
Test Designed to Measure with Precision over a Range • Brian Junker suggested the following procedure. • Select range • Pick items at extremes of range • Fill in with items between extremes to yield flat information function • Continue until information criterion is reached over entire range
Specifications Counter to Traditional Specifications • Most tests have normal distributions of difficulties. • These results seem very odd compared to traditional results. • Need to investigate further. • What is distribution of scores? • What is distribution of p-values?
Odd Results • Distribution of scores is near normal. • Distribution of p-values mirrors b-parameter distribution. • Extreme item difficulties are .08 and .92. • Surprising that these items yield normal distribution of scores. • Look at test characteristic curve.
Test Characteristic Cure • Test characteristic curve is virtually linear from -2 to 2. • When curve is linear, the form of the distribution of qis mapped to theestimated true score scale. • In this case, since the q distribution was normal, so is the number-correct score distribution.
Conclusions • A process has been developed for designing target information functions and item difficulty distributions for tests. • The process suggests that either a rectangular or a U-shaped distribution is appropriate if it is desired to measure with equal precision over a range. • The number of items needed is related to the range of the scale that needs to be measured. • The U-shaped item difficulty distribution works best if it is desired to recover the underlying q distribution. • The results are quite different than traditional test development procedures.