Reliability in Scales

Reliability in Scales • Reliability is a question of consistency • do we get the same numbers on repeated measurements? • Low reliability: reaction time • High reliability: measuring weight • Psychological tests fall inbetween

Reliability & Measurement error • Measures are not perfectly reliable because they have error • The “built in accuracy” of the scale • Pokemon wristwatch vs. USN atomic clock • We can express this as: • X = T + e • X = your measurement • T = the “True score” • e = the error involved in measuring it (+ or -)

Example: the effect of e • Imagine we have someone with a “true” int score of 100. • If your int scale has a large e, then you measurements will vary a lot (say from 60 all the way to 130) • If your scale has a small e, your scale will vary a little (say from 90 to 110)

Measurements as distributions • Think of e as variance in a distribution, with X as your mean Small e - scores clustered close to true score Large e - scores all over the place! (hard to say what the true score is)

More on the error • Measures with a large e are dodgy (hides the true score) • We can reduce the size of e, but not eliminate it completely • Measuring reliability is measuring the impact of e

Different forms of reliability • Reliability (“effect of e”) can be very hard to conceptualise • To help, we break it up into 2 subclasses • Temporal stability • If I measure you today and tomorrow, do I get the same result? • Internal consistency • Are all the questions in the test measuring the same thing?

Temporal stability • The big idea: If I test you now, and then I test you tomorrow, I should get the same result • Why have it? • Can’t measure changes otherwise! • Tells us that we can trust results (small time error) • Tells us that there is no learning effect

Measuring temporal stability • How can we measure if a test is temporally stable? • The problem: we have 2 sets of scores. We need to see if they are the same • Solution: Use a correlation. If the two sets are strongly related, then they are basically the same

Example: Correlations & Stability • Imagine a test with ten questions, and a person does it twice (on Monday and Wednesday): • M: 5 6 5 3 4 8 6 4 8 7 • W: 4 7 4 3 4 8 6 3 9 5 • Are these scores the same? (r = 0.897)

Example: correlations & stability • Now imagine a crappy scale: • M: 5 6 5 3 4 8 6 4 8 7 • W: 5 3 2 2 8 5 8 1 4 4 • Are these scores basically the same (r = 0.211)

Different approaches to stability • There are a two main ways of testing temporal stability • Test-retest method: give the same test to the same people • Alternate forms: give a highly similar test to the same people

Test-retest method • Method: • 1. Select a group of people • 2. Give them your test • 3. Get them to come back later • 4. Give them the test again • 5. Correlate the results to see

Things to note • It must be the same people • We want to know that if client X returns, we can measure that person again • The amount of time between tests depends on your requirements • The correlation value must be very high - above 0.85

Why it works • We get 2 results from each person to compare • this means we can draw rely on the test to work for the same people • We use a lot of people in our assessment • this means that we can rely on the test, regardless of who our client is • The correlation tells us the degree to which the 2 tests agree (R2 is the % they agree)

The learning effect • What if you have a test where learning/practice can affect your score? • Eg. class test • The Test-retest method will always yield poor correlations • people will always score higher marks the second time around • This will make it look as if temporal stability is poor!

Alternate forms reliability • Answer: do test-retest, but don’t use the same test twice • Use a highly similar test • In order for this to work, both forms must be equally difficult • The more similar, the better

Making alternate forms of a test • Simple to ensure both forms are equally difficult • Make twice as many questions as you will want in the test • Randomly divide them up into two halves • Each half is a test! • The random division ensures both forms are equally difficult

The procedure: alternate forms • Once you have your 2 forms: • Collect a sample of people • Give them the first form of the test • wait a while • Give them the second form of the test • Correlate the results • If the correlation is high (> .85), you have stability

Which to use: alternate forms or test-retest? • If you are measuring something which can be learned/perfected by practice - alternate forms • If not, you could choose • Test-retest if preferable, removes confound about difficulty • In many cases, you don’t really know if learning is an issue • Alternate forms is “safer”, but poorer statistically

What if you don’t have temporal stability? • Temporal stability is not required for all tests • Most important for tests which work longitudinally • Very important if you want to track changes over time • Excludes all “once off” tests (eg. aptitude tests)

Internal consistency • A different type of reliability • The big idea: Are all the questions in my test tapping into the same thing? • (or, are some questions irrelevant) • All tests require this property

Why it’s important • Imagine we have an arithmetic ability test, with 4 questions: • 1. What is 5 x 3 • 2. What is 12 + 2 - 5 • 3. What is the capital of the Ukraine • 4. What is 5 x 2 + 3

Why it’s important • Item 3 does not contribute to measuring arithmetic • Someone who is a maths wiz (should get 4/4) might only get 3/4 • A complete maths idiot (should get 0/4) could get 1/4 • It does not belong in this test! • If we include it in our total, it will confuse us • Items such as this become “third variables”

How do we know if an item belongs? • We need to figure out if a particular item is testing the same thing as the others • We can correlate the item’s scores with the scores of some other item we do know belongs • High correlation (above 0.85) - it tests the same thing • Low correlation (below 0.85) - it measures something else

Our example again • Some people who know maths, will also know geography • But not everyone! • Correlate Q1 to Q3 - it will be weak • Those who know arithmetic will know how to do the other items • Correlate Q1 to Q2 or Q4, all will give a high correlation

Doing it for real • Problem: how do we know which items are suspect? • Any item could be at fault • Not always ovious • Solution - check them all • Split half method • Cronbach’s Alpha

Split half approach • Basic idea: check one half of the test against the other half • If first half correlates well to the other half, then they are tapping into the same thing • Problem to overcome: each half of the test must be the same difficulty

Split half - procedure • Give a bunch of people your test • Decide on how to split the test in half • Correlate the halves • If the correlation is high (above 0.85), the test is reliable

Where to split? • Problem: how do we split the test? • First 10 Q vs last 10? • Odd numbered Q vs Even numbered Q? • Any method is acceptable, as long as the halves are of equivalent difficulty • How do you show that? • Not by correlation - paradox! • (low r could be difficulty or reliability!)

Cronbach’s  coefficient • A major problem with split-half approach • How do you know that inside a half there aren’t a few bad items? • Catches most, but not all • Solution: Select another half to split at • But: if you have the same number of bad items in each half, they balance out - hidden!

The splitting headache • Imagine you have a few bad items, evenly spread in the test: If you use a first 3/ last 3 split, end up with one bad item in each half, so they are balanced out (hidden) If you use a even/odd split, they are balanced out as well (hidden) How do you split? (Black bars are bad items)

A solution to splitting • Remember: we don’t know which the bad ones are • Can’t make bizzare splits to work around them • Solution: brute force! • Work out the correlations between every possible split, and average them out!

Cronbach’s  • Not to be confused with  (prob of Type I error) from significance tests! • Works out the correlation between each half and each other half, and averages them out • Impossible for bad items to “hide” by balancing out

Interpreting Cronbach’s  • Gives numbers between 0 and 1 • Needs to be very high (above 0.9) • It is a measure of homogeneity of the test • If your test is designed to measure more than one thing, the score will be low

Other forms of reliability • Kuder-Richardson formula 20 (KR20) • Like Cronbach’s alpha, but specialized for correct/incorrect type answers • Inter-scorer reliability • for judgement tests • to what degree do several judges agree on the answer • expressed as a correlation

Reliability in Scales

Reliability in Scales

Presentation Transcript

Scales

Lecture 5: Reliability and validity of scales

Spring Scales

Reliability in Testing

Primer on Evaluating Reliability and Validity of Multi-Item Scales

Scales

Scales

Developing a Measure: scales, validity and reliability

. Scales

Reliability in Cloud

Scales

SCALES

Map Scales

ESL SCALES

SCALES

Scales

Scales in our World

scales

NTEP SCALES-Selleton Scales

Scales

SCALES

Reliability in Testing