370 likes | 667 Views
Reliability in Scales. Reliability is a question of consistency do we get the same numbers on repeated measurements? Low reliability: reaction time High reliability: measuring weight Psychological tests fall inbetween. Reliability & Measurement error.
E N D
Reliability in Scales • Reliability is a question of consistency • do we get the same numbers on repeated measurements? • Low reliability: reaction time • High reliability: measuring weight • Psychological tests fall inbetween
Reliability & Measurement error • Measures are not perfectly reliable because they have error • The “built in accuracy” of the scale • Pokemon wristwatch vs. USN atomic clock • We can express this as: • X = T + e • X = your measurement • T = the “True score” • e = the error involved in measuring it (+ or -)
Example: the effect of e • Imagine we have someone with a “true” int score of 100. • If your int scale has a large e, then you measurements will vary a lot (say from 60 all the way to 130) • If your scale has a small e, your scale will vary a little (say from 90 to 110)
Measurements as distributions • Think of e as variance in a distribution, with X as your mean Small e - scores clustered close to true score Large e - scores all over the place! (hard to say what the true score is)
More on the error • Measures with a large e are dodgy (hides the true score) • We can reduce the size of e, but not eliminate it completely • Measuring reliability is measuring the impact of e
Different forms of reliability • Reliability (“effect of e”) can be very hard to conceptualise • To help, we break it up into 2 subclasses • Temporal stability • If I measure you today and tomorrow, do I get the same result? • Internal consistency • Are all the questions in the test measuring the same thing?
Temporal stability • The big idea: If I test you now, and then I test you tomorrow, I should get the same result • Why have it? • Can’t measure changes otherwise! • Tells us that we can trust results (small time error) • Tells us that there is no learning effect
Measuring temporal stability • How can we measure if a test is temporally stable? • The problem: we have 2 sets of scores. We need to see if they are the same • Solution: Use a correlation. If the two sets are strongly related, then they are basically the same
Example: Correlations & Stability • Imagine a test with ten questions, and a person does it twice (on Monday and Wednesday): • M: 5 6 5 3 4 8 6 4 8 7 • W: 4 7 4 3 4 8 6 3 9 5 • Are these scores the same? (r = 0.897)
Example: correlations & stability • Now imagine a crappy scale: • M: 5 6 5 3 4 8 6 4 8 7 • W: 5 3 2 2 8 5 8 1 4 4 • Are these scores basically the same (r = 0.211)
Different approaches to stability • There are a two main ways of testing temporal stability • Test-retest method: give the same test to the same people • Alternate forms: give a highly similar test to the same people
Test-retest method • Method: • 1. Select a group of people • 2. Give them your test • 3. Get them to come back later • 4. Give them the test again • 5. Correlate the results to see
Things to note • It must be the same people • We want to know that if client X returns, we can measure that person again • The amount of time between tests depends on your requirements • The correlation value must be very high - above 0.85
Why it works • We get 2 results from each person to compare • this means we can draw rely on the test to work for the same people • We use a lot of people in our assessment • this means that we can rely on the test, regardless of who our client is • The correlation tells us the degree to which the 2 tests agree (R2 is the % they agree)
The learning effect • What if you have a test where learning/practice can affect your score? • Eg. class test • The Test-retest method will always yield poor correlations • people will always score higher marks the second time around • This will make it look as if temporal stability is poor!
Alternate forms reliability • Answer: do test-retest, but don’t use the same test twice • Use a highly similar test • In order for this to work, both forms must be equally difficult • The more similar, the better
Making alternate forms of a test • Simple to ensure both forms are equally difficult • Make twice as many questions as you will want in the test • Randomly divide them up into two halves • Each half is a test! • The random division ensures both forms are equally difficult
The procedure: alternate forms • Once you have your 2 forms: • Collect a sample of people • Give them the first form of the test • wait a while • Give them the second form of the test • Correlate the results • If the correlation is high (> .85), you have stability
Which to use: alternate forms or test-retest? • If you are measuring something which can be learned/perfected by practice - alternate forms • If not, you could choose • Test-retest if preferable, removes confound about difficulty • In many cases, you don’t really know if learning is an issue • Alternate forms is “safer”, but poorer statistically
What if you don’t have temporal stability? • Temporal stability is not required for all tests • Most important for tests which work longitudinally • Very important if you want to track changes over time • Excludes all “once off” tests (eg. aptitude tests)
Internal consistency • A different type of reliability • The big idea: Are all the questions in my test tapping into the same thing? • (or, are some questions irrelevant) • All tests require this property
Why it’s important • Imagine we have an arithmetic ability test, with 4 questions: • 1. What is 5 x 3 • 2. What is 12 + 2 - 5 • 3. What is the capital of the Ukraine • 4. What is 5 x 2 + 3
Why it’s important • Item 3 does not contribute to measuring arithmetic • Someone who is a maths wiz (should get 4/4) might only get 3/4 • A complete maths idiot (should get 0/4) could get 1/4 • It does not belong in this test! • If we include it in our total, it will confuse us • Items such as this become “third variables”
How do we know if an item belongs? • We need to figure out if a particular item is testing the same thing as the others • We can correlate the item’s scores with the scores of some other item we do know belongs • High correlation (above 0.85) - it tests the same thing • Low correlation (below 0.85) - it measures something else
Our example again • Some people who know maths, will also know geography • But not everyone! • Correlate Q1 to Q3 - it will be weak • Those who know arithmetic will know how to do the other items • Correlate Q1 to Q2 or Q4, all will give a high correlation
Doing it for real • Problem: how do we know which items are suspect? • Any item could be at fault • Not always ovious • Solution - check them all • Split half method • Cronbach’s Alpha
Split half approach • Basic idea: check one half of the test against the other half • If first half correlates well to the other half, then they are tapping into the same thing • Problem to overcome: each half of the test must be the same difficulty
Split half - procedure • Give a bunch of people your test • Decide on how to split the test in half • Correlate the halves • If the correlation is high (above 0.85), the test is reliable
Where to split? • Problem: how do we split the test? • First 10 Q vs last 10? • Odd numbered Q vs Even numbered Q? • Any method is acceptable, as long as the halves are of equivalent difficulty • How do you show that? • Not by correlation - paradox! • (low r could be difficulty or reliability!)
Cronbach’s coefficient • A major problem with split-half approach • How do you know that inside a half there aren’t a few bad items? • Catches most, but not all • Solution: Select another half to split at • But: if you have the same number of bad items in each half, they balance out - hidden!
The splitting headache • Imagine you have a few bad items, evenly spread in the test: If you use a first 3/ last 3 split, end up with one bad item in each half, so they are balanced out (hidden) If you use a even/odd split, they are balanced out as well (hidden) How do you split? (Black bars are bad items)
A solution to splitting • Remember: we don’t know which the bad ones are • Can’t make bizzare splits to work around them • Solution: brute force! • Work out the correlations between every possible split, and average them out!
Cronbach’s • Not to be confused with (prob of Type I error) from significance tests! • Works out the correlation between each half and each other half, and averages them out • Impossible for bad items to “hide” by balancing out
Interpreting Cronbach’s • Gives numbers between 0 and 1 • Needs to be very high (above 0.9) • It is a measure of homogeneity of the test • If your test is designed to measure more than one thing, the score will be low
Other forms of reliability • Kuder-Richardson formula 20 (KR20) • Like Cronbach’s alpha, but specialized for correct/incorrect type answers • Inter-scorer reliability • for judgement tests • to what degree do several judges agree on the answer • expressed as a correlation