1 / 35

Reliability in Scales

Reliability in Scales. Reliability is a question of consistency do we get the same numbers on repeated measurements? Low reliability: reaction time High reliability: measuring weight Psychological tests fall inbetween. Reliability & Measurement error.

thom
Download Presentation

Reliability in Scales

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reliability in Scales • Reliability is a question of consistency • do we get the same numbers on repeated measurements? • Low reliability: reaction time • High reliability: measuring weight • Psychological tests fall inbetween

  2. Reliability & Measurement error • Measures are not perfectly reliable because they have error • The “built in accuracy” of the scale • Pokemon wristwatch vs. USN atomic clock • We can express this as: • X = T + e • X = your measurement • T = the “True score” • e = the error involved in measuring it (+ or -)

  3. Example: the effect of e • Imagine we have someone with a “true” int score of 100. • If your int scale has a large e, then you measurements will vary a lot (say from 60 all the way to 130) • If your scale has a small e, your scale will vary a little (say from 90 to 110)

  4. Measurements as distributions • Think of e as variance in a distribution, with X as your mean Small e - scores clustered close to true score Large e - scores all over the place! (hard to say what the true score is)

  5. More on the error • Measures with a large e are dodgy (hides the true score) • We can reduce the size of e, but not eliminate it completely • Measuring reliability is measuring the impact of e

  6. Different forms of reliability • Reliability (“effect of e”) can be very hard to conceptualise • To help, we break it up into 2 subclasses • Temporal stability • If I measure you today and tomorrow, do I get the same result? • Internal consistency • Are all the questions in the test measuring the same thing?

  7. Temporal stability • The big idea: If I test you now, and then I test you tomorrow, I should get the same result • Why have it? • Can’t measure changes otherwise! • Tells us that we can trust results (small time error) • Tells us that there is no learning effect

  8. Measuring temporal stability • How can we measure if a test is temporally stable? • The problem: we have 2 sets of scores. We need to see if they are the same • Solution: Use a correlation. If the two sets are strongly related, then they are basically the same

  9. Example: Correlations & Stability • Imagine a test with ten questions, and a person does it twice (on Monday and Wednesday): • M: 5 6 5 3 4 8 6 4 8 7 • W: 4 7 4 3 4 8 6 3 9 5 • Are these scores the same? (r = 0.897)

  10. Example: correlations & stability • Now imagine a crappy scale: • M: 5 6 5 3 4 8 6 4 8 7 • W: 5 3 2 2 8 5 8 1 4 4 • Are these scores basically the same (r = 0.211)

  11. Different approaches to stability • There are a two main ways of testing temporal stability • Test-retest method: give the same test to the same people • Alternate forms: give a highly similar test to the same people

  12. Test-retest method • Method: • 1. Select a group of people • 2. Give them your test • 3. Get them to come back later • 4. Give them the test again • 5. Correlate the results to see

  13. Things to note • It must be the same people • We want to know that if client X returns, we can measure that person again • The amount of time between tests depends on your requirements • The correlation value must be very high - above 0.85

  14. Why it works • We get 2 results from each person to compare • this means we can draw rely on the test to work for the same people • We use a lot of people in our assessment • this means that we can rely on the test, regardless of who our client is • The correlation tells us the degree to which the 2 tests agree (R2 is the % they agree)

  15. The learning effect • What if you have a test where learning/practice can affect your score? • Eg. class test • The Test-retest method will always yield poor correlations • people will always score higher marks the second time around • This will make it look as if temporal stability is poor!

  16. Alternate forms reliability • Answer: do test-retest, but don’t use the same test twice • Use a highly similar test • In order for this to work, both forms must be equally difficult • The more similar, the better

  17. Making alternate forms of a test • Simple to ensure both forms are equally difficult • Make twice as many questions as you will want in the test • Randomly divide them up into two halves • Each half is a test! • The random division ensures both forms are equally difficult

  18. The procedure: alternate forms • Once you have your 2 forms: • Collect a sample of people • Give them the first form of the test • wait a while • Give them the second form of the test • Correlate the results • If the correlation is high (> .85), you have stability

  19. Which to use: alternate forms or test-retest? • If you are measuring something which can be learned/perfected by practice - alternate forms • If not, you could choose • Test-retest if preferable, removes confound about difficulty • In many cases, you don’t really know if learning is an issue • Alternate forms is “safer”, but poorer statistically

  20. What if you don’t have temporal stability? • Temporal stability is not required for all tests • Most important for tests which work longitudinally • Very important if you want to track changes over time • Excludes all “once off” tests (eg. aptitude tests)

  21. Internal consistency • A different type of reliability • The big idea: Are all the questions in my test tapping into the same thing? • (or, are some questions irrelevant) • All tests require this property

  22. Why it’s important • Imagine we have an arithmetic ability test, with 4 questions: • 1. What is 5 x 3 • 2. What is 12 + 2 - 5 • 3. What is the capital of the Ukraine • 4. What is 5 x 2 + 3

  23. Why it’s important • Item 3 does not contribute to measuring arithmetic • Someone who is a maths wiz (should get 4/4) might only get 3/4 • A complete maths idiot (should get 0/4) could get 1/4 • It does not belong in this test! • If we include it in our total, it will confuse us • Items such as this become “third variables”

  24. How do we know if an item belongs? • We need to figure out if a particular item is testing the same thing as the others • We can correlate the item’s scores with the scores of some other item we do know belongs • High correlation (above 0.85) - it tests the same thing • Low correlation (below 0.85) - it measures something else

  25. Our example again • Some people who know maths, will also know geography • But not everyone! • Correlate Q1 to Q3 - it will be weak • Those who know arithmetic will know how to do the other items • Correlate Q1 to Q2 or Q4, all will give a high correlation

  26. Doing it for real • Problem: how do we know which items are suspect? • Any item could be at fault • Not always ovious • Solution - check them all • Split half method • Cronbach’s Alpha

  27. Split half approach • Basic idea: check one half of the test against the other half • If first half correlates well to the other half, then they are tapping into the same thing • Problem to overcome: each half of the test must be the same difficulty

  28. Split half - procedure • Give a bunch of people your test • Decide on how to split the test in half • Correlate the halves • If the correlation is high (above 0.85), the test is reliable

  29. Where to split? • Problem: how do we split the test? • First 10 Q vs last 10? • Odd numbered Q vs Even numbered Q? • Any method is acceptable, as long as the halves are of equivalent difficulty • How do you show that? • Not by correlation - paradox! • (low r could be difficulty or reliability!)

  30. Cronbach’s  coefficient • A major problem with split-half approach • How do you know that inside a half there aren’t a few bad items? • Catches most, but not all • Solution: Select another half to split at • But: if you have the same number of bad items in each half, they balance out - hidden!

  31. The splitting headache • Imagine you have a few bad items, evenly spread in the test: If you use a first 3/ last 3 split, end up with one bad item in each half, so they are balanced out (hidden) If you use a even/odd split, they are balanced out as well (hidden) How do you split? (Black bars are bad items)

  32. A solution to splitting • Remember: we don’t know which the bad ones are • Can’t make bizzare splits to work around them • Solution: brute force! • Work out the correlations between every possible split, and average them out!

  33. Cronbach’s  • Not to be confused with  (prob of Type I error) from significance tests! • Works out the correlation between each half and each other half, and averages them out • Impossible for bad items to “hide” by balancing out

  34. Interpreting Cronbach’s  • Gives numbers between 0 and 1 • Needs to be very high (above 0.9) • It is a measure of homogeneity of the test • If your test is designed to measure more than one thing, the score will be low

  35. Other forms of reliability • Kuder-Richardson formula 20 (KR20) • Like Cronbach’s alpha, but specialized for correct/incorrect type answers • Inter-scorer reliability • for judgement tests • to what degree do several judges agree on the answer • expressed as a correlation

More Related