Reliability and Validity

Reliability and Validity

Measurement Error • Whatever measurement we might make with regard to some psychological construct, we do so with some amount of error • Any observed score for an individual is their true score with error added in • There are different types of “error”, but here we are concerned with a measure’s inability to capture the true response for an individual • Observed Score = True score + Error of measurement

Reliability • Reliability refers to a measure’s ability to capture an individual’s true score, i.e. to distinguish accurately one person from another • While a reliable measure will be consistent, consistency can actually be seen as a by-product of reliability, and in a case where we had perfect consistency (everyone scores the same and gets the same score repeatedly), reliability coefficients could not be calculated • No variance/covariance to give a correlation • The error in our analyses is due to individual differences but also the lack of the measure being perfectly reliable

Reliability • Criteria of reliability • Test-retest • Test components (internal consistency) • Test-retest reliability • Consistency of measurement for individuals over time • The score similarly e.g. today and 6 months from now • Issues • Memory • If too close in time the correlation between scores is due to memory of item responses rather than true score captured • Chance covariation • Any two variables will always have a non-zero correlation • Reliability is not constant across subsets of a population • General IQ scores good reliability • IQ scores for college students, less reliable • Restriction of range, fewer individual differences

Internal Consistency • We can get a sort of average correlation among items to assess the reliability of some measure1 • As one would most likely intuitively assume, having more measures of something is better than few • It is the case that having more items which correlate with one another will increase the test’s reliability

What’s good reliability? • While we have conventions, it really kind of depends • As mentioned reliability of a measure may be different for different groups of people • What we may need to do is compare reliability to those measures which are in place and deemed ‘good’ as well as get interval estimates to provide an assessment of the uncertainty in our reliability estimate • Note also that reliability estimates are biased upwardly and so are a bit optimistic • Also, many of our techniques do not take into account the reliability of our measures, and poor reliability can result in lower statistical power i.e. an increase in type II error • Though technically increasing reliability can potentially also lower power1

Replication and Reliability • While reliability implies replicability, assessing reliability does not provide a probability of replication • Note also that statistical significance is not a measure of reliability or replicability1 • Replication is not perhaps conducted as much as should be in psychology for a number of reasons • Practical concerns, lack of publishing outlets etc. • Furthermore, knowing our estimates are biased and variable themselves, we might even think that in many cases we would not expect consistent research findings • In psychology, many people spend a lot of time debating back and forth about the merits of some theory, citing cases where it did or did not replicate • However the lack of replication could be due to low power, low reliability, problem data, incorrectly carrying out the experiment etc. • In other words, we didn’t repeat because of methodology, not because the theory was wrong

Factors affecting the utility of replications • You can’t step in the same river twice! • Heraclitus1 • When • Later replications are not providing as much information, however they can contribute greatly to the overall assessment of an effect • Meta-analysis • How • There is no perfect replication (different people involved, time it takes to conduct etc.) • Doing ‘exact’ replication gives us more confidence in the original finding (should it hold), but may not offer much in the way of generalization • Example: doing a gender difference study at UNT over and over. Does it work for non-college folk? People outside of Texas?

Factors affecting the utility of replications • By whom • It is well known that those with a vested interest in some idea tend to find confirming evidence more than those that don’t • Replications by others are still being done by those with an interest in that research topic and so may have a ‘precorrelation’ inherent in their attempt • Direct: correlation of attributes of persons involved • Indirect: correlation of data to be obtained • Gist, we can’t have truly independent replication attempts, but must strive to minimize bias • The more independent replication attempts are, the more informative they will be

Validity • Validity refers to the question of whether our measurements are actually hitting on the construct we think they are • While we can obtain specific statistics for reliability (even different types), validity is more of a global assessment based on the evidence available • We can have reliable measurements that are invalid • Classic example: The scale which is consistent and able to distinguish from one person to the next but actually off by 5 pounds

Validity Criteria in Psychological Testing • Content validity • Criterion validity • Concurrent • Predictive • Construct-related validity • Convergent • Discriminant • Content validity • Items represent the kinds of material (or content areas) they are supposed to represent • Are the questions worth a flip in the sense they cover all domains of a given construct? • E.g. job satisfaction = salary, relationship w/ boss, relationship w/ coworkers etc.

Validity Criteria in Psychological Testing • Criterion validity • the degree to which the measure correlates with various outcomes • Does some new personality measure correlate with the Big 5 • Concurrent • Criterion is in the present • Measure of ADHD and current scholastic behavioral problems • Predictive • Criterion in the future • SAT and college gpa

Validity Criteria in Psychological Testing • Construct-related validity • How much is it an actual measure of the construct of interest • Convergent • Correlates well with other measures of the construct • Depression scale correlates well with other dep scales • Discriminant • Is distinguished from related but distinct constructs • Dep scale != Stress scale

Validity Criteria in Experimentation • Statistical conclusion validity • Is there a causal relationship between X and Y? • Correlation is our starting point (i.e. correlation isn’t causation, but does lead to it) • Related to this is the question of whether the study was sufficiently sensitive to pick up on the correlation • Internal validity • Has the study been conducted so as to rule out other effects which were controllable? • Poor instruments, experimenter bias • External validity • Will the relationship be seen in other settings? • Construct validity • Same concerns as before • Ex. Is reaction time an appropriate measure of learning?

Summary • Reliability and Validity are key concerns in psychological research • Part of the problem in psychology is the lack of reliable measures of the things we are interested in1 • Assuming that they are valid to begin with, we must always press for more reliable measures if we are to progress scientifically • This means letting go of supposed ‘standards’ when they are no longer as useful and look for ways to improve current ones

Reliability and Validity