Can five be enough? Sample sizes in usability tests

Can five be enough?Sample sizes in usability tests Paul Cairns and Caroline Jarrett

Problem: usability studies have small samples • Good experiments: 30+ Ps • Typical usability studies: ~5Ps • Moving to 3Ps! • How?! What?! • Typically, I have conniptions • CJ asked me to solve it!

UX people like small samples • Common practice (CJ) • Krug (2010), 3 (“a morning a month”) • Tullis & Albert, 6 to 8 (formative) • Virzi (1992) • Nielsen (1993) • 7 experts ≈ 5 experts

Use probabilities to suggest sample sizes • Total number of problems, K • Probability of problem discovery, p • Find n so that 1 – (1-p)n is x% of K • Binomial distribution • n is our sample size • p = 0.16, 0.22, 0.41, 0.6, n ≈ 5

The models can be refined • What is p for your system? • p can be small (Spool & Schroeder, 2001) • Bootstrap • Is p constant for all problems? • More complex models • Are all participants equally good? • Tend to increase n

The models have conceptual flaws • Is p meaningful? • Independence of discovery • Discovery is probabilistic • What’s the probability space? • Problem classification

A usability test can be an experiment • Conduct like an experiment • Need an alternative hypothesis • Measure (quantitatively) one thing • Carefully defined tasks • Manipulate the interface • Use statistics to identify true variation

Example questions • Is task quicker on new design? • Does design increase click-throughs? • Are errors below a threshold rate? • Is performance comparable in new design? • Can you prove this design is worth it?

Why use an experiment? Good for… Not for… Show-stoppers Large effects Anything but alternative hypothesis • When reasoning is not enough • Good beliefs for improvements • Finessing

Usability tests are more about better designs • Move to new technology • Design well • Reach a point of plausibility • Competing considerations • Test!

There are different argument styles • Deduction: X causes Y; X hence Y • Induction: From instances of X and Y, when I see X, I infer Y. • Abduction: X causes Y; Y hence? • Explanation seeking • Pierce: “matted felt of pure hypothesis” • Sherlock Holmes does abduction!

Solutions arise from abduction • Users act in response to system • Features cause good/bad outcomes • Abduce explanations • More experience, better explanations

So what should be the sample size? • H= “X is good” • Null: p(H) = 0.5 • Five people are enough: • H does not hold for 5 people • (0.5)5 = 1/32 < 1/20 hence sig

So what should be the sample size? • H= “X is good” • Null: p(H) = 0.5 • Five people are enough: • H does not hold for 5 people • (0.5)5 = 1/32 < 1/20 hence sig • This is false!

Usability tests sit in a cloud of hypotheses • Usability as a privative • Every feature is contingently usable • Any falsification forces revision (Popper) • Kuhnian resistance • Neo-Popperian (Deutsch) • Falsification + narratives (explanations)

Sample size depends on explanation • Plausible sample sizes • Show-stopper: 1 • Unexpected but plausible: 3-5 • No explanation: many • Different behaviour, same explanation • ROI

Why usability tests might look like experiments • Control in experiments • Causal attribution • Coverage • Observation • Piggy-backing

Questions?

Can five be enough? Sample sizes in usability tests