Enhancing Statistical Understanding Through Simulation-Based Inference

Teaching the statistical investigation process with simulation-based inference Beth Chance, Cal Poly- San Luis ObispoNathan Tintle, Dordt College

Introductions • Beth • Nathan

Goals • What/why SBI? (11:00-11:30 ET) • One proportion examples and where to from here (1130-11:50) • Q+A(11:50-12:00) • Two group simulation (12:00-12:15) • How to assess and what student performance looks like? (12:15-12:35) • How to get started/get more information; Q+A (12:35-12:45)

Brief and select history of stat ed • Consensus approach for intro stats by late 1990s, but nexus in early 1980s • Descriptive Statistics • Probability/Design/Sampling Distributions • Inference (testing and intervals) • GAISE College Report (2005) • Six pedagogical suggestions for Stat 101: Conceptual understanding, Active learning, Real data, Statistical literacy and thinking, Use technology, and Use assessments forlearning

Brief history of stat ed • No real pressure to change content • Major changes • Increased computational resources for data collection and analysis • Recognition of the utility of simulation to enhance student understanding of random processes • Assessment results illustrating that students don’t really (a) improve much pre to post-course on standardized tests of statistical thinking or (b) retain much from a typical introductory statistics course

Intro Stat as a Cathedral of Tweaks (A George Cobb analogy) Boswell, a biographer, famously described Samuel Johnson as a “cathedral of tics” due to his gesticulations and tics Thesis: The usual normal distribution-worshipping intro course is a cathedral of tweaks.

The orthodox doctrine • The orthodox doctrine is simple • Central limit theorem justifies use of normal distribution • If observed statistic is in the tails (>2SEs), reject null • Confidence interval is estimate +/- 2SEs

The Cathedral of Tweaks (a) One tower: z vs t If we know the population SD we use z If we estimate the SD we use t… except for proportions; then we use z, not t, even when we estimate the SD… …except when you do tests on proportions, then use the null value

Still More Tweaks • Another tower: If your data set is not normal you may need to transform • Another tower: If you work with small samples there are guidelines for when you can use methods based on the normal, e.g., n > 30, or np > 5 and n(1-p) > 5

(b) Another tower: n vs. (n-1) vs. (n-2) The SE is SD/, because there are n observations… …except for estimating the SD when we divide by n-1 …except for regression when we use n-2

The consequence • Few students ever leave our course seeing statistics as this

The consequence • The better students may get a fuzzy impression

The consequence • All too many noses stay too close to the canvas, and see disconnected details

A potential solution? • ‘Simulation-based methods’ = simulation, bootstrapping and/or permutation tests (Alt: Resampling, Randomization, etc.) • Use of these methods to: • Estimate/approximate the null distribution for significance tests • Estimate/approximate the margin of error for confidence intervals

General trends • Momentum behind simulation-based approach to inference in last 8-10 years • Cobb 2005 talk (USCOTS) • Cobb 2007 paper (TISE) • 2011 USCOTS: The Next Big Thing • Continued workshops, sessions – e.g., numerous at eCOTS!

General trends • Recent curricula • Lock5 (theory and randomization, more traditional sequence of topics) • Tintle et al. ISI (theory and randomization, four pillars of inference and then chapters based on type of data) • CATALST (emphasis on modelling) • OpenIntro • Others--- Statistical Reasoning in Sports (Tabor- geared to HS students)

General trends • Many sessions at conferences talking about approach, benefits, questions/concerns • Assessment: Multiple papers (Tintle et al. 2011, Tintle et al. 2012, Tintle et al. 2014, Chance et al. 2014, Swanson et al. 2014); Better on many things, do no harm on others; more papers coming

Simulating a single proportion

Set-up: Can dogs understand human cues? • A dog is shown two cups (on ground, 2.5 meters from dog) and then given a choice of which one to approach. • Before approaching the cups the researcher leans in one direction or the other • The dog (Harley) chooses the correct cup 9 out of 10 times • Is the dog ‘understanding’ the researcher?

Questions for students • What do you think? • Why?

In class dialogue • Probably ‘understanding’ the researcher • Assuming some things about the study design • Not always the same cup; same color/kinds of cups; object underneath doesn’t have a scent, etc. • Why ‘understanding the researcher’? • 9 out of 10 is ‘convincing’ • Why convincing? • Unlikely to happen by chance

In class dialogue • What about people not convinced? How would you convince them of your ‘gut feeling’ that 9 out of 10 is ‘rare’ and ‘not likely to happen by chance’ • What would happen by chance is 5 or 6 or 4 or … • Flip a coin

In class tactile simulation • Flip coins • Students come to front and put dots on dotplot • Illustrate that 9 out of 10 heads is rare ->confirming intuition that 9 out of 10 correct is rare

Applet • http://math.hope.edu/isi(or our Wiley textbook site Introduction to Statistical Investigations) for links to rossmanchance.com applets • One proportion applet demo

Take homes • Logic of inference very early in the class • No technical lingo • Follow-up with 6 out of 10. Mechanical arm points at a cup. Dog just guessing?

Another quick example • Eight out of last 10 patients with heart transplants at St. George’s Hospital died within 30 days. Made news because heart transplant surgeries were suspended pending an investigation • Historical national data is ~15% 30 day mortality rate after heart transplant • What do think? Would you suspend heart transplants at that hospital? Could there be another explanation? • How can we investigate the “random chance” explanation?

St. George’s • Simulation • Coin tossing? • Ross a die? • Spinner? • Observations • Where is distributed centered? • Why is it not symmetric? • Do I care? • Where does 8 fall in this distribution?

Take homes • Follow-up: 71 out 361 patients at St. George’s died since 1986 (19.67%)

Take homes • Where do you go from here • P-value/null-alt hypothesis language • What impacts strength of evidence • Standardized statistics • Normal approx. to binomial (“Theory based approach” ) • St. George’s • Process to population

Take homes • Have them design their own simulations for a while • Technology – not a black box; directly connects to tactile in class simulation • Contrast with traditional approach • Lots of probability scaffolding; abstract theory; disconnection from real data; technical language and notation, etc. • Less ‘spiraling’ and less opportunity to do inference (the main objective?)

More take homes • SBI • Integration of GAISE (content and pedagogy) • Keeping data front and center (e.g., 6 steps of inference) • Build on strong conceptual foundation of what inference is • Layer confidence intervals, generalizability and causation on top of this foundation • Through choice of examples they see many other important issues dealing with data collection and messy data, but always in the context of a full statistical investigation

Q+A

In our course… • Chapter 1 – simulating one proportion (logic of inference – significance testing) • Chapter 2- importance of random samples (scope of inference - generalizing) (one proportion) • Chapter 3- estimation (logic of inference - confidence intervals) (one proportion) • Chapter 4 – randomized experiments vs. observational studies (scope of inference – causation) (two groups) • Chapters 5-7 – comparing two groups (proportions, quant variable, paired) • Chapters 8-10 – comparing multiple groups/regression (association)

In our course… • Chapters 5-10 • Focus on overall statistical process • Six steps • Integrated presentation of descriptive and inferential statistics • Shuffling to break the association • 3S process: Statistic, Simulate, Strength of Evidence • Theory-based approaches predict ‘what would happen if you simulated’ (more or less) and have valid predictions if certain data conditions are met • Simplified versions of those conditions, can always verify with simulation!

Lingering effects of sleep deprivation • Participants were trained on a visual discrimination task on the computer and then half were not allowed to sleep that night. Everyone got as much sleep as they wanted on nights 2 and 3 and then the subjects were retested. The response variable is the improvement in their reaction times (positive values indicate how much faster they were on the task the second time)

Lingering effects of sleep deprivation • Key question: Could this have happened by random chance alone? • Now: randomness is from the random assignment in the experiment • So what do we need to know? • How does our statistic behave by random chance alone when there really is no treatment effect? • How can we simulate this?

Lingering effects of sleep deprivation • Key question: Could this have happened by random chance alone? • Students take 21 index cards and write down each improvement score • The cards are shuffled and 11 are dealt to be the “sleep deprived” group and the remaining 10 are the “unrestricted sleep” group • Assuming nothing special about which group you are assigned to, your outcome is not going to change, there is no treatment effect • After each shuffle we calculate the new statistic and produce a distribution of the different values of the statistic under this model

Lingering effects of sleep deprivation • Applet demo

Follow up Original data Fake data

Follow up Real data Fake data

Take home messages • Core logic of inference is the same • From this point on, practically a “downhill” slope • Standardized statistic is simply statistic/SE (SE from simulation) • “Quick and dirty” 95% CI is simply +/- 2*SE (SE from simulation) • Alternative choice of statistic is nice and easy • “Why are we using the mean instead of the median if the median is better?” • Students are ‘ready’ to confront different situations • Theory-based is convenient prediction when certain conditions are met – overlay of distribution

How do you do assessment? • May ask students to use applets on exam • Applets can be used on personal devices, most can be downloaded locally in advance • But not required • Can be asked to interpret results • Can be asked to design the simulation • Do ask more conceptual questions about logic and scope of inference • Interpretation of p-value

What kinds of questions do you ask? • Screen capture and fill in blanks/interpret output • “What values would you use in the applet to…” • “Which graph represents the null distribution?” (e.g., where centered) • “Circle the dots that represent the p-value.” or “Indicate on the graph how to find the p-value” • “Based on the simulated null distribution, how strong is the evidence against the null hypothesis” • What–if questions • Show a skewed simulated distribution and ask ‘what’s wrong’ with theory-based p-value • How would the null distribution change if we increased the sample size

Another example assessment question • Two different approaches were taken in order to yield a p-value. • Option #1. 1000 sets of 20 “coin tosses” were generated where the probability of heads was 10%. Out of the 1000 sets of tosses 129 sets had at least 4 heads occur, and so a p-value of 0.129 is obtained, showing little evidence that more than 10% of Dordt students study more than 35 hours a week. • Option #2. The Theory-Based Inference applet was used, generating a z-score of 1.49 with a p-value of 0.068, yielding moderate evidence that more than 10% of Dordt students study more than 35 hours a week.

Another example assessment question Student question: Briefly explain which p-value (Option #1 or Option #2) is more valid and why.

Assessment results • Major assessment studies are underway • Evidence is mounting for • Improved student conceptual understanding of numerous inferential outcomes • “No harm” on other outcomes • For both stronger and weaker students • Regardless of institution or level of instructor experience with SBI

Dordt’s before and after story • Methods • Traditional curriculum (Moore 2010) - 94 students; spring 2011 • New curriculum (ISI, 2011 version) – 155 students; fall 2011 and spring 2012 • All students completed the 40-question CAOS test during the first week of the semester and again during the last week of the semester. Students were given course credit for completing the assessment test, but not for their performance, and the test was administered electronically outside of class. • Two instructors taught the course each semester, with one instructor the same each semester, and one different in spring 2011 than in fall 2011/spring 2012

Dordt’s before and after story • Overall performance Post-test Pre-test Very similar to Tintle et al (2011) results at another institutionApprox. twice the gains using new curriculum as compared to traditional (11.6% vs. 5.6%; p<0.001)

Dordt’s before and after story

Enhancing Statistical Understanding Through Simulation-Based Inference

Enhancing Statistical Understanding Through Simulation-Based Inference

Presentation Transcript

Statistical Inference

STATISTICAL INFERENCE

Statistical Inference

Statistical Inference

Statistical Inference

Using Simulation Methods to Introduce Statistical Inference

Statistical inference

Statistical Inference

Statistical Inference

Statistical Inference

Statistical Inference

Statistical inference

Statistical Inference

Statistical Inference

Statistical inference

Statistical Inference

Statistical Inference

Drawing Statistical Inference from Simulation Runs...

Statistical Inference

Statistical Inference