1 / 39

Bayes Factors and inference

This text discusses the use of Bayes Factors in the analysis of a study on face emotion and working memory, providing guidelines for interpreting evidence and demonstrating the importance of null findings. It also covers Bayesian meta-analysis as a way to combine replication studies.

mmunger
Download Presentation

Bayes Factors and inference

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayes Factors and inference Greg Francis PSY 626: Bayesian Statistics for Psychological Science Fall 2018 Purdue University

  2. Bayes Factor • The ratio of likelihood for the data under the null compared to the alternative • Nothing special about the null, it compares any two models • Likelihoods are averages across different possible parameter values specified by the model by a prior distribution

  3. What does it mean? • Guidelines BFEvidence 1 – 3 Anecdotal 3 – 10 Substantial 10 – 30 Strong 30 – 100 Very strong >100 Decisive

  4. Evidence for the null • BF01>1 implies (some) support the null hypothesis • Evidence for “invariances” • This is more or less impossible for NHST • It is a useful measure • Consider a recent study in Psychological Sciences • Liu, Wang, Wang & Jiang (2016). Conscious Access to Suppressed Threatening Information Is Modulated by Working Memory

  5. Working memory face emotion • Explored whether keeping a face in working memory influenced its visibility under continuous flash suppression • To insure subjects kept face in memory, tested for identity

  6. Working memory face emotion • Different types of face emotions: fearful face, neutral face • No significant differences of correct responses (same/different) for emotions: • Experiment 1: t(11)= -1.74, p=0.110 • If we compute the JZS Bayes Factor we get • > ttest.tstat(t=-1.74, n1=12, simple=TRUE) • B10 • 0.9240776 • Which is anecdotal support for the null hypothesis • You would want B10< 1/3 for substantial support for the null

  7. Replications • Experiment 3 • t(11)=-1.62, p=.133 • Experiment 4 • t(13)=-1.37, p=.195 • Converting to JZS Bayes Factors suggests these are modest support for the null • Experiment 3 • ttest.tstat(t= -1.62, n1=12, simple=TRUE) • B10 • 0.8033315 • Experiment 4 • ttest.tstat(t= -1.37, n1=14, simple=TRUE) • B10 • 0.5857839

  8. The null result matters • The authors wanted to demonstrate that faces with different emotions were equivalently represented in working memory • But differently affected visibility during the flash suppression part of a trial • Experiment 1: • Reaction times for seeing a face during continuous flash suppression were shorter for fearful faces than for neutral faces • Main effect of emotion: F(1, 11)=5.06, p=0.046 • Reaction times were shorter when the emotion of the face during continuous flash suppression matched the emotion of the face in working memory • Main effect of congruency: F(1, 11)=11.86, p=0.005

  9. Main effects • We will talk about a Bayesian ANOVA later, but we can consider the t-test equivalent of these tests: • Effect of emotion • > ttest.tstat(t= sqrt(5.06), n1=12, simple=TRUE) • B10 • 1.769459 • Suggests anecdotal support for the alternative hypothesis • Effect of congruency • ttest.tstat(t= sqrt(11.86), n1=12, simple=TRUE) • B10 • 9.664241 • Suggests substantial support for the alternative hypothesis

  10. Evidence • It is generally harder to get convincing evidence (BF>3 or BF>10) than to get p<.05 • Interaction: F(1, 11)=4.36, p=.061 • Contrasts: • RT for fearful faces shorter if congruent withworking memory: t(11)=-3.59, p=.004 • RT for neutral faces unaffected by congruencyt(11)=-0.45 • Bayesian interpretations of t-tests: • > ttest.tstat(t=-3.59, n1=12, simple=TRUE) • B10 • 11.94693 • > ttest.tstat(t=-0.45, n1=12, simple=TRUE) • B10 • 0.3136903

  11. Substantial Evidence • For a two-sample t-test (n1=n2=10), a BF>3 corresponds to p<0.022 • For a two-sample t-test (n1=n2=100), a BF>3 corresponds to p<0.012 • For a two-sample t-test (n1=n2=1000), a BF>3 corresponds to p<0.004

  12. Strong Evidence • For a two-sample t-test (n1=n2=10), a BF>10 corresponds to p<0.004 • For a two-sample t-test (n1=n2=100), a BF>10 corresponds to p<0.003 • For a two-sample t-test (n1=n2=1000), a BF>10 corresponds to p<0.001 • Of course, if you change your prior you change these values • (but not much) • Setting the scale parameter r=sqrt(2) (ultra wide) gives • For a two-sample t-test (n1=n2=10), a BF>10 corresponds to p<0.005 • For a two-sample t-test (n1=n2=100), a BF>10 corresponds to p<0.0017 • For a two-sample t-test (n1=n2=1000), a BF>10 corresponds to p<0.00054

  13. Bayesian meta-analysis • Rouder & Morey (2011) identified how to combine replication studies to produce a JZS Bayes Factor that accumulates the information across experiments • The formula for a one-sample, one-tailed t-test for BF10 is • f( ) is the Cauchy (or half-Cauchy) distribution • g( ) is the non-central t distribution • It looks complicated, but it is easy enough to calculate

  14. Bayesian meta-analysis • Consider the null results on face emotion and memorability • Experiment 1 • t(11)= -1.74, p=0.110 • Experiment 3 • t(11)=-1.62, p=.133 • Experiment 4 • t(13)=-1.37, p=.195 • Strong support for the alternative! > tvalues<-c(-1.74, -1.62, -1.37) > nvalues<-c(12, 12, 14) > meta.ttestBF(t=tvalues, n1=nvalues) Bayes factor analysis -------------- [1] Alt., r=0.707 : 4.414733 ±0% Against denominator: Null, d = 0 --- Bayes factor type: BFmetat, JZS

  15. Equivalent statistics • Bayes Factors are not magic, and they use the very same information as other approaches to statistical inference • Consider a variety of statistics for different inferential methods • Standardized effect size (Cohen’s d, Hedge’s g) • Confidence interval for d or g • JZS Bayes Factor • Akaiki Information Criterion (AIC) • Bayesian Information Criterion (BIC)

  16. Equivalent statistics • For a 2-sample t-test with known sample sizes n1 and n2, all of these statistics are mathematicallyequivalent to each other • Given one statistic, you can compute all the others • You should use the statistic that is appropriate for the inference you want to make

  17. Equivalent statistics • Each of these statistics is a “sufficient statistic” for the population effect size • A data set provides an estimate of the population effect size • It is “sufficient” because knowing the whole data set provides no more information aboutδthan just knowing d

  18. Equivalent statistics: d, t, p • Any invertible transformation of a sufficient statistic is also sufficient • For example, • Similarly, a t value corresponds to a unique p value

  19. Equivalent statistics: CIs • The variance of Cohen’s d is a function of only sample size and d • This means that if you know d and the sample sizes, you can compute either limit of a confidence interval of d • If you know either limit of a confidence interval of d you can also compute d • You get no more information about the data set by reporting a confidence interval of d than by reporting a p value

  20. Equivalent statistics: Likelihood • Many statistics are based on likelihood • Essentially the “probability” of the observed data, given a specific model (not quite probability because a specific value of a continuous variable has probability zero- so it is a product of the probability density function values) • For a two-sample t-test, the alternative hypothesis (full model) is that a score from group s (1 or 2) is defined as • With different means for each group s • Likelihood for the full model is then:

  21. Equivalent statistics: Likelihood • For a two-sample t-test, the null hypothesis (reduced model) is that a score from group s (1 or 2) is defined as • With the same mean for each group s • These calculations always use estimates of the mean and standard deviation that maximize the likelihood value for that model

  22. Equivalent statistics: Likelihood • Compare the full (alternative) model against the reduced (null) model • Log likelihood ratio • Because the reduced model is a special case of the full model, LF > LR • If Λ is sufficiently big, you can argue that the full model is better than the reduced model • Likelihood test

  23. Equivalent statistics: t, Likelihood • No new information here • Let n = n1 + n2 • Then

  24. Equivalent statistics: AIC • As we saw earlier, just adding complexity to a model will make its claims unreplicable • The model ends up “explaining” random noise • The model will poorly predict future random samples • A better approach is to adjust the likelihood to consider the complexity of the model • Models are penalized for complexity • Akaiki Information Criterion (AIC) • Smaller (more negative) values are better

  25. Equivalent statistics: AIC • For a two-sample t-test, we can compare the full (alternative, 3 parameters) model and the reduced (null, 2 parameters) model • When ΔAIC>0, choose the full model • When ΔAIC<0, choose the null model

  26. Equivalent statistics: AIC • For small sample sizes, you will do better with a “corrected” formula • So, for a two-sample t-test • When ΔAICc>0, choose the full model • When ΔAICc<0, choose the null model • The chosen model is expected to do the better job of predicting future data • This does not mean it will do a “good” job, maybe both models are bad

  27. Equivalent statistics: AIC • Model selection based on AIC is appropriate when you want to predict future data, but you do not have a lot of confidence that you have an appropriate model • You expect the model to change with future data • Perhaps guided by the current model • To me, this feels like a lot of research in experimental psychology • The calculations are based on the very same information in a data set as the t-value, d-value, and p-value

  28. Equivalent statistics: AIC • Inference based on AIC is actually more lenient than the traditional criterion for p-values

  29. Equivalent statistics: BIC • Decisions based on AIC are not guaranteed to pick the “correct” model • An alternative complexity correction does better in this regard • Bayesian Information Criterion • For a two-sample t-test

  30. Equivalent statistics: BIC • Inference based on BIC is much more stringent than the traditional criterion for p-values

  31. Equivalent statistics: JZS BF • AIC and BIC use the “best” (maximum likelihood) model parameters • A fully Bayesian approach is to average likelihood across plausible parameters • Requires a prior probability density function • Compute the ratio of average likelihood for the full (alternative) and reduced (null) model • Bayes Factor • The JZS prior is for the standardized effect size • It’s Bayes Factor is simply a function of t and the sample sizes • It contains no more information about the data set than a p-value

  32. Equivalent statistics: JZS BF • Inference based on the JZS Bayes Factor is much more stringent than the traditional criterion for p-values

  33. Equivalent statistics: JZS BF • Model selection based on BIC and the JZS Bayes Factor is guaranteed to select the “true” model, if it is being tested • So, if you think you understand a situation well enough that you can identify plausible “true” models, then the BIC or Bayes Factor process is a good choice for identifying the true model

  34. Equivalent statistics • I created a web site to do the conversions between statistics • http://psych.purdue.edu/~gfrancis/EquivalentStatistics/ • Also computes other relevant statistics (e.g., post hoc power)

  35. Equivalent statistics • The various statistics are equivalent, but that does not mean you should report whatever you want • It means you should think very carefully about your analysis • Do you want to predict future data? • Do you think you can identify the “true” model? • Do you want to control the Type I error rate? • Do you want to estimate the effect size? • You also need to think carefully about whether you can satisfy the requirements of the inference • Can you avoid optional stopping in data collection? • Is your prior informative?

  36. What should we do? • The first step is to identify what you want to do • Not as easy as it seems • “Produce a significant result”is not an appropriate answer • Your options are basically: • 1) Control Type I error: [identify an appropriate sample size and fix it; identify the appropriate analyses and adjust the significance criterion appropriately; do not include data from any other studies (past or future)] • 2) Estimate an effect size: [sample until you have a precise enough measurement; have to figure what “precise enough” means; explore/describe the data without drawing conclusions] • 3) Find the “true” model: [sample until the Bayes Factor provides overwhelming evidence for one model versus other models; have to identify prior distributions of “belief” in those models; have to believe that the true model is among the set being considered] • 4) Find the model that best predicts future data: [machine learning techniques such as cross-validation; information criterion; be willing to accept that your current model is probably wrong]

  37. Equivalent statistics • Common statistics are equivalent with regard to the information in the data set • But no method of statistical inference is appropriate for every situation • The choice of what to do can give radically different answers to seeming similar questions • n1=n2=250, d=0.183 • p=0.04 • ΔBIC= - 2.03 (evidence for null) • ΔAICc = 2.16 (full model better predicts future data than the null model) • JZS Bayes Factor = 0.755 (weak evidence that slightly favors the null model)

  38. What should we do? • Do you even need to make a decision? (choose a model, reject a null) • Oftentimes the decision of a hypothesis test is really just a description of the data • When you make a decision you need to consider the context (weigh probabilities and utilities) • For example, suppose a teacher needs to improve mean reading scores by 7 points for a class of 30 students • Approach A (compared to current method): , s=5, d=1.2 • Approach B (compared to current method): , s=50, d=0.1 B: P(Mean>7)=0.41 A: P(Mean>7)=0.14

  39. Conclusions • These differences make sense because science involves many different activities at different stages of investigation • Discovery • Theorizing • Verification • Prediction • Testing • Bayes Factors fit into part (but not all) of these activities

More Related