1 / 31

Model checks for complex hierarchical models

Model checks for complex hierarchical models. Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics. Background and Aims. Many complex models used in bioinformatics Classification/clustering can be greatly affected by choice of distributions

neka
Download Presentation

Model checks for complex hierarchical models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

  2. Background and Aims • Many complex models used in bioinformatics • Classification/clustering can be greatly affected by choice of distributions • Our approach: exploit the structure of the model to perform predictive checks • hierarchical models generally involve exchangeability assumptions • mixture models are partially exchangeable

  3. Outline of Talk • Mixture model for gene expression data • Model checks for mixture model • distribution for gene-specific variances • different mixture priors • Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005)

  4. ηj wj μ,τ g σg ybarg Sg variance for each gene differential effect for gene g Data: paired log differences between 2 conditions Hierarchical mixture model for gene expression data w ~ Dirichlet(1,…,1), various priors for δg, g δg | η~ Σwjhj(ηj), g2 | μ,τ f(μ,τ) ygr | δg, g  N(δg, g2) g = gene r = replicate j = mixture component

  5. Mixture model for gene expression data • Many mixture models have been proposed for gene expression data • Set-up is similar to variable selection prior: point mass + alternative distribution • Particular choices for alternative: • Normal (Lönnstedt and Speed) • Uniform (Parmigiani et al) • many others …

  6. Mixture model for gene expression data Allow for asymmetry in over-and under-expressed genes  3-component mixture model δg | η~ w1h1(η1) + w2h2(η2) + w3h3(η3) 6 knock-out and 5 wildtype mice MAS5.0 processed data

  7. Mixture model for gene expression data Classify each gene into mixture components using posterior probabilities

  8. Choice of mixture prior affects classification results

  9. Outline of Talk • Mixture model for gene expression data • Models checks for mixture model • distribution for gene-specific variances • different mixture priors • Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005)

  10. Predictive model checks • Predict new data from the model • Use posterior predictive distribution • Condition on hyperparameters (‘mixed predictive’ *  not very conservative) • Get Bayesian p-value for each gene/marker/sample • Use all p-values together (100’s or 1000’s) to assess model fit * Gelman, Meng and Stern 1995; Marshall and Spiegelhalter 2003

  11. μ,τ Sgobs posterior Smpred g σg σpred post. pred. Sgppred mixed pred. Smpred ybarg Sgobs Checking distribution for gene variances Bayesian p-value for gene g: pg = Prob( Smpred > Sgobs | data ) All genes are exchangeable  histogram of p-values for all genes together

  12. ‘Mixed’ v. ‘posterior’ predictive • Predictive p-values for data simulated from the model • Histograms should be Uniform • Mixed predictive distribution much less conservative than posterior predictive Using gene-specific distributions Using global distribution

  13. Checking different variance models g2 | μ,τ Gam(μ,τ), μ fixed g2 = 2 for all genes Model differential expression between 3 transgenic and 3 wildtype mice g2 | μ,τ Gam(μ,τ) g2 | μ,τ logNorm(μ,τ)

  14. μ,τ g σg σpred mixed pred. Smpred ybarg Sgobs Implementation (MCMC) niter = no. MCMC iterations m = (no. replicates – 1)/2 pg= 0 for t = 1,…,niter { σtpred  f(μt,τt) Stmpred  Gam( m, m(σtpred)-2 ) pg pg + I[ Stmpred > Sgobs ] } pg pg / niter Just two extra parameters predicted at each iteration

  15. Outline of Talk • Mixture model for gene expression data • Model checks for mixture model • distribution for gene-specific variances • different mixture priors • Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005)

  16. Checking mixture prior δg | η~ w1h1(η1) + w2h2(η2) + w3h3(η3) OR δg | η, zg = j~ hj(ηj) j = 1,…,3 P(zg= j) = wj Model checking: focus on separate mixture components

  17. Issues for mixture model checking δg | η, zg = j~ hj(ηj) j = 1,…,3 Think about MCMC iterations … • Mixture component is estimated from genes currently assigned to that component • Can only define p-value for given gene and mix. component when the gene is assigned to that component (i.e. condition on zg in p-value) • So check each component using only the genes currently assigned (i.e. condition on zg in histogram)

  18. μ,τ ηj wj σg g jpred ybargjmpred Sg ybarg Predictive checks for mixture model Bayesian p-value for gene g and mix. component j: pgj = Prob( ybargjmpred > ybargobs | data, zg=j ) • Genes assigned to the same mix. component are exchangeable • histogram of p-values for each mix. component separately • histogram for component j made only from genes with large P(zg= j)

  19. Condition on classification to check separate components Predictive p-values for data simulated from the model All genes with P(zg = j) > 0 Only genes with P(zg = j) > 0.5 Effectively we condition on a best classification

  20. Checking different mixture distributions • Outer mix. components skewed too much away from zero • Null component too narrow w1Unif(-η-,0) + w2δ(0) + w3Unif(0,η+)

  21. Checking different mixture distributions • Outer components skewed opposite • Null still too narrow? w1Gam-(1.5,η-) + w2 δ(0) + w3Gam+(1.5,η+)

  22. Checking different mixture distributions • Better fit for all components w1Gam-(1.5,η-) + w2N(0,ε) + w3Gam+(1.5,η+)

  23. wj μ,τ ηj σg g jpred ybargjmpred ybarg Sg Implementation pgj= 0 for t = 1,…,niter { • δjtpred~ hjt(ηjt) j = 1,…,3 • ybargtmpred N( δjtpred , g2/nrep ) for j = zgt • pgj pgj + I[ ybargtmpred > ybargobs ] for j = zgt } pgj pgj / niter(zg=j) Need ≈ngenes extra parameters at each iteration

  24. Summary of model checking procedure • Find part of model where individuals are assumed to be exchangeable (so information is shared) • Choose test statistic T (eg. sample mean or variance) • Predict Tpred from distribution for exchangeable individuals (whole posterior for Tpred) • Compare observed Ti for each individual i to distribution of Tpred • For checking mixture components, condition on the best classification

  25. Outline of Talk • Mixture model for gene expression data • Model checks for mixture model • distribution for gene-specific variances • different mixture priors • Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005)

  26. Clustering and variable selection (Tadesse et al. 2005) • yivector of gene expression for each sample i = 1,…,n • Multi-variate mixture model for clustering samples: • yi| zi= j  MVN(ζj, Λj) j = 1,…,J • P(zi= j) = wj • No. of mix. components (J) is estimated in the model • Aim to select genes which are informative for clustering the samples

  27. Clustering and variable selection (Tadesse et al. 2005) Likelihood conditional on allocation to mixture: γ’ = vector of indices of variables not used to cluster samples γ = vector of indices of selected variables Conjugate priors on multivariate means and covariance matrices P(γg= 1) = φ i = sample g = gene j = mix. component

  28. J μj(γ) , Σj(γ) wj η(γ), Ω(γ) φ yi y(γ)jpred Clustering and variable selection (Tadesse et al. 2005) Model checking: want to check the distribution for each mixture component separately (conditional on J) In addition, need to condition on a given variable selection Clearly impossible computationally i = sample g = gene j = mix. component

  29. Computing predictive p-values • Run model with no prediction • Find the best configuration: • set of selected variables (γ) • no. mixture components J • allocation of samples to mixture components zi • Re-run model, with (γ), J and zi fixed, calculated predictive p-values pij = Prob( Tjpred > Tiobs | data, zi=j, J, (γ) ) where T = |y|2 (for example)

  30. Conclusions • Choice of model distributions can greatly influence results of clustering and classification • For models where information is shared across individuals, predictive checks can be used as an alternative to cross-validation • Should be possible to do this even for quite complex models (if you can fit the model, you can check it)

  31. Acknowledgements Collaborators on BBSRC Exploiting Genomics Grant Natalia Bochkina,Clare Marshall Peter Green Meeting on model checking in Cambridge David Spiegelhalter Shaun Seaman BBSRC Exploiting Genomics Grant Paper and software at http://www.bgx.org.uk/

More Related