1 / 62

Nonparametric Bayesian Methods for Genetic Inference

School of Computer Science. Nonparametric Bayesian Methods for Genetic Inference. Eric P. Xing MLD, LTI and CSD School of Computer Science Carnegie Mellon University. Genome Polymorphisms. ancestors. Time. ancestors. Time. Present. ancestors. Time. ancestors. Time. Present.

zuzela
Download Presentation

Nonparametric Bayesian Methods for Genetic Inference

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. School of Computer Science Nonparametric Bayesian Methods for Genetic Inference Eric P. Xing MLD, LTI and CSD School of Computer Science Carnegie Mellon University

  2. Genome Polymorphisms

  3. ancestors Time

  4. ancestors Time

  5. Present ancestors Time

  6. ancestors Time Present

  7. TCGAGGTATTAAC The ancestral chromosome

  8. * ** * * The SNPs TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC TCGAGGTATTAGC TCTAGGTATCAAC Single Nucleotide Polymorphisms • Each DNA site is call a "locus" • Each variant is called an “allele”

  9. The haplotypes TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC TCGAGGTATTAGC TCTAGGTATCAAC TC█AGG██T█A█C TC█AGG██T█A█C TC█AGG██T█A█C TC█AGG██T█A█C TC█AGG██T█A█C TC█AGG██T█A█C • useful markers for studying disease association or genome evolution: • -- landmarks, indicators, co-variates, causes …

  10. Genetic Inference • Determine genetic markers • Haplotype inference • Reveal genome inheritance events • Recombination hotspot identification • Deconvolve population structure • Ancestral spectrum analysis

  11. Outline • Haplotype Inference • Dirichlet Process for phasing single population • Hierarchical DP for phasing multiple population • Linkage-disequilibrium analysis • Hidden Markov DP for identifying recombination hotspots • Population structure analysis • Admixture model • HMDP models

  12. Genetic Inference • Determine genetic markers • Haplotype inference • Reveal genome inheritance events • Recombination hotspot identification • Deconvolve population structure • Ancestral spectrum analysis

  13. T Cp G A C sequencing Cm T ATGC A Heterozygous diploid individual TC TG AA T Genotype g pairs of alleles, whose associations to chromosomes are unknown G A C T A ??? T T A C G A haplotype hº(h1, h2) possible associations of alleles to chromosomes Haplotype Ambiguity

  14. 0 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 1 This solution seems ‘better’ since it uses fewer haplotypes 0 0 1 0 0 1 1 1 1 1 Haplotype Inference The Rationale: parsimony • Many haplotypes are shared in a population • Data for many individuals allows phasing SNP genetypes 1/0 0/1 1/1 0/1 0/1 1/1 0/1 0/1 0/1 1/1

  15. A Finite (Mixture of ) Allele Model • The probability of a genotype g: • Standard settings: • |H| = K << 2Jfixed-sized population haplotype pool • p(h1,h2)= p(h1)p(h2)=f1f2Hardy-Weinberg equilibrium • Problem: K ? H ? Hn1 Hn2 Gn Genotyping model Population haplotype pool Haplotype model

  16. The PAC Model • The joint probability of all haplotypes h1, h2, … hn: • Problem: • Ordering? • Ancestor? H1,1 H1,2 H2,1 H2,2 Hn,1 Hn,2 Gn Gn Gn

  17. ¥ An Infinite (Mixture of ) Allele Model • How? • Via a nonparametric hierarchical Bayesian formalism ! Ak qk Hn1 Hn2 Gn N

  18. A CDF, G, on possible worlds of random partitions follows a Dirichlet Process if for any measurable finite partition(f1,f2, .., fm): (G(f1), G(f2), …, G(fm) ) ~ Dirichlet(aG0(f1), …., aG0(fm) ) where G0is the base measure and ais the scale parameter Dirichlet Process Possible worlds of partitions Thus a Dirichlet Process G defines a distribution of distribution

  19. Joint: Marginal: DP – a Pólya urn Process • "Infinite" • Self-reinforcing property • exchangeable partition of samples

  20. Clustering and DP Mixture • We can associate ancestors (i.e., mixture components) with the colors in the Pólya urn and thereby define an infinite clustering of the haplotypes (i.e., balls) … 1 3 2 4 5 6 … {a1,q1} {a2,q2} {a3,q3}

  21. a G0 G ¥ DP infinite mixture components (for population haplotypes) Dirichlet Process Mixture of Haplotypes (Xinget al. ICML 2004) • A Hierarchical Bayesian Infinite Allele model Ak qk Hn1 Hn2 Likelihood model (for individual haplotypes and genotypes) Gn

  22. Infinite mixtures "Star" genealogies ¥ N Population Genetic Basis of IAM • Kingman coalescent process with fixed (large) population size • New population haplotype alleles emerge along all branches of the coalescence tree at rate a/2 per unit length  Ewens Sampling Formula: an exchangeable random partition of individuals Coalescent with mutation Þ  Dirichlet Process Mixture

  23. Single-locus mutation model Noisy observation model Inheritance and Observation Models … Ancestral pool Haplotypes Genotype

  24. Posterior Prior x Likelihood Pólya urn MCMC for Haplotype Inference • Gibbs sampling for exploring the posterior distribution under the proposed model • Integrate out the parameters such as or , and sample and • Gibbs sampling algorithm: draw samples of each random variable to be sampled given values of all the remaining variables

  25. Results - HapMap Data • DP vs. Finite Mixture via EM DP EM

  26. Extensions of the DP haplotyper

  27. Multi-population Genetic Demography • Inference done separately, or jointly?

  28. Multi-population Genetic Demography • Pool everything together and solve 1 hap problem? • --- ignore population structures • Solve 4 hap problems separately? • --- data fragmentation • Co-clustering … solve 4 coupled hap problems jointly

  29. g H .... Hierarchical DP Mixture (Xinget al. ICML 2006) 2 3 G2 G3 G1 4 G4 1

  30. Draw from stock urn define Dirichlet Process DP(g,H) Conditioning on DP(g,H), the mjth draw from the mth bottom-level urn also form a Dirichlet measure A Hierarchical Pólya Urn Sampler Two level Pólya urn scheme

  31. Results - Simulated Data • 5 populations with 20 individuals each (two kinds of mutation rates) • 5 populations share parts of their ancestral haplotypes • the sequence length = 10 Haplotype error

  32. Results on Simulated Data Estimation of K

  33. Results - International HapMap DB • Different sample sizes, and different # of sub-populations

  34. Genetic Inference • Determine genetic markers • Haplotype inference • Reveal genome inheritance events • Recombination hotspot identification • Deconvolve population structure • Ancestral spectrum analysis

  35. Inhomogeneous transition model 10 12 8 11 11 8 14 5 10 1 Underlying haplotypes Modeling Haplotype Structure • Open issues: • Where are the boundaries? How many haplotypes per block? • Genetically unreasonable to assume different # of haplotypes for different blocks

  36. x x x Individual chromosomes x x x x x x x x Inheritance Model Each individual haplotype is a mosaic of ancestral haplotypes Ancestral chromosomes (K=5)

  37. Transition process:recombination Ct+1 1 2 … K 1 • Emission process: mutation 2 Ct : . K The Hidden Markov Model How many recombining ancestors?

  38. a G0 G ¥ DP infinite mixture components (for population haplotypes) Recall DP Mixture (Xinget al. ICML 2004, 2006) • A Hierarchical Bayesian Infinite Allele model Ak qk Hn1 Hn2 Likelihood model (for individual haplotypes and genotypes) Gn

  39. Hidden Markov Dirichlet Process (Xingand Sohn. Bayesian Analysis, 2007, Sohn and Xing, ISMB 2007) • Hidden Markov Dirichlet process mixtures • Extension of HMM model to infinite ancestral space • Infinite dimensional transition matrix • Each row of the transition matrix is modeled with a DP: Ct+1 Ct …

  40. Ancestor allele reconstruction Inferring population structure Inferring recombination hotspot A ¥ C1 C2 C3 CN ¥ H HMDP as a Graphical Model

  41. MCMC Inference • Gibbs sampling • Block sampler: • Pólya urn sampler: posterior transition probability

  42. Recombination Analysis Recombination hotspot detection threshold for hotspot detection

  43. Recombination Analysis CEU YRI HCB+JPT HapMap4

  44. Genetic Inference • Determine genetic markers • Haplotype inference • Reveal genome inheritance events • Recombination hotspot identification • Deconvolve population structure • Ancestral spectrum analysis

  45. Ancestral proportion Genetic Population Structure • How to display population structure? • Structure Genetic structure of Human Populations (Rosenberg et al. 2002)

  46. Variable Number of Tandem Repeats (VNTR) Polymorphism Þ

  47. 0.4 0.2 0.3 0.3 0.7 0.2 0.3 0.1 0.5 The Admixture Model • Admixture of "ancestral frequency profiles (AP)" • No distinction between ancestral and current alleles • Does not model mutation and chromosomal recombination Ancestral populations represented as allele frequency profiles Structure 2.1 0.02 0.08 0.90

  48. From Structure to mStruct (Shringarpure and Xing, ICML 2008) • From admixture of APs to admixture of MIMs • MiM: population-specific Mixture of Inheritance Models • The inheritance model: • Microsatellite: SNPs:

  49. Variational Inference • The joint: • We can sample z, c, and q as in Structure --- slow • Alternatively, we approximatep(z, c,q | x) by q(z, c,q ) = q(z)q(c)q(q) • Minimizing KL(q|p): • Fixed-point iteration …

  50. Accuracy of Admixing Vector Est.

More Related