1 / 8

A Coalescent-based Method for Population Tree Inference with Haplotypes

A Coalescent-based Method for Population Tree Inference with Haplotypes. Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA. Population Tree: Population split history (including order and time); not known. Coalescence. Locus (gene): genomic region. Mutation.

Download Presentation

A Coalescent-based Method for Population Tree Inference with Haplotypes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Coalescent-based Method for Population Tree Inference with Haplotypes YufengWu Dept. of Computer Science & Engineering University of Connecticut, USA Cold Spring Harbor Asia Meeting SuZhou, China, 2014

  2. Population Tree:Population split history (including order and time); not known Coalescence Locus (gene): genomic region Mutation H: haplotypesat SNPs a (A): AAGCCAATTCCGAACAAGA b (B): ACGCCAATTCCGGACAAGA c (C): ACGCCTATTCCGGACAAGA d (D): AAGCCAATTCCGAACCAGA 1 2 3 4 AAAA CAGA CTGA AAAC Time Coalescent genealogical tree: underlying genetic model D A C B Population tree inference: given haplotypes Hfrom multiple loci, infer the population tree MLE of T: find Tmaximizing P(H|T) d a c b P(H|T): probability of H given T under coalescent models Challenge:P(H|T) isdifficult to compute even for single population

  3. SNP vs. Haplotype Common simplification: treating haplotypes as unlinked variants (SNPs).P(H|T) ≈ P(S1|T)P(S2|T)P(S3|T)…, Si: ith SNP of H. See, e.g. SNAPP (Bryant, et al., MBE, 2012), TreeMix (Pickrell and Pritchard, PLoS Genet, 2012) Single SNPs: potential loss of information in haplotypes. This talk: likelihood based population tree inference from haplotypes. Assumptions: (1) No intra-locus recombination and (2) infinite sites model of mutations AAAA Fact 1: haplotypes H implies a unique (non-bifurcating) genealogical tree called the perfect phylogeny TH 1 1 2 3 4 a: AAAA b: CAGA c: CTGA d: AAAC 3 4 2 a c b d Fact 2: under infinite sites model, P(H|T)=P(TH|T) Unfortunately, computing P(TH|T)is still non-trivial

  4. Simplification of Likelihood G’: genealogical topology implied by haplotypes H Ignoremutations on genealogy G. KeyAssumption: P(G|T)  P(G’|T) Inference of population tree T: maximizing P(G’1|T)P(G’2|T)P(G’3|T)… G’i: gene genealogical topologies of ith locus Use G to refer to genealogical topology Ignore mutations G’ G 1 3 4 2 a c b d a c b d Genealogical topology G and population tree T: Gene lineages b and c coalesce first  Populations B and C are likely to be more closely related But not always… d a c b A B D a c C b d Incomplete lineage sorting: gene tree topology is stochastic

  5. STELLSH: infer population trees from haplotypes For population tree T and a gene tree topologyG: Gene tree probability P(G|T): probability of observing a gene tree topology G for population tree T under coalescent theory. Gene tree probability P(G|T): (relatively) efficiently computed by the STELLSalgorithm (Wu, Evolution, 2012) algorithm for when G is bifurcating and can be used in inference. Issue: perfect phylogeny from haplotypes usually non-bifurcating Gene tree probability for non-bifurcating topology: sum over all compatible bifurcating topologies. Can be more efficiently computed: Wu, manuscript, 2014. STELLSH: maximizing probability of all gene topologies, by optimizing topology and branch lengths of population tree (e.g. nearest neighbor interchange)

  6. Population tree: same tree topologies. Haplotypes: use Hudson’s ms (support island model) Simulation • Multiple alleles per population per gene • Various population tree heights (0.1, 0.5 and 1.0 coalescent units) • Number of loci: 10,50,100,200,500 Inference • STELLSH: infer population tree from haplotypes Evaluation • Topological errorof inferred population trees Inference error Assume: no migration; no intra-locus recombination. Accuracy: higher with more loci Moderate migration or recombination: accurate inference Strong migration or high recombination: less accurate Number of loci

  7. Compare with TreeMix Simulation data STELLSH (Solidlines): up to 4 alleles per population TreeMix(dashed lines): up to 100 alleles per population STELLSH: more accurate than TreeMix, even TreeMix uses 25 times more data. Also analyzed part of 1000 Genomes Project to infer population trees from 10 populations: CHB,JPT,CHS,CEU, TSI,FIN,GBR,IBS, YRI, and LWK. • Conclusion: • Haplotypes: can be more informative than individual SNPs • Simplifying likelihood function may lead to faster algorithms to use in inference. Paper: “A Coalescent-based Method for Population Tree Inference with Haplotypes”, Yufeng Wu, submitted for publication, 2014. Paper: “Coalescent-based Species Tree Inference from Gene Tree Topologies Under Incomplete Lineage Sorting by Maximum Likelihood”, Yufeng Wu, Evolution, v. 66 (3), p. 763-775, 2012.” Research supported by National Science Foundation under grants IIS-0803440 and CCF-1116175

More Related