90 likes | 235 Views
A Coalescent-based Method for Population Tree Inference with Haplotypes. Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA. Population Tree: Population split history (including order and time); not known. Coalescence. Locus (gene): genomic region. Mutation.
E N D
A Coalescent-based Method for Population Tree Inference with Haplotypes YufengWu Dept. of Computer Science & Engineering University of Connecticut, USA Cold Spring Harbor Asia Meeting SuZhou, China, 2014
Population Tree:Population split history (including order and time); not known Coalescence Locus (gene): genomic region Mutation H: haplotypesat SNPs a (A): AAGCCAATTCCGAACAAGA b (B): ACGCCAATTCCGGACAAGA c (C): ACGCCTATTCCGGACAAGA d (D): AAGCCAATTCCGAACCAGA 1 2 3 4 AAAA CAGA CTGA AAAC Time Coalescent genealogical tree: underlying genetic model D A C B Population tree inference: given haplotypes Hfrom multiple loci, infer the population tree MLE of T: find Tmaximizing P(H|T) d a c b P(H|T): probability of H given T under coalescent models Challenge:P(H|T) isdifficult to compute even for single population
SNP vs. Haplotype Common simplification: treating haplotypes as unlinked variants (SNPs).P(H|T) ≈ P(S1|T)P(S2|T)P(S3|T)…, Si: ith SNP of H. See, e.g. SNAPP (Bryant, et al., MBE, 2012), TreeMix (Pickrell and Pritchard, PLoS Genet, 2012) Single SNPs: potential loss of information in haplotypes. This talk: likelihood based population tree inference from haplotypes. Assumptions: (1) No intra-locus recombination and (2) infinite sites model of mutations AAAA Fact 1: haplotypes H implies a unique (non-bifurcating) genealogical tree called the perfect phylogeny TH 1 1 2 3 4 a: AAAA b: CAGA c: CTGA d: AAAC 3 4 2 a c b d Fact 2: under infinite sites model, P(H|T)=P(TH|T) Unfortunately, computing P(TH|T)is still non-trivial
Simplification of Likelihood G’: genealogical topology implied by haplotypes H Ignoremutations on genealogy G. KeyAssumption: P(G|T) P(G’|T) Inference of population tree T: maximizing P(G’1|T)P(G’2|T)P(G’3|T)… G’i: gene genealogical topologies of ith locus Use G to refer to genealogical topology Ignore mutations G’ G 1 3 4 2 a c b d a c b d Genealogical topology G and population tree T: Gene lineages b and c coalesce first Populations B and C are likely to be more closely related But not always… d a c b A B D a c C b d Incomplete lineage sorting: gene tree topology is stochastic
STELLSH: infer population trees from haplotypes For population tree T and a gene tree topologyG: Gene tree probability P(G|T): probability of observing a gene tree topology G for population tree T under coalescent theory. Gene tree probability P(G|T): (relatively) efficiently computed by the STELLSalgorithm (Wu, Evolution, 2012) algorithm for when G is bifurcating and can be used in inference. Issue: perfect phylogeny from haplotypes usually non-bifurcating Gene tree probability for non-bifurcating topology: sum over all compatible bifurcating topologies. Can be more efficiently computed: Wu, manuscript, 2014. STELLSH: maximizing probability of all gene topologies, by optimizing topology and branch lengths of population tree (e.g. nearest neighbor interchange)
Population tree: same tree topologies. Haplotypes: use Hudson’s ms (support island model) Simulation • Multiple alleles per population per gene • Various population tree heights (0.1, 0.5 and 1.0 coalescent units) • Number of loci: 10,50,100,200,500 Inference • STELLSH: infer population tree from haplotypes Evaluation • Topological errorof inferred population trees Inference error Assume: no migration; no intra-locus recombination. Accuracy: higher with more loci Moderate migration or recombination: accurate inference Strong migration or high recombination: less accurate Number of loci
Compare with TreeMix Simulation data STELLSH (Solidlines): up to 4 alleles per population TreeMix(dashed lines): up to 100 alleles per population STELLSH: more accurate than TreeMix, even TreeMix uses 25 times more data. Also analyzed part of 1000 Genomes Project to infer population trees from 10 populations: CHB,JPT,CHS,CEU, TSI,FIN,GBR,IBS, YRI, and LWK. • Conclusion: • Haplotypes: can be more informative than individual SNPs • Simplifying likelihood function may lead to faster algorithms to use in inference. Paper: “A Coalescent-based Method for Population Tree Inference with Haplotypes”, Yufeng Wu, submitted for publication, 2014. Paper: “Coalescent-based Species Tree Inference from Gene Tree Topologies Under Incomplete Lineage Sorting by Maximum Likelihood”, Yufeng Wu, Evolution, v. 66 (3), p. 763-775, 2012.” Research supported by National Science Foundation under grants IIS-0803440 and CCF-1116175