1 / 36

Finding Genes based on Comparative Genomics

Finding Genes based on Comparative Genomics. Robin Raffard November, 30 th 2004 CS 374. References. Main References Multiple-sequence functional annotation and the generalized hidden Markov phylogeny. McAuliffe J., Pachter L., Jordan M. 2004.

kamala
Download Presentation

Finding Genes based on Comparative Genomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Genes based on Comparative Genomics Robin Raffard November, 30th 2004 CS 374

  2. References Main References • Multiple-sequence functional annotation and the generalized hidden Markov phylogeny. McAuliffe J., Pachter L., Jordan M. 2004. • Computational identification of evolutionarily conserved exons. Siepel A., Haussler D. 2004. Additional references • Phylogenetic shadowing if primate sequences to find functional regions of the human genome. Boffelli D., McAuliffe J., Ovcharenko D., Lewis K., Ovcharenko I., Pachter L., Rubin E. • A hidden markov model approach to variation among sites in rate evolution. Felsenstein J., Churchill G. • Statistics for Biology and health. Ewens W., Grant G.

  3. Problem formulation DNA consists of genes (functional sequences) separated by intergenics (nonfunctional sequences). Gene 1 Gene 2 Gene 3 ATCATTACGCGGCTTAGCCCTTATAGCGATACGATGACAGATGACAA DNA Intergenics

  4. Problem formulation DNA consists of genes (functional sequences) separated by intergenics (nonfunctional sequences). Gene 1 Gene 2 Gene 3 DNA

  5. Problem formulation DNA consists of genes (functional sequences) separated by intergenics (nonfunctional sequences). Gene 1 Gene 2 Gene 3 DNA

  6. DNA consists of genes (functional sequences) separated by intergenics (nonfunctional sequences). Problem: Find genes using comparative genomics Key: Exons are conserved along evolution Problem formulation Gene 1 Gene 2 Gene 3 DNA

  7. In Practice >human AGTGAGACACGACGAGCCTACTATCAGGACGAGAGCAGGAGAGTGATGATGAGTAGCGCACAGCGACGATCATCACGAGAGAGTAAGAAGCAGTGATGATGTAGAGCGACGAGAGCACAGCGGCGACTACTACTAGG >mouse AGTGTGTCTCGTCGTGCCTACTTTCAGGACGAGAGCAGGTGAGTGTTGATGAGTTGCGCTCTGCGACGTTCATCTCGAGTGAGTTAGAAAGTGAAGGTATAACACAAGGTGTGAAGGCAGTGATGATGTAGAGCGACGAGAGCACAGCGGCGGGATGATATATCTAGGAGGATGCCCAATTTTTTTTT >platypus CTCTGCGGCGTTCGTCTCGGGTGGGTTGGGGGGTGGGGGTGTGGCGCAAGGTGTGAAGCACGACGACGATCTACGACGAGCGAGTGATGAGAGTGATGAGCGACGACGAGCACTAGAAGCGACGACTACTATCGACGAGCAGCCGAGATGATGATGAAAGAGAGAGAA

  8. 2 Questions • 1st question: Which genomes to compare: human/mouse or human/primates ? • 2nd question: How to extract genes from this comparison ?

  9. Outline • Human/Mouse vs Human/Primate • Advantages of Human/Mouse • Advantages of Human/Primate • Conclusion • Gene Finding • Phylogenic tree • Hidden Markov Chain • Hidden Markov Phylogeny • Contributions of the 2 papers

  10. Functional sequences in Human/Mouse/Primates % of similitude DNA sequence

  11. Advantage of Human/Mouse Easy to figure out what the functional sequences are

  12. Disadvantage of Human/Mouse Some human genes are not present in the mouse genome. Therefore impossible to extract them from a Mouse/Human comparison Human Mouse

  13. Human/Primates

  14. Phylogenetic shadowing

  15. Phylogenetic shadowing on real data Likelihood of mutation (log) DNA sequence

  16. Absent Present Motivating Example: Gene apo(a) • Plasma protein • Important cardiovascular disease risk predictor

  17. Phylogenetic shadowing of apo(a) Likelihood of mutation (log) DNA sequence

  18. So Human/Mouse or Human/Primate ? • Old genes: Human/Mouse (Non coding sequences are strongly different) • New genes: Human/Primate (Straightforward alignment of coding sequences)

  19. Outline • Human/Mouse vs Human/Primate • Advantages of Human/Mouse • Advantages of Human/Primate • Conclusion • Gene Finding • Phylogenic tree • Hidden Markov Chain • Hidden Markov Phylogeny • Contributions of the 2 papers

  20. Naive way of extracting genes • Is not flexible/probabilistic. • Does not respect gene structure. Drawbacks:

  21. 1st step: Phylogenetic tree Nucleotide 1 Nucleotide 2 Given a nucleotide, is it functional or not ? Species

  22. Primate phylogeny T T A A G A

  23. Primate phylogeny Observed nucleotides A A T A G A • Which nucleotide ? • Which rate α ? A A C A

  24. Algorithm • Given observed nucleotide, find the most likely rate α. • Mathematically, • Therefore,

  25. Phylogenetic tree: Results Drawback: No biological model built in

  26. Gene structure A gene finder should satisfy: Promoter region about 50 base upstream of gene 3’ untranslated region 5’ untranslated region TATA: start of transcription

  27. Gene Model Exon S3 TATA S5 S6 S4 S2 S1 Intron

  28. Hidden Markov Chain Model Composed of: • Sequence of states which are unobservable: S1, S2, S3, … , Sn. Si = exon, intron. Jump from Si to Si+1 follows a Markov chain: P(Si | Si+1) • Sequence of (sequence of) letters O1, O2, O3, …, On, which are emitted by the states ( according to P(Oi | Si ) ) and which are observed. P(S4 | S5) S1 O1 S2 O2 S3 O3 S4 O4 S5 O5 S6 O6 S7 O7 = ACGTACG… P(O1 | S1)

  29. Viterbi Algorithm • Given a sequence of letters O1, … On (observed), find the sequence of states S1,…,Sn (unobservable). • Mathematically, find • 2 steps: • Compute max Prob(S,O) via dynamic programming: max Prob(S1,…,Si+1,O) = f ( max Prob(S1,…,Si,O) ) • Find a sequence of state which achieves the optimal: Si = argmax max Prob(S1,…,Si,O).

  30. Generalized hidden Markov phylogeny Cumulates the 2 concepts: Hidden Markov chain Phylogenetic tree + Generalized hidden Markov phylogeny =

  31. Global Method • Get a series of DNA sequences • Align them • Build the Generalized Hidden Markov Model • Train the parameters on sample genes • Find the hidden states: Si • The coding sequences are the exons

  32. Contributions of the 1st paper • 1st to implement the Hidden Markov Phylogeny on the Primate/Human phylogeny. • Require only 5 primate species. • Able to sequence the apo(a) gene. Gene Finders

  33. Contributions of the 2nd paper Implement sophisticated Hidden Markov Phylogeny on Human/Mouse phylogeny • Context-dependent phylogenetic models ( High-order Markov chain: Emission of one state also depends of the neighboring states). More computationally expensive but better. • Explicit modeling of conserved non-coding sequences. • Modeling of insertions and deletions.

  34. Results of the 2nd paper Gene Finders Gene Finders

  35. Conclusion • Genes found based on genomics comparison. • Mouse/Human for oldgenes • Primate/Human for recent genes • In any cases, same tool for extracting coding sequences: Hidden Markov Phylogeny • Future: Improve Markov model, sequence more genomes.

  36. Thank you! Questions ?

More Related