Finding Genes based on Comparative Genomics

Finding Genes based on Comparative Genomics Robin Raffard November, 30th 2004 CS 374

References Main References • Multiple-sequence functional annotation and the generalized hidden Markov phylogeny. McAuliffe J., Pachter L., Jordan M. 2004. • Computational identification of evolutionarily conserved exons. Siepel A., Haussler D. 2004. Additional references • Phylogenetic shadowing if primate sequences to find functional regions of the human genome. Boffelli D., McAuliffe J., Ovcharenko D., Lewis K., Ovcharenko I., Pachter L., Rubin E. • A hidden markov model approach to variation among sites in rate evolution. Felsenstein J., Churchill G. • Statistics for Biology and health. Ewens W., Grant G.

Problem formulation DNA consists of genes (functional sequences) separated by intergenics (nonfunctional sequences). Gene 1 Gene 2 Gene 3 ATCATTACGCGGCTTAGCCCTTATAGCGATACGATGACAGATGACAA DNA Intergenics

Problem formulation DNA consists of genes (functional sequences) separated by intergenics (nonfunctional sequences). Gene 1 Gene 2 Gene 3 DNA

DNA consists of genes (functional sequences) separated by intergenics (nonfunctional sequences). Problem: Find genes using comparative genomics Key: Exons are conserved along evolution Problem formulation Gene 1 Gene 2 Gene 3 DNA

In Practice >human AGTGAGACACGACGAGCCTACTATCAGGACGAGAGCAGGAGAGTGATGATGAGTAGCGCACAGCGACGATCATCACGAGAGAGTAAGAAGCAGTGATGATGTAGAGCGACGAGAGCACAGCGGCGACTACTACTAGG >mouse AGTGTGTCTCGTCGTGCCTACTTTCAGGACGAGAGCAGGTGAGTGTTGATGAGTTGCGCTCTGCGACGTTCATCTCGAGTGAGTTAGAAAGTGAAGGTATAACACAAGGTGTGAAGGCAGTGATGATGTAGAGCGACGAGAGCACAGCGGCGGGATGATATATCTAGGAGGATGCCCAATTTTTTTTT >platypus CTCTGCGGCGTTCGTCTCGGGTGGGTTGGGGGGTGGGGGTGTGGCGCAAGGTGTGAAGCACGACGACGATCTACGACGAGCGAGTGATGAGAGTGATGAGCGACGACGAGCACTAGAAGCGACGACTACTATCGACGAGCAGCCGAGATGATGATGAAAGAGAGAGAA

2 Questions • 1st question: Which genomes to compare: human/mouse or human/primates ? • 2nd question: How to extract genes from this comparison ?

Outline • Human/Mouse vs Human/Primate • Advantages of Human/Mouse • Advantages of Human/Primate • Conclusion • Gene Finding • Phylogenic tree • Hidden Markov Chain • Hidden Markov Phylogeny • Contributions of the 2 papers

Functional sequences in Human/Mouse/Primates % of similitude DNA sequence

Advantage of Human/Mouse Easy to figure out what the functional sequences are

Disadvantage of Human/Mouse Some human genes are not present in the mouse genome. Therefore impossible to extract them from a Mouse/Human comparison Human Mouse

Human/Primates

Phylogenetic shadowing

Phylogenetic shadowing on real data Likelihood of mutation (log) DNA sequence

Absent Present Motivating Example: Gene apo(a) • Plasma protein • Important cardiovascular disease risk predictor

Phylogenetic shadowing of apo(a) Likelihood of mutation (log) DNA sequence

So Human/Mouse or Human/Primate ? • Old genes: Human/Mouse (Non coding sequences are strongly different) • New genes: Human/Primate (Straightforward alignment of coding sequences)

Outline • Human/Mouse vs Human/Primate • Advantages of Human/Mouse • Advantages of Human/Primate • Conclusion • Gene Finding • Phylogenic tree • Hidden Markov Chain • Hidden Markov Phylogeny • Contributions of the 2 papers

Naive way of extracting genes • Is not flexible/probabilistic. • Does not respect gene structure. Drawbacks:

1st step: Phylogenetic tree Nucleotide 1 Nucleotide 2 Given a nucleotide, is it functional or not ? Species

Primate phylogeny T T A A G A

Primate phylogeny Observed nucleotides A A T A G A • Which nucleotide ? • Which rate α ? A A C A

Algorithm • Given observed nucleotide, find the most likely rate α. • Mathematically, • Therefore,

Phylogenetic tree: Results Drawback: No biological model built in

Gene structure A gene finder should satisfy: Promoter region about 50 base upstream of gene 3’ untranslated region 5’ untranslated region TATA: start of transcription

Gene Model Exon S3 TATA S5 S6 S4 S2 S1 Intron

Hidden Markov Chain Model Composed of: • Sequence of states which are unobservable: S1, S2, S3, … , Sn. Si = exon, intron. Jump from Si to Si+1 follows a Markov chain: P(Si | Si+1) • Sequence of (sequence of) letters O1, O2, O3, …, On, which are emitted by the states ( according to P(Oi | Si ) ) and which are observed. P(S4 | S5) S1 O1 S2 O2 S3 O3 S4 O4 S5 O5 S6 O6 S7 O7 = ACGTACG… P(O1 | S1)

Viterbi Algorithm • Given a sequence of letters O1, … On (observed), find the sequence of states S1,…,Sn (unobservable). • Mathematically, find • 2 steps: • Compute max Prob(S,O) via dynamic programming: max Prob(S1,…,Si+1,O) = f ( max Prob(S1,…,Si,O) ) • Find a sequence of state which achieves the optimal: Si = argmax max Prob(S1,…,Si,O).

Generalized hidden Markov phylogeny Cumulates the 2 concepts: Hidden Markov chain Phylogenetic tree + Generalized hidden Markov phylogeny =

Global Method • Get a series of DNA sequences • Align them • Build the Generalized Hidden Markov Model • Train the parameters on sample genes • Find the hidden states: Si • The coding sequences are the exons

Contributions of the 1st paper • 1st to implement the Hidden Markov Phylogeny on the Primate/Human phylogeny. • Require only 5 primate species. • Able to sequence the apo(a) gene. Gene Finders

Contributions of the 2nd paper Implement sophisticated Hidden Markov Phylogeny on Human/Mouse phylogeny • Context-dependent phylogenetic models ( High-order Markov chain: Emission of one state also depends of the neighboring states). More computationally expensive but better. • Explicit modeling of conserved non-coding sequences. • Modeling of insertions and deletions.

Results of the 2nd paper Gene Finders Gene Finders

Conclusion • Genes found based on genomics comparison. • Mouse/Human for oldgenes • Primate/Human for recent genes • In any cases, same tool for extracting coding sequences: Hidden Markov Phylogeny • Future: Improve Markov model, sequence more genomes.

Thank you! Questions ?

Finding Genes based on Comparative Genomics