1 / 27

Eukaryotic Gene Finding

Eukaryotic Gene Finding. Adapted in part from http://online.itp.ucsb.edu/online/infobio01/burge/. Prokaryotes small genomes high gene density no introns (or splicing) no RNA processing similar promoters overlapping genes. Eukaryotes large genomes low gene density introns (splicing)

dusty
Download Presentation

Eukaryotic Gene Finding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Eukaryotic Gene Finding Adapted in part from http://online.itp.ucsb.edu/online/infobio01/burge/

  2. Prokaryotes small genomes high gene density no introns (or splicing) no RNA processing similar promoters overlapping genes Eukaryotes large genomes low gene density introns (splicing) RNA processing heterogeneous promoters polyadenylation Prokaryotic vs. Eukaryotic Genes

  3. P P P N N N R 5 R R 5 n 6 n n s 3 F s s F A 2 1 A 1 2 U U U 2 U U SR proteins branch signal ’ ’ ’ 5 splice signal 5 splice signal 3 splice signal polyY exonic repressor exonic enhancers intronic enhancers intronic repressor Pre-mRNA Splicing exon definition intron definition ... (assembly of spliceosome, catalysis) ...

  4. Some Statistics • On average, a vertebrate gene is about 30KB long • Coding region takes about 1KB • Exon sizes can vary from double digit numbers to kilobases • An average 5’ UTR is about 750 bp • An average 3’UTR is about 450 bp but both can be much longer.

  5. Human Splice Signal Motifs 5' splice signal 3' splice signal

  6. Semi-Markov HMM Model

  7. Genscan HSMM

  8. N - intergenic region P - promoter F - 5’ untranslated region Esngl – single exon (intronless) (translation start -> stop codon) Einit – initial exon (translation start -> donor splice site) Ek – phase k internal exon (acceptor splice site -> donor splice site) Eterm – terminal exon (acceptor splice site -> stop codon) Ik – phase k intron: 0 – between codons; 1 – after the first base of a codon; 2 – after the second base of a codon GenScan States

  9. GenScan features • Model both strands at once • Each state may output a string of symbols (according to some probability distribution). • Explicit intron/exon length modeling • Advanced splice site modeling • Parameters learned from annotated genes • Separate parameter training for different CpG content groups

  10. GenScan Signal Modeling • PSSM: P(S) = P1(S1)•P2(S2) •…•Pn(Sn) • PolyA signal • Translation initiation/termination signal • Promoters • WAM: P(S) = P1(S1) •P2(S2|S1)•…•Pn(Sn|Sn-1) • 5’ and 3’ splice sites

  11. HMM-based Gene Finding • GENSCAN (Burge 1997) • FGENESH (Solovyev 1997) • HMMgene (Krogh 1997) • GENIE (Kulp 1996) • GENMARK (Borodovsky & McIninch 1993) • VEIL (Henderson, Salzberg, & Fasman 1997)

  12. GenomeScan • Idea: We can enhance our gene prediction by using external information: DNA regions with homology to known proteins are more likely to be coding exons. • Combine probabilistic ‘extrinsic’ information (BLAST hits) with a probabilistic model of gene structure/composition (GenScan) • Focus on ‘typical case’ when homologous but not identical • proteins are available.

  13. GeneWise [Birney, Amitai] • Motivation: Use good DB of protein world (PFAM) to help us annotate genomic DNA • GeneWise algorithm aligns a profile HMM directly to the DNA

  14. Sample GeneWise Output

  15. Developing GeneWise Model • Start with a PFAM domain HMM • Replace AA emissions with codon emissions • Allow for sequencing errors (deletions/insertions) • Add a 3-state intron model

  16. GeneWise Model

  17. central PY tract spacer GeneWise Intron Model 5’ site 3’ site

  18. GeneWise Model • Viterbi algorithm -> “best” alignment of DNA to protein domain • Alignment gives exact exon-intron boundaries • Parameters learned from species-specific statistics

  19. GeneWise problems • Only provides partial prediction, and only where the homology lies • Does not find “more” genes • Pseudogenes, Retrotransposons picked up • CPU intensive • Solution: Pre-filter with BLAST

  20. Summary • Genes are complex structures which are difficult to predict with the required level of accuracy/confidence • Different approaches to gene finding: • Ab Initio : GenScan • Ab Initio modified by BLAST homologies: GenomeScan • Homology guided: GeneWise

More Related