1 / 70

Gene Prediction

Gene Prediction. Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar. Gene Prediction. Introduction Protein-coding gene prediction RNA gene prediction Modification and finishing Project schema. Gene Prediction. Introduction

fawn
Download Presentation

Gene Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Prediction • Chengwei Luo, Amanda McCook, Nadeem Bulsara, • Phillip Lee, Neha Gupta, and Divya Anjan Kumar

  2. Gene Prediction • Introduction • Protein-coding gene prediction • RNA gene prediction • Modification and finishing • Project schema

  3. Gene Prediction • Introduction • Protein-coding gene prediction • RNA gene prediction • Modification and finishing • Project schema

  4. Why gene prediction? experimental way?

  5. Why gene prediction? Exponential growth of sequences New sequencing technology Metagenomics: ~1% grow in lab

  6. How to do it?

  7. How to do it? It is a complicated task, let’s break it into parts

  8. How to do it? It is a complicated task, let’s break it into parts Genome

  9. How to do it? It is a complicated task, let’s break it into parts Genome

  10. How to do it? Protein-coding gene prediction Homology Search Phillip Lee & Divya Anjan Kumar ab initio approach Nadeem Bulsara & Neha Gupta

  11. How to do it? RNA gene prediction Amanda McCook & Chengwei Luo tRNA rRNA sRNA

  12. Gene Prediction • Introduction • Protein-coding gene prediction • RNA gene prediction • Modification and finishing • Project schema

  13. Homology Search

  14. Homology Search

  15. Strategy

  16. open reading frame(ORF)

  17. How/Why find ORF?

  18. How/Why find ORF?

  19. How/Why find ORF?

  20. Protein Database Searches

  21. SWISSPROT- statistics

  22. Pfam-Statistics 11,912 families, with 1,808 new families and 236 families deleted Updated to include metagenomic samples Involves MSA and HMM Only 63%of the Pfam families match the proteins in SWISSPROT and TrEMBL

  23. Domain searches

  24. Integrating the results 3 possible outcomes: Complete consensus Partial consensus No consensus How do we choose? Scores like E-values Percentage similarity Relevance

  25. Limitations of Extrinsic Prediction

  26. ab initio Prediction

  27. Homology Search is not Enough! Biased and incomplete Database Sequenced genomes are not evenly distributed on the tree of life, and does not reflect the diversity accordingly either. Number of sequenced genomes clustered here

  28. ab initio Gene Prediction

  29. Features

  30. ORFs (6 frames)

  31. Codon Statistics

  32. Features (Contd.)

  33. Probabilistic View

  34. Supervised Techniques

  35. Unsupervised Techniques

  36. Usually Used Tools GeneMark GLIMMER EasyGene PRODIGAL

  37. GeneMark • Developed in 1993 at Georgia Institute of Technology as the first gene finding tool. • Used markov chain to represent the statistics of coding and noncoding reading frames using dicodon statistics. Shortcomings Inability to find exact gene boundaries

  38. GeneMark.hmm

  39. GeneMark.hmm • 9 hidden states were defined • Typical gene in the direct strand • Typical gene in the reverse strand • Atypical gene in the direct strand • Atypical gene in the reverse strand • Non-coding (intergenic) region • Start codons in the direct strand • Stop codons in the direct strand • Start codons in the reverse strand • Stop codons in the reverse strand • Probability of any sequence S underlying functional sequence X is calculated as P(X|S)=P(x1,x2,…………,xL| b1,b2,…………,bL) • Viterbi algorithm then calculates the functional sequence X* such that P(X*|S) is the largest among all possible values of X. • Ribosome binding site model was also added to augment accuracy in the prediction of translational start sites.

  40. GeneMark • RBS feature overcomes this problem by defining a % position nucleotide matrix based on alignment of 325 E coli genes whose RBS signals have already been annotated. • Uses a consensus sequence AGGAG to search upstream of any alternative start codons for genes predicted by HMM. GENEMARKS • Considered the best gene prediction tool. • Based on unsupervised learning. Even in prokaryotic genomes gene overlaps are quite common GeneMarkS

  41. GLIMMER Maintained by Steven Salzberg, Art Delcher at the University of Maryland , College Park • Used IMM (Interpolated Markov Models) for the first time. • Predictions based on variable context (oligomers of variable lengths). • More flexible than the fixed order Markov models. Principle IMM combines probability based on 0,1……..k previous bases, in this case k=8 is used. But this is for oligomers that occur frequently. However, for rarely occurring oligomers, 5th order or lower may also be used.

  42. Glimmer development Glimmer 2 (1999) • Increased the sensitivity of prediction by adding concept of ICM (Interpolated Context Model) Glimmer 3 (2007) • Overcomes the shortcomings of previous models by taking in account sum of RBS score, IMM coding potentials and a score for start codons which is dependent on relative frequency of each possible start codon in the same training set used for RBS determination. • Algorithm used reverse scoring of IMM by scoring all ORF (open reading frames) in reverse, from the stop codon to start codon. • Score being the sum of log likelihood of the bases contained in the ORF.

  43. Glimmer3.02

  44. PRODIGALProkaryotic Dynamic Programming Gene Finding Algorithm Developed at Oak Ridge National Laboratory and the University of Tennessee

  45. PRODIGAL-Features

  46. PRODIGAL-Features

  47. EasyGene Developed at University of Copenhagen Statistical significance is the measure for gene prediction.

  48. Comparison of Different Tools

  49. Gene Prediction • Introduction • Protein-coding gene prediction • RNA gene prediction • Modification and finishing • Project schema

  50. RNA Gene Prediction

More Related