300 likes | 850 Views
Comparative Genomics Comparative Gene Prediction in the Human Genome. Maribel Hernandez Rosales. What is Comparative Genomics?. Comparative genomics is the analysis and comparison of genomes from different species.
E N D
Comparative GenomicsComparative Gene Prediction in the Human Genome Maribel Hernandez Rosales
What is Comparative Genomics? • Comparative genomics is the analysis and comparison of genomes from different species. • The purpose is to gain a better understanding of how species have evolved and to determine the function of genes and noncoding regions of the genome. • Researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse. • Genome researchers look at many different features when comparing genomes: sequence similarity, gene location, the length and number of coding regions (called exons) within genes, the amount of noncoding DNA in each genome, and highly conserved regions maintained in organisms as simple as bacteria and as complex as humans. • Comparative genomics involves the use of computer programs that can line up multiple genomes and look for regions of similarity among them.
What are the comparative genome sizes of humans and other organisms being studied?
Comparative Gene Prediction • GenScan : ab initio gene prediction. • GeneWise, Procrustes : homology guided. • Rosseta, SGP1 (Syntetic Gene Prediction), CEM (Conserved Exon Method) : gene prediction and sequence alignment are clearly separated. • GenomeScan : Ab Initio modified by BLAST homologies. • SGP-2, TwinScan, SLAM, DoubleScan : modification of GenScan scoring schema to incorporate similarity to known proteins.
GeneScan • A general probabilistic model for the gene structure of human genomic sequences. • Gene identification by identifying complete exon/intron structures of genes in genomic DNA. • Include de capacity to predict multiple genes in a sequence, to deal with partial as well as complete genes, and to predict consistent sets of genes occurring on either or both DNA strands. • Markov Model of coding regions: predictions do not depend on presence of a similar gene in the protein sequence databases and complement the information provided by homology-based gene identification methods (BLASTX). • Maximal Dependence Decomposition (MDD): new statistical model of donor and acceptor splice sites which capture important dependencies between signal positions.
P P P N N N R 5 R R 5 n 6 n n s 3 F s s F A 2 1 A 1 2 U U U 2 U U SR proteins branch signal ’ ’ ’ 5 splice signal 5 splice signal 3 splice signal polyY exonic repressor exonic enhancers intronic enhancers intronic repressor Pre-mRNA Splicing exon definition intron definition ... (assembly of spliceosome, catalysis) ...
GenScan HMM • N - intergenic region • P - promoter • F - 5’ untranslated region • Esngl – single exon (intronless) (translation start -> stop codon) • Einit – initial exon (translation start -> donor splice site) • Ek – phase k internal exon (acceptor splice site -> donor splice site) • Eterm – terminal exon (acceptor splice site -> stop codon) • Ik – phase k intron: 0 – between codons; 1 – after the first base of a codon; 2 – after the second base of a codon
GenScan Features • Model both strands at once • Each state may output a string of symbols (according to some probability distribution). • Explicit intron/exon length modeling • Advanced splice site modeling • Parameters learned from annotated genes • Prediction of multiple genes in a sequence (partial or complete).
GenomeScan • We can enhance our gene prediction by using external information: DNA regions with homology to known proteins are more likely to be coding exons. • Combine probabilistic ‘extrinsic’ information (BLAST hits) with a probabilistic model of gene structure/composition (GenScan). • Focus on ‘typical case’ when homologous but not identical proteins are available.
GeneWise • Motivation: Use good DB of protein world (PFAM) to help us annotate genomic DNA • GeneWise algorithm aligns a profile HMM directly to the DNA
GeneWise • Start with a PFAM domain HMM • Replace AA emissions with codon emissions • Allow for sequencing errors (deletions/ insertions) • Add a 3-state intron model
central PY tract spacer GeneWise Intron Model 5’ site 3’ site
GeneWise Features & Problems • “Best” alignment of DNA to protein domain • Alignment gives exact exon-intron boundaries • Parameters learned from species-specific statistics • Only provides partial prediction, and only where the homology lies • Does not find “more” genes • Pseudogenes, Retrotransposons picked up • CPU intensive • Solution: Pre-filter with BLAST
Rosetta • Gene prediction is separated from sequence alignment. • First, the alignment is obtained between two homologous genomic sequences using sequence global alignment Glass. Then, gene structures (splice sites, exon number and length, etc.) are predicted that are compatible with this alignment, meaning that predicted exons fall in the aligned regions.
Syntenic Gene Prediction • This approach does not require the comparison of two homologous genomic sequences. • A query sequence from a target genome is compared against a collection of sequence from a second (informant, reference) genome and the results of the comparison are used to modify the scores of the exons produced by underlying ``ab initio'' gene prediction algorithms. • Gene prediction and sequence alignment are separated.
tblastxHSPs HSPsProjections QuerySequence geneidExons SGPExons SGP-2
Gene predicition programs predict a large number of genes almost every mouse gene has the human orthologue counterpart
Orthologous human mouse genes have conserved exonic structure. • 85% of the orhologous pairs have identical number of exons • 91% of the orthologous exons have identical length • 99.5% of the orthologous exons have identical phase • there are a few cases of intron insertion/deletion (22)
Summary • Genes are complex structures which are difficult to predict with the required level of accuracy/ confidence • Different approaches to gene finding improve accuracy/confidence of the predictions: • Ab Initio : GenScan • Ab Initio modified by BLAST homologies: GenomeScan • Homology guided: GeneWise • Gene prediction and sequence alignment separately: Rosseta • Ab initio with similarity in known proteins: SGP-2