1 / 26

Genome Bioinformatics

Genome Bioinformatics. Tyler Alioto Center for Genomic Regulation Barcelona, Spain. Node 1 of the INB. GN1 Bioinformática y Genómica Genome Bioinformatic Lab, CRG Roderic Guigó (PI). Themes. Gene prediction ab initio => GeneID dual-genome => SGP2 u12 introns => GeneID v1.3 and U12DB

tiger-glass
Download Presentation

Genome Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

  2. Node 1 of the INB • GN1 Bioinformática y Genómica • Genome Bioinformatic Lab, CRG • Roderic Guigó (PI) INB Roadshow - Pamplona

  3. Themes • Gene prediction • ab initio => GeneID • dual-genome => SGP2 • u12 introns => GeneID v1.3 and U12DB • combiner => GenePC • Genome feature visualization • gff2ps • Alternative splicing • ASTALAVISTA • Gene expression regulatory elements • meta and mmeta alignment INB Roadshow - Pamplona

  4. Eukaryotic gene structure INB Roadshow - Pamplona

  5. Eukaryotic gene structure INTRONS PROMOTOR donor acceptor EXONS DOWNSTREAM REGULATOR UPSTREAM REGULATOR INB Roadshow - Pamplona

  6. The Splicing Code INB Roadshow - Pamplona

  7. Gene Prediction Strategies • Expressed Sequence (cDNA) or protein sequence available? • Yes  Spliced alignment • BLAT, Exonerate, est_genome, spidey, GMAP, Genewise • No  Integrated gene prediction • Informant genome(s) available? • Yes  Dual or n-genome de novo predictors: • SGP2, Twinscan, NSCAN, • (Genomescan – same or cross genome protein blastx) • No  ab initio predictors • geneid, genscan, augustus, fgenesh, genemark, etc. • Many newer gene predictors can run in multiple modes depending on the evidence available. INB Roadshow - Pamplona

  8. Frameworks for gene prediction • Hierarchical exon-buliding and chaining • Hidden Markov Models (many flavors) • HMM, GHMM, GPHMM, Phylo-HMM • Conditional Random Fields (new!) • Conrad, Contrast... and, no doubt, more to come All of them involve parsing the optimal path of exons using dynamic programming (e.g. GenAmic, Viterbi algorithms) INB Roadshow - Pamplona

  9. How does GeneID approach gene prediction?

  10. e4 e8 The gene prediction problem sites a4 a2 a1 a3 exons d1 d2 e1 d3 e2 d4 e3 d5 e4 e5 e6 e7 genes e8 e1 INB Roadshow - Pamplona

  11. Geneid follows a hierarchical structure: signalexongene Exon score: Score of exon-defining signals + protein-coding potential (log-likelihood ratios) Dynamic programming algorithm: maximize score of assembled exons  assembled gene GeneID INB Roadshow - Pamplona

  12. GAGGTAAAC TCCGTAAGT CAGGTTGGA ACAGTCAGT TAGGTCATT TAGGTACTG ATGGTAACT CAGGTATAC TGTGTGAGT AAGGTAAGT ATGGCAGGGACCGTGACGGAAGCCTGGGATGTGGCAGTATTTGCTGCCCGACGGCGCAAT GATGAAGACGACACCACAAGGGATAGCTTGTTCACTTATACCAACAGCAACAATACCCGG GGCCCCTTTGAAGGTCCAAACTATCACATTGCGCCACGCTGGGTCTACAATATCACTTCT GTCTGGATGATTTTTGTGGTCATCGCTTCAATCTTCACCAATGGTTTGGTATTGGTGGCC ACTGCCAAATTCAAGAAGCTACGGCATCCTCTGAACTGGATTCTGGTAAACTTGGCGATA GCTGATCTGGGTGAGACGGTTATTGCCAGTACCATCAGTGTCATCAACCAGATCTCTGGC Training GeneID INB Roadshow - Pamplona

  13. Running GeneID command line or on geneid server NAME geneid - a program to annotate genomic sequences SYNOPSIS geneid [-bdaefitnxszr] [-DA] [-Z] [-p gene_prefix] [-G] [-3] [-X] [-M] [-m] [-WCF] [-o] [-j lower_bound_coord] [-k upper_bound_coord] [-O <gff_exons_file>] [-R <gff_annotation-file>] [-S <gff_homology_file>] [-P <parameter_file>] [-E exonweight] [-V evidence_exonweight] [-Bv] [-h] <locus_seq_in_fasta_format> RELEASE geneid v 1.3 OPTIONS -b: Output Start codons -d: Output Donor splice sites -a: Output Acceptor splice sites -e: Output Stop codons -f: Output Initial exons -i: Output Internal exons -t: Output Terminal exons -n: Output introns -s: Output Single genes -x: Output all predicted exons -z: Output Open Reading Frames -D: Output genomic sequence of exons in predicted genes -A: Output amino acid sequence derived from predicted CDS -p: Prefix this value to the names of predicted genes, peptides and CDS -G: Use GFF format to print predictions -3: Use GFF3 format to print predictions -X: Use extended-format to print gene predictions -M: Use XML format to print gene predictions -m: Show DTD for XML-format output -j Begin prediction at this coordinate -k End prediction at this coordinate -W: Only Forward sense prediction (Watson) -C: Only Reverse sense prediction (Crick) -U: Allow U12 introns (Requires appropriate U12 parameters to be set in the parameter file) -r: Use recursive splicing -F: Force the prediction of one gene structure -o: Only running exon prediction (disable gene prediction) -O <exons_filename>: Only running gene prediction (not exon prediction) -Z: Activate Open Reading Frames searching -R <exons_filename>: Provide annotations to improve predictions -S <HSP_filename>: Using information from protein sequence alignments to improve predictions -E: Add this value to the exon weight parameter (see parameter file) -V: Add this value to the score of evidence exons -P <parameter_file>: Use other than default parameter file (human) -B: Display memory required to execute geneid given a sequence -v: Verbose. Display info messages -h: Show this help AUTHORS geneid_v1.3 has been developed by Enrique Blanco, Tyler Alioto and Roderic Guigo. Parameter files have been created by Genis Parra and Tyler Alioto. Any bug or suggestion can be reported to geneid@imim.es INB Roadshow - Pamplona

  14. GeneID output ## gff-version 2 ## date Mon Nov 26 14:37:15 2007 ## source-version: geneid v 1.2 -- geneid@imim.es # Sequence HS307871 - Length = 4514 bps # Optimal Gene Structure. 1 genes. Score = 16.20 # Gene 1 (Forward). 9 exons. 391 aa. Score = 16.20 HS307871 geneid_v1.2Internal17101860 -0.11 + 0HS307871_1 HS307871 geneid_v1.2Internal19762055 0.24 + 2HS307871_1 HS307871 geneid_v1.2Internal21322194 0.44 + 0HS307871_1 HS307871 geneid_v1.2Internal24342682 4.66 + 0HS307871_1 HS307871 geneid_v1.2Internal27492910 3.19 + 0HS307871_1 HS307871 geneid_v1.2Internal32793416 0.97 + 0HS307871_1 HS307871 geneid_v1.2Internal35763676 3.23 + 0HS307871_1 HS307871 geneid_v1.2Internal37803846 -0.96 + 1HS307871_1 HS307871 geneid_v1.2Terminal41794340 4.55 + 0HS307871_1 INB Roadshow - Pamplona

  15. GFF: a standard annotation format • Stands for: • Gene Finding Format -or- General Feature Format • Designed as a single line record for describing features on DNA sequence -- originally used for gene prediction output • 9 tab-delimited fields common to all versions • seq source feature begin end score strand frame group • The group field differs between versions, but in every case no tabs are allowed • GFF2: group is a unique description, usually the gene name. • NCOA1 • GFF2.5 / GTF (Gene Transfer Format): tag-value pairs introduced, start_codon and stop_codon are required features for CDS • transcript_id “NM_056789” ; gene_id “NCOA1” • GFF3: Capitalized tags follow Sequence Ontology (SO) relationships, FASTA seqs can be embedded • ID=NM_056789_exon1; Parent=NM_056789; note=“5’ UTR exon” INB Roadshow - Pamplona

  16. GeneID output ## gff-version 2 ## date Mon Nov 26 14:37:15 2007 ## source-version: geneid v 1.2 -- geneid@imim.es # Sequence HS307871 - Length = 4514 bps # Optimal Gene Structure. 1 genes. Score = 16.20 # Gene 1 (Forward). 9 exons. 391 aa. Score = 16.20 HS307871 geneid_v1.2Internal17101860 -0.11 + 0HS307871_1 HS307871 geneid_v1.2Internal19762055 0.24 + 2HS307871_1 HS307871 geneid_v1.2Internal21322194 0.44 + 0HS307871_1 HS307871 geneid_v1.2Internal24342682 4.66 + 0HS307871_1 HS307871 geneid_v1.2Internal27492910 3.19 + 0HS307871_1 HS307871 geneid_v1.2Internal32793416 0.97 + 0HS307871_1 HS307871 geneid_v1.2Internal35763676 3.23 + 0HS307871_1 HS307871 geneid_v1.2Internal37803846 -0.96 + 1HS307871_1 HS307871 geneid_v1.2Terminal41794340 4.55 + 0HS307871_1 INB Roadshow - Pamplona

  17. Visualizing features with gff2ps INB Roadshow - Pamplona generated by Josep Abril

  18. Visualizing features on UCSC genome browser (custom tracks) • If “your” genome is served by UCSC, this is a good option because: • browsing is dynamic • access to other annotations • can view DNA sequence • can do complex intersections and filtering • gff2ps is good when: • your genome is not on UCSC • you want more flexible layout options • you want to run it ‘offline’ INB Roadshow - Pamplona

  19. Extensions to GeneID • Syntenic Gene Prediction (dual-genome) • Evidence-based (constrained) gene prediction • U12 intron detection • Combining gene predictions • Selenoprotein gene prediction INB Roadshow - Pamplona

  20. Syntenic Gene Prediction: SGP2 INB Roadshow - Pamplona

  21. Minor splicing and U12 introns • U12 introns make up a minor proportion of all introns (~0.33% in human, less in insects) • But they can be found in 2-3% of genes • Normally ignored, but this causes annotation problems • Easy to predict due to highly conserved donor and branch sites INB Roadshow - Pamplona

  22. Splice Signal Profiles: major and minor INB Roadshow - Pamplona

  23. Gathering U12 Introns Human Fruit Fly 2084 aln to EST/ mRNA aln to EST/ mRNA predict predict genome genome 563 score score 568 385 merge merge all annotated introns all annotated introns 658 ENSEMBL? ortholog search (17 species) + spliced alignment 597 published U12 DB INB Roadshow - Pamplona

  24. INB Roadshow - Pamplona

  25. Coming Soon: GenePCa Gene Prediction Combiner INB Roadshow - Pamplona

  26. Tutorial Homepage • http://genome.imim.es/courses/Pamplona07/ GBL Homepage • http://genome.imim.es/ INB Roadshow - Pamplona

More Related