Computational Biology: Genome annotation formats

1. Computational Biology:Genome annotation formats October 2004 Ian Holmes Department of Bioengineering University of California, Berkeley From an original lecture by Irmtraud Meyer

2. Overview: What is genome annotation? In which format can a genome annotation be saved to files? Definition of the gff genome annotation format Other genome annotation formats Application: evaluating the performance of a gene prediction program Exercises

3. What is genome annotation ? genome annotation is the localisation of functional elements in a genomic sequence For example: the location of protein coding genes tRNA and other RNA genes promoters ...

4. Example 1: protein coding genes

5. Formats for saving annotations: Motivation: To save information on a gene, a format should be able to record: the location of the gene in the genome the position of its exon-intron boundaries the strand of DNA on which the gene lies the source of annotation the completeness of the gene structure

6. The GFF format: GFF = Genefinding File Format a format used to save gene structures idea: divide gene into its constituents Exon � transcribed sections of a gene CDS � translated sections of a gene Start_Codon Stop_Codon

7. The GFF format:

8. The GFF format: Format of each gff-line: name source feature start end score strand frame group where: name: the name of sequence (string) source: the name of the source of annotation (string) feature: feature type: �Exon�, �CDS�, �Start_Codon�, �Stop_Codon� (string) start: start position of feature (integer) end: end position of feature (integer) score: score (rational number) associated with feature, set to �.� if score not used strand: strand on which feature lies, possible values are �+� or �-� frame: �0�, �1� or �2� for CDS, Start_Codon and Stop_Codon, �.� for Exon

9. The GFF format: remarks the fields in a gff line are tab delimited start < end (important to keep in mind when dealing with genes on the reverse strand !) the start and end positions are the corresponding positions on the �+� strand definition of frame for CDS, Start_Codon and Stop_Codon features: �0�: first nucleotide in feature has codon position 0 �1�: first nucleotide in feature has codon position 2 �2�: first nucleotide in feature has codon position 1 => note that the frame of a CDS is NOT its length modulo 3 and that the frame of a Start_Codon and Stop_Codon always has to be �0� (Why ?) Exons do not have a frame, use �.� as the value of their frame if there is no score associated with a feature, use �.�

10. The GFF format: more remarks the terminal CDS does not comprise the positions of the Stop_Codon as the Stop_Codon is not translated the initial CDS does comprise the positions of the Start_Codon as it is translated the order of lines in a gff file is irrelevant although it makes sense to group them by genes

11. The GFF format: Example 2: A valid description of this gene in gff format is for example: Chr1 src Exon 150 200 . + . gene_id 1; transcript_id 1; exon_number 1 Chr1 src Exon 300 401 . + . gene_id 1; transcript_id 1; exon_number 2 Chr1 src CDS 380 401 . + 0 gene_id 1; transcript_id 1; exon_number 2 Chr1 src Exon 501 650 . + . gene_id 1; transcript_id 1; exon_number 3 Chr1 src CDS 501 650 . + 2 gene_id 1; transcript_id 1; exon_number 3 Chr1 src Exon 700 800 . + . gene_id 1; transcript_id 1; exon_number 4 Chr1 src CDS 700 707 . + 2 gene_id 1; transcript_id 1; exon_number 4 Chr1 src Exon 900 1000 . + . gene_id 1; transcript_id 1; exon_number 5 Chr1 src Start_Codon 380 382 . + 0 gene_id 1; transcript_id 1; exon_number 2 Chr1 src Stop_Codon 708 709 . + 0 gene_id 1; transcript_id 1; exon_number 4

12. The GFF format: Example 3: a gene on the reverse strand The valid description of this gene in gff format is for example: Chr22 src Exon 649 700 . - . gene_id 1; transcript_id 1; exon_number 1 Chr22 src CDS 649 700 . - 0 gene_id 1; transcript_id 1; exon_number 1 Chr22 src Exon 351 500 . - . gene_id 1; transcript_id 1; exon_number 2 Chr22 src CDS 351 500 . - 2 gene_id 1; transcript_id 1; exon_number 2 Chr22 src Exon 150 250 . - . gene_id 1; transcript_id 1; exon_number 3 Chr22 src CDS 153 250 . - 2 gene_id 1; transcript_id 1; exon_number 3 Chr22 src Start_Codon 698 700 . - 0 gene_id 1; transcript_id 1; exon_number 1 Chr22 src Stop_Codon 150 152 . - 0 gene_id 1; transcript_id 1; exon_number 3

13. Other genome annotation formats: DAS = XML version of GFF uses tags to delimit fields, not whitespace a lirrle more structured GAME = Genome Annotation Markup Elements The format definition can be found at: http://www.bioxml.org/Projects/game

14. Uses of a genome annotation format: exchanging annotation information checking an annotation comparing differrent annotations visualising an annotation, see for example www.ensembl.org

15. Evaluating the performance of a gene prediction program:

16. Evaluation on different levels:

17. Evaluation on different levels (cont'd):

18. Measures of performance:

19. Exercises: 1.) Check that you can reproduce the frames of the CDS lines in example 3 knowing the positions of the CDSs, the start codon and the stop codon. 2.) What do the terms (# tp + # fp) and (# tp + # fn) stand for ? 3.) Looking at a gff entry of a gene, can you deduce if the annotation of the gene is complete ? 4.) In which interval of numbers do the values of sensitivity and specificity fall ?

20. Exercises: 5.) This exercise prepares you for the practicals following this lecture: You are collaborating with colleagues abroad who send you a gff file with the genes of their genome annotation as well as a fasta file with the corresponding genome sequence. a) How do you check the gff file for errors ? Which checks can you think of ? b) Outline the structure of (i.e. write the pseudocode for) a program which checks the gff file for errors. 6.) You are given a gff file with an annotation predicted by a gene prediction program. a) Which information do you require to evaluate the performance of the gene prediction program ? b) Outline the structure of a program which evaluates the performance of a gene prediction program by comparing the predicted genes (contained in gff format in file 1) to the known genes (contained in gff format in file 2) (see example 4).

21. Answers to exercises: 1.) look at gff lines with features CDS and start codon in example 3: - CDS with exon_number 1 is the initial i.e. 5'-most CDS of the gene as it starts with a start codon - the initial CDS has length 700 � 649 + 1 = 52 = 17 * 3 + 1 => the next CDS with exon_number 2 starts with codon position 1 => the next CDS has frame 2 - the second CDS has length 500 � 351 + 1 = 150 = 50 * 3 => the next CDS with exon_number 3 start with the same codon position => the next CDS has frame 2 2.) (# tp + # fp) is the number of predicted features (# tp + # fn) is the number of annotated features 3.) A gff entry to a gene only tells you if the protein coding part of the gene is complete. If the gff entry comprises start and stop codon of the gene, its protein coding part is complete. A gff entry does not show if the information on the untranslated exons is complete.

22. Answers to exercises: 4.) The values for sensitivity ((# tp) / (# tp + # fn)) and specificity ((# tp) / (# tp + # fp)) lie between 0 and 1. The sensitivity is 1 only if (# fn) = 0 and the specificity is 1 only if (# fp) = 0.

23. Answer to exercise 5: Note: This exercise is about checking the annotation given in gff format, NOT the gff format itself ! a) � checking the annotation in the gff file is best done if the corresponding DNA sequences are available as this allows more checks to be performed, so for the practical you can assume that you are given a gff and the corresponding fasta file containing the DNA sequences - possible checks of the annotation are: -Is the start codon correct (if it exists) ? - Is the stop codon correct (if it exists) ? - Are there no in-frame stop codons within the CDS ? - Do the splice sites look fine ? - For complete genes: Is the sum of CDS lengths a multiple of 3 ?

24. Answer to exercise 5 (cont'd): b) For the program which checks the annotation you may assume the following which you do not have to check: - sequences names in the fasta file are unique - use of gff format is correc You may assume in your program, but should check the following: - DNA sequences consist of A,C,G,T letters only - all genes are complete, ie comprise a start and stop codon - splice sites are either GT�AG (consensus) or GC�AG - there is exactly one gene associated with each fasta file sequence Some things to keep in mind: - genes can lie on the forward �+� or the reverse �-� strand - the DNA sequences in the fasta file are the �+� strand sequences - the coordinates in the fasta and the gff file are absolute coordinates, but in your program you may prefer to make some calculatations in relative coordinates (ie the first sequence position being 1 and the last being length_of_sequence

25. Pseudocode (outline of the program): 1.) read all of the fasta file and get all DNA sequences and headers 2.) for each entry in the fasta file: a) check fasta entry: i) length of DNA sequence equals length indicated in header ? If not, report error and go to next sequence (=: rerr&gonext) ii) DNA sequence consists of A, C, G, T letter only ? If not, rerr&gonext. b) read gff lines for that sequence name: i) check gff lines exist: if not, rerr&gonext ii) check there is exactly one gene associated with fasta entry: if not, rerr&gonext Iii) check if gene is complete: if not, rerr&gonext iv) check if sum of CDS lengths multiple of 3: if not, rerr&gonext v) check if start codon correct: if not, report error vi) check if stop codon correct: if not, report error vii) check that there are no in-frame stop codons: if there are any, report error viii) if relevant, check if splice sites are ok: if not, report error

26. Info on input files and functions:

27. Remark about the fasta header lines:

28. Answer to exercise 5b:

Computational Biology: Genome annotation formats

Computational Biology: Genome annotation formats

Presentation Transcript

Genome analysis and annotation

3. Genome Annotation: Gene Prediction (II)

QTL Annotation in MGI Susan M Bello, Ira Lu, Cynthia L Smith, Janan T Eppig, and the Mouse Genome Informatics Group

Melampsora Genome Annotation and Genome Structure Analysis

P. Tang ( 鄧致剛 ) ; RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 )

Data Curation and Management activities within the UCT Computational Biology Group

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY

Peptide-assisted annotation of the Mlp genome

Genome sequencing and annotation

Improving the Sensitivity of Peptide Identification for Genome Annotation

Computational Systems Biology … Biology X – Lecture 2 …

Pre-SIG meeting " Genome Annotation" A BioSapiens initiative Goal of the workshop were

The Zebrafish Genome Sequencing Project Bioinformatics resources

Genome-specific Curation

Genome Annotation

VectorBase genome annotation

Rice Sequence and Map Analysis Leonid Teytelman

Bioinformatics, Computational Biology — An Introduction

Of Sea Urchins, Birds and Men

Basics of Genome Annotation

Genome sequencing and annotation

3 rd Summer School in Computational Biology September 10, 2014

Computational Biology: Genome annotation formats