Practical Information on Genome Annotation: Understanding the Possibilities and Pitfalls

Introduction to genome annotation - practical information Somepossibilities and somepitfalls

Practical info • Coffee breaks • Lunch • Dinner at Mesa Wednesday 18.00

Understanding annotation Somepossibilities and somepitfalls Henrik Lantz, BILS/SciLifeLab

Lecture synopsis • What is annotation? • Structural genome annotation • Types of data used • Transcriptome annotation • Functional annotation

What is annotation? • My definition: Identification of regions of interest in sequence data • A more strict definition: Using multiple lines of evidence to identify regions of interest in sequence data • Gene prediction: Inferring the most likely gene models (no external data used)

From a genome…

…to an annotated gene

GFF file format

GFF3 file format

GTF file format

Why is annotation important? Example: Differential expression Mapped reads - condition 1 Genome Mapped reads - condition 2

Why is annotation important? RNA-seq reads Genome

There are two major parts of annotation • 1) Structural: Find out where the regions of interest (usually genes) are in the genome and what they look like. How many exons/introns? UTRs? Isoforms? • 2) Functional: Find out what the regions do. What do they code for?

Many open reading frames possible

Difficult in practice

Combine data - use Maker! • External data - proteins, rna-seq (incl. ESTs) • Ab-initio gene finders • (Lift-overs from closely related genomes) Combined annotation

Before annotation – check assembly quality • The quality of the assembly will heavily influence the quality of the annotation • SNP-errors can change start/stop-codons • Indels can cause frame-shifts • Annotation tools often have problems with incomplete loci • And of course, if a locus is completely missing from the assembly, it cannot be annotated

Assembly validation using CEGMA/BUSCO • CEGMA now depreceted, BUSCO actively developed • Both look for core genes; CEGMA=248 core genes, BUSCO=phylogenetic groups, up to 3000 genes • Both report %complete genes -> extrapolated to amount of gene space assembled

BUSCO output

CEGMA output #Prots %Completeness - #Total Average %Ortho Complete 233 93.95 - 265 1.14 9.87 Group 1 60 90.91 - 66 1.10 6.67 Group 2 52 92.86 - 58 1.12 11.54 Group 3 59 96.72 - 71 1.20 13.56 Group 4 62 95.38 - 70 1.13 8.06 Partial 238 95.97 - 277 1.16 12.18 Group 1 62 93.94 - 69 1.11 6.45 Group 2 54 96.43 - 61 1.13 12.96 Group 3 60 98.36 - 75 1.25 18.33 Group 4 62 95.38 - 72 1.16 11.29 # These results are based on the set of genes selected by Genis Parra # # Prots = number of 248 ultra-conserved CEGs present in genome # # %Completeness = percentage of 248 ultra-conserved CEGs present # # Total = total number of CEGs present including putative orthologs # # Average = average number of orthologs per CEG # # %Ortho = percentage of detected CEGS that have more than 1 ortholog #

Repeatmasking • Repeatmodeler to find new repeats • http://www.repeatmasker.org/RepeatModeler/ • Repeatmasker to mask known repeats • http://www.repeatmasker.org • Repeat-coordinates are given to Maker as a GFF-file

Types of data used Proteins Transcripts • Known amino acid sequences from other organisms • Assembled from RNA-seq or downloaded ESTs

Proteins • Conserved in sequence => conserved annotation with little noise • Proteins from model organisms often used => bias? • Proteins can be incomplete => problems as many annotation procedures are heavily dependent on protein alignments >ENSTGUP00000017616 pep:novel chromosome:taeGut3.2.4:8_random:2849599:2959678:-1 gene:ENSTGUG00000017338 transcript:ENSTGUT00000018018 gene_biotype:protein_coding transcript_biotype:protein_coding RSPNATEYNWHHLRYPKIPERLNPPAAAGPALSTAEGWMLPWGNGQHPLLARAPGKGRER DGKELIKKPKTFKFTFLKKKKKKKKKTFK >ENSTGUP00000017615 pep:novel chromosome:taeGut3.2.4:23_random:205321:209117:1 gene:ENSTGUG00000017337 transcript:ENSTGUT00000018017 gene_biotype:protein_coding transcript_biotype:protein_coding PDLRELVLMFEHLHRVRNGGFRNSEVKKWPDRSPPPYHSFTPAQKSFSLAGCSGESTKMG IKERMRLSSSQRQGSRGRQQHLGPPLHRSPSPEDVAEATSPTKVQKSWSFNDRTRFRASL RLKPRIPAEGDCPPEDSGEERSSPCDLTFEDIMPAVKTLIRAVRILKFLVAKRKFKETLR PYDVKDVIEQYSAGHLDMLGRIKSLQTRVEQIVGRDRALPADKKVREKGEKPALEAELVD ELSMMGRVVKVERQVQSIEHKLDLLLGLYSRCLRKGSANSLVLAAVRVPPGEPDVTSDYQ SPVEHEDISTSAQSLSISRLASTNMD

Protein sequences are aligned to the genome

Proteins • Maker will align proteins for you: Blast -> Exonerate • Blast is not structure aware, Exonerate is (splice sites, start/stop codons) • Preferred file-format: fasta

RNA-seq DNA Exon Intron Exon Intron Exon Intron Exon UTR UTR AG GT AG GT GT AG ATG Start codon TAG, TAA, TGA Stop codon Transcription Pre-mRNA UTR UTR AA AAAAA ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR AAAAAAAAA ATG Start codon TAG, TAA, TGA Stop codon Translation

RNA-seq • Should always be included in an annotation project • From the same organism as the genomic data => unbiased • Can be very noisy (tissue/species dependent), can include pre-mRNA • Sample different tissues or life stages if possible • Avoid gonads; muscle or liver is good

Spliced reads DNA Exon Intron Exon Intron Exon Intron Exon UTR UTR AG GT AG GT GT AG ATG Start codon TAG, TAA, TGA Stop codon Transcription Pre-mRNA UTR UTR AA AAAAA ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR AAAAAAAAA ATG Start codon TAG, TAA, TGA Stop codon Translation

RNA-seq - Spliced reads

Pre-mRNA DNA Exon Intron Exon Intron Exon Intron Exon UTR UTR GT GT GT ATG Start codon TAG, TAA, TGA Stop codon Transcription Pre-mRNA UTR UTR AA AAAAA ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR ATG Start codon TAG, TAA, TGA Stop codon Translation

Pre-mRNA

A lot is transcribed in a cell

How to use RNA-seq • Maker will align transcripts (ESTs), but these need to be assembled first. • Cufflinks/Stringtie: mapped reads -> transcripts

How to use RNA-seq • Maker will align transcripts (ESTs), but these need to be assembled first. • Cufflinks/Stringtie: mapped reads -> transcripts • Trinity: assembles transcripts without a genome

Mapped Trinity-assembled transcripts

How to use RNA-seq • Maker will align transcripts (ESTs), but these need to be assembled first. • Cufflinks/Stringtie : mapped reads -> transcripts • Trinity: assembles transcripts without a genome • PASA can be used to improve transcript quality

Liftovers are very useful for orthology determination • Kraken • Align the two genomes (Satsuma) and then transfer annotations between aligned regions

General recommendations • Always combine different types of evidence! • One single method is not enough! • Use Maker!

Or get help - NBIS assembly and annotation team • Five people working with assembly and annotation • Deliver high quality annotations • Enable visualization and manual curation through a web interface • Also available for consultation • http://nbis.se/support/supportform/index.php

Practical Information on Genome Annotation: Understanding the Possibilities and Pitfalls

Practical Information on Genome Annotation: Understanding the Possibilities and Pitfalls

Presentation Transcript

Genome annotation

MICROBIAL GENOME ANNOTATION

Subsystem Approach to Genome Annotation

Computational Genome Annotation

Genome Annotation

Genome Annotation

Eukaryotic Genome Annotation

Genome Annotation

Genome Annotation

Genome Annotation

Genome Annotation Continued

An Introduction to Genome Classification, Analysis and Annotation

microbial genome annotation

Genome Annotation

Genome Annotation

VectorBase genome annotation

Eukaryotic Genome Annotation

Introduction Practical information

Arabidopsis Genome Annotation

Introduction to genome annotation - practical information

Introduction Practical information

Genome Annotation