Introduction to the Gene Ontology: A User’s Guide

Introduction to the Gene Ontology: A User’s Guide COST Functional Modeling Workshop 22-24 April, Helsinki

Introduction to GO • The Gene Ontology Consortium • The Gene ontology • A GO annotation example • GO evidence codes • no GO vs ND • Making Annotations • Multiple annotations - the gene association (ga) file • Sources of GO

The Gene Ontology Consortium

http://www.geneontology.org/

The GO Consortium provides: • central repository for ontology updates and annotations • central mechanism for changing GO terms (adding, editing, deleting) • quality checking for annotations • consistency checks for how annotations are made by different groups • central source of information for users • co-ordination of annotation effort

GO Consortium and GO Groups: • groups decide gene product set to annotate • biocurator training • tool development mostly by groups • many non-consortium groups • education and training by groups • outreach to biocurators/databases by GOC

Annotation Strategy • Experimental data • many species have a body of published, experimental data • Detailed, species-specific annotation: ‘depth’ • Requires manual annotation of literature - slow • Computational analysis • Can be automated - faster • Gives ‘breadth’ of coverage across the genome • Annotations are general • Relatively few annotation pipelines

Releasing GO Annotations • GO annotations are stored at individual databases • Sanity checks as data is entered – is all the data required filled in? • Databases do quality control (QC) checks and submit to GO • GO Consortium runs additional QC and collates annotations • Checked annotations are picked up by GO users • eg. public databases, genome browsers, array vendors, GO expression analysis tools

‘sanity’ check & GOC QC AgBase Quality Checks & Releases AgBase Biocurators ‘sanity’ check AgBase biocuration interface AgBase database GO analysis tools Microarray developers ‘sanity’ check UniProtdb QuickGO browser GO analysis tools Microarray developers EBI GOA Project ‘sanity’ check: checks to ensure all appropriate information is captured, no obsolete GO:IDs are used, etc. ‘sanity’ check & GOC QC Public databases AmiGO browser GO analysis tools Microarray developers GO Consortium database

The Gene Ontology

Gene Ontology (GO) • Not about genes! • Gene products: genes, transcripts, ncRNA, proteins • The GO describes gene product function • Not a single ontology • Biological Process (BP or P) • Molecular Function (MF or F) • Cellular Component (CC or C) • de facto method for functional annotation • Widely used for functional genomics (high throughput).

What the GO doesn’t do: • Does not describe individual gene products • e.g. cytochrome c is not in the GO but oxidoreductase activity is • Does not describe mutants or diseases, e.g. oncogenesis. • Does not include sequence attributes, e.g., exons, introns, protein domains. • Is not a database of sequences.

What is the Gene Ontology? assign functions to gene products at different levels, depending on how much is known about a gene product is used for a diverse range of species structured to be queried at different levels, eg: find all the chicken gene products in the genome that are involved in signal transduction zoom in on all the receptor tyrosine kinases human readable GO function has a digital tag to allow computational analysis of large datasets “a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing”

Ontologies relationships between terms digital identifier (computers) description (humans)

A GO Annotation example

NDUFAB1 A GO Annotation Example NDUFAB1 (UniProt P52505) Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa Biological Process (BP or P) GO:0006633 fatty acid biosynthetic process TAS GO:0006120 mitochondrial electron transport, NADH to ubiquinone TAS GO:0008610 lipid biosynthetic process IEA Molecular Function (MF or F) GO:0005504 fatty acid binding IDA GO:0008137 NADH dehydrogenase (ubiquinone) activity TAS GO:0016491 oxidoreductase activity TAS GO:0000036 acyl carrier activity IEA Cellular Component (CC or C) GO:0005759 mitochondrial matrix IDA GO:0005747 mitochondrial respiratory chain complex I IDA GO:0005739 mitochondrion IEA

GO:ID (unique) aspect or ontology GO evidence code GO term name A GO Annotation Example NDUFAB1 (UniProt P52505) Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa

GO Evidence codes& Making annotations

Why record GO evidence code? • GO did not initially record evidence for functional assertion: • NR: Not Recorded • “inferred from…” • deduce or conclude (information) from evidence and reasoning • provides information about the support for associating a gene product with a function • different experiments allow us to draw different conclusions • reliability

Types of GO Evidence Codes • Experimental Evidence Codes • Computational Analysis Evidence Codes • Author Statement Evidence Codes • Curator Statement Evidence Codes • Automatically-assigned Evidence Codes • Obsolete Evidence Codes

Guide to GO Evidence Codes http://www.geneontology.org/GO.evidence.shtml GO EVIDENCE CODES Direct Evidence Codes IDA - inferred from direct assay IEP - inferred from expression pattern IGI - inferred from genetic interaction IMP - inferred from mutant phenotype IPI - inferred from physical interaction Indirect Evidence Codes inferred from literature IGC - inferred from genomic context TAS - traceable author statement NAS - non-traceable author statement IC - inferred by curator inferred by sequence analysis RCA - inferred from reviewed computational analysis IS* - inferred from sequence* IEA - inferred from electronic annotation Other NR - not recorded (historical) ND - no biological data available ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model

NDUFAB1 GO EVIDENCE CODES Direct Evidence Codes IDA - inferred from direct assay IEP - inferred from expression pattern IGI - inferred from genetic interaction IMP - inferred from mutant phenotype IPI - inferred from physical interaction Indirect Evidence Codes inferred from literature IGC - inferred from genomic context TAS - traceable author statement NAS - non-traceable author statement IC - inferred by curator inferred by sequence analysis RCA - inferred from reviewed computational analysis IS* - inferred from sequence* IEA - inferred from electronic annotation Other NR - not recorded (historical) ND - no biological data available GO Mapping Example • Biocuration of literature • detailed function • “depth” • slower (manual)

P05147 PMID: 2976880 Biocuration of Literature: detailed gene function Find a paper about the protein.

Use most specific term possible Read paper to get experimental evidence of function experiment assayed kinase activity: use IDA evidence code

NDUFAB1 GO EVIDENCE CODES Direct Evidence Codes IDA - inferred from direct assay IEP - inferred from expression pattern IGI - inferred from genetic interaction IMP - inferred from mutant phenotype IPI - inferred from physical interaction Indirect Evidence Codes inferred from literature IGC - inferred from genomic context TAS - traceable author statement NAS - non-traceable author statement IC - inferred by curator inferred by sequence analysis RCA - inferred from reviewed computational analysis IS* - inferred from sequence* IEA - inferred from electronic annotation Other NR - not recorded (historical) ND - no biological data available GO Mapping Example • Biocuration of literature • detailed function • “depth” • slower (manual) • Sequence analysis • rapid (computational) • “breadth” of coverage • less detailed ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model

Computational Analysis Evidence In the beginning: • IGC: Inferred from Genomic Context • e.g. operons • RCA: inferred from Reviewed Computational Analysis • computational analyses that integrate datasets of several types • ISS: Inferred from Sequence or Structural Similarity

Computational Analysis Evidence • Then different types of sequence analysis added: ISS: Inferred from Sequence or Structural Similarity • ISO: Inferred from Sequence Orthology • ISA: Inferred from Sequence Alignment • ISM: Inferred from Sequence Model

Computational Analysis Evidence • Phylogenetic analysis codes added: • IBA: Inferred from Biological aspect of Ancestor • IBD: Inferred from Biological aspect of Descendant • IKR: Inferred from Key Residues • characterized by the loss of key sequence residues - implies a NOT annotation • IRD: Inferred from Rapid Divergence • characterized by rapid divergence from ancestral sequence – implies a NOT annotation

Unknown Function vs No GO • ND – no data • Biocurators have tried to add GO but there is no functional data available • Previously: “process_unknown”, “function_unknown”, “component_unknown” • Now: “biological process”, “molecular function”, “cellular component” • No annotations (including no “ND”): biocurators have not annotated • this is important for your dataset: what % has GO?

Multiple Annotations: gene association files

The gene association (ga) file • standard file format used to capture GO annotation data • tab-delimited file containing 17* fields of information: • Information about the gene product (database, accession, name, symbol, synonyms, species) • information about the function: • GO ID, ontology, reference, evidence, qualifiers, context (with/from) • data about the functional annotation • date, annotator * GO Annotation File Format 2.0 has two additional columns compared to GAF 1.0: annotation extension (column 16) and gene product form ID (column 17).

http://www.geneontology.org/GO.format.gaf-2_0.shtml

(additional column added to this example)

gene product information

metadata: when & who

function information

Used to give more specific information about the evidence code (not always displayed)

Used to qualify the annotation (not always displayed)

Gene association files • GO Consortium ga files • many organism specific files • also includes EBI GOA files • EBI GOA ga files • UniProt file contains GO annotation for all species represented in UniProtKB • AgBase ga files • organism specific files • AgBase GOC file – submitted to GO Consortium & EBI GOA • AgBase Community file – GO annotations not yet submitted or not supported / annotations provided by researchers • all files are quality checked

http://www.geneontology.org

http://www.ebi.ac.uk/GOA/

http://www.agbase.msstate.edu/

Sources of GO • Primary sources of GO: from the GO Consortium (GOC) & GOC members • most up to date • most comprehensive • Secondary sources: other resources that use GO provided by GOC members • public databases (eg. NCBI, UniProtKB) • genome browsers (eg. Ensembl) • array vendors (eg. Affymetrix) • GO expression analysis tools

Sources of GO annotation • Different tools and databases display the GO annotations differently. • Since GO terms are continually changing and GO annotations are continually added, need to know when GO annotations were last updated.

Secondary Sources of GO annotation • EXAMPLES: • public databases (eg. NCBI, UniProtKB) • genome browsers (eg. Ensembl) • array vendors (eg. Affymetrix) • CONSIDERATIONS: • What is the original source? • When was it last updated? • Are evidence codes displayed?

Introduction to the Gene Ontology: A User’s Guide

Introduction to the Gene Ontology: A User’s Guide

Presentation Transcript

Gene Ontology (GO) Project geneontology/ Jane Lomax

Ontology Generation and Applications

Tutorial on the Semantic Web

The Ontology of Holes

Study Guide GST 102 Introduction to Philosophy and Logic

Basic Introduction to Ontology-based Language Technology (LT) (2nd year Ms in Social Medicine, UG, Belgium)

Ontologies and Much More

Formal Ontology and Information Systems

Outlook 2013 Web App (OWA) User Guide

Gene Concept

Microsoft Office Interactive User Guide

From Formal Ontology to Biomedical Ontology

How to Build an Ontology

Semantic Web Services: The Web Service Modelling Ontology and IRS-III

ONTOLOGY PRINCIPLES DESIGN AND DEVELOPMENT

Searching The Semantic Web

Travel Coordinators’ User Guide

Gene flow

1: Introduction

Curriculum Verification AND results reporting portal ( cvr )

FMS Guide

Formal Principles for Biomedical Ontologies