Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR)

UniProt NCBI UniProtKB UniRef90/50 PIR-PSD Entrez Gene RefSeq GenPept Name Filtering Entity Recognition Preprocessing Highly Ambiguous Nonsensical Terms Acronym detection Name Extraction Abstracts Full-Length Texts Sentence extraction Pattern 1: <AGENT> <VG-active-phosphorylate> <THEME> (in/at <SITE>)? Term recognition Genome Part of speech tagging FlyBase WormBase MGD SGD RGD Raw Thesurus ATR/FRP-1 also phosphorylated p53 in Ser 15 Phrase Detection Estrogen receptor alpha (PIRSF50001) Other Relation Identification HUGO EC OMIM Extracted Annotations Tagged Abstracts Semantic Typing Noun and verb group detection 3 iProLINK: An integrated protein resource for literature mining RLIMS-P Rule-based LIterature Mining System for Protein Phosphorylation • Protein Phosphorylation Annotation Extraction • Manual tagging assisted with computational extraction • Training sets of positive and negative samples UMLS Nominal level relation Other syntactic structure detection Verbal level relation iProClass Applications: Bioinformatics. 2005 Jun 1;21(11):2759-65 BioThesaurus • Biological entity tagging • Name mapping • Database annotation • literature mining • Gateway to other resources Post-Processing 1. Bibliography mapping - UniProt mapped citations 2. Annotation extraction - annotation tagged literature 3. Protein entity recognition - dictionary, tagged literature 4. Protein ontology development - PIRSF-based ontology UniProtKB Entries: Protein/Gene Names & Synonyms Semantic Type Classification INTRODUCTION RLIMS-P • As the volume of scientific literature rapidly grows, literature data mining becomes increasingly criticalto facilitate genome/proteome annotation and to improve the quality of biological databases. Annotations derived from experimentally verified data from literature are of special value to the UniProtKB (UniProt Knowledgbase). One objective of UniProtKB is to have accurate, consistent, and rich annotation of protein sequence and function. Relevant to this goal are the literature-based curation and development and adoption of ontologies and controlled vocabularies. • Literature-Based Curation – Extract Reliable Information from Literature • Protein properties: protein function, domains and sites, developmental stages, catalytic activity, binding and modified residues, regulation, induction, pathways, tissue specificity, subcellular location, quaternary structure… • This will ensure high quality, accurate and up-to-date experimental data for each protein. But it is a major bottleneck! • Ontologies/Controlled Vocabularies – For Information Integration and Knowledge Management • UniProtKB entries will be annotated using widely accepted biological ontologies and other controlled vocabularies, e.g. Gene Ontology (GO) and EC nomenclature. • The Protein Information Resource has been collaborating with several NLP research groups to develop text-mining methodologies to extract information from biological literature and to develop protein ontology. PIR – Integrated Protein Informatics Resource for Genomic/Proteomic Research Testing and Benchmarking Dataset Benchmarking of RLIMS-P http://pir.georgetown.edu/iprolink/ http://pir.georgetown.edu/iprolink/ (http://pir.georgetown.edu) High recall for paper retrieval and high precision for information extraction UniProt– Central international database of protein sequence and function • UniProtKB site feature annotation • Proteomics MS data analysis: protein identification • RLIMS-P text mining tool • Protein dictionaries • Name tagging guideline • Protein ontology (http://www.uniprot.org) PIRSF Protein Family Classification • Acknowledgements • Research Projects • NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR (UniProt) • NSF: SEIII (Entity Tagging) • NSF: ITR (Ontology) • Collaborators • I. Mani from Georgetown University Department of Linguistics on protein name recognition and protein name ontology. • H. Liu from University of Maryland Department of Information System on protein name recognition and text mining. • Vijay K. Shanker from University of Delaware Department of Computer and Information Science on text mining of protein phosphorylation features. • PIRSF: A network structure from superfamilies to subfamilies to reflect evolutionary relationships of full-length proteins • Definitions • Basic unit = Homeomorphic Family • Homeomorphic: Full-length similarity, common domain architecture • Network Structure: Flexible number of levels with varying degrees of sequence conservation Molecular function • Two cases: analyze GO branches and concepts and identify missing GO nodes Biological process Case I. Nuclear receptor superfamily Case II. IGF-binding protein superfamily 1 Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR) Hu ZZ1, Mani I2, Liu H3, Vijay-Shanker K4,Hermoso V1, Nikolskaya A1, Natale DA1, and Wu CH1 1Protein Information Resource, Georgetown University Medical Center, 3900 Reservoir Road, NW, Washington, DC 20057; 2Georgetown University, 37th and O Streets, NW, Washington, DC 20057; 3University of Maryland at Baltimore County, Baltimore, MD 21250; 4Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716 ABSTRACT An integrated protein literature mining resource iProLINK is developed at PIR to provide data sources for Natural Language Processing (NLP) research on bibliography mapping, annotation extraction, protein named-entity recognition, and protein ontology development. A rule-based text-mining system RLIMS-P is used to extract protein phosphorylation information from MEDLINE abstracts to assist database annotation, an online BioThesaurus is developed for protein/gene name mapping and to assist with protein named-entity recognition, and a family classification PIRSF-based protein ontology is developed and to complement other ontologies. 1 4 2 Online RLIMS-P text-mining tool (version 1.0) 2 1 http://pir.georgetown.edu/iprolink/rlimsp/ 1. Search interface 2. Summary table with top hit of all sites 3. All sites and tagged text evidence 3 5 6 Web-based BioThesaurus 7 • Summary • PIR iProLINK literature mining resource provides annotated data sets for NLP research on annotation extraction and protein ontology development • RLIMS-P text-mining tool for protein phosphorylation from PubMed literature. Coupling the high recall for paper retrieval and high precision for information extraction, RLIMS-P can be applied for UniProtKB protein feature annotation. • Biothesaurus can be used to solve name synonym and ambiguity, name mapping. • PIRSF-based protein ontology can complement GO by identify missing GO concepts/nodes and provides systematic links between three GO sub-ontologies. PIRSF-Based Protein Ontology BioThesaurus PIRSF to GO Mapping Protein Ontology Can Complement GO • Complements GO: PIRSF-based ontology can be used to analyze GO branches and concepts and to provide links between the GO sub-ontologies • Mapped5363 curated PIRSF homeomorphic families and subfamilies to the GO hierarchy • 68% of the PIRSF families and subfamilies map to GO leaf nodes • 2329 PIRSFs have shared GO leaf nodes • PIRSF family hierarchy based on evolutionary relationships • Standardized PIRSF family names as hierarchical protein ontology • DAG Network structure for PIRSF family classification system • Expanding a Node: Identification of GO subtrees that need expansion if GO concepts are too broad • IGFBP subfamilies • High- vs. low-affinity binding for IGF between IGFBP and IGFBPrP Gene/Protein Name Mapping Search Synonyms Resolve Name Ambiguity Underlying ID Mapping PIRSF in DAG View DynGO viewer http://pir.georgetown.edu/iprolink/biothesaurus/ BioThesaurus v1.0 m = million Liu et al, 2005, submitted BioThesaurus report UniProtKB entry P35625 (May, 2005) GO-centric view Example 1. Name ambiguity of TIMP3 DAG file: ftp://ftp.pir.georgetown.edu/pir_databases/pirsf/dagfiles/ Liu et al, 2005, submitted • Superimpose GO and PIRSF hierarchies • Bidirectional display (GO- or PIRSF-centric views) Exploration of Gene and Protein Ontology 8 • Systematic links between three GO sub-ontologies based on the shared annotations at different protein family levels, e.g., linking molecular function and biological process: • estrogen receptor binding and • estrogen receptor signaling pathway Protein Name Tagging • Tagging guideline versions 1.0 and 2.0 • Generation of domain expert-tagged corpora • Inter-coder reliability – upper bound of machine tagging • Dictionary pre-tagging • F-measure: 0.412 (0.372 Precision, 0.462 Recall) • Advantages: helpful with standardization and extent of tagging, reducing fatigue problem, and improve inter-coder reliability. • BioThesaurus for pre-tagging Example 2. Name ambiguity of CLIM1 PIRSF-centric view

Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR)

Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR)

Presentation Transcript

Protein Sequence Analysis - Overview

Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Pro

i ProLINK: An integrated protein resource for literature mining and literature-based curation

PIR (Protein Information Resource)

The Protein Ontology (PRO)

Protein sequence retrieval AND other database information

PROTEIN DATABASE

UniProt: Universal Protein Resource

Protein Ontology: Addressing the need for precision in representing protein networks

Protein Information Resource

Protein Ontology (PRO)

Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

Protein Structure Databases

An Ontology for Protein-Protein Interaction Data

Genome Annotation: A Protein-centric Perspective

Demo: Protein Information Resource

Literature Data Mining and Protein Ontology Development

Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

Protein Information Resource

PIR: Protein Information Resource

SRS – Sequence Retrieval System PIR – Protein Information Resource