420 likes | 546 Views
Literature Mining and Ontology BMI/IBGP 705 Winter, 2012. Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu Department of Biomedical Informatics The Ohio State University. Outline. What is Literature Mining? Popular Tools for Literature Mining Basic Techniques
E N D
Literature Mining and OntologyBMI/IBGP 705 Winter, 2012 Yang Xiang, Ph.D. in Computer Scienceyxiang@bmi.osu.edu Department of Biomedical Informatics The Ohio State University
Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology
Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology
What is Literature (Text) Mining? • The purposes of Literature Mining • Find relevant documents • Discover knowledge (what is knowledge?) • e.g. opinion mining (sentiment analysis) • e.g. document similarity • The advantage of computer-based Literature Mining • Simply, computers can search much more documents! • Computers can ‘think’ and discover knowledge. • We will focus on biomedical literature mining in the following
Why Literature Mining is Very Popular in Biomedical Science? • Biomedical science studies nature subjects. • Species • Genes • Phenotypes • Diseases ….
Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology
Popular Tools for Biomedical Literature Mining – Document search • Google • Google Scholar: http://scholar.google.com • ISI web of knoledge • www.isiknowledge.com • Pubmed • www.ncbi.nlm.nih.gov/pubmed • Scopus • www.scopus.com
Tools for Biomedical Literature Mining – Knowledge discovery • The Gene Ontology • http://www.geneontology.org/ • Gene answer • www.geneanswers.com
Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology
Techniques Behind Literature Mining • Interdisciplinary • Computer Science • Information retrieval • Data mining • Natural Language Processing • Machine learning • Library Science • Biomedical Science • Linguistics • Computational linguistics • Statistics • And more! • Two main research areas (some overlaps) • Information Retrieval • Natural Language Processing
Basic Text Search Algorithm text … H e l l o , w o r l d … • Assume text size is n. • Assume search string size is m. • How to design an efficient algorithm to find all matches in the text? • Brutal force algorithm, O(mn). • Boyer-Moore Heuristics, O(mn), but fast in most cases for English text. • KMP (Knuth-Morris-Pratt) algorithm, O(m+n). String to match w o r l d
Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology
Information Retrieval (Indexing) • Archiving (preprocessing) documents for fast search • Preprocessing time • Query time • Index size
Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology
Programming language processing (C++, Java, etc) • Lexical analysis y=x+10; • Syntax analysis assignment operator expression identifier = expression expression + y number identifier x 10
Natural Language Processing • Lexical level • Stemming (including lemmatizing): find the root of a wordswimming, swam, swim, swimmer swim • Stemming rule may vary (balance between overstemming and understemming) • Typical algorithm (Porter Stemming algorithm) • Alias, Synonym • Grammatical level • Parsing“…We find Gene1 interacts with Gene2…” Sentence Verb phrase Noun phrase Noun phrase Verb Gene1 interact Gene2
Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology
Statistical and Data Mining Processing • Statistical • Count the word frequency • Count the expression frequency • Data Mining • Mining the set of frequent words • Association Rule Mining • Applications • Document similarity • Automatic summarization
Document Classification • E.g., classify all documents related to coffee and health • Various machine learning algorithms can be applied here. Cardioprotective Documents show benefits … Coffee and health related documents Laxative Cholesterol Documents show risk … Anxiety
Accuracy vs Relevancyin Pattern Recognition/Machine Learning • Precision=|{relevant docs}∩{retrieved docs}|/| {retrieved docs}| • Recall= |{relevant docs}∩{retrieved docs}|/|{relevant docs}| • Fall-out= |{nonrelevant docs}∩{retrieved docs}|/|{nonrelevant docs}|
Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology
Ontology • According to philosophy, ontology is a systematic account of Existence • In information science, ontology is a representation of concepts and their relationships, often by directed graphs
Ontology Example: Scientifc classification Kingdom Animalia Phylum Chordata … Hemichordata Class … Actinopterygii Sarcopterygii … Neopterygii Subclass Chondrostei Infraclass … Teleostei Order … Cypriniformes Family … Cyprinidae
Ontology Example (Informal) fish salt water fresh water Asian Europe North American …… native Common Carp Crappie mirror Carp invasive
Represent Ontology by Graphs • Directed Acyclic Graph (DAG): Most ontologies fall into this type. • Directed Tree Directed Graph DAG Tree
Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology
Gene Ontology (GO) Consortium DNA metabolis cell Molecular function … … … Nucleic acid binding enzyme helicase DNA binding DNA helicase ATP-dependent DNA helicase … Reference: Gene Ontology: tool for the unification of biology, nature genetics, 2000 http://dx.doi.org/ 10.1038/75556
Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology
Unified Medical Language System (UMLS) • A compendium of controlled vocabularies in the biomedical sciences (since 1986). It contains: • Metathesaurus • Semantic Network • SPECIALIST Lexicon • Maintained by US National Library of Medicine • Website: http://www.nlm.nih.gov/research/umls/
UMLS - Metathesaurus • Number of biomedical concepts > 1 million • Stem from over 100 incorporated controlled source vocabularies: • ICD (International Statistical Classification of Diseases and Related Health Problems) • MeSH (Medical Subject Headings) • SNOMED CT (Systematized Nomenclature of Medicine – Clinical Terms) • LOINC (Logical Observation Identifiers Names and Codes) • Gene Ontology • OMIM (Mendelian Inheritance in Man) … http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/source_vocabularies.html
UMLS - Semantic • Semantic types (categories) • Entity • Physical Object • Organism … … • Event • Actitivity • Behavior … … • Semantic relationships (connecting two concepts) • isa • assoicated_with • physically_related_to • part_of… • spatially_related_to • location_of… … Drug A treats treated_by Disease B disease_is_marked_by_gene Gene A http://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html http://www.clres.com/semrels/umls_relation_list.html
Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology
Use of ontology systems • Statistical • Gene ontology enrichment test • Indexing • Reachability • Distance • Path
Reachability The problem: Given two vertices u and v in a directed graph G, is there a path from u to v ? ?Query(1,11) Yes ?Query(3,9) No 15 14 11 13 10 12 6 7 8 9 3 4 5 1 2
Distance The problem: Given two vertices u and v in a (directed) graph G, what is the distance from u to v? ?Query dG(1, 11) =3 15 14 11 13 10 12 6 7 8 9 3 4 5 1 2
Path The problem:Given two vertices u and v in a (directed) graph G, what is a path (are paths) connecting u to v ? 15 14 Find a path from1to11 11 13 10 12 6 7 8 9 3 4 5 1 2
The estimated difficulty of building a very efficient indexing graph database schemes (based on current research) Reference: R. Jin, Y. Xiang, N. Ruan, H. Wang, "Efficiently Answering Reachability Queries on Very Large Directed Graphs", Proc. of ACM SIGMOD Conference, Vancouver, June 9-12, 2008, pp. 595-608. R. Jin, Y. Xiang, N. Ruan, D. Fuhry, "3-HOP: A High-Compression Indexing Scheme for Reachability Query", Proc. of ACM SIGMOD Conference, Providence, Rhode Island, June 29-July 2, 2009, pp. 813-826.
Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Ontology use and indexing • Applications of Literature Mining and Ontology
Applications of Literature Mining and Ontology - I • Build confirmed gene-phenotype relations • Human Phenotype Ontology (HPO) • Built from Online Mendelian Inheritance in Man (OMIM) database. • http://human-phenotype-ontology.org/ Reference: Robinson PN, Mundlos S. The Human Phenotype Ontology. Clinical Genetics 77(6) 2010: 525–534. http://dx.doi.org/10.1111/j.1399-0004.2010.01436.x
Applications of Literature Mining and Ontology - II • MetaMap program and CKC Mining • MetaMap: Mapping biomedical text to UMLS Metathesaurus. • CKC (Conceptual Knowledge Constructs) represents a path connecting several concepts in the UMLS. • Knowledge Discovery using MetaMap and CKC mining. ……… .… … C phenotypes bio-molecular CKCs Literature MetaMap Reference: Aronson, A.: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In : AMIA Symposium, p.17 (2001) Payne, P., Borlawsky, T., Kwok, A., Greaves, A.: Supporting the design of translational clinical studies through the generation and verification of conceptual knowledge-anchored hypotheses. In : AMIA Annual Symposium Proceedings, p.566 (2008)
Applications of Literature Mining and Ontology - III • kDLS – Index the UMLS graph • "C1415882:IDS gene" – has_manifestation "C0004096:Asthma" – clinically_associated_with "C0002871:Anemia"– clinically_associated_with "C0023434:Chronic Lymphocytic Leukemia" • "C1415882:IDS gene" – has_manifestation "C0018802:Congestive heart failure" – clinically_associated_with " C0002871:Anemia " – clinically_associated_with "C0023434:Chronic Lymphocytic Leukemia" Reference: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip RO Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery. Journal of Biomedical Informatics. In press. (available online Dec 2011)