1 / 20

Using ontologies for text processing

Using ontologies for text processing. Overview. Thesis: Ontologies (or even more elaborated knowledge-bases) are required to solve the lexical ambiguity problem Describe the lexical ambiguity problem and its central importance in natural language processing

daire
Download Presentation

Using ontologies for text processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using ontologies for text processing

  2. Overview • Thesis: Ontologies (or even more elaborated knowledge-bases) are required to solve the lexical ambiguity problem • Describe the lexical ambiguity problem and its central importance in natural language processing • Demonstrate how GO, combined with Direct Memory Access Parsing, provides a simple solution to some instances of this problem • Argue no alternative is likely to work as well

  3. Lexical Ambiguity • A word (character string) means different things in different contexts • How can a program disambiguate (tell which is meant)? • Widespread problem even in “simple” bioNLP • DNA vs. mRNA vs. protein [Hatzivassiloglou et al. 2001] • Gene symbol vs. non-gene acronym [Pustejovsky et al. 2001], [Chang et al. 2002], [Liu and Friedman 2003], [Schwartz and Hearst 2003] • Gene/product vs. any other noun [Tanabe and Wilbur, 2002]

  4. A particular example • “Hunk” can be a • Cell type: human natural killer • Gene: hormonally upregulated Neu-associated kinase • Medical abbreviation: radiographic/orthopedic joint classification system • Non-technical English: a large lump, piece, or portion All occur in Medline documents…. (e.g. “hunk of metal” in article on ambulance design)

  5. How do ontologies help? • The idea that knowledge is relevant to understanding words in context is controversial only among linguists, but… • Direct Memory Access Parsing (DMAP) [Martin, 1991] [Fitzgerald, 2000] technique demonstrates the power of knowledge-based method for disambiguation • GO & similar efforts make DMAP (or other knowledge-based methods) practical today

  6. What is DMAP? • Conceptual parser • Maps from text to conceptual representations organized in packaging and abstraction hierarchies (like GO) • In contrast to: pure syntactic parsers, pattern matching and machine learning systems • Conceptual representations include lexical patterns that specify how to recognize the concept in text • Patterns consist of text literals and/or references to other concepts • Organized around concepts, not words; no independent lexicon. • Recognition creates expectations for related concepts

  7. ID: cell-type-HUNKIS-A: cell-type lex: human natural killer ID: gene-expression ID: GO-0006350 slots: HUNK lex: expressed-item: gene transcription mechanism: expression ID: gene-26559IS-A: gene expression lex: (gene) (expression) lex: hormonally upregulated Neu-associated kinase RESULTS HUNK hormonally upregulated neu tumor-associated kinase A real example “…Hunk expression is restricted to subsets of cells…”[Gardner et al. 2000]

  8. DMAP output with and without context (parse ‘(Hunk)) e-gene-26559 begin: 1 end: 1 e-cell-type-HUNK begin: 1 end: 1 (parse ‘(Hunk expression)) c-gene-expression-1 begin: 1 end: 2 expressed-item: e-gene-26559 begin: 1 end: 1 mechanism: GO:0006350 begin: 2 end: 2 Hunk alone: ambiguous Hunk expression: not ambiguous

  9. DMAP can handle much more complex constructions “Hunk is expressed in mouse epithelial cells during cell proliferation.” c-localized-gene-expression expressed-item: e-gene-26559 mechanism: GO:0006350 where: c-epithelial-cell taxon: ncbi_10090 when: GO:0008283 But uses our enriched knowledge-base, not just GO

  10. Even just DMAP/GO is a big win • Recall 7,042 ambiguous symbols for 9,723 genes • Straightforward to disambiguate symbols that map to 2 or more genes when: • Each ambiguous gene referent has GO annotations, and • There is no overlap between the annotations for the genes • 3,333 of the symbols (for 4715 of the genes) have this feature – nearly half the problem is solved!

  11. Compare the alternatives • Statistical or machine learning approaches • Must avoid being fooled by word “cells” in example • Scalability: need statistics for many covariates of every ambiguous word; doesn’t exploit the abstraction hierarchy • Full syntactic parse doesn’t disambiguate at all! • Cascaded FST’s, pattern-matching, etc. • Where is source of knowledge for these? • Much DMAP lexical information can be taken directly from GO (and LocusLink, etc.)

  12. Acknowledgments • Philip V. Ogren • Daniel J. McGoldrick • Christoffer S. Crosby • Jens Eberlein • George K. Acquaah-Mensah • I/NET’s (http://inetmi.com) CM / CMP software • Support from Wyeth Genetics Institute, NIAAA http://compbio.uchsc.edu

  13. Biognosticopoea representation of the hunk gene

  14. Attachment ambiguity • Attachment ambiguity • These findings suggest that FAK functions in the regulation of cell migration and cell proliferation. (Gilmore and Romer 1996:1209) • What does FAK do? • ALMOST RIGHT: • FAK functions in the regulation of cell migration • FAK functions in cell proliferation • RIGHT: • FAK functions in the regulation of cell migration • FAK functions in the regulation of cell proliferation

  15. Attachment ambiguity GO-0016477 isA go-process lex: cell migration GO-0008283 isA go-process lex: cell proliferation GO-0042127 isA go-process lex: regulation of cell proliferation regulation of ((go-process) and)* cell proliferation GO-0030334 lex: regulation of cell migration regulation of ((go-process) and)* cell migration

  16. Attachment ambiguity (parse ‘(These findings suggest that FAK functions in the regulation of cell migration and cell proliferation)) GO:30334 begin: 9 end: 12 GO:0042127 begin: 9 end: 15

  17. What do we have so far? • Gene Ontology • UMLS • MeSH • …

  18. What more do we need? • Family • Location • Macroanatomical • Subcellular localization • Structure • Function • Disease associations • Protein/protein interactions • …..

  19. Where can we get it? • GO definitions • UMLS definitions • MeSH notes • Biomedical literature

  20. full syntactic parse first cascaded FST’s “a little syntax, a little semantics” machine learning pattern-matching All can benefit from ontology/KB If you don’t like DMAP….

More Related