1 / 41

Research in the Verspoor Lab

Research in the Verspoor Lab. Text Mining. Information extraction from the biomedical literature Entity recognition and normalization Relation and event extraction Last time, I promised that we would look at: Ontologies as constraints for information extraction. Making BioNLP relevant.

edita
Download Presentation

Research in the Verspoor Lab

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Research in the Verspoor Lab

  2. Text Mining • Information extraction from the biomedical literature • Entity recognition and normalization • Relation and event extraction • Last time, I promised that we would look at: • Ontologies as constraints for information extraction

  3. Making BioNLP relevant • Recognition of OBO terms, relations • CRAFT corpus (first release later this year)

  4. OpenDMAP extracts typed relations from the literature • Concept recognition tool • Connect ontological terms to literature instances • Built on Protégé knowledge representation system • Language patterns associated with concepts and slots • Patterns can contain text literals, other concepts, constraints (conceptual or syntactic), ordering information, or outputs of other processing. • Linked to many text analysis engines via UIMA • Best performance in BioCreative II IPS task • >500,000 instances of three predicates (with arguments) extracted from Medline Abstracts • [Hunter, et al., 2008] http://bionlp.sourceforge.net

  5. freetext ontology patterns OpenDMAP extracted information OpenDMAP

  6. freetext ontology patterns OpenDMAP protein protein interaction: interactor1: cyclin E2 interactor2: cdk2 extracted information OpenDMAP Cyclin E2 interacts with Cdk2 in a functional kinase complex. <ontology> Protein protein interaction := [int1] interacts with [int2]

  7. PROTÉGÉ ONTOLOGY CLASS: protein protein interaction SLOT: interactor1 TYPE: molecule SLOT: interactor2 TYPE: molecule PATTERNS {c-interact} := [interactor1] interacts with [interactor2] {c-interact} := [interactor1] is bound by [interactor2] … OpenDMAP OpenDMAP

  8. BioCreative II Example • Some BioCreative patterns for interact {c-interact} := [interactor1]{w-is}{w-interact-verb1}{w-preposition} the? [interactor2]; {w-is} := is, are, was, were; {w-interact-verb1} := co-immunoprecipitate, co-immunoprecipitates, co-immunoprecipitated, co-localize, co-localizes, co-localized; {w-preposition} := among, between, by, of, with, to; • Matched text: PMID 16494873, SENT_ID 16494873_114 Upon precipitation of the SOX10 protein with anti-HA antibody, Western blot detection revealed expression of UBC9-V5 (25 kDa) in the sample (Fig. 1, line 6), indicating that {UBC9wasco-immunoprecipitatedwithSOX10}. INTERACTOR_1: UBC9 resolved to UniprotID: UBC9_RAT INTERACTOR_2: SOX10 resolved to UniProtID: SOX10_RAT {c-interact}:= [UBC9_RAT]interactor_1, [SOX10_RAT]interactor_2

  9. BioCreative Results • 359 full-text articles in the test set • 385 interaction assertions produced • Performance averaged per article (to avoid dominance of a few assertion-heavy articles) P = 0.39, R = 0.31,F = 0.29 • Best result in the evaluation! • F score 10% higher than next-scoring system • F score > 3 standard deviations above mean • Recall 20% higher than next-scoring system

  10. BioCreative conclusions • Information extraction in biomedical text is hard • Linguistic variability in how concepts are expressed • Complex concepts with multiple “slots” • OpenDMAP advances the state of the art • Use of an ontology grounds the search for information • Flexibility of the pattern language to incorporate constraints at different levels (conceptual, lexical, word order, linguistic)

  11. BioNLP’09: Methods Protein_transport := [TRANSPORTED-ENTITY] translocation @(from {DET}? [TRANSPORT-ORIGIN]) @(to {DET}? [TRANSPORT-DESTINATION]) Bax translocationto mitochondriafrom the cytosol Bax translocationfrom the cytosolto the mitochondria Slide credit: Kevin B. Cohen

  12. BioNLP’09: Methods Protein_transport := [TRANSPORTED-ENTITY] translocation @(from {DET}? [TRANSPORT-ORIGIN]) @(to {DET}? [TRANSPORT-DESTINATION]) Protein (Sequence Ontology) Cellular Component (Gene Ontology) Slide credit: Kevin B. Cohen

  13. BioNLP’09: Methods Slide credit: Kevin B. Cohen

  14. BioNLP’09: Methods • All event types represented as frames • Elements from ontology constrain every slot EVENT TYPE: REGULATION AtLoc: instance of biological_entity Cause: instance of protein CSite: instance of biological_concept or polypeptide_region Event_action: instance of trigger_word or detection_method Site: instance of biological_concept or polypeptide_region Theme: instance of protein or biological_process ToLoc: instance of biological_entity Sequence Ontology Molecular Interaction Ontology Gene Ontology Cell Cycle Ontology Slide credit: Kevin B. Cohen

  15. BioNLP’09: Methods Partial view of ontology—reality is a little bit less clean Slide credit: Kevin B. Cohen

  16. BioNLP’09: Methods BTO: BRENDA Tissue Ontology CCO: Cell Cycle Ontology CTO: Cell Type Ontology GO: Gene Ontology SO: Sequence Ontology Slide credit: Kevin B. Cohen

  17. BioNLP’09: Methods • Manual pattern-writing • Before availability of training data: based on native speaker intuitions, examples from PubMed, and variations on same, as in Cohen et al. (2004) • After release of training data: based on examination of corpus data, targeting high-frequency predicates only • Nominalizations predominated; used insights from Cohen et al. (2008) regarding Theme placement • Protein binding rules re-used from BioCreative II protein-protein interaction task • Eschewed use of wildcards Slide credit: Kevin B. Cohen

  18. BioNLP’09: Results Task 1: P 10 points higher than second-highest Task 2: P 14 points higher than second-highest Task 3: P 3.4 points lower than highest (3/6) Slide credit: Kevin B. Cohen

  19. BioNLP’09: Results Unofficial results: contribution of bug repairs Still the highest precision (#2 was 62.21) Slide credit: Kevin B. Cohen

  20. BioNLP’09: Results • Contribution of coördination-handling • Bug-fixed results: F 27.62 (Task 1) • Without coordination-handling: F 24.72 • Decrease in F of 2.9 without coördination-handling Slide credit: Kevin B. Cohen

  21. Syntax helps • 125I-labeled C3b was covalently deposited on CR2, when hemolytically active 125I-labeled C3 was added to Raji cells preincubated with iC3, factor B, properdin, and factor D, thus proving functionality of CR2-bound C3 convertase. <cr2> BINDS <c3 convertase> • CD8alpha(alpha) binds one HLA-A2/peptide molecule, interfacing with the alpha2 and alpha3 domains of HLA-A2 and also contacting beta2-microglobulin. <cd8alpha ( alpha )> BINDS <hla a2 / peptide molecule> • The binding of 109Cd to metallothionein and the thiol density of the protein were determined after incubation of a purified Zn/Cd-metallothionein preparation with either hydrogen peroxide alone, or with a number of free radical generating systems. <109cd> BINDS <metallothionein> • Although these shifts in alpha3 may provide a synergistic modulation of affinity, the binding of CD8 to MHC is clearly consistent with an avidity-based contribution from CD8 to TCR- peptide-MHC interactions. <Cd8> BINDS <major histocompatibility complex>

  22. More complex examples • Complex noun phrases • The inactive C3 (iC3), which forms spontaneously in serum in low amounts by reaction of native C3 with H2O, binds noncovalently to the N-terminal part of CR2. <inactive c3> BINDS <cr2> • RelB binds transcriptionally active kappaB motifs in the TNF-alpha promoter in normal cells, and in vitro studies with macrophages isolated from RelB- deficient animals revealed impaired production of TNF-alpha in response to LPS and IFN-gamma. <relb> BINDS <tnf - alpha promoter> • Negation • TNP-BSA, however, did not bind to the CD4 receptor. <trinitrophenyl-bovine serum albumin> DOES_NOT_BIND <cd4 receptor> • Similarly, when cells expressing the wild type FSHR were treated with tunicamycin to prevent N-linked glycosylation, the resulting nonglycosylated FSHR was not able to bind FSH. <resulting nonglycosylated fsh receptor> DOES_NOT_BIND <follicle-stimulating hormone>

  23. Coordination isparticularly hard In contrast both the S4GGnM-R and the Man-R are able to bind Man-BSA. <mannose receptor> BINDS <man bsa> <s4ggnm - r> BINDS <man bsa> Purified recombinant NC1, like authentic NC1, also bound specifically to fibronectin, collagen type I, and a laminin 5/6 complex.<authentic nc1> BINDS <laminin 5 / 6 complex><authentic nc1> BINDS <collagen type I><authentic nc1> BINDS <fibronectin><purified recombinant nc1> BINDS <laminin 5 / 6 complex><purified recombinant nc1> BINDS <collagen type I><purified recombinant nc1> BINDS <fibronectin> The nonvisual arrestins, beta-arrestin and arrestin3, but not visual arrestin, bind specifically to a glutathione S-transferase-clathrin terminal domain fusion protein. *<Arrestin3> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><beta arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><nonvisual arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein>

  24. BioNLP Shared Task ‘11 • Extension of BioNLP’09 tasks • Generalization to full text (from abstracts) • Additional event types: post-translational modifications and catalysis • Methods: • Based on empirically derived patterns • Derived from training data + manual refinement • Using dependency relations (syntax) • Work of Haibin Liu (postdoc)

  25. Integrating background knowledge • Can improve OpenDMAP precision with minimal cost to recall • Take advantage of background knowledge • Tighten constraints on slot fillers in the ontology • No change to existing patterns • Proof of concept: • Distinguish among several types of protein activation (enzyme and receptor) in GeneRIFs • Utilize Gene Ontology annotations

  26. Refining selectional restrictions TP: [GeneRIF 104155 ] an ER stress induces the activation of [caspase-12_protein - catalytic activity]activated_entity via [caspase-3_protein]activator prevented FP: [GeneRIF 105594] factor Xa can induce mesangial cell proliferation through the activation of ERK_proteinvia PAR2_protein in mesangial cells

  27. Results

  28. Biological entities • Genes (and their products) are particularly valuable to recognize, but are not the only entities of interest: • Diseases • Drugs, Chemicals, and other treatments • Anatomical and other locations • Time and temporal relationships • Methods and evidence • Molecular functions, biological processes

  29. Biological Concept Recognition

  30. Two dictionary-based toolstested against CRAFT • UIMA ConceptMapper http://incubator.apache.org/uima/sandbox.html#concept.mapper.annotator • stemming and case matching relaxation • non-contiguous spans • ignore stopwords • order-independent lookup • Open Biomedical Annotator http://bioportal.bioontology.org/annotator • ignore stopwords • partial word matches

  31. Best run results • CM/CTO: stemming + FindAllMatches: false • OBA/CTO: using default stop words • CM/GO_CC: stemming + caseMatch: insensitive • CM/ChEBI: caseMatch: sensitive

  32. Concept Matching Conclusions • The kinds of terms in the ontology matter • The strategies used in the dictionary matching tools matter • OpenDMAP will support strategies that go beyond dictionary matching …

  33. Evaluation via Test Suite • Big picture: How to evaluate ontology concept recognition systems? • Traditional approach: “corpus” • Expensive • Time-consuming to produce • Redundancy for some things… • …underrepresentation of others • Immediate (narrow) goal of this work: Use techniques from software testing and descriptive linguistics to build test suites that: • Control test data • Eliminate redundancy • Systematic coverage (Oepen 1998) • Immediate (broad) goal of this work: Are there general principles for test suite design? Slide credit: Kevin B. Cohen

  34. Methods • Steps: develop “catalogue” of dimensions along which terms vary • Use insights from linguistics and from how we know concept recognition systems work • Structural aspects: length • Content aspects: typography, orthography, lexical contents (function words)… • …to build a structured set of test cases • Also compare to other test suite work (Cohen et al. 2004) to look for common principles Slide credit: Kevin B. Cohen

  35. Structured test suite Canonical Non-canonical GO:0000133 Polarisomes GO:0000108 Repairosomes GO:0000786 Nucleosomes GO:0001660 Fevers GO:0001726 Ruffles GO:0005623 Cells GO:0005694 Chromosomes GO:0005814 Centrioles GO:0005874 Microtubules • GO:0000133 Polarisome • GO:0000108 Repairosome • GO:0000786 Nucleosome • GO:0001660 Fever • GO:0001726 Ruffle • GO:0005623 Cell • GO:0005694 Chromosome • GO:0005814 Centriole • GO:0005874 Microtubule indution of apoptosis -> apoptosis induction (Syntax) cell migration -> cell migrated (Part of speech) ensheathment of neurons -> ensheathment of some neurons Slide credit: Kevin B. Cohen

  36. Methods/Results • Gene Ontology, revision 9/24/2009 • Canonical: 188 • Non-canonical: 117 • Observation: • 5:1 “dirty” versus 5:1 “clean” is mark of “mature” testing • Applied publicly available concept recognition system Slide credit: Kevin B. Cohen

  37. Results • 97.9% of canonical terms were recognized • All exceptions contain the word in • No non-canonical terms were recognized • What would it take to recognize the error pattern with canonical terms with a corpus-based approach?? • General principles: Length, ortho/typography (numerals/punctuation), function/stopwords, syntactic context Slide credit: Kevin B. Cohen

More Related