1 / 18

A Cascaded Approach to Normalising Gene Mentions in Biomedical Literature

A Cascaded Approach to Normalising Gene Mentions in Biomedical Literature. Hui Yang, Goran Nenadic , John Keane School of Computer Science Manchester Interdisciplinary BioCentre G.Nenadic@manchester.ac.uk. Identification of gene names.

thanos
Download Presentation

A Cascaded Approach to Normalising Gene Mentions in Biomedical Literature

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Cascaded Approach to Normalising Gene Mentions in Biomedical Literature Hui Yang, Goran Nenadic, John Keane School of Computer ScienceManchester Interdisciplinary BioCentre G.Nenadic@manchester.ac.uk

  2. Identification of gene names • Gene/protein names are essential for integrating and exploring bio-literature • e.g. building/browsing regulatory networks • Two step process • recognise gene mentions in text • map these to a referent database • Gene name variability and ambiguity “biologists would rather share a toothbrush than a gene name”

  3. Outline • Overview – why a cascaded approach • Dictionary for matching • Exact-like matching • Approximate matching • Experiments • Summary and conclusions

  4. Why a cascaded approach? • Overall aim: improve recall, but with a controlled loss of precision • Give your best shot first • intuitively: try “exact” and exact-like matches first, and then try with approximations • experimentally: find an optimal sequence • Apply further (less-reliable) steps only on (still) unmatched gene mentions

  5. Dictionary re-engineering • Automatically generate gene name synonyms from existing DBs (e.g. Entrez Gene, UniProt) • use set of (generic, non-organism specific) rules to generate canonical representations of synonyms alphaCP-4 protein alphaCP4_protein RP13-16H11.4 RP13_6H11.4 Rev-ErbAalpha Rev_ErbAalpha ST3GALVI ST3GALVI • Two versions: preserve original synonyms as well as normalised canonical forms

  6. Pre-processing gene mentions • Generate a set of canonical representations of a gene mention • analogous to dictionary re-engineering • but, some add-ons and differences • resolving potential acronymsinterleukin (IL)-17E  interleukin -17E, IL-17E • resolving gene name coordinations ORP 3 to 6 ORP 3 and 6 • token-based normalisation • Roman numbers, acronyms, Greek letters

  7. 1st stage: exact matching • Step E1: Match original dictionary and original mentions • Step E2: Match normalised dictionary and normalised mentions • Step E3: Match normalised dictionary and token-based normalised mentions

  8. 2nd stage: approximate matching • component-based comparisons • relevant (specific) component classes (Digit, Greek-Letter, Roman-Number, Chemical etc. tokens) • Step A1a:Component permutation (order) is ignored • Step A1b:Non-relevant components missing from a synonym are ignored

  9. 2nd stage: approximate matching • Step A2a:One non-relevant extra component in a synonym is ignored • Step A2b:One non-relevant extra component in a synonym is ignored if all relevant components are matched

  10. Original mentions Normalised mentions Original synonyms Original vs. Original Normalised vs. Normalised Normalised vs. Token-normalised Normalised synonyms Ignore word permutations Ignore one missing non-relevant component Ignore one extra non-relevant component Ignore one extra non-relevant componentif all relevant components are matched Tokennormalised mentions

  11. Experiments Experimental context • BioCreative II data set • Map human genes to Entrez Gene

  12. Results: exact-like matching

  13. Results: approximate matching

  14. Cumulative performance • Precision: 0.93 • Recall: 0.69 • F-measure: 0.79 • For comparisons (BioCreative II test data) • Precision: 0.94 • Recall: 0.72 • F-measure: 0.81

  15. Some conclusions • Exact-like matching achieves 0.76 F-measure (0.96 P, 0.64 R) • Approximate matching improve recall only 10-15% • ignoring word order is effective (both recall and precision-wise), as well as ignoring one extra non-relevant component (recall) • Some approaches consistent across different test sets, some not • e.g. precision of approximate match: 0.63 – 0.78 recall of exact matching: 0.59 – 0.68

  16. Summary • Simple yet effective approach • cascaded approach with reliable matching strategies which can be switched on and off • some are good for precision, some for recall • can be easily used for other species • More work needed on • gene name coordination and enumerations • acronyms/symbols embedded in mentions • species identification

  17. Acknowledgements • Partially funded by UK BBSRC (Project “Mining Term Associations from Literature to Support Knowledge Discovery in Biology”) • Manchester Interdisciplinary Biocentre (Irena Spasic) • Faculty of Life Sciences (Casey Bergman) • National Centre for Text Mining (NaCTeM)

  18. A Cascaded Approach to Normalising Gene Mentions in Biomedical Literature Hui Yang, Goran Nenadic, John Keane School of Computer ScienceManchester Interdisciplinary BioCentre G.Nenadic@manchester.ac.uk

More Related