400 likes | 593 Views
Automatic template creation for biomedical information extraction: theory and practice. David Corney UCL Computer Science 2 nd May 2006. Motivation. Biologists need tools to process literature This requires knowledge of domain and of computational linguistics
E N D
Automatic template creation for biomedical information extraction: theory and practice David Corney UCL Computer Science 2nd May 2006
Motivation • Biologists need tools to process literature • This requires knowledge of domain and of computational linguistics • Option 1: become a computational linguist • Option 2: collaborate with a computational linguist • Option 3: use a tool that provides linguistic knowledge • Aim: to “de-skill” template creation – McLinguist
Scale of the problem • 672,963 citations added to PubMed in 2005 • c. 91% of these citations are “journal articles” • Mean length of a journal article: 5600 words • Over 110 words per second are published in life sciences journals
Background • BioRAT project – 2001-? • Sponsors: GSK (initially) and BBSRC (now) • Aim: to make biomedical information extraction practical for life sciences researchers • Uses GATE (ANNIE) • Software is available as a “permanent prototype” • Variety of groups using / evaluating system • http://bioinf.cs.ucl.ac.uk/biorat
Document Swi6 was phosphorylated by Hrr25 kinase... Sentence splitter <Swi6> <was> <phosphorylated> <by> <Hrr25> <kinase> Token splitter Semantic tagging (named entity recognition) <Swi6 protein> <phosphorylated interaction> <Hrr25 protein> Part-of-speech tagging <Swi6 noun> <phosphorylated verb> <by preposition> <Hrr25 noun> <kinase noun> <protein> <interaction> <preposition> <protein> Pattern recognition Information extraction Database of relevant facts
Making template creation easier • Biologists are not computational linguists • But have great domain knowledge • Several approaches being developed • Template design tool • Stupid templates + Smart filters = ...? • A logical framework and its implementation
Option 1: Template authoring tool • A tool to allow users to create their own templates, without learning a new language • Need a simple way to define patterns • Need to give rapid feedback • Interface needs to be intuitive
Authoring tool: Summary • Easier than writing templates directly • But still requires user to learn a new tool • Constraints on what can be done • Maybe useful for prototyping
Option 2: Learning filters • User defines concepts of interest • Software generates a variety of templates • Very general and imprecise • Templates applied to a corpus • User then marks results as correct or incorrect • Software then learns ‘filters’ to remove false positives • May learn to ignore negative findings (“there is no evidence that...”) • Or learn to focus on the start of the paper
Filter learning: Summary • User has less learning to do than with an authoring tool • Each decision is relatively straightforward • But requires a lot of effort • Computer needs many examples before it can successfully learn what is relevant
Moving on • Three methods for creating templates • Template design tool • Stupid templates + smart filters • A logical framework and its implementation • Questions?
Option 3: Machine learning via a logical framework • Fully automatic template creation • Sentence level: • User provides set of one or more interesting sentences • Computer generalises these to a more abstract pattern • Paper level: • User provides a set of interesting papers • Computer creates templates that match information in those papers (and not in an irrelevant corpus)
A Framework for Information Extraction • Aim: to describe formally the space of all templates for any set of one or more phrases • We can then use machine learning to search for templates • Formally define “word”, “attribute”, “template”, “fragment”, “match” etc. • Describe how to create, modify and evaluate templates • Possible search algorithms
Information extraction templates • A template is a pattern of words and their attributes • It is a list of word-attributes that correspond to sequences of words found in that order • Good templates match interesting fragments (true positives) with as few irrelevant fragments (false positives) as possible
Superset generalisation • Each generalisation creates a new template that matches everything that its parent matches • Including the same true-positive and false-positive fragments • So if a template has too many false positives, then so will all of its descendents • We can feed knowledge forward as we evaluate templates • Count the numbers of true positive and false positive matches • Parents’ performance define a lower-bound of children’s performance
Searching for good templates • Start with a “seed” phrase • Creates a trivial but precise template • Create several new templates through generalisation • Increases recall • Each template has at least one true positive • Need to choose which template to generalise next • Select generalisation that maintains precision • Compare performance on “relevant” and “irrelevant” corpora • Assume every match in “relevant” corpus is true positive • Assume every match in “irrelevant” corpus is false positive
Implementation • Aim is to aid real-world applications • Focus of IE is on biomedical text • Templates as described are already part of the “BioRAT” system • Search algorithm implemented but still being evaluated • Start with seed phrase, and generalise gradually • Some explorations have been carried out
Example results • Start with: • Seed phrase “Rad53p protein binds to Dbf4p” • Positive corpus: 500 abstracts on protein-protein interactions (from the DIP database) • Neutral corpus: first 500 abstracts dated September 05 • Results:
Example results • After 30 iterations, we get 286 TP and 4 FP • [GAZ: protein] [WILD: ?????] [GAZ: prot_binding, main] [WILD:???] [GAZ: protein, sp_gene] • Example matches • Protein kinase C delta associates with and phosphorylates Stat3 in an interleukin-6-dependent manner. • Furthermore, Stat3 was phosphorylated by PKC delta in vivo on Ser-727... • ...efficient transcription of yeast AMP biosynthesis genes requires interaction between Bas1p and Bas2p which is promoted... • ...RII alpha fused to endonexin II formed dimers but did not bind MAP2. • A protein interaction map for cell polarity development.
Over-generalisation • Compromise between true-positives and false-positives • After 50 iterations, we get 2621 TP and 151 FP • [WILD: ???] [WILD: ?????] [GAZ: prot_binding, main] [WILD:?????] [WILD: *]
Machine learning via framework: Summary • Requires less effort from the user • Just providing a few examples to get things going • But less user input may lead to less reliable results • May need to be combined with previous methods • Could also start with several positive examples • Interactive search • Negative examples
Measuring document similarity • Create random templates from one document • Search for matching fragments in a second document • Similar documents will have similar number of matches • Templates will capture semantics as well as word frequencies • C.f. vectors of word frequencies and TF.IDF
References • Corney, D. P. A., Byrne, E.L., Buxton, B. F. and Jones, D. T. (2005) "A Logical Framework for Template Creation and Information Extraction", Foundations of Semantic Oriented Data and Web Mining workshop, part of ICDM2005 (the Fifth IEEE International Conference on Data Mining). • Corney, D. P. A., Buxton, B. F., Langdon W.B. and Jones, D. T. (2004) "BioRAT: Extracting Biological Information from Full-length Papers", Bioinformatics 2004; 20(17), pp. 3206-13. • http://www.cs.ucl.ac.uk/staff/d.corney/publications.html
BioRAT Acknowledgements • Profs. Bernard Buxton, David Jones (PI) • Framework with Dr. Emma Byrne • Funding and support from • BBSRC, Inpharmatica and GlaxoSmithKline • BioRAT (Biological Research Assistant for Text Mining) • http://bioinf.cs.ucl.ac.uk/biorat • d.corney@cs.ucl.ac.uk
Search algorithm • Maintain sets of unevaluated and evaluated templates • Start with seed phrase • Generalise each term every possible way • Evaluate “most promising” template • Find matches in positive and neutral corpora • Generate each possible child template • Repeat for next template
Inheritance Key: TP+FPN