460 likes | 525 Views
Introduction to Web Science. Harvesting the SW. Six challenges of the Knowledge Life Cycle. Acquire Model Reuse Retrieve Publish Maintain. Information Extraction vs. Retrieval. IR. IE. . A couple of approaches …. Active learning to reduce annotation burden Supervised learning
E N D
Introduction to Web Science Harvesting the SW
Six challenges of the Knowledge Life Cycle • Acquire • Model • Reuse • Retrieve • Publish • Maintain
A couple of approaches … • Active learning to reduce annotation burden • Supervised learning • Adaptive IE • The Melita methodology • Automatic annotation of large repositories • Largely unsupervised • Armadillo
The Seminar Announcements Task • Created by Carnegie Mellon School of Computer Science • How to retrieve • Speaker • Location • Start Time • End Time • From seminar announcements received by email
Seminar Announcements Example Dr. Steals presents in Dean Hall at one am. becomes <speaker>Dr. Steals</speaker> presents in <location>Dean Hall</location> at <stime>one am</stime>.
Information Extraction Measures • How many documents out of the retrieved documents are relevant? • How many retrieved documents are relevant out of all the relevant documents? • Weighted harmonic mean of precision and recall
IE Measures Examples • If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and f-measure?
IE Measures Answers • If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and f-measure? • Precision = 4/8 = 50% • Recall = 4/10 = 40% • F =(2*50*40)/(50+40) = 44.4%
Adaptive IE • What is IE? • Automated ways of extracting unstructured or partially structured information from machine readable files • What is AIE? • Performs tasks of traditional IE • Exploits the power of Machine Learning in order to adapt to • complex domains having large amounts of domain dependent data • different sub-language features • different text genres • Considers important the Usability and Accessibility of the system
Amilcare • Tool for adaptive IE from Web-related texts • Specifically designed for document annotation • Based on (LP)2 algorithm *Linguistic Patterns by Learning Patterns • Covering algorithm based on Lazy NLP • Trains with a limited amount of examples • Effective on different text types • free texts • semi-structured texts • structured texts • Uses Gate and Annie for preprocessing
CMU: detailed results • Best overall accuracy • Best result on speaker field • No results below 75%
Gate • General Architecture for Text Engineering • provides a software infrastructure for researchers and developers working in NLP • Contains • Tokeniser • Gazetteers • Sentence Splitter • POS Tagger • Semantic Tagger (ANNIE) • Co-reference Resolution • Multi lingual support • Protégé • WEKA • many more exist and can be added • http://www.gate.ac.uk
is complex is time consuming needs annotation by experts Annotation Current practice of annotation for knowledge identification and extraction Reduce burden of text annotation for Knowledge Management
Different Annotation Systems • SGML • TEX • Xanadu • CoNote • ComMentor • JotBot • Third Voice • Annotate.net • The Annotation Engine • Alembic • The Gate Annotation Tool • iMarkup, Yawas • MnM, S-CREAM
Melita • Tool for assisted automatic annotation • Uses an Adaptive IE engine to learn how to annotate (no use of rule writing for adapting the system) • Users: annotates document samples • IE System: • Trains while users annotate • Generalizes over seen cases • Provides preliminary annotation for new documents • Performs smart ordering of documents • Advantages • Annotates trivial or previously seen cases • Focuses slow/expensive user activity on unseen cases • User mainly validates extracted information • Simpler & less error prone / Speeds up corpus annotation • The system learns how to improve its capabilities
Amilcare Learns in background Bare Text User Annotates Methodology: Melita Bootstrap Phase
User Annotates Amilcare Annotates Learning in background from missing tags, mistakes Bare Text Methodology: Melita Checking Phase
Corrections used to retrain Bare Text Amilcare Annotates Methodology: Melita Support Phase User Corrects
User Annotates Bare Text Learns annotations Smart ordering of Documents Tries to annotate all the documents and selects the document with partial annotations
Intrusivity • An evolving system is difficult to control • Goal: • Avoiding unwelcome/unreliable suggestions • Adapting proactivity to user’s needs • Method: • Allow users to tune proactivity • Monitor user reactions to suggestions
Methodology: Melita Control Panel Ontology defining concepts Document Panel
60 30 Results
Future Work • Research better ways of annotating concepts in documents • Optimise document ordering to maximise the discovery of new tags • Allow users to edit the rules • Learn to discover relationships !! • Not only suggest but also corrects user annotations !!
Annotation for the Semantic Web • Semantic Web requires document annotation • Current approaches • Manual (e.g. Ontomat)or semi-automatic (MnM, S-Cream, Melita) • BUT: • Manual/Semi-automatic annotation of • Large diverse repositories • Containing different and sparse information is unfeasible • E.g. a Web site (So: 1,600 pages)
Redundancy • Informationon the Web (or large repositories) is Redundant • Information repeated in different superficial formats • Databases/ontologies • Structured pages (e.g. produced by databases) • Largely structured pages (bibliography pages) • Unstructured pages (free texts)
The Idea • Largely unsupervised annotation of documents • Based on Adaptive Information Extraction • Bootstrapped using redundancyof information • Method • Use the structured information (easier to extract) to bootstrap learning on less structured sources (more difficult to extract)
Example: Extracting Bibliographies • Mines web-sites to extract biblios from personal pages Tasks: • Finding people’s names • Finding home pages • Finding personal biblio pages • Extract biblio references • Sources • NE Recognition (Gate’s Annie) • Citeseer/Unitrier (largely incomplete biblios) • Google • Homepagesearch
Annotates known names • Trains on annotations to discover the HTML structure of the page • Recovers all names and hyperlinks Mining Web sites (1) • Mines the site looking for People’s names • Uses • Generic patterns (NER) • Citeseer for likely bigrams • Looks for structured lists of names
Experimental Results II - Sheffield • People • discovering who works in the department • using Information Integration • Total present in site 139 • Using generic patterns + online repositories • 35 correct, 5 wrong • Precision 35 / 40 = 87.5 % • Recall 35 / 139 = 25.2 % • F-measure 39.1 % • Errors • A. Schriffin • Eugenio Moggi • Peter Gray
Experimental Results IE - Sheffield • People • using Information Extraction • Total present in site 139 • 116 correct, 8 wrong • Precision 116 / 124 = 93.5 % • Recall 116 / 139 = 83.5 % • F-measure 88.2 % • Errors • Speech and Hearing • European Network • Department Of • Enhancements – Lists, Postprocessor • Position Paper • The Network • To System
Experimental Results - Edinburgh • People • using Information Integration • Total present in site 216 • Using generic patterns + online repositories • 11 correct, 2 wrong • Precision 11 / 13 = 84.6 % • Recall 11 / 216 = 5.1 % • F-measure 9.6 % • using Information Extraction • 153 correct, 10 wrong • Precision 153 / 163 = 93.9 % • Recall 153 / 216 = 70.8 % • F-measure 80.7 %
Experimental Results - Aberdeen • People • using Information Integration • Total present in site 70 • Using generic patterns + online repositories • 21 correct, 1 wrong • Precision 21 / 22 = 95.5 % • Recall 21 / 70 = 30.0 % • F-measure 45.7 % • using Information Extraction • 63 correct, 2 wrong • Precision 63 / 65 = 96.9 % • Recall 63 / 70 = 90.0 % • F-measure 93.3 %
Mining Web sites (2) • Annotates known papers • Trains on annotations to discover the HTML structure • Recovers co-authoring information
Experimental Results (1) • Papers • discovering publications in the department • using Information Integration • Total present in site 320 • Using generic patterns + online repositories • 151 correct, 1 wrong • Precision 151 / 152 = 99 % • Recall 151 / 320 = 47 % • F-measure 64 % • Errors - Garbage in database!! @misc{ computer-mining, author = "Department Of Computer", title = "Mining Web Sites Using Adaptive Information Extraction Alexiei Dingli and Fabio Ciravegna and David Guthrie and Yorick Wilks", url = "citeseer.nj.nec.com/582939.html" }
Experimental Results (2) • Papers • using Information Extraction • Total present in site 320 • 214 correct, 3 wrong • Precision 214 / 217 = 99 % • Recall 214 / 320 = 67 % • F-measure 80 % • Errors • Wrong boundaries in detection of paper names! • Names of workshops mistaken as paper names!
Artists domain • Task • Given the name of an artist, find all the paintings of that artist. • Created for the ArtEquAKT project
User Role • Providing … • A URL • List of services • Already wrapped (e.g. Google is in default library) • Train wrappers using examples • Examples of fillers (e.g. project names) • In case … • Correcting intermediate results • Reactivating Armadillo when paused
Armadillo • Library of known services (e.g. Google, Citeseer) • Tools for training learners for other structured sources • Tools for bootstrapping learning • From un/structured sources • No user annotation • Multi-strategy acquisition of information using redundancy • User-driven revision of results • With re-learning after user correction
Rationale • Armadillo learns how to extract information • From large repositories By integrating information • from diverse and distributed resources • Use: • Ontology population • Information highlighting • Document enrichment • Enhancing user experience
IE for SW: The Vision • Automatic annotation services • For a specific ontology • Constantly re-indexing/re-annotating documents • Semanticsearch engine • Effects: • No annotation in the document • As today’s indexes are not stored in the documents • No legacy with the past • Annotation with the latest version of the ontology • Multiple annotations for a single document • Simplifies maintenance • Page changed but not re-annotated