1 / 30

Data Integration: A “Killer App” for Multi-Strategy Learning

Data Integration: A “Killer App” for Multi-Strategy Learning. Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington. Overview. Data integration & XML Schema matching Multi-strategy learning Prototype system & experiments

donald
Download Presentation

Data Integration: A “Killer App” for Multi-Strategy Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Integration:A “Killer App” for Multi-Strategy Learning Pedro Domingos Joint work with AnHai Doan & Alon LevyDepartment of Computer Science & EngineeringUniversity of Washington

  2. Overview • Data integration & XML • Schema matching • Multi-strategy learning • Prototype system & experiments • Related work • Future work • Summary

  3. wrapper wrapper wrapper Data Integration Find houses with four bathrooms and price under $500,000 mediated schema source schema source schema source schema realestate.com homeseekers.com superhomes.com

  4. Why Data Integration Matters • Very active area in database & AI • research / workshops • start-ups • Large organizations • multiple databases with differing schemas • Data warehousing • The Web: HTML sources • The Web: XML sources

  5. XML • Extensible Markup Language • introduced in 1996 • Thestandard for data publishing & exchange • replaces HTML & proprietary formats • embraced by database/web/e-commerce communities • XML versus HTML • both use tags to mark up data elements • HTML tags specify format • XML tags define meaning • relationships among elements provided via nesting

  6. Example HTML XML <h1> Residential Listings </h1> <ul>House For Sale <li> location: Seattle, WA, USA <li> agent-phone: (206) 729 0831 <li> listed-price: $250,000 <li> comments: Fantastic house ... </ul> <hr> <ul> House For Sale ... </ul> ... <residential-listings> <house> < location> <city> Seattle </city> <state> WA </state> <country> USA </country> </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments> Fantastic house ... </comments> </house> ... </residential-listings>

  7. XML DTD • A DTD can be visualized as a tree • Document Type Descriptor • BNF grammar • constraints on element structure: type, order, # of times • A real-estate DTD <!ELEMENT residential-listings (house*)><!ELEMENT house (location?, agent-phone, listed-price, comments?)> <!ELEMENT location (city, state, country?)>

  8. Semantic Mappings between Schemas • Mediated & source schemas = XML DTDs house address contact num-baths amenities namephone house location contact-info full-baths half-baths handicap-equipped agent-name agent-phone

  9. Map of the Problem source descriptions schema matching data translation scope completeness reliability query capability 1-1 mappings complex mappings leaf elements higher-level elements

  10. Current State of Affairs • Largely done by hand • labor intensive & error prone • key bottleneck in building applications • Will only be exacerbated • data sharing & XML become pervasive • proliferation of DTDs • translation of legacy data • Need automatic approaches to scale up!

  11. Our Approach • Use machine learning to match schemas • Basic idea 1. create training data • manually map a set of sources to mediated schema 2. train system on training data • learns from • name of schema elements • format of values • frequency of words & symbols • characteristics of value distribution • proximity, position, structure, ... 3. system proposes mappings for subsequent sources

  12. Example mediated schema realestate.com address phone price description <house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments>Fantastic house ... </comments> </house> ... comments Fantastic house ... Great ... Hurry! ... ... location Seattle, WA Seattle, WA Dallas, TX ... agent-phone (206) 729 0831 (206) 321 4571 (214) 722 4035 ... listed-price $250,000 $162,000 $180,000 ...

  13. Multi-Strategy Learning • Use a set of base learners • each exploits certain types of information • Match schema elements of a new source • apply the learners • combine their predictions using a meta-learner • Meta-learner • measures base learner accuracy on training data • weighs each learner based on its accuracy

  14. Learners • Input • schema information: name, proximity, structure, ... • data information: value, format, ... • Output • prediction weighted by confidence score • Example learners • name matcher • agent-name => (name,0.7), (phone,0.3) • Naive Bayes • “Seattle, WA” => (address,0.8), (name,0.2) • “Great location ...” => (description,0.9), (address,0.1)

  15. Training the Learners realestate.com mediated schema address phone price description <house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $ 250,000 </listed-price> <comments> Fantastic house ... </comments> </house> ... location agent-phone listed-price comments Name Matcher (location, address) (agent-phone, phone) (listed-price, price) (comments, description) ... Naive Bayes (“Seattle, WA”, address) (“(206) 729 0831”, phone) (“$ 250,000”, price) (“Fantastic house ...”, description) ...

  16. Applying the Learned Models homes.com mediated schema address phone price description area Seattle, WA Kent, WA Austin, TX Seattle, WA Name Matcher Naive Bayes Meta-learner address address description address Combiner Name Matcher Naive Bayes Meta-learner address

  17. The LSD System • Base learners/modules • name matcher • Naive Bayes • Whirl nearest-neighbor classifier [Cohen&Hirsh-KDD98] • county-name recognizer • Meta-learner • stacking [Ting&Witten99, Wolpert92]

  18. Name Matcher • Matches based on names • including all names on path from root to current node • allowing synonyms • Good for ... • specific, descriptive names: agent-phone, listed-price • Bad for ... • vacuous names: item, listings • partially specified, ambiguous names: office(for “office phone”)

  19. Naive Bayes Learner • Exploits frequencies of words & symbols • Good for ... • elements with words/symbols that are strongly indicative • examples: • “fantastic” & “great” in house descriptions • $ in prices, parentheses in phone numbers • Bad for ... • short, numeric elements: num-baths, num-bedrooms

  20. WHIRL Nearest-Neighbor Classifier • Similarity-based • stores all examples seen so far • classifies a new example based on similarity to training examples • IR document similarity metric • Good for ... • long, textual elements: house description, names • limited, descriptive set of values: color (blue, red, ...) • Bad for ... • short, numeric elements: num-baths, num-bedrooms

  21. County-Name Recognizer • Stores all county names, obtained from the Web • Verifies if the input name is a county name • Essential to matching a county-name element

  22. Meta-Learner: Stacking • Training • uses training data to learn weights • one for each (base learner, mediated-schema element) • Combining predictions • for each mediated-schema element • computes weighted sum of base-learner confidence scores • picks mediated-schema element with highest sum

  23. Experiments

  24. Reasons for Incorrect Matchings • Unfamiliarity • suburb • solution: add a suburb-name recognizer • Insufficient information • correctly identified the general type • failed to pinpoint the exact type • <agent-name>Richard Smith</agent-name><phone> (206) 234 5412 </phone> • solution: add a proximity learner

  25. Experiments: Summary • Multi-strategy learning • better performance than any single learner • Accuracy of 100% unlikely to be reached • difficult even for human • Lots of room for improvement • more learners • better learning algorithms

  26. Related Work • Rule-based approaches • TRANSCM [Milo&Zohar98], ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98] • utilize only schema information • Learner-based approaches • SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95] • employ a single learner, limited applicability

  27. Future Work source descriptions schemamatching data translationscope completeness reliability query capability 1-1 mappings complex mappings leaf elements higher-level elements

  28. Future Work • Improve matching accuracy • more learners, more domains • Incorporate domain knowledge • semantic integrity constraints • concept hierarchy of mediated-schema elements • Learn with structured data

  29. Learning with Structured Data • Each example with >1 level of structure • Generative model for XML • XML classifier • XML: “killer app” for relational learning

  30. Summary • Schema matching • automated by learning • Multi-strategy learning is essential • handles different types of data • incorporates different types of domain knowledge • easy to incorporate new learners • alleviates effects of noise & dirty data • Implemented LSD • promising results with initial experiments

More Related