Data Integration: A “Killer App” for Multi-Strategy Learning

Data Integration:A “Killer App” for Multi-Strategy Learning Pedro Domingos Joint work with AnHai Doan & Alon LevyDepartment of Computer Science & EngineeringUniversity of Washington

Overview • Data integration & XML • Schema matching • Multi-strategy learning • Prototype system & experiments • Related work • Future work • Summary

wrapper wrapper wrapper Data Integration Find houses with four bathrooms and price under $500,000 mediated schema source schema source schema source schema realestate.com homeseekers.com superhomes.com

Why Data Integration Matters • Very active area in database & AI • research / workshops • start-ups • Large organizations • multiple databases with differing schemas • Data warehousing • The Web: HTML sources • The Web: XML sources

XML • Extensible Markup Language • introduced in 1996 • Thestandard for data publishing & exchange • replaces HTML & proprietary formats • embraced by database/web/e-commerce communities • XML versus HTML • both use tags to mark up data elements • HTML tags specify format • XML tags define meaning • relationships among elements provided via nesting

Example HTML XML <h1> Residential Listings </h1> <ul>House For Sale <li> location: Seattle, WA, USA <li> agent-phone: (206) 729 0831 <li> listed-price: $250,000 <li> comments: Fantastic house ... </ul> <hr> <ul> House For Sale ... </ul> ... <residential-listings> <house> < location> <city> Seattle </city> <state> WA </state> <country> USA </country> </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments> Fantastic house ... </comments> </house> ... </residential-listings>

XML DTD • A DTD can be visualized as a tree • Document Type Descriptor • BNF grammar • constraints on element structure: type, order, # of times • A real-estate DTD <!ELEMENT residential-listings (house*)><!ELEMENT house (location?, agent-phone, listed-price, comments?)> <!ELEMENT location (city, state, country?)>

Semantic Mappings between Schemas • Mediated & source schemas = XML DTDs house address contact num-baths amenities namephone house location contact-info full-baths half-baths handicap-equipped agent-name agent-phone

Map of the Problem source descriptions schema matching data translation scope completeness reliability query capability 1-1 mappings complex mappings leaf elements higher-level elements

Current State of Affairs • Largely done by hand • labor intensive & error prone • key bottleneck in building applications • Will only be exacerbated • data sharing & XML become pervasive • proliferation of DTDs • translation of legacy data • Need automatic approaches to scale up!

Our Approach • Use machine learning to match schemas • Basic idea 1. create training data • manually map a set of sources to mediated schema 2. train system on training data • learns from • name of schema elements • format of values • frequency of words & symbols • characteristics of value distribution • proximity, position, structure, ... 3. system proposes mappings for subsequent sources

Example mediated schema realestate.com address phone price description <house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments>Fantastic house ... </comments> </house> ... comments Fantastic house ... Great ... Hurry! ... ... location Seattle, WA Seattle, WA Dallas, TX ... agent-phone (206) 729 0831 (206) 321 4571 (214) 722 4035 ... listed-price $250,000 $162,000 $180,000 ...

Multi-Strategy Learning • Use a set of base learners • each exploits certain types of information • Match schema elements of a new source • apply the learners • combine their predictions using a meta-learner • Meta-learner • measures base learner accuracy on training data • weighs each learner based on its accuracy

Learners • Input • schema information: name, proximity, structure, ... • data information: value, format, ... • Output • prediction weighted by confidence score • Example learners • name matcher • agent-name => (name,0.7), (phone,0.3) • Naive Bayes • “Seattle, WA” => (address,0.8), (name,0.2) • “Great location ...” => (description,0.9), (address,0.1)

Training the Learners realestate.com mediated schema address phone price description <house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $ 250,000 </listed-price> <comments> Fantastic house ... </comments> </house> ... location agent-phone listed-price comments Name Matcher (location, address) (agent-phone, phone) (listed-price, price) (comments, description) ... Naive Bayes (“Seattle, WA”, address) (“(206) 729 0831”, phone) (“$ 250,000”, price) (“Fantastic house ...”, description) ...

Applying the Learned Models homes.com mediated schema address phone price description area Seattle, WA Kent, WA Austin, TX Seattle, WA Name Matcher Naive Bayes Meta-learner address address description address Combiner Name Matcher Naive Bayes Meta-learner address

The LSD System • Base learners/modules • name matcher • Naive Bayes • Whirl nearest-neighbor classifier [Cohen&Hirsh-KDD98] • county-name recognizer • Meta-learner • stacking [Ting&Witten99, Wolpert92]

Name Matcher • Matches based on names • including all names on path from root to current node • allowing synonyms • Good for ... • specific, descriptive names: agent-phone, listed-price • Bad for ... • vacuous names: item, listings • partially specified, ambiguous names: office(for “office phone”)

Naive Bayes Learner • Exploits frequencies of words & symbols • Good for ... • elements with words/symbols that are strongly indicative • examples: • “fantastic” & “great” in house descriptions • $ in prices, parentheses in phone numbers • Bad for ... • short, numeric elements: num-baths, num-bedrooms

WHIRL Nearest-Neighbor Classifier • Similarity-based • stores all examples seen so far • classifies a new example based on similarity to training examples • IR document similarity metric • Good for ... • long, textual elements: house description, names • limited, descriptive set of values: color (blue, red, ...) • Bad for ... • short, numeric elements: num-baths, num-bedrooms

County-Name Recognizer • Stores all county names, obtained from the Web • Verifies if the input name is a county name • Essential to matching a county-name element

Meta-Learner: Stacking • Training • uses training data to learn weights • one for each (base learner, mediated-schema element) • Combining predictions • for each mediated-schema element • computes weighted sum of base-learner confidence scores • picks mediated-schema element with highest sum

Experiments

Reasons for Incorrect Matchings • Unfamiliarity • suburb • solution: add a suburb-name recognizer • Insufficient information • correctly identified the general type • failed to pinpoint the exact type • <agent-name>Richard Smith</agent-name><phone> (206) 234 5412 </phone> • solution: add a proximity learner

Experiments: Summary • Multi-strategy learning • better performance than any single learner • Accuracy of 100% unlikely to be reached • difficult even for human • Lots of room for improvement • more learners • better learning algorithms

Related Work • Rule-based approaches • TRANSCM [Milo&Zohar98], ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98] • utilize only schema information • Learner-based approaches • SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95] • employ a single learner, limited applicability

Future Work source descriptions schemamatching data translationscope completeness reliability query capability 1-1 mappings complex mappings leaf elements higher-level elements

Future Work • Improve matching accuracy • more learners, more domains • Incorporate domain knowledge • semantic integrity constraints • concept hierarchy of mediated-schema elements • Learn with structured data

Learning with Structured Data • Each example with >1 level of structure • Generative model for XML • XML classifier • XML: “killer app” for relational learning

Summary • Schema matching • automated by learning • Multi-strategy learning is essential • handles different types of data • incorporates different types of domain knowledge • easy to incorporate new learners • alleviates effects of noise & dirty data • Implemented LSD • promising results with initial experiments

Data Integration: A “Killer App” for Multi-Strategy Learning