Piazza: Data Management Infrastructure for Semantic Web Applications

Piazza: Data Management Infrastructure for Semantic WebApplications Alon Y. Halevy, Zachary G. Ives,Peter Mork, Igor Tatarinov. Speaker: Sergey Chernov Tutor: Jens Graupmann Peer-to-Peer Information Systems – WS 03/04

Outline • INTRODUCTION. SEMANTIC WEB. • PIAZZA: SYSTEM OVERVIEW • IMPLEMENTATION DETAILS 3.1 MAPPING LANGUAGE 3.2 QUERY ANSWERING ALGORITHM • CONCLUSIONS. Peer-to-Peer Information Systems – WS 03/04

Introduction • Goal: • Data Integration and Knowledge Management • Problem: • Web data lacks machine-understandable semantics • Solution: • Semantic Web? Peer-to-Peer Information Systems – WS 03/04

The Semantic Web* • Web sites include structural annotations • You can pose meaningful queries on them. • Ontologies provide the semantic glue. • Internal implementation of web sites left open. • Agents perform tasks: • Query one or more web sites • Perform updates (e.g., set schedules) • Coordinate actions • Trust each other (or not). • I.e., agents operating on a gigantic heterogeneous distributed database. (*View by A. Halevy) Peer-to-Peer Information Systems – WS 03/04

General requirements • Robust infrastructure for querying • Peer data management systems. • Facilitate mapping between different structures. Need tools for: • Locating relevant structures • Easily joining the semantic web. • Get data into structured form • Should we worry about the legacy web? Peer-to-Peer Information Systems – WS 03/04

Using views for specifyingmappings • Local-As-View (LAV). Data sources can be described as views over the mediated schema. • Global-As-View (GAV). Mediated schema can be described as a set of views over the data sources. Mediated Schema Site B Site C Site A Mediated Schema Site A Site B Site C Peer-to-Peer Information Systems – WS 03/04

Mapping • Mapping AB specifies representation of structured data from scheme of node A into scheme of node B Mediated Schema Mapping “MS-C” Mapping “A-MS” Mapping “MS-A” Mapping “C-MS” Mapping “AB” Mapping “BC” Site B Site C Site A Mapping “BA” Mapping “CB” Peer-to-Peer Information Systems – WS 03/04

Piazza: Peer Data-Management System • Goal: • Large scale autonomous sharing of structured data • Peer data management system (PDMS) • Autonomous Peers export data in their own schemas • Pair-wise mappings between peers • Generalization of a Data Integration system • NOT a P2P file sharing system Peer-to-Peer Information Systems – WS 03/04

Relationship of PDMS to… • P2P overlay networks (the “Structured World”) • Data integration systems (no central logical mediated schema) • Federated databases (scale, ad-hoc nature) • Distributed databases (no central administration) Peer-to-Peer Information Systems – WS 03/04

Representing Data • A spectrum of possibilities: • Relational tables, some integrity constraints • XML: can encode relational, hierarchical • Xquery – emerging standard query language (SQL for XML) • RDF: “XML on drugs”. • Sees only the logic; ignores other aspects. • DAML+OIL • Full-blown Knowledge representation language. • They all have semantics; just different expressive powers. • We keep the data simple. Mappings between data at different peers are more complex. Peer-to-Peer Information Systems – WS 03/04

Area(areaID, name, descr) Project(projID, name, sponsor) ProjArea(projID, areaID) Pubs(pubID, projName, title, venue, year) Author(pubID, author) Member(projName, member) Members(memID, name) Projects(projID, name, startDate) ProjFaculty(projID, facID) ProjStudents(projID, studID) … Direction(dirID, name) Project(pID, dirID, name) … Project(projID, name, descr) Student(studID, name, status) Faculty(facID, name, rank, office) Advisor(facID, studID) ProjMember(projID, memberID) Paper(papID, title, forum, year) Author(authorID, paperID) Area(areaID, name, descr) Project(projID, areaID, name) Pub(pubID, title, venue, year) PubAuthor(pubID, authorID) PubProj(pubID, projID) Member(memID, projID, name, pos) Alumn(name, year, thesis) Peer Data Management • Mappings are query expressions • DbResearcher(x) Researcher(x),Area(x,DB) • DbResearcher(x), Office(x,DBLab) =DbLabMember(x) DB Projects MIT UW Stanford UCB Peer-to-Peer Information Systems – WS 03/04

Piazza mapping language (1) • XML/XML Example • <pubs> • <book> • {: $a IN document(“source.xml”)\ • /authors/author • $t IN $a/publication/title, • $typ IN $a/publication/pub-type • WHERE $typ = “book” : } • <title> { $t }</title> • <author> • <name> {: $a/full-name :} </name> • </author> • </book> • </pubs> Target: pubs book* title author* name publisher* name Source: authors author* full-name publication* title pub-type Peer-to-Peer Information Systems – WS 03/04

Piazza mapping language (2) • piazza:id attribute • <pubs> • <book piazza:id={$t}> • {: $a IN document(“source.xml”)\ • /authors/author • $t IN $a/publication/title, • $typ IN $a/publication/pub-type • WHERE $typ = “book” : } • <title piazza:id={$t}> { $t }</title> • <author piazza:id={$t}> • <name> {: $a/full-name :} </name> • </author> • </book> • </pubs> Target: pubs book* title author* name publisher* name Source: authors author* full-name publication* title pub-type Peer-to-Peer Information Systems – WS 03/04

Piazza mapping language (3) • Partial mapping • <pubs> • <book piazza:id={$t}> • {: $a IN document(“source.xml”)\ • /authors/author • $t IN $a/publication/title, • $typ IN $a/publication/pub-type • WHERE $typ = “book” : } • PROPERTY $t >=’A’ AND $t < ‘B’ • : } • [: <publisher> • <name> • {: PROPERTY $this IN • {“PrintersInc”, “PubsInc”} :} • </name> • </publisher> :] • </book> • </pubs> Target: pubs book* title author* name publisher* name Source: authors author* full-name publication* title pub-type Peer-to-Peer Information Systems – WS 03/04

Query Answering Algorithm • Problem • Evaluate query Q at P1 given a network of mappings • Reformulate the query over all relevant peers • Chaining of mappings using a combination of query composition and query rewriting • QP1(x) :- DbResearcher(x) • Query Composition • M:DbResearcher(x)Researcher(x),Area(x,DB)  QP2 (x) Researcher(x),Area(x,DB) • Query Rewriting • M: DbResearcher(x), Office(x,DBLab) =DbLabMember(x)  QP3 (x) DbLabMember(x) Peer-to-Peer Information Systems – WS 03/04

Query Reformulation (1) Query: Mapping: • <S2> • <people> {: $people=/S1/people :} • <faculty> {: $name=$people/faculty/name/text():} • { $name} • </faculty> • <student>{: $student=$people/student/text():} • <name> { $student } </name> • <advisor> {: $faculty=$people/faculty, • $name=$faculty/name/text(), • $advisee=$faculty/advisee/text() • where $advisee=$student :} • { $name } • <advisor> • </student> • </people> • </S2> <result> { for $faculty in /S1/people/faculty, $name in $faculty/name/text(), $advisee in $faculty/advisee/text() where $name = “Ullman” return <student> {$advisee} </student> } </result> Peer-to-Peer Information Systems – WS 03/04

<result> S1 people faculty faculty name <faculty> {$name} name advisee $name = “Ullman” <student> {$advisee} student <student> <name> {$student} Query Reformulation (2) Query tree pattern: Mapping tree pattern: Query: <S2> <result> { for $faculty in /S1/people/faculty, $name in $faculty/name/text(), $advisee in $faculty/advisee/text() where $name = “Ullman” return <student> {$advisee} </student> } </result> S1 <people> people faculty name advisee $advisee=$student <advisor> {$name} Peer-to-Peer Information Systems – WS 03/04

<result> S1 people faculty faculty name <faculty> {$name} name advisee $name = “Ullman” <student> {$advisee} student <student> <name> {$student} Query Reformulation (3) Query tree pattern: Mapping tree pattern: Query: <S2> <result> { for $faculty in /S2/people/student, $advisor in $student/advisor/text(), $name in $student/name/text() where $advisor = “Ullman” return <student> { $name } </student> } </result> S1 <people> people faculty name advisee $advisee=$student <advisor> {$name} Peer-to-Peer Information Systems – WS 03/04

Reformulation times • Table 1: The test queries and their respective running times. Peer-to-Peer Information Systems – WS 03/04

Current and the Future • Current status • Demo scenario using XML • Looking at real domains (Bio dbs, NASA dbs) • Future Work • More efficient reformulation algorithm • Semantic network analysis – eliminate redundant mappings and inconsistent mappings • Query caching to speed up query evaluation Peer-to-Peer Information Systems – WS 03/04

Conclusions • Mapping language for mapping between sets of XML source nodes with different document structures • Architecture that uses the transitive closure of mappings to answer queries • Algorithm for query answering over this transitive closure of mappings, which is able to follow mappings in both forward and reverse directions Peer-to-Peer Information Systems – WS 03/04

Thank You! Peer-to-Peer Information Systems – WS 03/04

Further literature • Alon Y. Halevy, Zachary G. Ives, Dan Suciu, Igor Tatarinov: Schema Mediation for Large-Scale Semantic Data Sharing • Igor Tatarinov, Zachary Ives, Jayant Madhavan, Alon Halevy, Dan Suciu, Nilesh Dalvi, Xin (Luna) Dong, Yana Kadiyska, Gerome Miklau, Peter Mork: The Piazza Peer Data Management Project • Alon Y. Halevy, Zachary G. Ives, Dan Suciu, Igor Tatarinov: Schema Mediation in Peer Data Management Systems • Alon Halevy, Oren Etzioni, AnHai Doan, Zachary Ives, Jayant Madhavan, Luke McDowell, Igor Tatarinov: Crossing the Structure Chasm • Madhan Arumugam, Amit Sheth, and I. Budak Arpinar: Towards Peer-to-Peer Semantic Web: A Distributed Environment for Sharing Semantic Knowledge on the Web • Hendler J., Berners-Lee T., Miller E.: Integrating Applications on the Semantic Web Peer-to-Peer Information Systems – WS 03/04

Piazza: Data Management Infrastructure for Semantic Web Applications