Efficient Top-k Search across Heterogeneous XML Data Sources

Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li1 Chengfei Liu1 Jeffrey Xu Yu2 Rui Zhou1 1Swinburne University of Technology 2Chinese University of Hong Kong

Outline • Motivation • Related Work • Preliminary and Problem Statement • BT-based Scheduling Strategy • Case Study • Experiments • Conclusions

Motivation • Top-k queries • Approximate answers are required when exact results cannot be found. • Returning a large number of results is not desirable. • Multiple XML data sources • With the application of XML data, sometimes users are interested in the results retrieved from several data sources at the same time. • Answering top-k queries over multiple xml data sources is still open problem.

Related Work • Top-k queries in XML • Amelie Marian etc. Adaptive processing of top-k queries in xml. ICDE2005. • Martin Theobald etc. An efficient and versatile query engine for topX search. VLDB2005. • Raghav Kaushik etc. On the integration of structure indexes and inverted lists. SIGMOD2004. • Top-k queries in Relational DB • Upper, MPro and TPUT etc. We focused on top-k queries over multiple XML data sources!

Preliminary – XML Query Relaxation • XML data and relevant schemas Fig.1 bookshop S1 Fig.2 schema d1 of S1 Fig.3 bookshop S2 Fig.4 schema d2 of S2

Preliminary – XML Query Relaxation • Relaxed results RankScore = 2.28 RankScore = 4.88 Fig.6 a relaxed query to d1 Fig.5 an original query q Fig.7 a relaxed query to d2 • We keep the changed weight for each edge in relaxed queries.

Problem Statement • Given a weighted query q and a number of data sources {S1, S2, …, Sn} conforming to DTDs {d1, d2, …, dn}, let {q1, q2, …, qn} be the set of weighted relaxed query templates of q w.r.t. the set of DTDs, our aim is to efficiently search top k results by scheduling the evaluation of {q1, q2, …, qn} over {S1, S2, …, Sn}.

BT-based Scheduling Strategy • Data source determination and switching • Result determination • Edge selection

Data source determination and switching • Computing the ranking scores {U(1) … U(n)} of relaxed queries {q1, q2, …, qn} w.r.t. data sources {S1, S2, …, Sn}. • Sorting the ranking scores as U={U(k1), … U(kn)} . • Taking the data source Sk1 to be evaluated and U(k2) as the current threshold σ. U(1) = 2.28 Threshold σ= 2.28 The relaxed query q1 w.r.t. d1 U(2) = 4.88 The relaxed query q2 w.r.t. d2

Result determination • We adjust the lower bound L and upper bound U during query evaluation. When L becomes equal to or larger than the current threshold, we can process the current candidates as follows: • The number of candidates is equal to k – Stop • The number of candidates is less than k – Continue to search • The number of candidates is larger than k – Refine candidates

Edge selection • Random • Min_weight • Max_weight

Case Study U(2) = 4.88 σ= 2.28 Top-2 result found! Top-1 result found! L(2)(G3) = 4.4 >σ book book L(2)(G1) = 3.5 >σ book B2 B1 title title B1, B2, B4 title info info info B4 B2, B4 L(2) = 1.70 <σ price year L(2)(G4) = 1.70 < σ U(2)(G4) = 2.18 < σ price L(2)(G2) = 1.70 < σ U(2)(G2) = 3.08 > σ Switching Data Source to search top-3 result!

Experiments • Experimental setup • We run all algorithms in Java on an Intel P4 3GHz PC with 512M memory. Wutka DTD parser was used to analyze the structures of DTDs. • Dataset and selected queries • We used Xmark XML data generator to produce a set of data that were taken as dataset. • Three queries were designed: • q1: //item[./description/parlist] • q2: //item[./description/parlist/mailbox/mail[./text]] • q3: //item[./mailbox/mail/text[./keyword and ./xxx] and ./name and ./xxx]

Experiments Static sort vs. Dynamic sort No schedule vs. BT schedule Varing top-k size Varing top-k size

Conclusions • Contributions: • Proposed a BT-based scheduling strategy for evaluating top-k queries over multiple XML data sources; • Output results immediately without waiting for the end of query evaluation; • Implemented relevant algorithms and demonstrated its effectiveness and efficiency with XMark data sets.

Thanks & Question

Efficient Top-k Search across Heterogeneous XML Data Sources

Efficient Top-k Search across Heterogeneous XML Data Sources

Presentation Transcript

Semantic Integration of Heterogeneous NASA Mission Data Sources

Semantic Information Retrieval from Distributed Heterogeneous Data Sources

Preferential top-k search over local data

Efficient Top-K Query Evaluation on Probabilistic Data

Supporting Efficient Top-k Queries in Type-A h ead Search

Space-Efficient Data Structures for Top-k Completion

Space-Efficient Data Structures for Top- k Completion

Efficient Keyword Search across Heterogeneous Relational Databases

Efficient Discovery of XML Data Redundancies

Efficient XML Interchange

Efficient XML Interchange

Efficient Keyword Search over Virtual XML Views

Efficient Keyword Search Over Virtual XML Views

Mediator Cost Models for Heterogeneous Data Sources

7 Top-k Queries on Web Sources and Structured Data

HPC across Heterogeneous Resources

Efficient Keyword Search over Virtual XML Views

Efficient Top-k Query Evaluation on Probabilistic Data

Supporting Top-K Keyword Search in XML Databases

Efficient Search on Encrypted Data