160 likes | 262 Views
Efficient Top-k Search across Heterogeneous XML Data Sources. Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology 2 Chinese University of Hong Kong. Outline. Motivation Related Work Preliminary and Problem Statement BT-based Scheduling Strategy
E N D
Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li1 Chengfei Liu1 Jeffrey Xu Yu2 Rui Zhou1 1Swinburne University of Technology 2Chinese University of Hong Kong
Outline • Motivation • Related Work • Preliminary and Problem Statement • BT-based Scheduling Strategy • Case Study • Experiments • Conclusions
Motivation • Top-k queries • Approximate answers are required when exact results cannot be found. • Returning a large number of results is not desirable. • Multiple XML data sources • With the application of XML data, sometimes users are interested in the results retrieved from several data sources at the same time. • Answering top-k queries over multiple xml data sources is still open problem.
Related Work • Top-k queries in XML • Amelie Marian etc. Adaptive processing of top-k queries in xml. ICDE2005. • Martin Theobald etc. An efficient and versatile query engine for topX search. VLDB2005. • Raghav Kaushik etc. On the integration of structure indexes and inverted lists. SIGMOD2004. • Top-k queries in Relational DB • Upper, MPro and TPUT etc. We focused on top-k queries over multiple XML data sources!
Preliminary – XML Query Relaxation • XML data and relevant schemas Fig.1 bookshop S1 Fig.2 schema d1 of S1 Fig.3 bookshop S2 Fig.4 schema d2 of S2
Preliminary – XML Query Relaxation • Relaxed results RankScore = 2.28 RankScore = 4.88 Fig.6 a relaxed query to d1 Fig.5 an original query q Fig.7 a relaxed query to d2 • We keep the changed weight for each edge in relaxed queries.
Problem Statement • Given a weighted query q and a number of data sources {S1, S2, …, Sn} conforming to DTDs {d1, d2, …, dn}, let {q1, q2, …, qn} be the set of weighted relaxed query templates of q w.r.t. the set of DTDs, our aim is to efficiently search top k results by scheduling the evaluation of {q1, q2, …, qn} over {S1, S2, …, Sn}.
BT-based Scheduling Strategy • Data source determination and switching • Result determination • Edge selection
Data source determination and switching • Computing the ranking scores {U(1) … U(n)} of relaxed queries {q1, q2, …, qn} w.r.t. data sources {S1, S2, …, Sn}. • Sorting the ranking scores as U={U(k1), … U(kn)} . • Taking the data source Sk1 to be evaluated and U(k2) as the current threshold σ. U(1) = 2.28 Threshold σ= 2.28 The relaxed query q1 w.r.t. d1 U(2) = 4.88 The relaxed query q2 w.r.t. d2
Result determination • We adjust the lower bound L and upper bound U during query evaluation. When L becomes equal to or larger than the current threshold, we can process the current candidates as follows: • The number of candidates is equal to k – Stop • The number of candidates is less than k – Continue to search • The number of candidates is larger than k – Refine candidates
Edge selection • Random • Min_weight • Max_weight
Case Study U(2) = 4.88 σ= 2.28 Top-2 result found! Top-1 result found! L(2)(G3) = 4.4 >σ book book L(2)(G1) = 3.5 >σ book B2 B1 title title B1, B2, B4 title info info info B4 B2, B4 L(2) = 1.70 <σ price year L(2)(G4) = 1.70 < σ U(2)(G4) = 2.18 < σ price L(2)(G2) = 1.70 < σ U(2)(G2) = 3.08 > σ Switching Data Source to search top-3 result!
Experiments • Experimental setup • We run all algorithms in Java on an Intel P4 3GHz PC with 512M memory. Wutka DTD parser was used to analyze the structures of DTDs. • Dataset and selected queries • We used Xmark XML data generator to produce a set of data that were taken as dataset. • Three queries were designed: • q1: //item[./description/parlist] • q2: //item[./description/parlist/mailbox/mail[./text]] • q3: //item[./mailbox/mail/text[./keyword and ./xxx] and ./name and ./xxx]
Experiments Static sort vs. Dynamic sort No schedule vs. BT schedule Varing top-k size Varing top-k size
Conclusions • Contributions: • Proposed a BT-based scheduling strategy for evaluating top-k queries over multiple XML data sources; • Output results immediately without waiting for the end of query evaluation; • Implemented relevant algorithms and demonstrated its effectiveness and efficiency with XMark data sets.