1 / 23

Structure and Content Scoring for XML

Structure and Content Scoring for XML. Amélie Marian (Columbia University) Joint work with: Sihem Amer-Yahia (AT&T Research Labs) Nick Koudas (University of Toronto) Divesh Srivastava (AT&T Research Labs) David Toman (University of Waterloo). book. info. edition (paperback). author

Download Presentation

Structure and Content Scoring for XML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Structure and Content Scoring for XML Amélie Marian (Columbia University) Joint work with: Sihem Amer-Yahia (AT&T Research Labs) Nick Koudas (University of Toronto) Divesh Srivastava (AT&T Research Labs) David Toman (University of Waterloo)

  2. book info edition (paperback) author (Dickens) title (Great Expectations) book info edition (paperback) author (Dickens) title (Great Expectations) Motivations: XML Data Heterogeneity book book Data Heterogeneous XML Data about books • Query: book[./info[./title=“Great Expectations” and ./author=“Dickens”] and ./edition=“paperback”] info info author (Dickens) title (Great Expectations) edition (paperback) title (Great Expectations) author (Dickens) Query root node: Distinguished node Amélie Marian - Columbia University

  3. book book info edition (paperback) info edition (paperback) author (Dickens) author (Dickens) title (Great Expectations) title (Great Expectations) XML Query Relaxation Query [Amer-Yahia et al. EDBT’02] • Tree pattern relaxations: • Leaf node deletion • Edge generalization • Subtree promotion book book Data edition? info info author (Dickens) title (Great Expectations) edition (paperback) title (Great Expectations) author (Dickens) Amélie Marian - Columbia University

  4. Motivations • Top-k query processing suitable for relaxed XML queries over heterogeneous collections • Return k XML nodes that are closest to query structure • Opportunity for more efficient query processing • Need scoring mechanism to identify best k answers Amélie Marian - Columbia University

  5. Contributions • Scoring mechanism for XML queries • Data structures for top-k query processing • Experimental evaluation Amélie Marian - Columbia University

  6. Scoring Functions Critical for Top-k Query Processing • Top-k answer quality depends on scoring function • Efficient top-k query processing requires scoring function: • Monotonic • Fast to compute • Little attention given to scoring functions for structured and semi-structured data • Extensively studied over text data (e.g., tf.idf) • Proposed scoring function inspired by tf.idf for XML data Amélie Marian - Columbia University

  7. Adaptation of tf.idf to XML Queries Amélie Marian - Columbia University

  8. Required properties: Exact matches should be scored higher than relaxed matches (idf) Distinguished nodes with several matches should be ranked higher than those with fewer matches (tf) How to combine tf and idf? tf.idf, as used by IR, violates above properties Ranking based on idf, then breaking ties using tf satisfies the properties book book info edition (paperback) edition (paperback) info author (Dickens) title (Great Expectations) title (Great Expectations) Scoring Function for XML Approximate Matches book book info info edition (paperback) edition (paperback) author (Dickens) title (Great Expectations) (a) (b) score(a) >= score(b) score(a) <= score(b) Amélie Marian - Columbia University

  9. Twig predicate High quality Expensive computation Path predicates Binary predicates Low quality Fast computation book book book + book + book book + book + book + book info info edition (paperback) edition (paperback) info info edition (paperback) author (Dickens) title (Great Expectations) info edition (paperback) author (Dickens) author (Dickens) title (Great Expectations) title (Great Expectations) author (Dickens) title (Great Expectations) A Family of Scoring Methods for XML Path Queries Query Amélie Marian - Columbia University

  10. Contributions • Scoring mechanism for XML queries • Data structures for top-k query processing • Experimental evaluation Amélie Marian - Columbia University

  11. a b d c e Matrix Representation of Twigs • Twigs (queries and tuples) can be represented by matrices that capture all relationships in the query: Partial Tuple: Query: a1 (not joined with e yet) (no matches for e) (e1 matches) b1 d1 c1 e1 // X X / = X X X X X Matrix subsumption used to compare tuple and queries Amélie Marian - Columbia University

  12. a b a a c b b c c a a a c c b b b c Representing Relaxed Query Patterns: DAG Structure • Each child is more relaxed (has more matches) than its parent • idf of a child is no higher than the idf of its parent • idf scores are accessible in constant time for any match (complete or partial) using hash function a b a a Exhaustive algorithm to build the DAG c b a Amélie Marian - Columbia University

  13. idf score information: idf=(1+|a|)/(1+|ap|), where |ap| is the number of a nodes that satisfy the query predicate For query processing: Best possible score from here Best possible score after each remaining join operations Number of matches (useful for tf) a b a a c b b c c a a a c c b b b c Information stored in the DAG 1.228 1.2 1.195 1.167 1.195 a 1.167 1.156 b a a 1.049 1.156 c b a 1 Amélie Marian - Columbia University

  14. Query Processing using the DAG • Benefits: • Score computation done in a preprocessing phase (using exact or approximate information) • Score access during query processing done in constant time • Additional information needed for query processing precomputed and accessed in constant time (e.g., score upper bound) • tf estimated at runtime based on available information Amélie Marian - Columbia University

  15. Quality/Space/Time tradeoff • Binary Predicates • Smaller DAG (O(4q)) • Faster pre-processing (and processing) • Lower Quality (fewer possible scores) • Path Predicates and Twig • DAG is O(4q^2/2)) in space (still reasonable in practice) • More pre-processing • Higher Quality (more differences between scores) Amélie Marian - Columbia University

  16. Contributions • Scoring mechanism for XML queries • Data structures for top-k query processing • Experimental evaluation Amélie Marian - Columbia University

  17. Experimental Setup • Data: • Synthetic heterogeneous document collections generated with Toxgene • Real dataset: Wall Street Journal Treebank corpora • Pregenerated queries exhibiting different sizes, query structures and predicates • Measures: • DAG size • DAG preprocessing time • Query processing time • Precision (percentage of top-k answers that are actual top-k answers, as given by Twig) Amélie Marian - Columbia University

  18. XML Scoring Precision Amélie Marian - Columbia University

  19. XML Scoring Preprocessing Time Amélie Marian - Columbia University

  20. XML Scoring Real data Amélie Marian - Columbia University

  21. Conclusions • Scoring method for XML queries • Inspired from tf.idf • Accounts for structure and content • Accounts for structural relaxations • Efficient data structures to compute and access scores during top-k query processing • DAG • Matrix representation of queries and tuples • Evaluation of the scoring methods tradeoffs • Answer quality vs. preprocessing time Amélie Marian - Columbia University

  22. Related Work • IR Scoring • Content only • XML Scoring • Content with structure • XIRQL [XML&IR’00], JuruXML [SIGIR’03], IR-CADG [WebDB’04] • None of these techniques account for structural relaxations (with the exception of our previous work [ICDE’05]) • XML Structural Relaxation • FleXPath [SIGMOD’04], Kanza and Sagiv [PODS’01], Schlieder [EDBT’02], Delobel and Rousset [FMII’01] Amélie Marian - Columbia University

  23. Future Work • Streaming scenarios • Incremental updates on DAG • Approximate scoring • Integration with approximate text scoring • Extend proposed XML scoring function to handle text content approximation (e.g., misspellings) • Unify structure and content score • Quality evaluation (INEX) Amélie Marian - Columbia University

More Related