1 / 63

Keyword Proximity Search in Complex Data Graphs

This paper discusses the problem of keyword proximity search in complex data graphs and proposes a system and algorithm for extracting meaningful parts of data without knowing the schema. The paper also addresses the challenges of redundancy and repeated information in highly cyclic data graphs.

judyg
Download Presentation

Keyword Proximity Search in Complex Data Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Keyword Proximity Search in Complex Data Graphs • Konstantin Golenberg • Benny Kimelfeld •Yehoshua Sagiv The Selim and Rachel Benin School of Engineering and Computer Science

  2. Schema-Free Extraction of Data • Exposure to many databases • Different types (relational, XML, RDF…) • Different schemas Nowadays… • Not easy to use traditional paradigms of querying (e.g., SQL, XQuery, SPARQL) and, moreover, they require a thorough understanding of the schema • Goal:Enable users to instantly pose (inaccurate) queries without knowing the schema The natural (and popular) option:Keyword Search • Problem: Inherently different from standard IR

  3. Keyword Proximity Search (KPS) The Goal: Extract meaningful parts of data w.r.t. the keywords • Data have varying degrees of structure • Relational (w/ foreign keys), XML (w/ id-references) • Natural representation by a graph • Usually, data-centric rather than document-centric • A query is a set of keywords • No structural constraints • Agrawal et al. ICDE’02 • Hristidis et al., VLDB’02,03, ICDE’03 • Bhalotia et al. VLDB’05 • Kacholia al., VLDB’06 • Ding et al., ICDE’07 • Liu et al., SIGMOD’06 • Wang et al., VLDB’06 • Luo et al., SIGMOD’07 …

  4. Example: Search in RDB Belgium, Brussels search Cities Organizations Countries Memberships

  5. Brussels is the capital city of Belgium Belgium, Brussels search Cities Organizations Countries Memberships

  6. Brussels hosts EU and Belgium is a member Belgium, Brussels search Cities Organizations Countries Memberships

  7. Example: Search in XML Yannakakis, Approximation search

  8. Yannakakiswrote a paper aboutApproximation Yannakakis, Approximation search

  9. Yannakakisis cited by a paperaboutApproximation Yannakakis, Approximation search

  10. Data Graphs • Structuralandkeyword nodes • Edges and nodes may have weights • – Weak relationships are penalized by large weights Each keyword has one occurrence in the data graph (technical)

  11. Queries Queries are sets of keywords from the data graph Q={ Summers ,Cohen ,coffee}

  12. An Answer is a Reduced Subtree This paper An answer is a subtree of the data graph • Contains all keywordsof the query • Has no redundant edges(and nodes) 3 variants: directed, undirected, strong(undirected, kw’s are leaves);

  13. Previous Solutions • Lack of guarantees • Highly relevant answers might be missed, and / or • Inefficient algorithms • Rather simple data sets – a (very) small number of relevant answers • They considered data that are essentially collections of entities, namely, DBLP, IMDB, Lyrics, etc. • An answer is usually within the scope of an entity → e.g., the keywords appear in a single movie • Crucial problems ignored • In particular, the “repeated information” problem • Especially pervasive in complex data graphs

  14. Contributions • A system for keyword proximity search • An algorithm for generating answers with guarantees • Does not miss (valuable) answers • Efficient (polynomial delay) • Answers generated in a 2-approximate order by height • A ranking technique that is aware of therepeated-information problem • Gives preference to answers with low similarity to earlier ones • Experimentation over a highly-cyclic data graph • The Mondial database • Many “meaningful” connections among keywords

  15. The MONDIAL Database • Institute for Informatics • Georg-August-Universität Göttingen http://www.dbis.informatik.uni-goettingen.de/Mondial/

  16. Challenges We employ a two-phase architecture: • Huge no. of answers; not instantiated! • Not simple to generate all relevant answers, even if ranking is ignored • For practical ranking functions, enumerating the answers in ranked order is probably impossible • For example, finding the smallest answer is the intractable Steiner-tree problem • Redundancy / repeated information • Many answers are very similar (altogether provide a low amount information) • Crucial in complex (highly cyclic) data graphs

  17. Architecture: Generator + Ranker Simplified ranking at first [Bhalotia et al., ICDE’02, VLDB’05] Answer Generator Generates next M·k answers (simplified ranking function) Ranker Ranks all answers generated up to now (- printed ones) top-k answers (relative to those that have already been printed) • search(keywords) • next k answers

  18. Generating the Top Answers: Not Trivial! To demonstrate the difficulty of generating the “good” (top) answers, let’s seehow existing approaches operate on a simple example:

  19. Find the Answers in this Example!

  20. The BANKS Approach Answers are directed subtrees [Bhalotia et al., ICDE’02, VLDB’05] • ∀nodes v (in a “good” order) and keyword occurrences: • Generate the min-height subtree emanating from v

  21. The BANKS Approach Answers are directed subtrees [Bhalotia et al., ICDE’02, VLDB’05] What about this answer? Never generated! • ∀nodes v (in a “good” order) and keyword occurrences: • Generate the min-height subtree emanating from v

  22. The NUITS Approach Answers are undirected subtrees [Ding et al., ICDE’07] • ∀nodes v (in a “good” order): • Generate the min-weight subtree that includes v

  23. The NUITS Approach Answers are undirected subtrees [Ding et al., ICDE’07] This node is redundant It is actually the previous answer! • ∀nodes v (in a “good” order): • Generate the min-weight subtree that includes v

  24. The NUITS Approach Answers are undirected subtrees [Ding et al., ICDE’07] This node is redundant Again, the previous answer! • ∀nodes v (in a “good” order): • Generate the min-weight subtree that includes v

  25. The NUITS Approach Answers are undirected subtrees [Ding et al., ICDE’07] What about this answer? Severe limit on # of generated answers! (≤ one per node) Never generated! • ∀nodes v (in a “good” order): • Generate the min-weight subtree that includes v

  26. The DISCOVER / DBXplorer Approach [Hristidis et al., VLDB’02,03, ICDE’03] [Agrawal et al. ICDE’02] Easy to implement! All answers are generated in ranked order! DBMS queries–No in-memory graph algorithms • ∀possible queries Q (from the schema) in inc. size: • Evaluate Q over the database

  27. The DISCOVER / DBXplorer Approach [Hristidis et al., VLDB’02,03, ICDE’03] [Agrawal et al. ICDE’02] Worst case: exponential in the data Inefficient! But many queries do not generate any answer at all! Limited Ranking! by the query (rather than the answer) weight • ∀possible queries Q (from the schema) in inc. size: • Evaluate Q over the database

  28. We Need Generators w/ Guarantees! • All answers are generated • In particular, each of the “relevant” answers is produced at some point (100% recall is achievable) • Controlled order of answers • For instance, increasing weight, increasing height, approximate (what is the ratio?) / heuristic order • Efficiency • The top-k answers should be generated efficiently • Bound on time between successive answers

  29. Order by Increasing Weight / Height If ≤ Then Top-k Answers

  30. Approximate and Heuristic Orders Heuristic order Approximate order Intuitively, expected to be close to the optimal order, but there is no guarantee There is a provable bound on the extent to which the actual order can deviate from the optimal one

  31. C-Approximate Order (inc. Weight / Height) If Then C ≤ C-Approximation of the Top-k Answers [Fagin et al., PODS’01]

  32. Our Approach • PODS’06: Enum. by (exact / approx) inc. weight • Problem: Repeated application of Steiner-tree alg’s • “Heavy” – hard to implement efficiently • Here: Follow the basic approach of PODS’06 • But, we adopt the BANKS idea of using height (≠ weight) for the enumeration order • Recall: BANKS might miss highly relevant answers • Thus, we bypass Steiner trees and obtain a much faster algorithm • Our alg. has all 3 guarantees: answers are not missed, approximate order, poly. delay

  33. An Overview of the Algorithm Task: • Lawler / Yen method • Types of Constraints: • Inclusion: “include edge e” • Exclusion: “exclude edge e” Enum. by (2-approx.) increasing height Task: Find (a 2-approx. of) the shortest answer under constraints The intricate part … Task: Find the shortest answer (w/o constraints) Backward-search (Dijkstra) iterators (~ BANKS)

  34. Finding an Answer under Constraints • Inclusion: “include edge e” • Exclusion: “exclude edge e” Handling exclusion constraints is easy Simply remove the excluded edges from the graph

  35. Inclusion Constraints are the Problem redundant edge • Inclusion: “include edge e” • Exclusion: “exclude edge e” The shortest subtree that contains the kw’s and satisfies the const’s But it is not an answer! • Not reduced (has redundancy) • Moreover, includes a previously printed answer • Sometimes, no answer at all!

  36. The Correct Answer • Inclusion: “include edge e” • Exclusion: “exclude edge e” • Technique: • 1. Generate a min-height subtree (as in the wrong solution) • 2. Not an answer? → modify • Intricate to guarantee 2-approx. • Details in the proceedings

  37. Running Times Each entry is an avg. of 4 queries

  38. Alg. Order vs. Weight Order How many answers are generated in order to obtain the top-k (among 1000) according to weight? Each entry is an avg. of 4 queries

  39. Effective Approx. Ratio: Height ↑ % k(answers) Effective approx. ratio worst / best (among first k) 2 keywords 3 keywords

  40. Effective Approx. Ratio: Height ↑ % k(answers) Effective approx. ratio worst / best (among first k) 4 keywords 5 keywords

  41. Effective Approx. Ratio: Weight ↑ % k(answers) Effective approx. ratio worst / best (among first k) 2 keywords 3 keywords

  42. Effective Approx. Ratio: Weight ↑ % k(answers) Effective approx. ratio worst / best (among first k) 4 keywords 5 keywords

  43. The Basic Ranking Function weight(a) = Σweight(node) + Σweight(edge) node∊a edge∊a

  44. Determining the Weight of an Edge org. enters many countries→ weak connection (large weight) Many org’s enter country→ weak connection (large weight) Strong connection (small weight) Strongest!

  45. The Basic Ranking Function (cont’d) weight(a) = Σweight(node) + Σweight(edge) node∊a edge∊a # t2 nodes with edges from v1 # t1 nodes with edges to v2 Relevant answers but … weight(node) = fixed (1) weight(edge) = log(1 + α·out(v1→t2) + (1 − α)·in(t1→v2)) edge = (v1,v2) tag(vi) = ti

  46. Answers with High Similarity

More Related