1 / 20

Minimal Probing: Supporting Expensive Predicates for Top-k Queries

Minimal Probing: Supporting Expensive Predicates for Top-k Queries. Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign. Context: Top-k Queries. Ranked queries return top- k results, unlike Boolean Crucial for retrieving data by “soft” conditions

mingan
Download Presentation

Minimal Probing: Supporting Expensive Predicates for Top-k Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Minimal Probing:Supporting Expensive Predicates for Top-k Queries Kevin C. ChangSeung-won HwangUniv. of Illinois at Urbana-Champaign

  2. Context: Top-k Queries • Ranked queries return top-k results, unlike Boolean • Crucial for retrieving data by “soft” conditions • relevance: e.g., text search engines • similarity: e.g., multimedia databases • preference: e.g., e-commerce product search • Example scenario: preference query for finding house: • selecth.idfromhouse h wherenew(age), cheap(price, size), large(size) order bymin(new,cheap,large) stop after5 • Observation: Crucial to support expensive predicates predicate scoring function k: retrieval size

  3. Problem: Expensive Predicates • Expensive predicates • no pre-computed indexes for zero-time sorted-access • needaprobeto evaluate each object (similar to sequential scan) • Unified abstraction for: • user-defined functions: functional extensibility • query conditions can be arbitrary, user-specific • e.g., cheap(price,size) • external predicates: data extensibility • source interface may require one probe per object • e.g., safe(zip) access crime rate from apbnews.com • fuzzy joins • associations of relations can be arbitrary • e.g., close(house.zip, park.zip)

  4. Current Limitations: “Sort-Merge” Framework • Require sorted access of search predicates. • To “simulate” sorted access, require complete probing • are these probes necessary? • Goal: Minimize probe cost Top-k output Merge step Sort step new (search predicate) F = min(new,cheap,large) a:0.90, b:0.80, c:0.70, d:0.60, e:0.50 k = 1 cheap (expensive predicate) û û û Merge Algorithm d:0.90, a:0.85, b:0.78, c:0.75, e:0.70 b:0.78 large (expensive predicate) û û û b:0.90, d:0.90, e:0.80, a:0.75, c:0.20

  5. Motivation: Solution Space • Assume sequential probing: Algorithm skeleton: do: schedule next obj o, pred p probe pr(o,p) until (top-k identified) predicates p1 p2 p3 objecta bc

  6. Our framework: Separate, Global Predicate Scheduling Two important decisions on framework: • Separate predicate scheduling • scheduling as separate “optimization” phase before probing • avoid run-time scheduling overhead • Global predicate scheduling • scheduling based on global info (predicate selectivities) • lack of per-object information to justify per-object scheduling • avoid per-object scheduling overhead • Simple framework and algorithm • and efficient! • allow essentially A* framework, for given predicate schedule • enable formal analysis: optimality, scalability

  7. Simple Framework • Separate, global predicate scheduling predicates H=(p1,p2,p3) p1 p2 p3 Algorithm skeleton: find global schedule H do: schedule next obj o probe pr(o, next(o,H)) until (top-k identified) objecta bc

  8. Challenges for Minimizing Probing • Predicate scheduling before probing • how to identify the best H? • Object scheduling during probing • how to find next object to probe, for achieving “minimal probing” with respect to H? Algorithm skeleton: find global schedule H do: schedule next obj o probe pr(o, next(o,H)) until (top-k identified) ? ?

  9. Challenge 1: Object Scheduling • Goal: Perform only necessary probes • Necessary probes: • A probe is necessary if top-k answers cannot be determined by any algorithm without it, regardless of the outcomes of other probes. • Question 1: Given a probe pr(o, next(o,H)), how to determine if it is necessary? • Probe-optimal algorithm • An algorithm is probe-optimal if it performs only the necessary probes. • Question 2: How to identify necessary probes in order to design such an algorithm?

  10. Question 1: Is this Probe Necessary? • k=1, F=min(x,p1,p2); suppose H=(p1,p2) OID x p1 p2F=min(x,p1,p2) a 0.9 b 0.8 c 0.7 d 0.6 e 0.5 ? 1 1 0.9 top 1 ? Maybe Not! £ 0.8 ? 1 1 0.7 ? 1 1 0.6 ? 1 1 0.5

  11. Question 1: Is this Probe Necessary? • k=1, F=min(x,p1,p2); suppose H=(p1,p2) • Theorem: Probe pr(o,p) is absolutely necessary, if o is among the current top-k in terms of ceiling scores. OID x p1 p2F=min(x,p1,p2) a 0.9 b 0.8 c 0.7 d 0.6 e 0.5 ? £ 0.9 Necessary! top 1? 1 1 0.8 1 1 0.7 1 1 0.6 1 1 0.5

  12. a:0.9 a:0.85 b:0.8 b:0.78 b:0.78 b:0.8 b:0.8 a:0.75 a:0.75 a:0.75 c:0.7 c:0.7 c:0.7 c:0.7 c:0.7 d:0.6 d:0.6 d:0.6 d:0.6 d:0.6 e:0.5 e:0.5 e:0.5 e:0.5 e:0.5 Question 2: Probe-optimal object scheduling • Objects in current top-k must be further probed • Probe-optimal object scheduling: Algorithm MPro • use a priority queue with ceiling scores as priorities pr(a,p1) =0.85 pr(a,p2) =0.75 pr(b,p1) =0.78 pr(b,p2) =0.90 top 1 b:0.78

  13. Challenge 2: Predicate Scheduling • Scheduling problem • find minimal cost schedule from permutations • Challenges • selectivity estimation: • dynamic predicates • aggregate selectivities (context-dependent) • scheduling computation: • NP-hard • Our approach: • on-line sampling to estimate selectivities • greedy selection to schedule predicates 0.1% sampling achieves almost the best schedule

  14. 6 hour 2 min Experiment Results • Practical performance of MPro • proportional cost to the retrieval size k • significant speedup for small k • Impact of performance factors • database size: sublinear cost scalability • score distribution and scoring function: see paper

  15. Demo : House Search • Data: All houses on sale in Illinois (N=20990) • from www.realtor.com. • objects: house(id, price, size, bed, bath, zip, city) • Query: F = Average(n, c, r) • n nearcity: close to Chicago • c cheap: “reasonable” price for its size • r roomy: prefer 4-6 rooms

  16. Summary of Contributions (more in the paper) • Abstraction: • for user-defined, external, and fuzzy join predicates • Framework and algorithm: • sampling-based global scheduling • probe-optimal algorithm MPro • extensions of MPro: fuzzy joins, parallel MPro, approximation • Principles/Theorems: • necessary-probe principle • probe-optimality of MPro • analytical scalability of MPro • Extensive experiments

  17. Thank You!

  18. Probe-parallel MPro Probe k necessary probes concurrently Up to k-fold speedup Data-parallel MPro Partition data into s chunks Up to s-time speedup top-k Merge MPro MPro MPro Parallel MPro: Overview

  19. Scalability N=1000N=10000N=100000 k=100N=1000 k=1000N=10000 k=10000N=100000

  20. Comparison T T T O O O

More Related