1 / 35

Probabilistic Ranking of Database Query Results

Probabilistic Ranking of Database Query Results. Surajit Chaudhuri , Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis , Florida International University Gerhard Weikum , MPI Informatik. Presented by Raghunath Ravi Sivaramakrishnan Subramani CSE@UTA. Roadmap.

dareh
Download Presentation

Probabilistic Ranking of Database Query Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Ranking of Database Query Results SurajitChaudhuri, Microsoft Research Gautam Das, Microsoft Research VagelisHristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by Raghunath Ravi SivaramakrishnanSubramani CSE@UTA

  2. Roadmap • Motivation • Key Problems • System Architecture • Construction of Ranking Function • Implementation • Experiments • Conclusion and open problems

  3. Motivation • Many-answers problem • Two alternative solutions: Query reformulation Automatic ranking • Apply probabilistic model in IR to DB tuple ranking

  4. Example – Realtor Database • House Attributes: Price, City, Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year • Query: City =`Seattle’ AND Waterfront = TRUE • Too Many Results! • Intuitively, Houses with lower Price, more Bedrooms, or BoatDock are generally preferable

  5. Rank According to Unspecified Attributes Score of a Result Tuplet depends on • Global Score: Global Importance of Unspecified Attribute Values [CIDR2003] • E.g., Newer Houses are generally preferred • Conditional Score: Correlations between Specified and Unspecified Attribute Values • E.g., Waterfront  BoatDock Many Bedrooms Good School District

  6. Roadmap • Motivation • Key Problems • System Architecture • Construction of Ranking Function • Implementation • Experiments • Conclusion and open problems

  7. Key Problems • Given a Query Q, How to Combine the Global and Conditional Scores into a Ranking Function.Use Probabilistic Information Retrieval (PIR). • How to Calculate the Global and Conditional Scores.Use Query Workload and Data.

  8. Roadmap • Motivation • Key Problems • System Architecture • Construction of Ranking Function • Implementation • Experiments • Conclusion and open problems

  9. System Architecture

  10. Roadmap • Motivation • Key Problems • System Architecture • Construction of Ranking Function • Implementation • Experiments • Conclusion and open problems

  11. PIR Review • Bayes’ Rule • Product Rule • Document (Tuple) t, Query QR: Relevant DocumentsR = D - R: Irrelevant Documents

  12. Adaptation of PIR to DB • Tuple t is considered as a document • Partition t into t(X) and t(Y) • t(X) and t(Y) are written as X and Y • Derive from initial scoring function until final ranking function is obtained

  13. Preliminary Derivation

  14. Limited Independence Assumptions • Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

  15. Continuing Derivation

  16. Pre-computing Atomic Probabilities in Ranking Function Relative frequency in W Relative frequency in D (#of tuples in W that conatains x, y)/total # of tuples in W (#of tuples in D that conatains x, y)/total # of tuples in D Use Workload Use Data

  17. Roadmap • Motivation • Key Problems • System Architecture • Construction of Ranking Function • Implementation • Experiments • Conclusion and open problems

  18. Architecture of Ranking Systems

  19. Scan Algorithm Preprocessing - Atomic Probabilities Module • Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y Execution • Select Tuples that Satisfy the Query • Scan and Compute Score for Each Result-Tuple • Return Top-K Tuples

  20. Beyond Scan Algorithm • Scan algorithm is Inefficient Many tuples in the answer set • Another extreme Pre-compute top-K tuples for all possible queries Still infeasible in practice • Trade-off solution Pre-compute ranked lists of tuples for all possible atomic queries At query time, merge ranked lists to get top-K tuples

  21. Output from Index Module • CondListCx {AttName, AttVal, TID, CondScore} B+ tree index on (AttName, AttVal, CondScore) • GlobListGx {AttName, AttVal, TID, GlobScore} B+ tree index on (AttName, AttVal, GlobScore)

  22. Index Module

  23. Preprocessing Component Preprocessing • For Each Distinct Value x of Database, Calculate and Store the Conditional (Cx) and the Global (Gx) Lists as follows • For Each Tuplet Containing x Calculate and add to Cx and Gx respectively • Sort Cx, Gx by decreasing scores Execution Query Q: X1=x1 AND … AND Xs=xs • Execute Threshold Algorithm [Fag01] on the following lists: Cx1,…,Cxs, and Gxb, where Gxb is the shortest list among Gx1,…,Gxs

  24. List Merge Algorithm

  25. Roadmap • Motivation • Key Problems • System Architecture • Construction of Ranking Function • Implementation • Experiments • Conclusion and open problems

  26. Experimental Setup • Datasets: • MSR HomeAdvisor Seattle (http://houseandhome.msn.com/) • Internet Movie Database (http://www.imdb.com) • Software and Hardware: • Microsoft SQL Server2000 RDBMS • P4 2.8-GHz PC, 1 GB RAM • C#, Connected to RDBMS through DAO

  27. Quality Experiments • Conducted on Seattle Homes and Movies tables • Collect a workload from users • Compare Conditional Ranking Method in the paper with the Global Method [CIDR03]

  28. Quality Experiment-Average Precision • For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples • Let each user mark 10 tuples in Hi as most relevant to Qi • Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm

  29. Quality Experiment- Fraction of Users Preferring Each Algorithm • 5 new queries • Users were given the top-5 results

  30. Performance Experiments • Datasets • Compare 2 Algorithms: • Scan algorithm • List Merge algorithm

  31. Performance Experiments – Pre-computation Time

  32. Performance Experiments – Execution Time

  33. Roadmap • Motivation • Key Problems • System Architecture • Construction of Ranking Function • Implementation • Experiments • Conclusion and open problems

  34. Conclusions – Future Work Conclusions • Completely Automated Approach for the Many-Answers Problem which Leverages Data and Workload Statistics and Correlations • Based on PIR Drawbacks • Mutiple-table query • Non-categorical attributes Future Work • Empty-Answer Problem • Handle Plain Text Attributes

  35. Questions?

More Related