1 / 14

Query Flocks: A Generalization of Association-Rule Mining

Query Flocks: A Generalization of Association-Rule Mining. D. Tsur, J.D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, A. Rosenthal. Motivations. Market basket analysis has been successful, partially due to the a-priori optimization Extend this trick to a more general context

sonya-welch
Download Presentation

Query Flocks: A Generalization of Association-Rule Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Flocks: A Generalization of Association-Rule Mining D. Tsur, J.D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, A. Rosenthal

  2. Motivations • Market basket analysis has been successful, partially due to the a-priori optimization • Extend this trick to a more general context • efficiently mine large databases for patterns • use parametrized queries with a filter condition • spend most of the time evaluating the “interesting” cases

  3. Query Flocks • Two parts: • generate parametrized queries (parameters are denoted by names starting with $) • filter the results of the queries • Result is the set of tuples which are “acceptable” assignments of values for the parameters

  4. Market Basket Example Datalog query: answer(B) :- baskets(B,$1) AND baskets(B,$2) Filter: COUNT(answer.B) >= 20 • Find all pairs of items that appear in at least 20 market baskets • Result is all pairs of items ($1,$2) such that at least 20 baskets have both items

  5. The same query in SQL: SELECT i1.Item, i2.Item FROM baskets i1, baskets i2 WHERE i1.Item < i2.Item AND i1.BID = i2.BID GROUP BY i1.Item, i2.Item HAVING 20 <= COUNT(i1.BID) A-Priori trick is not implemented by conventional optimizers Claim: necessary code optimizations could be implemented in SQL systems Why Not SQL?

  6. Generalizing the A-Priori Technique • First evaluate a less expensive query and eliminate certain answers • Use a subset of the subgoals of the query • This subset must form a safe query

  7. Safe Queries • A variable in the head appears in a nonnegated, nonarithmetic subgoal • A variable in a negated subgoal appears in a nonnegated subgoal • A variable in an arithmetic subgoal appears in a nonnegated, nonarithmetic subgoal

  8. Relations: diagnosed(patient, disease) exhibits(patient, symptom) treatments(patient, medicine) causes(disease, symptom) Query: answer(P) :- exhibits(P,$s) AND treatments(P,$m) AND diagnosed(P,D) AND NOT causes(D,$s) Find symptoms $s and medicines $m such that many (at least 20) patients exhibit the symptom and are taking the medicine, but their disease does not explain the symptom Example

  9. Some Safe Subqueries • answer(P) :- exhibits(P,$s). 20+ patients exhibit the symptom • answer(P) :- treatments(P,$m). 20+ patients were given the medicine • answer(P) :- diagnosed(P,D) AND exhibits(P,$s) AND NOT causes(D,$s). 20+ patients have an unexplained symptom • answer(P) :- exhibits(P,$s) AND treatments(P,$m). 20+ patients are taking the medicine and exhibit the symptom

  10. A Formal Query Plan Using A Sequence of Filter Steps okS($s) := FILTER($s, answer(P) :- exhibits(P,$s), COUNT(answer.P) >= 20); okM($m) := FILTER($m, answer(P) :- treatments(P,$m), COUNT(answer.P) >= 20); ok($s,$m) := FILTER({$s,$m}, answer(P) :- okS($s) AND okM($m) AND diagnoses(P,D) AND exhibits(P,$s) AND treatments(p,$m) AND NOT causes(D,$s), COUNT(answer.P) >= 20);

  11. But Which Subqueries Are Best? • Depends on sizes of relations, and numbers of patients, diseases, etc. • Use heuristics for restricting the search for a query plan

  12. A Dynamic Technique • Use the sizes of the intermediate relations, after computation, to decide whether to filter • if the relation size gives an average number of tuples per value assignment that is much lower than previous steps, filter • if the set of parameters has not been seen before, compare number of tuples per value assignment with support threshold

  13. 1. Compare number patients with number symptoms 2. Compare number patients with number medicines 3. Compare size of relation with symptoms * medicines 4. Compare number patients in relation from 3 with number patients from leaf 5. Must be done to get query result Example 5 4 NOT causes(D,$s) 3 diagnosed(P,D) 1 2 exhibits(P,$s) treatments(P,$m)

  14. Summary • This is a way of describing operations on large-scale databases • flocks consist of parametrized queries and filters for the results of the queries • exploit the a-priori algorithm with subqueries • use techniques for limiting the search for query plans

More Related