Query Flocks: A Generalization of Association-Rule Mining

Query Flocks: A Generalization of Association-Rule Mining D. Tsur, J.D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, A. Rosenthal

Motivations • Market basket analysis has been successful, partially due to the a-priori optimization • Extend this trick to a more general context • efficiently mine large databases for patterns • use parametrized queries with a filter condition • spend most of the time evaluating the “interesting” cases

Query Flocks • Two parts: • generate parametrized queries (parameters are denoted by names starting with $) • filter the results of the queries • Result is the set of tuples which are “acceptable” assignments of values for the parameters

Market Basket Example Datalog query: answer(B) :- baskets(B,$1) AND baskets(B,$2) Filter: COUNT(answer.B) >= 20 • Find all pairs of items that appear in at least 20 market baskets • Result is all pairs of items ($1,$2) such that at least 20 baskets have both items

The same query in SQL: SELECT i1.Item, i2.Item FROM baskets i1, baskets i2 WHERE i1.Item < i2.Item AND i1.BID = i2.BID GROUP BY i1.Item, i2.Item HAVING 20 <= COUNT(i1.BID) A-Priori trick is not implemented by conventional optimizers Claim: necessary code optimizations could be implemented in SQL systems Why Not SQL?

Generalizing the A-Priori Technique • First evaluate a less expensive query and eliminate certain answers • Use a subset of the subgoals of the query • This subset must form a safe query

Safe Queries • A variable in the head appears in a nonnegated, nonarithmetic subgoal • A variable in a negated subgoal appears in a nonnegated subgoal • A variable in an arithmetic subgoal appears in a nonnegated, nonarithmetic subgoal

Relations: diagnosed(patient, disease) exhibits(patient, symptom) treatments(patient, medicine) causes(disease, symptom) Query: answer(P) :- exhibits(P,$s) AND treatments(P,$m) AND diagnosed(P,D) AND NOT causes(D,$s) Find symptoms $s and medicines $m such that many (at least 20) patients exhibit the symptom and are taking the medicine, but their disease does not explain the symptom Example

Some Safe Subqueries • answer(P) :- exhibits(P,$s). 20+ patients exhibit the symptom • answer(P) :- treatments(P,$m). 20+ patients were given the medicine • answer(P) :- diagnosed(P,D) AND exhibits(P,$s) AND NOT causes(D,$s). 20+ patients have an unexplained symptom • answer(P) :- exhibits(P,$s) AND treatments(P,$m). 20+ patients are taking the medicine and exhibit the symptom

A Formal Query Plan Using A Sequence of Filter Steps okS($s) := FILTER($s, answer(P) :- exhibits(P,$s), COUNT(answer.P) >= 20); okM($m) := FILTER($m, answer(P) :- treatments(P,$m), COUNT(answer.P) >= 20); ok($s,$m) := FILTER({$s,$m}, answer(P) :- okS($s) AND okM($m) AND diagnoses(P,D) AND exhibits(P,$s) AND treatments(p,$m) AND NOT causes(D,$s), COUNT(answer.P) >= 20);

But Which Subqueries Are Best? • Depends on sizes of relations, and numbers of patients, diseases, etc. • Use heuristics for restricting the search for a query plan

A Dynamic Technique • Use the sizes of the intermediate relations, after computation, to decide whether to filter • if the relation size gives an average number of tuples per value assignment that is much lower than previous steps, filter • if the set of parameters has not been seen before, compare number of tuples per value assignment with support threshold

1. Compare number patients with number symptoms 2. Compare number patients with number medicines 3. Compare size of relation with symptoms * medicines 4. Compare number patients in relation from 3 with number patients from leaf 5. Must be done to get query result Example 5 4 NOT causes(D,$s) 3 diagnosed(P,D) 1 2 exhibits(P,$s) treatments(P,$m)

Summary • This is a way of describing operations on large-scale databases • flocks consist of parametrized queries and filters for the results of the queries • exploit the a-priori algorithm with subqueries • use techniques for limiting the search for query plans

Query Flocks: A Generalization of Association-Rule Mining

Query Flocks: A Generalization of Association-Rule Mining

Presentation Transcript

Data Mining with Clementine

Frequent Item Mining

Making Generalizations

CS490D: Introduction to Data Mining Prof. Walid Aref

Chapter 10

Query Processing

Advanced Topics in Data Mining: Association Rules

Presentation for Deforestation project

CSE544 Query Execution

CMPT 454

CS411 Database Systems

SQL – Structured Query Langauge

5 Querying XML

Data Mining with DB

New Mining Technology 采矿新技术

RDF FOR DEVELOPERS

Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 8 —

AM18 ASA INTERNALS: QUERY EXECUTION AND OPTIMIZATION

Presentation for Deforestation project

Content Marketing Strategy - Using Data Mining to Discover Buying Objections