Faster Query Answering in Probabilistic Databases using Read-Once Functions

Faster Query Answering in Probabilistic Databases using Read-Once Functions Sudeepa Roy Joint work with VittorioPerduca Val Tannen University of Pennsylvania

Probabilistic Databases • Possible worlds model • Each possible world w is a standard database instance, has a probability P[w] • Compact representation D based on independence assumptions • Query Semantics in Probabilistic Databases • (wlog.) Boolean query q • Traditional database: q(D)  {true, false} • Probabilistic database: P[q(D)] = ∑q(w) = true P[w] • Goal: Efficiently evaluate P[q(D)] • Data complexity; want time polynomial in n = |D|

Computation of P[q(D)] • Can we efficiently compute P[q(D)]? • NO, In general#P-hard • DalviSuciu’04, ff. : Positive queries can be partitioned into • Safe queries:Safe plans run in poly-time on all instances • Unsafe queries: Data complexity is #P-hard • Includes very simple queries like R(x) S(x, y) T(y) • Given q as input, we can efficiently decide whether q is safe • BUT: • For unsafe queries, probabilities on some instances can be efficiently computed • Our Approach: Take both q and D as input

Restrictions Probability R S T D = R S T w = a possible world P[w] = 0.3 (1 – 0.4) (1 – 0.6) 0.1 (1 – 0.5) (1–0.2) (1–0.1) 0.7 (1–0.8) (1 – 0.4) • Conjunctive query without self-join (CQ-) • q():= R(x)S(x, y)T(y) • (This is the H0 query from Suciu’s keynote) • Tuple-independent representation D • Tuple t annotated by P[t]

Query Answering in Two Steps: Example EASY HARD S Event variables D R T Probability q():= R(x), S(x, y), T(y) • Event variables for tuples • Step 1: Event expression for q(D) or “lineage” • E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 • The “form” of the expression depends on query plan; here ()((R ⋈S) ⋈ T) • Step 2: Compute P[q(D)] = P[E] • given Pr[w1] = 0.3, Pr[v1] = 0.4, …. • This work: take advantage of Read-Once expressions

Read-Once Boolean Expressions z u v x y • Expression in Read-once Form: Every variable occurs exactly once • e.g. ((x+y)z + w)(u+v) • Linear time probability computation • P(x y) = P(x) P(y) • P(x + y) = 1 – (1 -P(x)) (1 – P(y)) • Read-once Expression: Has an equivalent read-once form. • e.g. • xzu + xzv + yzu + yzv + wu+ wv [in DNF, as large as O(n|q|)] • xzu + xzv + (yz + w)(u+v) [not in DNF, can be much smaller] • Non-read-once Expressions: No read-once form • e.g.. xy + yz + zx, xy + yz + zw

Read-Once Event Expressions • Safe plans for safe queries directly produce expressions in read-once form (OlteanuHuang’08) • Unsafe queries can also produce read-once expressions • Our example is read-once • E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 = (w1 v1 + w2 v2) u1 + w3 (v3 u2 + v4 u3) • Corresponds to unsafe query q():= R(x) S(x, y) T(y) • No query plan can produce the read-once form directly

Problem Definition • Given • a booleanCQ- query q, • a tuple-independent database D, • Can we efficiently decide whether the event expression corresponding to q(D) is read-once? • If yes, can we compute the read-once form efficiently? • (then P[q(D)] can be computed efficiently)

Read-once-ness: only a sufficient condition to efficiently compute P[q(D)] read-once form of E can be computed efficiently E is read-once P[E] can be computed efficiently • e.g., E = x1 x2 + x2 x3 + x3 x4 + …… • Not read-once • P[E] can be computed in poly-time using dynamic programming • Moreover, see detailed analysis in JhaSuciu ’11 using OBDD, FBDD, d-DNNF

Outline • Background • Existing characterization of read-once expressions • Co-occurrence Graphs • Our Contributions • Co-table graph • Step1. Computation of co-table graph • Step2. Computation of read-once form • Related work, Future work and Conclusion

Characterization of Read-once Expressions z A positive boolean expression is read-once if and only if its “co-occurrence graph” is P4-free (no simple induced path with four vertices) and “normal”. • Gurvich’ 77, ’ 91 • Can be checked (and computed) in poly-time if the expression is given in DNF (GolumbicMR’ 06)

Co-occurrence Graph - GCO x y z • Graph on variables in the expression as vertices • 1. Express boolean expression in irredundant DNF • xy + xyz + zxxy + zx • 2. Put an edge between variables if they co-occur in a disjunct • Can be easily computed if the expression is in DNF

Outline • Background • Existing characterization of read-once expressions • Co-occurrence Graphs • Our Contributions • Co-table graph • Step1. Computation of co-table graph • Step2. Computation of read-once form • Related work, Future work and Conclusion

Our Contributions (1) Uses Gurvich’s characterization vs. (2) Uses alternative (2) Is faster than (1) (1) Our Framework (2) Compute GCO Compute GCT Use existing read-once testing algorithms Use our read-once testing algorithm • 1. DNF of event expression is not needed for CQ- • GCO can be directly computed from “provenance DAGs” • 2. We do not need to compute GCO • A subgraph of GCO suffices – “Co-table graph” GCT

Provenance DAG u2 u3 u1 v1 v2 Queryq():= R(x), S(x, y), T(y) Query Plan ()((R ⋈S) ⋈ T) v3 v4 w1 w2 w3 • E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 Event expressions, called “lineage” (Suciu keynote), are a form of provenance (GreenKarvounarakisT ’07). We use provenance DAGs (Green et. al. ’07)

Co-Table Graph -- GCT v1 v1 w1 v2 u1 w1 v2 u1 w2 u2 w2 v3 u2 v3 w3 v4 u3 w3 v4 u3 GCO GCT • E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 q():= R(x) S(x, y) T(y) • Subgraph of Gco: |GCT|  |GCO| • Put an edge between variables only if their tables share variables in q • e.g.: q():= R(x) S(y) • R, S have n tuples each, GCO has n2 edges, GCT has zero!

Our Algorithm • Input: Provenance DAG, H • Obtained from the query plan • Step1: Compute GCT • (the same procedure can compute GCO as well) • Step2: Compute read-once form (if possible) • Otherwise output that event expression is not read-once

Step1: Computing GCT • E = xy + xz y x Z • Proof uses critically the no-self-join assumption Theorem: Two variables are adjacent in GCT if and only if their least common ancestor set contains a product-node in the provenance DAG

Step2: Computing Read-once form Table decomposition Row decomposition q1 q2 E1 q q E2 q E3 E1 E2 E = E1 + E2 + E3 E = E1 E2 • Input: GCT • Alternate between • Row Decomposition and Table Decomposition • Recursive computation • Exactly one can be done at a recursion level, otherwise not read-once • Proof uses critically no-union assumption • Sound and Complete

Example: Row Decomposition S T R q():= R(x), S(x, y), T(y) • E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 v1 w1 v2 u1 S1 T1 R1 w2 + u2 v3 u3 w3 v4

Example: Table Decomposition q():= R(x), S(x, y), T(y) q1():= R(x), S(x, y1) q2():= T(y2) S1 R1 T1 v1 u1 w1 v2 u1 (w1 v1 + w2 v2)u1  w2 (w1 v1 + w2 v2) Final Expression: (w1 v1 + w2 v2)u1 + w3(v3 u2 + v4 u3)

Overall Time Complexity • Summary • Analysis uses “charging argument” • Bound recursion depth, total time at each recursion level Step1 is more expensive • Step2 is linear • In |GCO|for existing algorithms • In |GCT|for our algorithms • |GCT| ≤ |GCO| • Input: Provenance DAG H • Step1: Compute GCT or GCO • Time complexity ≈ O(n mH + WH mCO) • mH = #edges in H, WH = width of H, mCO= #edges in GCO,mCT= #edges in GCT • Step2: Compute read-once form (if possible) • Using our algorithm:O((mCT+ n) min (|q|, √n)) ; Data complexity O(mCT+ n) • Using existing algorithms: O(mCO + n), mCT≤ mCO

Outline • Background • Co-occurrence Graphs • Existing characterization of read-once expressions • Our Contributions • Co-table graph • Step1. Computation of co-table graph • Step2. Computation of read-once form • Related work, Future work and Conclusion

Related Work • SenDeshpandeGetoor’ 10 • Independent work, considers the same problem • Shows that “normality” check is not needed for CQ- • Tests P4-freeness using “lineage-trees” without computing the co-occurrence graph • Our work: • Computes the co-occurrence graph without DNF computation • existing algorithms can be used. • Was an open question in SenDeshpandeGetoor’10 • Obtains a faster and simpler algorithm • Time complexity comparison in the paper • Uses BFS/DFS, easier to implement • Uses compact provenance DAGs instead of lineage trees

Other Related Work • Semantics of probabilistic query answering • Fuhr-Rollecke ’97, Zimanyi ‘97 • Dichotomy of CQ- ,CQ and UCQ queries • Dalvi-Suciu ’04, ’07, Dalvi-Schnaitter-Suciu ’10 • Knowledge compilation techniques • Olteanu-Huang ’08 • Jha-Olteanu-Suciu ‘10 • Jha-Suciu ’11 • Fink-Olteanu ‘11

Conclusion and Future Work • Can co-occurrence/co-table graph be computed as a • pre-processing step? • This is the more expensive step • Akin to building indexes on databases but depends on query’s “join pattern” • Cache the already computed GCT with the join pattern • How to handle • Larger classes of queries (UCQ?) and database models (disjoint independent?) • Other efficient knowledge-compilation forms

Thank You. Questions?

Faster Query Answering in Probabilistic Databases using Read-Once Functions