160 likes | 367 Views
Characterizing Memory Requirements for Queries over Continuous Data Streams. Arvind Arasu, Brian Babcock*, Shivnath Babu, Jon McAlister, Jennifer Widom. Stanford University. *Speaker. Continuous Data Streams. Network traffic data Transaction logs Call records, Web logs, ... Financial data
E N D
Characterizing Memory Requirements for Queries over Continuous Data Streams Arvind Arasu, Brian Babcock*, Shivnath Babu, Jon McAlister, Jennifer Widom Stanford University *Speaker
Continuous Data Streams • Network traffic data • Transaction logs • Call records, Web logs, ... • Financial data • Sensor networks • Scientific data • Astronomy, Biology, ...
A DBMS for Data Streams? • Lots of existing work in data streams • Mostly special-purpose applications • We’re building a general-purpose “data stream management system” (DSMS) http://www-db.stanford.edu/stream/
Relations stored on disk Tuples read once & discarded Data access patterns controlled by RDBMS Data must be processed as it arrives Query answers are relations Query answers are streams RBDMS DSMS
Query Execution Model 1. Client registers query Client …and answers returned to client ? 2. Tuples arrive on streams... ...are read and discarded... S T Limited-size “scratch space” available Memory DSMS
Our Problem: Given a data stream query, determine how much memory is required to evaluate it.
Queries We Consider • SPJ Queries: L(P (S1 x S2 x … x Sn)) • Projection is either duplicate-preserving or duplicate-eliminating • Selection predicates are conjunctions of: • Si.A Op Sj.B -or- Si.A Op k • Op{>, > , =, <, < } • All attributes are integers
An example with no joins SELECT cust_id FROM orders WHERE amt > 5 DISTINCT • Requires boundedmemory • Remembercust_ids from1000-9999 AND cust_id > 1000 AND cust_id < 9999 • Requires no “scratch” memory • Each tuple is independent • Tuples in the answer are streamed away • Requires unbounded memory • All cust_ids must be remembered
An example with an equijoin SELECT R.prod_id FROM orders O, returns R WHERE O.order_num = R.order_num AND R.prod_id >= 100 AND R.prod_id < 199 AND O.order_num > 1000 AND O.order_num < 1103
prod_id MAX(amt) prod_id MIN(amt) 100 45 100 17 101 21 101 12 ... ... 299 2 299 36 An example with an inequality SELECT FROM orders O, inventory I WHERE O.amt > I.qty AND O.prod_id >= 100 AND O.prod_id < 300 DISTINCT O.prod_id O.prod_id
“Locally Totally Ordered” Queries • LTO Queries: SPJ queries with additional predicates applied • For each stream, stipulate a total order for all attributes in the stream & all constants • Only allow tuples whose attribute values follow that ordering • All SPJ queries can be written as a union of LTO queries
Example of an LTO query Stream S: (A, B) Stream T: (C, D) SELECT S.A, T.C FROM S, T WHERE S.B > 12 SELECT S.A, T.C FROM S, T WHERE S.B > 12 AND S.A = S.B AND T.D < T.C AND T.C < 12
Example: SELECT T.C FROM S, T WHERE S.A < 5 AND 5 < T.C AND S.A < T.C { S.A < 5, 5 < T.C} => S.A < T.C, so S.A < T.C is not a necessary inequality. MinRef and MaxRef For each stream S in the query: MinRef(S) = { S.A : S.A < T.B is a necessary inequality in the predicate}
Bounded-Memory Conditions 1. All attributes in the projection list must be bounded. 2. All attributes participating in equijoins must be bounded. 3. In each stream S, |MinRef(S)| + |MaxRef(S)|: = 0, for SELECT < 1, for SELECT DISTINCT
S: (A,B) T: (C,D,E) (0, 0) (-1, 1) (-2, 2) … (-c, c) ... (1-c, 1+c, 10) An unbounded example SELECT DISTINCT T.E FROM S, T WHERE T.E = 10 AND S.A < T.C AND S.B < T.D
Conclusion • We consider SPJ queries over data streams • We identify which queries can and cannot be evaluated using bounded memory • For queries than can, we provide an execution strategy based on synopses. • For queries that cannot, we provide examples of “bad” input streams. Full paper at http://www-db.stanford.edu/??? E-mail: babcock@cs.stanford.edu