510 likes | 620 Views
Data Streams. Definition: Data arriving continuously, usually just by insertions of new elements. The size of the stream is not known a-priori, and may be unbounded. Hot research area. Data Streams. Applications: Phone call records in AT&T Network Monitoring
E N D
Data Streams • Definition: Data arriving continuously, usually just by insertions of new elements. The size of the stream is not known a-priori, and may be unbounded. • Hot research area
Data Streams • Applications: • Phone call records in AT&T • Network Monitoring • Financial Applications (Stock Quotes) • Web Applications (Data Clicks) • Sensor Networks
Continuous Queries • Mainly used in data stream environments • Defined once, and run until user terminates them • Example: Give me the names of the stocks who increased their value by at least 5% over the last hour
What is the Problem (1)? • Q is a selection: then size(A) may be unbounded. Thus, we cannot guarantee we can store it.
What is the Problem (2)? • Q is a self-join: If we want to provide only NEW results, then we need unlimited storage to guarantee no duplicates exist in result
What is the Problem (3)? • Q contains aggregation: then tuples in A might be deleted by new observed tuples. Ex: Select A, sum(B) From Stream X What if B < 0 ? Group by A Having sum(B) > 100
What is the Problem (4)? • What if we can delete tuples in the Stream? • What if Q contains a blocking operator near the top (example: aggregation)? • Online Aggregation Techniques useful
Related Areas • Data Approximation: limits size of scratch, store • Grouping Continuous Queries submitted over the same sources • Adaptive Query Processing (data sources may be providing elements with varying rates) • Partial Results: Give partial results to the user (the query may run forever) • Data Mining: Can the algorithms be modified to use one scan of the data and still provide good results?
Initial Approaches (1) • Typical Approach: Limit expressiveness of query language to limit size of Store, Scratch • Alert (1991): • Triggers on Append-Only (Active) Tables • Event-condition-action triggers • Event: Cursor on Active Table • Condition: From and Where Clause of Rule • Action: Select Clause of Rule (typically called a function) • Triggers were expressed as continuous queries • User was responsible for monitoring the size of tables
Initial Approaches (2) • Tapestry (1992): • Introduced Notion of Continuous Queries • Used subset of SQL (TQL) • Query Q was converted to the Minimum Monotone Bounding Query QM(t)= Union QM(τ) , for all τ <= t • Then QM was converted to an Incremental query QI. • Problems: • Duplicate tuples were returned • Aggregation Queries were not supported • No Outer-Joins allowed
Initial Approaches (3) • Chronicle Data Model (1995): • Data Streams referred as Chronicles (append-only) • Assumptions: • A new tuple is not joined with previously seen tuples. • At most a constant number of tuples from a relation R can join with Chronicle C. • Achievement: Incremental maintenance of views in time independent of the Chronicle size
Materialized Views • Work on Self-Maintenance: important to limit size of Scratch. If a view can be self-maintainable, any auxiliary storage much occupy bounded space • Work on Data Expiration: important for knowing when to move elements from Scratch to Throw.
Data Approximation • Area most working is being done nowadays • Problem: We cannot have O(N) space/time cost per element to solve a problem, but want solutions close to O(poly(logN)). • Sampling • Histograms • Wavelets • Sketching Techniques
Sampling • Easiest one to implement, use • Reservoir Sampling: dominant algorithm • Used for any problem (but with serious limitations, especially in cases of joins) • Stratified Sampling (sampling data at different rates) • Reduce variance in data • Reduce error in Group-By Queries.
Histograms • V-Optimal: • Gilbert et al. removed sorted restriction: time/space using sketches in O(poly(B,logN,1/ε) • Equi-Width: • Compute quantiles in O(1/ε logεN) space and precision of εN. • Correlated Aggregates: • AGG-D{Y : Predicate(X, AGG-I(X)) } • AGG-D : Count or Sum • AGG-I : Min, Max or Average • Reallocate histogram based on arriving tuples, and the AGG-I (if we want min, and are storing [min, min + ε] in the histogram and receive new min, throw away previous histogram.
Wavelets • Used for Signal Decomposition (good if measured aggregate follows a signal) • Matias, Vitter: Incremental Maintenance of top Wavelet coefficients • Gilbert et al: Point, Range Queries with wavelets
Sketching Techniques (1) • Main idea: If getting the exact value of a variable V requires O(n) time, then use approximation: • Define a random variable R with expected value equal to that of V, and small variance. • Example (self-join): • Select 4-wise independent variables ξi (i = 1, …, dom(A)) • Define Z = X2, X = Σf(i)ξ(i) , f(i): frequency of i-th value • Result is median of s2 variables Yj, where Yj is the average of s1 variables (boosting accuracy, confidence interval)
Sketching Techniques (2) • Answer Complex Aggregate Queries • Frequency moments Fk, where capture statistics of data • mi: the frequency of occurrence for value i • F0: number of distinct values • F1: number of total elements • F2: Gini index (useful for self-joins) • L1, L2 norms of a vector computes similarly to F2 • Quantiles (combination of histograms, sketches)
Grouping Continuous Queries • Goal: Group similar queries over data sources, to eliminate common processing needed and minimize response time and storage needed. • Niagara (joint work of Wisconsin, Oregon) • Tukwilla (Washington) • Telegraph (Berkeley)
Niagara (1) • Supports thousands of queries over XML sources. • Features: • Incremental Grouping of Queries • Supports queries evaluated when data sources change (change-based) • Supports queries evaluated at specific intervals (timer-based) • Timer-based queries are harder to group because of overlapping time intervals • Change-based queries have better response times but waste more resources.
Niagara (2) • Why Group Queries: • Share Computation • Test multiple “Fire” actions together • Group plans can be kept in memory more easily
Niagara - Key ideas • Query Expression Signature • Query Plan (generated by Niagara parser)
Group • Group signature (common signature of all queries in a plan) • Group Constant Table (signature constants of all the queries in the group, and destination buffers) • Group plan: the query plan shared by all queries in the group.
Incremental Grouping • Create signature of new query. Place in lower parts of signature the most selective predicates. • Insert new query in the Group which best matches its signature bottom-up • If no match is found, create new group for this query • Store any timer information, and the data sources needed for this query
Other Issues (1) • Why write output to file, and not use pipelining?: • Pipelining would fire all actions, even if new needed to be fired • Pipelining does not work in timer-based queries where results need to be buffered. • Split operator may become a bottleneck if output is consumed in widely different rates • Query plan too complex for the optimizer
Other Issues (2) • Selection Operator above, or below Joins? • Below only if selections are very selective • Else, better to have one join • Range queries? • Like equality queries. Save lower, upper bound • Output in one common sorted file to eliminate duplicates.
Tukwilla • Adaptive Query Processing over autonomous data sources • Periodically change query plan if output of operators is not satisfactory. • At the end perform cleanup. Some calculated results may have to be thrown away.
Telegraph • Adaptive Query Engine based on Eddy Concept • Queries are over autonomous sources over the internet • Environment is unpredictable and data rates may differ significantly during query execution. Therefore the query processing SHOULD be adaptive. • Also can help produce partial results
Eddy • Eddy: Routes tuples to operators for processing, gets them back and routes them again…
Eddy – Knowing State of Tuples • Passes Tuples by Reference to Operators (avoids copying) • When Eddy does not have any more input tuples, it polls the sources for more input. • Tuples need to be augmented with additional information: • Ready Bits: Which operators need to be applied • Done Bits: Which operators have been applied • Queries Completed: Signals if tuple has been output or rejected by the query • Completion Mask (per query): To know when a tuple can be output for a query (completion mask & done bits = mask)
Eddy – Other Details • Queries with no joins are partitioned per data source (to save space in the bits required) • Queries with Disjunctions (OR’s) are transformed into conjunctive normal form (and of or’s). • Range/exact predicates are found in Grouped filter
Joins - SteMs • SteMs: Multiway-Pipelined Joins • Double- Pipelined Joins maintain a hash index on each relation. • When N relations are joined, at least n-2 inflight indices are needed for intermediate results even for left-deep trees. • Previous approach cannot change query plan without re-computing intermediate indices.
SteMs - Functionality • Keeps hash-table (or other index) on one data source • Can have tuples inserted into it (passed from eddy) • Can be probed. Intermediate tuples (results of join) are returned to eddy, with the appropriate bits set • Tuples have sequence numbers. A tuple X can join only with tuples in stem M, if the indexed tuples have lower sequence numbers than X (arrived earlier).
Telegraph – Routing • How to route between operators? • Route to operator with smaller queue • Route to more selective operators (ticket scheme)
Partial Results - Telegraph • Idea: When tuple returns to eddy, the tuple may contribute to final result (fields may be missing because of joins not performed yet). • Present tuple anyway. The missing fields will be filled later. • Tuple is guaranteed to be in result if referential constraints exist (foreign keys). Not usual in web sources. • Might be useful to the user to present tuples that do not have matching fields (like in outer join).
Partial Results - Telegraph • Results presented in tabular representation • User can: • Re-arrange columns • Drill down (add columns) or roll-up (remove columns) • Assume current area of focus is where user needs more tuples. • Weight tuples based on: • Selected columns and their order • Selected Values for some dimension • Eddy sorts tuples according to their benefit in result and schedule them accordingly
Partial Results – Other methods • Online Aggregation: Present current aggregate with error bounds, and continuously refine results • Previous approaches involved changing some blocking operators to be able to get partial results • Join (use symmetric hash-join) • Nest • Average • Except
Data Mining (1) • General problem: Data mining techniques usually require: • Entire dataset to be present (in memory or in disk) • Multiple passes of the data • Too much time per data element
Data Mining (2) • New algorithms should require: • Small constant time per record • Use of a fixed amount of memory • Use one scan of data • Provide a useful model at all times • Produce a model that would be close to the one produced by multiple passes over the same data if the dataset was available offline. • Alter the model when generating phenomenon changes over time
Decision Trees • Input: A set of examples (x, v) where x is a vector of D attributes and v is a discrete class label • Find at each node the best attribute to split. • Hoeffding bounds are useful here: • Consider a variable r with range R • N independent observations • Computed average r’ differs by true average of r by at most ε with probability 1-δ, where:
Hoeffding Tree • At each node maintain counts for each attribute X, and each value Xi of X and each correct class • Let G(Xi) be the heuristic measure to choose test attributes (for example, Gini index) • Assume two attributes A,B with maximum G • If G(A) – G(B) > ε, then with probability 1-δ, A is the correct attribute to split • Memory needed = O(dvc) (dimensions, values, classes) • Can prove that produced tree is very close to optimal tree.
VFDT Tree • Extension of Hoeffding tree • Breaks ties more aggressively (if they delay splitting) • Computes G after nmin tuples arrive (splits are not that often anyway) • Remove least promising leaf nodes if a memory problem exists (they may be reactivated later) • Drops attributes from consideration if at the beginning their G value is very small
CVFDT System • Source producing examples may significantly change behavior. • In some nodes of the tree, the current splitting attribute may not be the best anymore • Expand alternate trees. Keep previous one, since at the beginning the alternate tree is small and will probably give worse results • Periodically use a bunch of samples to evaluate qualities of trees. • When alternate tree becomes better than the old one, remove the old one. • CVFDT also has smaller memory requirements than VFDT over sliding window samples.
OLAP • On-Line Analytic Processing • Requires processing very large quantities of data to produce result • Usually updates are done in batch, and sometimes when system is offline • Organization of data extremely important (query response times, and mainly update times can vary by several orders of magnitude)
Terminology • Dimension • Measure • Aggregate Function • Hierarchy • What is the CUBE operator: • All 2D possible views, if no hierarchies exist • , if hierarchies exist
Cube Representations - MOLAP • MOLAP: Multi-dimensional array • Good for dense cubes, as it does not store the attributes of each tuple • Bad in sparse cubes (high-dimensional cubes) • Needs no indexing if stored as is • Typical methods save dense set of dimensions in MOLAP mode, and index remaining dimensions with other methods • How to store? Chunk in blocks to speed up range queries
Cube Representations - ROLAP • Store views in relations • Needs to index produced relations, otherwise queries will be slow. • Indexes slow down updates • Issues: If limited size is available, which views to store? • Store fact table, and smaller views (ones who have performed most aggregation) • Queries usually specify few dimensions • These views are more expensive to compute on-the-fly
Research issues- ROLAP • How to compute the CUBE • Compute views from smaller parent • Share sort orders • Exhibit locality • Can the size of the cube be limited? • Prefix redundancy (Cube Forests) • Suffix Redundancy (Dwarf) • Approximation Techniques (Wavelets)
Research Issues - ROLAP • How to speed up selected classes of Queries: Range-Sum, Count…Different structures for each case (Partial Sum, Dynamic Data Cube) • How to best represent hierarchical data. Almost no research here.