Integrating Distributed Data Streams

Integrating DistributedData Streams Alasdair J G Gray Supervisors: M Howard Williams Werner Nutt 7th June 2007

Overview • The problem • Limits of current technology • Proposed system • Architecture • Query Answering • Performance • Conclusions

Main sources: sensors Characteristics: Unbounded Append only Frequency Managed by: Sensor networks Network/Grid monitoring Ubiquitous/Pervasive computing environments Streams of Data Reading

Streams are everywhere Internet

Bookkeeping Job progress Grid Monitoring data Grid Monitoring • Resources supplied by various institutions • Resources publish status information • Scheduler must allocate jobs to resources • Bookkeeping tracks resource usage • Users track job progress

Requirements Ability to: • Publish distributed streams of data • Query multiple streams with no knowledge of • Existence of source streams • Location of individual streams • Access methods to individual streams • Scale to large numbers of users and sources

Data Integration System • Several distributed data sources • Users send query to Mediator • Mediator • Translates user query into sub-queries • Combines results of sub-queries • Only for stored data sources Mediator DB1 DB3 DB2

Stream Management System • Data streams into the server • Server applies long-standing queries to the streams • Answers streamed out • Users need to know which streams exist • Centralised server

Solution Need a system that combines: • Ability to access multiple sources without specific source knowledge Data integration • Ability to process streams of data Stream processing A Stream Integration System!

Consumer Consumer Registry Stream Integration System 1 • Producer publishes streams of data • Consumer query for streams of data • Registry matches consumer requests with publications Producer Producer Producer

Publishing Monitoring Data • Stream data can be represented in terms of relations with • Keys: “what” and “where” • Measurements: the “value” • Timestamps: “when” For example, Network ThroughPut • One reading is a tuple in the relation

Consuming Monitoring Data • Users are interested in how the grid changes over time. For example, • Latency for large packets sent from hw • Links with a low latency as recorded by the PingER tool • These can be expressed as SQL selection queries

What is an Answer to a Query? • Global relations contain no tuples (virtual relation) • Need to translate into query over sources • An answer stream should be • Sound • Complete • Duplicate free • Weakly ordered: all tuples that share the same key value will be in timestamp order • Order in general is difficult in a distributed setting • Weak order sufficient for more complex queries such as aggregates

Query Planning: Consumer Query q2: tool=‘ping' Λ latency≤10.0 q1: from='hw' Λ psize≥1024 • Satisfiability used to find relevant producers from='hw' Λ psize≥1024 Λ from='ral' Λ tool='ping' Λ from='hw' Λ tool='udp' S1: from='hw' Λ tool='udp' S2: from='hw' Λ tool='ping' S3: from='ral' Λ tool='ping' S4: from='ral' Λ tool='udp' S5: from=‘an' Λ tool='ping'

Problem: Every consumer contacting every producer of interest does not scale Even a small Grid of less than a dozen sites has problems Grids may contain thousands of resources Forexample, Large Hadron Collider Computing Grid (LCG) Scalability is an Issue

Republishers Allow the System to Scale A republisher • Consumes answers to a selection query • Merges "trickles" into streams • Publishes • Answer stream • Latest-state answer • History Republisher Producer S1 Producer S2 Problem: Choice in where to obtain information

Query Planning in the Presence of Republishers q2: tool='ping' Λ latency≤10.0 q1: from='hw' Λ psize≥1024 • Meta query plan contains choice • Query plan uses one of R1 or R3 • Find all relevant publishers • Rank according to data provided R1: from='hw' R2: from='ral' R3:from='hw' Λ tool='ping' S1: from='hw' Λ tool='udp' S2: from='hw' Λ tool='ping' S3: from='ral' Λ tool='ping' S4: from='ral' Λ tool='udp' S5: from=‘an' Λ tool='ping'

Weak Order is not Guaranteed q2: tool=‘ping' Λ latency≤10.0 • Tuples for same channel • (3) published before (8) • Arrive at consumer in wrong order (8) (3) slow link (8) (3) latency≤5.0 latency>5.0 (3) (8) S2: from='hw' Λ tool='ping'

Generating Well Formed Query Plans • A publisher is relevant for a global query if • Conditions are satisfiable, and • All measurements that agree on their key values come from the same publisher • The measurement condition can be checked using entailment. • Previous example was well formed.

Query Re-Planning • Queries are long-lived • Set of publishers can change • Query plans should reflect changes

How does a new Republisher affect our Consumers? q2: tool= 'ping' Λ latency≤10.0 q1: from='hw' Λ psize≥1024 • Find consumers for which R4 is relevant • Compare R4 to publishers in Meta Query Plan R4: TRUE R1: from='hw' R2: from='ral' R3:from='hw' Λ tool='ping' S1: from='hw' Λ tool='udp' S2: from='hw' Λ tool='ping' S3: from='ral' Λ tool='ping' S4: from='ral' Λ tool='udp' S5: from=‘an' Λ tool='ping'

Planning a Republisher Query • Applying Consumer planning techniques results in a problem R4: TRUE R1: from='hw' R2: from='ral' Problem: • Hierarchy contains cycles • Republishers disconnected from Producers R3:from='hw' Λ tool='ping' S1: from='hw' Λ tool='udp' S2: from='hw' Λ tool='ping' S3: from='ral' Λ tool='ping' S4: from='ral' Λ tool='udp' S5: from=‘an' Λ tool='ping'

Desirable Properties for a Hierarchy • Correctness: streams answer queries • Cycle freeness: loops can lead to duplicates • Uniqueness: hierarchy defined for a set of publishers • Local planning: Publishers and Consumers only need to communicate with the Registry

Generating Well Formed Hierarchies • Need a stricter relevance criterion • R1 can consume from R2 iff • Everything R2 offers is relevant to R1, and • R1 offers something R2 does not. • Can be checked by entailment • Ensures • No loops in the hierarchy • Republishers connected to the Producers

Re-Planning, Re-Visited! • Stricter relevance criterion • Republishers only consume from publishers below them R4 is not relevant for R1 R4: TRUE R1: from='hw' R2: from='ral' R3:from='hw' Λ tool='ping' S1: from='hw' Λ tool='udp' S2: from='hw' Λ tool='ping' S3: from='ral' Λ tool='ping' S4: from='ral' Λ tool='udp' S5: from=‘an' Λ tool='ping'

Republishers Effect on Latency • Tuple published by producer • Tuple passes through some number of republishers • Tuple arrives at consumer Republishers add to the time taken!

Performance Measure

Conclusions • Distributed streams of data are increasing be made available • Distributed users interested in multiple streams • Developed a system for • Publishing distributed data streams • Querying multiple stream sources without source knowledge • Republishers required to allow system to scale

Future Work • Increase complexity of query language • Integrate stored and stream sources

Integrating Distributed Data Streams

Integrating Distributed Data Streams

Presentation Transcript

Sketching Massive Distributed Data Streams

Managing Data Streams

Data Streams

Clustering Data Streams

Clustering Data Streams

Massive data streams

Data Streams

Data Streams

Finding Frequent Items in Distributed Data Streams

Mining Data Streams

Managing Distributed Data Streams – II

Mining Data Streams

Streams: Infinite Data

A Secure Clustering Algorithm for Distributed Data Streams

Data Streams

Monitoring Distributed Data Streams

Mining Data Streams

Mining Data Streams

Monitoring Distributed Data Streams

Monitoring Distributed Data Streams