290 likes | 372 Views
Integrating Distributed Data Streams. Alasdair J G Gray Supervisors: M Howard Williams Werner Nutt 7 th June 2007. Overview. The problem Limits of current technology Proposed system Architecture Query Answering Performance Conclusions. Main sources: sensors Characteristics:
E N D
Integrating DistributedData Streams Alasdair J G Gray Supervisors: M Howard Williams Werner Nutt 7th June 2007
Overview • The problem • Limits of current technology • Proposed system • Architecture • Query Answering • Performance • Conclusions
Main sources: sensors Characteristics: Unbounded Append only Frequency Managed by: Sensor networks Network/Grid monitoring Ubiquitous/Pervasive computing environments Streams of Data Reading
Streams are everywhere Internet
Bookkeeping Job progress Grid Monitoring data Grid Monitoring • Resources supplied by various institutions • Resources publish status information • Scheduler must allocate jobs to resources • Bookkeeping tracks resource usage • Users track job progress
Requirements Ability to: • Publish distributed streams of data • Query multiple streams with no knowledge of • Existence of source streams • Location of individual streams • Access methods to individual streams • Scale to large numbers of users and sources
Data Integration System • Several distributed data sources • Users send query to Mediator • Mediator • Translates user query into sub-queries • Combines results of sub-queries • Only for stored data sources Mediator DB1 DB3 DB2
Stream Management System • Data streams into the server • Server applies long-standing queries to the streams • Answers streamed out • Users need to know which streams exist • Centralised server
Solution Need a system that combines: • Ability to access multiple sources without specific source knowledge Data integration • Ability to process streams of data Stream processing A Stream Integration System!
Consumer Consumer Registry Stream Integration System 1 • Producer publishes streams of data • Consumer query for streams of data • Registry matches consumer requests with publications Producer Producer Producer
Publishing Monitoring Data • Stream data can be represented in terms of relations with • Keys: “what” and “where” • Measurements: the “value” • Timestamps: “when” For example, Network ThroughPut • One reading is a tuple in the relation
Consuming Monitoring Data • Users are interested in how the grid changes over time. For example, • Latency for large packets sent from hw • Links with a low latency as recorded by the PingER tool • These can be expressed as SQL selection queries
What is an Answer to a Query? • Global relations contain no tuples (virtual relation) • Need to translate into query over sources • An answer stream should be • Sound • Complete • Duplicate free • Weakly ordered: all tuples that share the same key value will be in timestamp order • Order in general is difficult in a distributed setting • Weak order sufficient for more complex queries such as aggregates
Query Planning: Consumer Query q2: tool=‘ping' Λ latency≤10.0 q1: from='hw' Λ psize≥1024 • Satisfiability used to find relevant producers from='hw' Λ psize≥1024 Λ from='ral' Λ tool='ping' Λ from='hw' Λ tool='udp' S1: from='hw' Λ tool='udp' S2: from='hw' Λ tool='ping' S3: from='ral' Λ tool='ping' S4: from='ral' Λ tool='udp' S5: from=‘an' Λ tool='ping'
Problem: Every consumer contacting every producer of interest does not scale Even a small Grid of less than a dozen sites has problems Grids may contain thousands of resources Forexample, Large Hadron Collider Computing Grid (LCG) Scalability is an Issue
Republishers Allow the System to Scale A republisher • Consumes answers to a selection query • Merges "trickles" into streams • Publishes • Answer stream • Latest-state answer • History Republisher Producer S1 Producer S2 Problem: Choice in where to obtain information
Query Planning in the Presence of Republishers q2: tool='ping' Λ latency≤10.0 q1: from='hw' Λ psize≥1024 • Meta query plan contains choice • Query plan uses one of R1 or R3 • Find all relevant publishers • Rank according to data provided R1: from='hw' R2: from='ral' R3:from='hw' Λ tool='ping' S1: from='hw' Λ tool='udp' S2: from='hw' Λ tool='ping' S3: from='ral' Λ tool='ping' S4: from='ral' Λ tool='udp' S5: from=‘an' Λ tool='ping'
Weak Order is not Guaranteed q2: tool=‘ping' Λ latency≤10.0 • Tuples for same channel • (3) published before (8) • Arrive at consumer in wrong order (8) (3) slow link (8) (3) latency≤5.0 latency>5.0 (3) (8) S2: from='hw' Λ tool='ping'
Generating Well Formed Query Plans • A publisher is relevant for a global query if • Conditions are satisfiable, and • All measurements that agree on their key values come from the same publisher • The measurement condition can be checked using entailment. • Previous example was well formed.
Query Re-Planning • Queries are long-lived • Set of publishers can change • Query plans should reflect changes
How does a new Republisher affect our Consumers? q2: tool= 'ping' Λ latency≤10.0 q1: from='hw' Λ psize≥1024 • Find consumers for which R4 is relevant • Compare R4 to publishers in Meta Query Plan R4: TRUE R1: from='hw' R2: from='ral' R3:from='hw' Λ tool='ping' S1: from='hw' Λ tool='udp' S2: from='hw' Λ tool='ping' S3: from='ral' Λ tool='ping' S4: from='ral' Λ tool='udp' S5: from=‘an' Λ tool='ping'
Planning a Republisher Query • Applying Consumer planning techniques results in a problem R4: TRUE R1: from='hw' R2: from='ral' Problem: • Hierarchy contains cycles • Republishers disconnected from Producers R3:from='hw' Λ tool='ping' S1: from='hw' Λ tool='udp' S2: from='hw' Λ tool='ping' S3: from='ral' Λ tool='ping' S4: from='ral' Λ tool='udp' S5: from=‘an' Λ tool='ping'
Desirable Properties for a Hierarchy • Correctness: streams answer queries • Cycle freeness: loops can lead to duplicates • Uniqueness: hierarchy defined for a set of publishers • Local planning: Publishers and Consumers only need to communicate with the Registry
Generating Well Formed Hierarchies • Need a stricter relevance criterion • R1 can consume from R2 iff • Everything R2 offers is relevant to R1, and • R1 offers something R2 does not. • Can be checked by entailment • Ensures • No loops in the hierarchy • Republishers connected to the Producers
Re-Planning, Re-Visited! • Stricter relevance criterion • Republishers only consume from publishers below them R4 is not relevant for R1 R4: TRUE R1: from='hw' R2: from='ral' R3:from='hw' Λ tool='ping' S1: from='hw' Λ tool='udp' S2: from='hw' Λ tool='ping' S3: from='ral' Λ tool='ping' S4: from='ral' Λ tool='udp' S5: from=‘an' Λ tool='ping'
Republishers Effect on Latency • Tuple published by producer • Tuple passes through some number of republishers • Tuple arrives at consumer Republishers add to the time taken!
Conclusions • Distributed streams of data are increasing be made available • Distributed users interested in multiple streams • Developed a system for • Publishing distributed data streams • Querying multiple stream sources without source knowledge • Republishers required to allow system to scale
Future Work • Increase complexity of query language • Integrate stored and stream sources