XML Routing in Data Dissemination Networks

1. XML Routing in Data Dissemination Networks Guoli Li, Shuang Hou, Hans-Arno Jacobsen Middleware Systems Research Group University of Toronto Presenter: Zhengdao Xu Project: XML-ToPSS. Part of the Toronto Publish/Subscribe Systems family

2. XML Routing in Data Dissemination Networks This project applies the content-based routing to XML (publications) and XPath (subscriptions). The objective is to efficiently route large numbers of XML documents throughout a distributed overlay of XML Routers (many documents per time, high change in subscriber interests, many new subscribers � .) XML Content processing is become a market force IBM Datapower, SolaceSystems, (and other recent acquisitions by Intel, Cisco et al.) of hardware-based XML processing technology, including content-based routing technology Similarly, much interest in software-based processing of XML against XPath/Xquery et al. Telcos as customers to hardware-based XMl Routers to offer content-providers data dissemination networks Cisco�s AON initiative is about similar value-added dissemination networks on the Internet and how to support them in the next generation of Internet routers XML routing is therefore a very timely, relevant, and important problem. This project applies the content-based routing to XML (publications) and XPath (subscriptions). The objective is to efficiently route large numbers of XML documents throughout a distributed overlay of XML Routers (many documents per time, high change in subscriber interests, many new subscribers � .) XML Content processing is become a market force IBM Datapower, SolaceSystems, (and other recent acquisitions by Intel, Cisco et al.) of hardware-based XML processing technology, including content-based routing technology Similarly, much interest in software-based processing of XML against XPath/Xquery et al. Telcos as customers to hardware-based XMl Routers to offer content-providers data dissemination networks Cisco�s AON initiative is about similar value-added dissemination networks on the Internet and how to support them in the next generation of Internet routers XML routing is therefore a very timely, relevant, and important problem.

3. XML Routing in Data Dissemination Networks Content-based Routing Challenges Adapt the use of advertisements to optimize information dissemination for XML content routing. Adapt covering and merging optimizations for XML content routing. Deliver relevant XML documents to a large and dynamically changing group of consumers. Advertisements (types of schemas) are well understood in content-based publish/subscribe (routing), however, it is not clear how to apply them to XML data. An advertisement summarizes the kind of information a data source (publisher) will publish. Simply using an XML Schema is not sufficient, since it is not clear how to apply known content-based routing algorithms to XML Schemas i.e., how to match an advertisement against a subscription (XML Schema against an XPath expression). Moreover, an XML Schema maybe too general (too broad) and not give the publisher sufficient expressiveness to narrow down his event space (the number of possible publications it will publish). Furthermore, covering / merging are common optimizations developed in p/s. Again, it is not clear how to apply these operations to the context of XPath. Covering is query consumption (known for XPath, but in the general case NP complete) So heuristics are required to solve the problem efficiently. Merging combines XPath expressions to reduce storage / matching cost. Unclear how to do this for XPath. Advertisements (types of schemas) are well understood in content-based publish/subscribe (routing), however, it is not clear how to apply them to XML data. An advertisement summarizes the kind of information a data source (publisher) will publish. Simply using an XML Schema is not sufficient, since it is not clear how to apply known content-based routing algorithms to XML Schemas i.e., how to match an advertisement against a subscription (XML Schema against an XPath expression). Moreover, an XML Schema maybe too general (too broad) and not give the publisher sufficient expressiveness to narrow down his event space (the number of possible publications it will publish). Furthermore, covering / merging are common optimizations developed in p/s. Again, it is not clear how to apply these operations to the context of XPath. Covering is query consumption (known for XPath, but in the general case NP complete) So heuristics are required to solve the problem efficiently. Merging combines XPath expressions to reduce storage / matching cost. Unclear how to do this for XPath.

4. XML Routing in Data Dissemination Networks Overview of Dissemination Networks This figure provides an overview of the dissemination network this paper assumes. In the overlay broker network, each broker only knows their neighbors. However, none of the clients -neither data producers nor data consumers know about each other or about the network topology, except the router they connect to.This figure provides an overview of the dissemination network this paper assumes. In the overlay broker network, each broker only knows their neighbors. However, none of the clients -neither data producers nor data consumers know about each other or about the network topology, except the router they connect to.

5. XML Routing in Data Dissemination Networks Content-based Routing This figure shows a scenario for advertisement-based content-based routing. The subscription routing table (SRT) consisting of <advertisement, last-hop> tuples stores advertisements in order to route subscriptions. Publications will trace back along the path setup by subscriptions to interested subscribers. The publication routing table (PRT) maintains the path information. For example, adv_1 is broadcast in the network, and is stored on each broker of the network with different last hop. Consequently, subscriptions that match adv_1 will be routed according to these last hops (e.g., sub_1 is routed along the link 5-4-3-1). Note that the sub_1 will not be forwarded to brokers 2 an 6 since the adv_1 indicates that the publisher is from broker1. Therefore, the pub_1 is routed along a reverse path 1-3-4-5 to the subscriber. This figure can also be used to explain how the covering-based routing can remove redundant subscriptions from the network in order to maintain a compacted routing table and reduce the network traffic. For example, if sub_1 covers sub_2, we can safely remove sub_2 on broker3 and gain a compacter routing table while maintaining the same information delivery behavior, for all the publications matching sub_2 must match sub_1. This figure shows a scenario for advertisement-based content-based routing. The subscription routing table (SRT) consisting of <advertisement, last-hop> tuples stores advertisements in order to route subscriptions. Publications will trace back along the path setup by subscriptions to interested subscribers. The publication routing table (PRT) maintains the path information. For example, adv_1 is broadcast in the network, and is stored on each broker of the network with different last hop. Consequently, subscriptions that match adv_1 will be routed according to these last hops (e.g., sub_1 is routed along the link 5-4-3-1). Note that the sub_1 will not be forwarded to brokers 2 an 6 since the adv_1 indicates that the publisher is from broker1. Therefore, the pub_1 is routed along a reverse path 1-3-4-5 to the subscriber. This figure can also be used to explain how the covering-based routing can remove redundant subscriptions from the network in order to maintain a compacted routing table and reduce the network traffic. For example, if sub_1 covers sub_2, we can safely remove sub_2 on broker3 and gain a compacter routing table while maintaining the same information delivery behavior, for all the publications matching sub_2 must match sub_1.

6. XML Routing in Data Dissemination Networks Language Model Representation Advertisement Non-recursive advertisement is extracted from a non-recursive DTD: a = /t1/t2/�/tn-1/tn Recursive advertisement Simple: a = a1(a2)+a3 Series: a = a1(a2)+a3(a4)+a5 Embedded: a = a1(a2(a3)+a4)+a5 Publication: e = /t1/t2/�/tn Subscription Absolute simple XPE: s = /c/d/*/e Relative simple XPE: s = c/d/*/e Descendant operators in XPE: s = s1//s2//�//sn We use the abstract XPath expression without //- operators as the format of advertisement in the context of XML/XPath data routing. Advertisements are a system internal mechanism, which is not exposed to the application or the user. In our approach, advertisements are derived from the DTD, since the DTD contains all possible paths from the root to the leaves appearing in related XML documents. We call an advertisement a non-recursive advertisement if it is extracted from a non-recursive DTD. a = /t_1/t_2/�/t_n, where, t_i can be either an element or a wildcard. A DTD is recursive if it has some element that is defined in terms of itself, directly or indirectly, e.g., the NITF DTD is recursive. We define an advertisement as a recursive advertisement if it has recursive elements defined in a DTD. An advertisement may have multiple recursive parts that appear in sequence or are embedded in each other. We classify the recursive advertisements to above three categories. We use the common interpretation of an XML document as a tree of nodes and consider each path from the root node to a leaf node. Thus, we decompose each XML document into a set of XML paths and each path is represented as e=/t_1/t_2/�/t_n, where t_i is the XML element name. We discuss three cases of subscriptions in our paper. They are absolute simple XPE, which only contains parent-child (/) and wildcard (*) operators, relative simple XPE, which is similar to absolute simple XPE except for the first operator that is the XPE is relative, and XPE with descendant (//) operators, which splits the XPE into sub-XPEs without //-operator. We use the abstract XPath expression without //- operators as the format of advertisement in the context of XML/XPath data routing. Advertisements are a system internal mechanism, which is not exposed to the application or the user. In our approach, advertisements are derived from the DTD, since the DTD contains all possible paths from the root to the leaves appearing in related XML documents. We call an advertisement a non-recursive advertisement if it is extracted from a non-recursive DTD. a = /t_1/t_2/�/t_n, where, t_i can be either an element or a wildcard. A DTD is recursive if it has some element that is defined in terms of itself, directly or indirectly, e.g., the NITF DTD is recursive. We define an advertisement as a recursive advertisement if it has recursive elements defined in a DTD. An advertisement may have multiple recursive parts that appear in sequence or are embedded in each other. We classify the recursive advertisements to above three categories. We use the common interpretation of an XML document as a tree of nodes and consider each path from the root node to a leaf node. Thus, we decompose each XML document into a set of XML paths and each path is represented as e=/t_1/t_2/�/t_n, where t_i is the XML element name. We discuss three cases of subscriptions in our paper. They are absolute simple XPE, which only contains parent-child (/) and wildcard (*) operators, relative simple XPE, which is similar to absolute simple XPE except for the first operator that is the XPE is relative, and XPE with descendant (//) operators, which splits the XPE into sub-XPEs without //-operator.

7. XML Routing in Data Dissemination Networks Subscription Covering Data Structure Maintain XPEs by identifying the covering relations among them This figure describes a novel data structure called subscription tree for maintaining subscriptions. The data structure captures the covering relations among subscriptions. It can speed up the publication and subscription matching as well. In the subscription tree, a subscription at a node in the tree covers all subscriptions in its subtree. Since a covering relation defines a partial order among susbcriptions, a tree data structure can not capture all the covering relations. A subscription node can have only one parent in the tree, but it may be covered by several subscriptions. We allow each node having a set of super pointers, which indicate the covering relations with nodes outside its subtree. Super pointers are shortcuts to subscriptions that the node covers. If there is no covering relation among a set of subscriptions, subscriptions can be merged into a new subscription to create a more concise routing table. In the figure, /a/b/a, /a/b/b and /a/b/d can be merged and they are represented by a new node /a/b/*. There was a super pointer should be removed because there is no covering relations between the pointer owner and the merger. Another case is , /b/d and /b/e are merged into /b/*, their children are the new node�s children. The super pointer at /b/d/a was not changed.This figure describes a novel data structure called subscription tree for maintaining subscriptions. The data structure captures the covering relations among subscriptions. It can speed up the publication and subscription matching as well. In the subscription tree, a subscription at a node in the tree covers all subscriptions in its subtree. Since a covering relation defines a partial order among susbcriptions, a tree data structure can not capture all the covering relations. A subscription node can have only one parent in the tree, but it may be covered by several subscriptions. We allow each node having a set of super pointers, which indicate the covering relations with nodes outside its subtree. Super pointers are shortcuts to subscriptions that the node covers. If there is no covering relation among a set of subscriptions, subscriptions can be merged into a new subscription to create a more concise routing table. In the figure, /a/b/a, /a/b/b and /a/b/d can be merged and they are represented by a new node /a/b/*. There was a super pointer should be removed because there is no covering relations between the pointer owner and the merger. Another case is , /b/d and /b/e are merged into /b/*, their children are the new node�s children. The super pointer at /b/d/a was not changed.

8. XML Routing in Data Dissemination Networks Evaluation Two data sets for NITF which include 100,000 XPEs each. They have different probability of �*� and �//� occurring in the XPE. The routing table size is reduced dramatically by covering optimization. Publication routing performance is improved by up to 85% because of a compacter routing table. The covering optimization can reduce the routing table size dramatically, and therefore, improve the publication routing performance by up to 85% because of a compacter routing table. To verify this fact, we generate two data sets for NITF which include 100,000 XPEs each. We vary the probability of �*� and �//� occurring at a location step to generate two data sets A and B with different covering rates 90% and 50%, respectively. The results suggest that the covering algorithm perform better on data sets with higher degree of overlap. That is, covering technique gains more benefit when subscribers in a community has similar interests. For the experiment of publication routing time, we generate 500 XML documents and extract 23,098 publications from these documents. The table shows the routing time of the publications against 100,000 XPEs. The measurements are obtained by averaging the time taken to route all publications. After applying the covering algorithm, the routing time for Set A and B are reduced by 84.6% and 47.5%, respectively. Note: the performance of original matching system has been evaluated against YFilter in the previous work[2].The covering optimization can reduce the routing table size dramatically, and therefore, improve the publication routing performance by up to 85% because of a compacter routing table. To verify this fact, we generate two data sets for NITF which include 100,000 XPEs each. We vary the probability of �*� and �//� occurring at a location step to generate two data sets A and B with different covering rates 90% and 50%, respectively. The results suggest that the covering algorithm perform better on data sets with higher degree of overlap. That is, covering technique gains more benefit when subscribers in a community has similar interests. For the experiment of publication routing time, we generate 500 XML documents and extract 23,098 publications from these documents. The table shows the routing time of the publications against 100,000 XPEs. The measurements are obtained by averaging the time taken to route all publications. After applying the covering algorithm, the routing time for Set A and B are reduced by 84.6% and 47.5%, respectively. Note: the performance of original matching system has been evaluated against YFilter in the previous work[2].

9. XML Routing in Data Dissemination Networks Evaluation A real 127 XML router network is built to investigate the impact of advertisement-based routing and covering techniques on network traffic. Advertisements are used to avoid the flooding of XPath queries, and can reduce the overall network traffic up to 75.6% In this experiment, we investigate the impact of advertisement-based routing and covering techniques on network traffic, given a tree-like broker topology. The broker overly network is a tree in which each broker is connected to 2 subordinate brokers. We build two real overlays. One has three levels, which consists of 7 brokers. The other broker overlay has seven levels with there are 64 leaf brokers, and 127 brokers in total. Each leaf broker is connected with a subscriber. Publishers randomly connect to the broker overlays. We generate 1000 distinct XPEs for each subscriber using the PSD DTD, and 50 XML documents for the publishers. 4,182 publications are extracted from these documents. The table shows the total number of messages in two broker overlays generated by one publisher. These messages, including advertisements, publications and subscriptions, are received by all brokers in the network under different routing strategies. As can be seen, the two advertisement-based routing methods significantly reduce the network traffic, because in this case a subscription is not flooded, and it is only forwarded to brokers that are on a path from the respective subscriber to the publisher. The introduction of advertisements reduces the network traffic to 68.5% and 75.6% for non-covering-based and covering-based routing strategies, respectively. Moreover, applying both advertisement-based routing and covering-based routing technique can reduce the overall network traffic to 66.2% and 49.9% in the two topologies. The experiment suggests that using advertisement to avoid subscription flooding and removing redundant queries by exploring covering relations among subscriptions can reduce the network traffic and save system resources. We gain more benefit in a larger broker network. The scalability of the system has been improved.In this experiment, we investigate the impact of advertisement-based routing and covering techniques on network traffic, given a tree-like broker topology. The broker overly network is a tree in which each broker is connected to 2 subordinate brokers. We build two real overlays. One has three levels, which consists of 7 brokers. The other broker overlay has seven levels with there are 64 leaf brokers, and 127 brokers in total. Each leaf broker is connected with a subscriber. Publishers randomly connect to the broker overlays. We generate 1000 distinct XPEs for each subscriber using the PSD DTD, and 50 XML documents for the publishers. 4,182 publications are extracted from these documents. The table shows the total number of messages in two broker overlays generated by one publisher. These messages, including advertisements, publications and subscriptions, are received by all brokers in the network under different routing strategies. As can be seen, the two advertisement-based routing methods significantly reduce the network traffic, because in this case a subscription is not flooded, and it is only forwarded to brokers that are on a path from the respective subscriber to the publisher. The introduction of advertisements reduces the network traffic to 68.5% and 75.6% for non-covering-based and covering-based routing strategies, respectively. Moreover, applying both advertisement-based routing and covering-based routing technique can reduce the overall network traffic to 66.2% and 49.9% in the two topologies. The experiment suggests that using advertisement to avoid subscription flooding and removing redundant queries by exploring covering relations among subscriptions can reduce the network traffic and save system resources. We gain more benefit in a larger broker network. The scalability of the system has been improved.

10. XML Routing in Data Dissemination Networks Conclusions Investigate advertisement-based routing for XML data dissemination networks. Present algorithms to determine the covering relations between arbitrary XPEs, and a novel data structure to maintain covering & merging relationships among XPEs. Propose rules to merge similar XPEs in order to further reduce the routing table size. Perform experimental evaluation on a real 127 broker overlay to demonstrate the approach. Reduce routing table by up to 90% Improve routing latency by roughly 85%

11. XML Routing in Data Dissemination Networks References [1] XML-based Toronto Publish/Subscribe System:xml-topss.msrg.utoronto.ca [2] Shuang Hou and H.-A. Jacobsen. Predicate-based filtering of XPath expressions. ICDE 2006 [3] Guoli Li, Shuang Hou and H.-A. Jacobsen. Routing of XML and XPath queries in data dissemination networks. Technical Report. University of Toronto, Oct 2006 [4] PADRES project is a distributed content-based publish/subscribe middleware that is designed for enterprise-grade event management applications (non XML-based routing.) padres.msrg.utoronto.ca/Padres/WebHome

XML Routing in Data Dissemination Networks

XML Routing in Data Dissemination Networks

Presentation Transcript

Routing in Sensor Networks

Data Dissemination in Vehicular Networks

Innovations in Data Dissemination

Routing in Optical Networks

Data Dissemination

Routing and Data Dissemination

User-Centric Data Dissemination in Disruption Tolerant Networks

Routing in wireless networks

Data dissemination

Data Dissemination in Vehicular Ad Hoc Networks

Routing in Mesh Networks

Tree-Pattern Aggregation for Scalable XML Data Dissemination

Routing of XML and XPath Queries in Data Dissemination Networks

Routing in Sensor Networks

Data Dissemination and Fusion In Sensor Networks

Routing in Mobile Networks

Data Dissemination and Fusion in Sensor Networks

Routing and Data Dissemination

XML Data Dissemination using Automata on top of Structured Overlay Networks