Adaptations to Peer to Peer Networks

An analysis of: Adaptations to Peer to Peer Networks Improving Search in Peer-to-Peer Networks Beverly Yang Hector Garcia–Molina Computer Science Department, Stanford University An Adaptive Peer-to-Peer Network for Distributed Caching of OLAP Results Panis Kalnis Wee Siong Ng Beng Chin Ooi Dmitris Papadias Kian-Lee Tan Department of Computer Science, Hong Kong University of Science and Technology Department of Computer Science, National University of Singapore Presentation by: Timothy Shih Undergraduate Computer Science Major Yale University Class of 2003

Outline • Background • Distributed Caching of OLAP results • Improving Search Techniques • Conclusion

Background – P2P networks Definition:Distributed systems in which nodes of equal roles and capabilities exchange information and services directly with each other. Benefits and Strengths - Sharing large volumes of data - Self-organization - Load-balancing - Adaptation - Fault Tolerance - Highly Scalable - No Central Administration

Background – P2P networks (cont’d) Usability of a P2P network depends on: • Searching for data • Retrieving data P2P networks can divided into two categories: • Storage/Archival – focus is upon availability of the data • Loose – focus is upon searching for and sharing data

Loose Networks Characteristics • Persistence and availability are not guaranteed • Design incorporates wide range of users from non-cooperating organizations • Users are ad-hoc and may be geographically spanning Result • Search techniques can afford to have looser guarantees • Search techniques cannot afford to strictly regulate the network • Capable of “richer” queries (i.e. keywords with reg. expressions) • Cost of retrieving data across locations may be prohibitive

Existing Loose Networks FastTrack – Morpheus, Kazaa • 300,000+ Users • “Super-peer” nodes act as a centralized resource for a small number of clients • These “Super-peer” nodes connect to form a pure P2P network Napster - Napster • 160,000 Users (Feb. – 1.6 million!) • “Hybrid” P2P system • Searches performed on a centralized server which indexes a fraction of the available files. File-sharing is P2P. Gnutella and Freenet – Clip2, Limewire • 40,000 Users • Pure P2P, all nodes can receive, forward, and satisfy queries

How to improve pure P2P? Employ a database system on top of the network • Nodes form distributed cache • Query satisfied by a single peer or partial results from many peers • Efficient use of available bandwidth Improve Search Techniques • Make queries more efficient • Generate as little load as possible • Provide good user experience

Distributed Caching of OLAP Results Definition: On-line Analytical Processing (OLAP) - a class of tools that can extract and present multidimensional data (“data cubes”) from different points of view. Basic Idea • Peers cache results of OLAP queries (gained from the data warehouse) • Upon initiating a new query, if it cannot be answered locally, it is propagated through the network until the answer is found. • An answer may be constructed from the partial results of many peers

Data Cubes Summary • “hypercubes” : N-dimensional view of data • Each dimension may be hierarchal, containing multiple attributes. (e.g. Time may be divided into years, months, weeks, etc.) • The values of the cells of the cube represent some “measures” of interest. • Hypercube consist of 2n summaries of activities • There are n! paths from the origin • OLAP queries the hypercube which may be fully materialized, partially materialized, or not materialized.

Multidimensional Data Simple visual representation of 3-dimensional data. 3 Dimensions – Sales, Year, Products X is the “measure” of interest • Tabular representation of 6-dimensional • Data. • 3 Dimensions • - Sales Rep, Quarter, Location • 3 Attributes • - Location subdivided into Country and City • Quarter subdivided into months • Sales is the “measure” of interest

Data Cube Lattices Lattice for a 3-dimensional data cube Dimensions are : Product, Consumer, Supplier • Each node in the lattice represents a possible group-by query • Edges represent interdependencies among the nodes • There is a path from node ui to uj if ui can be used to answer uj • So for node psc there are 6 million rows, etc. Therefore, a fully materialized • data cube consists of 19 million rows.

On-Line Analytical Processing (OLAP) • OLAP query extracts information from hypercubes • Materializing entire cube results in efficient queries but takes up a lot of space. • Several Techniques for accelerating OLAP. Basic Idea – Partial Materialization - Static Approach : Partial views selected when warehouse is created. (e.g. Greedy algorithms based on statistical properties of expected workload) - Dynamic Approach : Semantic Data Caching – results of previous queries stored with their semantic information.

Dynamic OLAP optimization systems Watchman • Semantic Cache Manager • Stores results of query + query string • Subsequent queries answered by cached data if there is an exact match on the query string Dynamat • Stores fragments • Fragments are aggregate query results in a finer granularity than views • Fragments may be aggregated to answer more general queries • Same caching policy as Watchman

Chunking • Decomposition of multidimensional space into chunks • Chunk – smallest piece of cached information • When processing query, system computes set of chunks required to answer it. • Chunks subdivided into two subsets based on whether they are cached or not • To answer query, system requests missing chunks from warehouse. • PeerOLAP uses chunking - uniformity of semantic regions - efficient space utilization

PeerOLAP Architecture Purpose (two-fold): • Accelerate OLAP queries through a distributed cache • Extend Peer to Peer networks to support execution of complex queries across multiple sources and to use intermediate results to answer consecutive queries. Features : • Employs mobile agents • Shares data at a finer granularity as well as computational power • Dynamic reconfiguration • Employs location independent global name lookups (LIGLO) servers

PeerOLAP network example • P2 requests (c1,c2,c3) • Neighbors computes estimation of cost for retrieving chunks • Each message assigned maximum # of hops • After query is processed, P2 decides which chunks to cache • Each chunk is assigned a benefit value, non-beneficial chunks are evicted • Chunks may be cached at neighbors • Neighbors periodically reassigned based on benefit metric

Architecture of a Peer • Local cache is organized as a chunk file • Cache control module implements admission/replacement policy • P2P platform implements low level communication among peers, and • between peers and the warehouse

Cost Model Let c be a chunk and size(c) be its size in tuples. S(c,P) denotes cost of computing c in node P. N(c,Q -> P) denotes network cost N for transferring c from Q to P N(c,Q -> P) = Cn(P -> Q) / k + size(c) / Tr(Q->P) Cn(P -> Q) is cost of establishing connection from Q to P k is the number of chunks to be transferred Tr(Q -> P) is the transfer rate between Q and P T(c,Q -> P) denotes total cost of answering c at P by using data from Q T(c,Q -> P) = S(c,Q) + N(c,Q -> P)

Query Processing Query q has the form: SELECT <grouping predicates> AGR(measure) FROM data WHERE <selection predicates> GROUP BY <grouping predicates> Let σ denote the set of selection predicates and γ denote the set of grouping predicates v is a representative view of q if the set of dimensions of v is σ U γ. A PeerOLAP node computes the results of such queries by first accessing the set C of chunks at the same level of granularity as v and then performing the necessary selections and aggregations on them

Eager Query Processing (EQP) Node P issues query q. 1. Query q is decomposed into chunks of the same granularity as the representative view. Call denotes this set of chunks. 2. P checks local cache. Clocal denotes this set, let Cmiss denote the rest. 3. P sends message to neighbors Q1, …, Qk asking for Cmiss. If Qi has a subset of Cmiss it calculates the cost T(ci, Q->P) for each chunk and sends estimations to P. If Qi has no chunks, it does not respond. In either case, Qi propagates the request for Cmiss to its neighbors recursively up to maximum # hops. 4. P receives responses for a period t after which no more results are accepted. 5. Let Cpeer be the subset of Cmiss that was found in the network. P retrieves Cpeer in the following manner: chunk ci is randomly selected from Cpeer and is assigned to Qi where Qi is the peer that provides ciat the lowest cost. Qj is the peer the provides randomly selected cj at the lowest cost. If Qi contains cj check if total cost T({ci, cj }, Qi) < T(ci , Qi) + T(cj , Qj). Continue process for all the remaining chunks in Cpeer. 6. P initialized connections to the peers defined in the execution plan and requests the corresponding chunks. Let Cevicted be the set of chunks not returned in this manner 7. The set CDW of chunks still missing is : CDW = Cmiss – (Cpeer – Cevicted). P retrieves CDW from the warehouse 8. P composes the answer and returns it to the user. The new chunks are sent to cache control and network reconfiguration is performed if necessary.

Lazy Query Processing (LQP) Same as EQP except for step 3: 3. P sends message to neighbors Q1, …, Qk asking for Cmiss. If Qi has a subset of Cmiss it calculates the cost T(ci, Q->P) for each chunk and sends estimations to P. If Qi has no chunks, it does not respond. In either case, Qi propagates the request for Cmiss to its neighbors recursively up to maximum # hops. Changed to 3. P sends message to neighbors Q1, …, Qk asking for Cmiss. But each neighbor will propagate request only to the most beneficial neighbor. In addition, if Qi can answer some of the chunks, it removes them from the propagated message. Note: If each peer has k neighbors, the # messages are O(k * hmax) as opposed to O(khmax) for EQP.

Caching Policy Caching policy is defined by assigning a benefit metric B() to every chunk c at peer P. H(P -> Q) is # hops from P to Q a is a constant representing overhead of sending one message Replication is allowed but only if necessary.

Admission and Replacement Algorithm Follows Least Benefit First(LBF) algorithm. • Each cached chunk is assigned a weigh W() which is initially equal to the computed benefit of that chunk • W() is decreased every time a new chunk is considered for admission • W() is restored whenever that chunk is accessed again • LBF will mark victims to evict that will release enough space for the new chunk. • Victims will be evicted if the benefit of storing the new chunk is less than the combined benefit of the victims. • LBF controls local cache of each peer

Caching Policies Isolated Caching Policy (ICP) • Assumes that peer P is autonomous and will attempt to benefit from other peers in a greedy manner • P publishes its cache contents but does not count hits on its cache from other peers. (e.g. Neighbor Q request chunk ci, P will not restore ci to its original value) Hit Aware Caching Policy (HACP) • Attempts to ensure caches cooperate to minimize total query cost • P counts hits on its cache from other peers. Voluntary Caching • Exploit underutilized resources that may exist in other peers • If P cannot store chunk c it will ask if its neighbors can store it before evicting it. • Voluntary caching may work in conjunction with either ICP (v-ICP) or HACP (v-HACP).

Network Reorganization • Attempt to optimize network structure by creating virtual neighborhoods of peers with similar query patterns • Maintain statistics on slightly more than ncmax connections. • After pre-specified k requests are served, evict neighbor if there is a more beneficial neighbor nearby.

Evaluation (DCSR) SYNTH • All evaluations done on two data sets: SYNTH and APB • APB contains 3.5G tuples • SYNTH contains 69M tuples • Total space chunked such that each chunk at a dimensional range was proportional to the number of values at that level • Size of largest chunk was 1M tuples • Detailed Cost Saving Ratio (DCSR) APB wcost(qi) is the cost of answering the query in the worst case. (Peer has no cache)

PeerOLAP vs. Client-Side Cache - 10 Peers all connected to each other (clique network) - Cache size varied from .001% to 10% of total data cube size - Query set – 20K queries following 80-20 rule - Each peer issues same number of queries - Naïve configuration: LQP and ICP

Results - Results from APB and SYNTH datasets follow similar trend for all experiments. - Central Cache represents optimal case of the system - Clique network achieves near-optimal performance

Results (realistic configuration) • Each peer connected to 4 others only • Max hops allowed is 3 • Cache size of each peer set to 1% • Policies remain the same (LQP with ICP) • Number of peers varied from 10 to 100 • Peers divided into groups of 10 • Each group is provided a separate query set following 90-10 rule

Evaluation of Query Optimization • Network of 100 peers, each with cache set to 1% • Caching policy is ICP and network reorganization is disabled • Query set consists of 10 groups of 10 peers, each accessing a different hot region • Fix # hops to 3, vary neighbors per peer • Fix # neighbors to 4, vary #of hops • Performance metric based no total execution cost, provides no information about response time. EQP transmits large amount of message to network

Comparison of Caching Policies • Clique network of 10 peers • 20K queries following 90-10 • # of queries not the same for all peers • In each dataset Qk, one peer receives k queries and the rest are divided among the remaining 9. • Q90 query set with varying cache from 1 to 10%

Evaluation of Caching Policies • Although HACP not beneficial in itself, it combines well with voluntary cache

Effect of network reorganization • Network of 100 peers • Cache size set to 1% • LQP and ICP • Max # hops set to 5 • Peer reorganizes neighbors after 40 queries (first graph) • Peer has 4 neighbors (second graph)

PeerOLAP conclusions • PeerOLAP optimizes typical client-server architecture by constructing a large virtual cache which can benefit all peers • Fully distributed and highly scalable, network has not specific structure and participation of peers is unpredictable • Significant performance gains through query optimization techniques, caching policies, network reconfiguration Extensions and shortcomings: • Only considers cached chunks at same level of aggregation as query • Network reconfiguration algorithm (LFU) not sophisticated • Clustering problem – identifying neighborhoods of ppers with similar access patterns

Improving Search Techniques (Metrics) Cost (Given a set of queries Qrep) • Average Aggregate Bandwidth • Average Aggregate Processing Cost Quality • Number of Results • Satisfaction (Given number of results Z) • Time to satisfaction Note: In general, there is a tradeoff between cost and quality.

Current Search Techniques Gnutella • Breadth First Search traversal (BFS) • Depth Limit (D) = system-wide TTL of a message in hops • Each node forwards query to all neighbors if # hops < D Freenet • Depth First Search traversal (DFS) • Depth Limit (D) • Each node forwards query to a single neighbor, then waits • If query not satisfied, forward to another neighbor • Otherwise, forward results back to source

Problems with Searching BFS • Ideal if Quality = # results • If Quality = satisfaction, waste of resources DFS • Minimizes Cost • Seq. execution = poor response time (worst case is exp in D) Existing techniques fall on opposite extremes of bandwidth/processing cost and response time. - Goal is to find “middle ground” while maintaining quality

Proposed Broadcast Policies Iterative Deepening • Assumption: Quality = satisfaction • Institute system-wide Policy P= {a,b,c} where # iterations = 3 • Establish waiting time W specifying time between iterations • Source S initiates BFS of depth a, waits for W, send a Resend with a TTL of a. • Nodes receiving Resend will forward. If node is at depth a, drop Resend, unfreeze query with TTL b – a.

Proposed Broadcast Policies(cont’d) Directed BFS • Attempt to minimize response time • Each node maintains simple statistics on its neighbors • Source S sends query to a specific subset of neighbors • These nodes then perform BFS on all of its neighbors • Possible Heuristics for intelligent selection: - highest number of results from previous queries - response message takes lowest average # hops - forwarded largest # messages to source - shortest message queue

Proposed Broadcast Policies(cont’d) Local Indices • Attempt to maintain satisfaction and # results while keeping cost low • Each node n maintains an index of the data of all nodes within r hops of itself where r is a system-wide variable known as the radius • When a node receives a query, it processes it on behalf of all nodes within r hops of itself • Institute system-wide policy P= {D1, D2} where only the nodes at the specified depth process the query • To create and maintain indices, steps must be taken whenever a node joins, leaves, or updates its data.

Relative Performance

Recommendations • Iterative Deepening can be applied immediately resulting in less bandwidth and processing consumption without sacrificing satisfaction of the query. However, it is 90% slower than BFS • Directed BFS can be applied immediately resulting in less bandwidth and processing consumption. However it is 40% slower than BFS and only has an 86% chance of satisfaction • Local Indices cannot be applied immediately, however it is approx as fast as BFS while resulting in significant savings in bandwidth and processing consumption.

Conclusion • PeerOLAP is more of a proof-of-concept rather than an immediately applicable solution. The creators are attempting to combine the benefits of a multidimensional database with a P2P network, thereby increasing the effectiveness of both technologies. However, the quality of the results do not present a sufficiently broad overview of the benefits it does provide for either. Both the network reconfiguration concept and the algorithms (LBF, LFU) may be optimizable resulting in a better solution. • Improving current search techniques by applying ID or D-BFS does result in significant savings of bandwidth and processing power. However, there is a trade-off in time to satisfaction and given the simplicity of the “improvement” it seems to be the P2P designer’s prerogative whether or not to implement the solution. Local indices are an admirable step in the right direction but they require a restructuring of currently available networks.

Data Collection (EXTRA SLIDES) • Evaluations based on Gnutella network • Some metrics can be directly measured (e.g. satisfaction of BFS query) • Some metrics can only be indirectly measured (e.g. Local Indices) • - collect performance data for queries under BFS • - combine data using analysis to estimate performance

Iterative Deepening (Symbols) Representative set of (500) queries Qrep selected randomly from a large set of 500,000 observed queries

Directed BFS (Symbols)

Calculating Costs – sizes of messages

Bandwidth Cost

Calculating Costs – Costs of actions

Experiments

Adaptations to Peer to Peer Networks

Adaptations to Peer to Peer Networks

Presentation Transcript

Introduction to Peer-to-Peer Networks

Peer-To-Peer Networks

Peer-to-Peer Networks

Peer-to-Peer Networks - Skype

Peer-to-Peer and Social Networks

Peer-to-Peer Networks

Peer-to-peer networks

Peer to Peer Networks

Peer-to-Peer (P2P) Networks

Structuring Unstructured Peer-to-Peer Networks

Peer-to-Peer Networks

Disrupting Peer-to-Peer Networks

Peer-to-Peer Networks

Peer-to-peer networks

Peer-to-Peer Structured Overlay Networks

Peer-to-Peer Networks

Peer-to-Peer Networks

Peer-to-peer networks