1 / 29

Large Scale Sharing

Large Scale Sharing. Marco F. Duarte COMP 520: Distributed Systems September 19, 2004. Introduction. P2P sharing systems are very popular In P2P, all nodes have identical capabilities and responsibilities

Download Presentation

Large Scale Sharing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004

  2. Introduction • P2P sharing systems are very popular • In P2P, all nodes have identical capabilities and responsibilities • Popular approaches are partially centralized, do not scale well, or do not provide desired anonymity • Scalability of systems critical • Need for decentralized, load-balancing architectures

  3. Features desired in a P2P sharing system • Decentralized architecture – no single point of failure • Scalability – bandwidth and load balancing • Fault tolerance – content replication • Anonymity for users – posters, readers, storers • Resilient against DoS attacks

  4. Freenet provides anonymity • No requester, provider information implicit in communication • Presence of a file in a node does not imply authorship • Popular files are replicated to improve locality • Does not intend to provide permanent storage

  5. Freenet Queries • Files receive FileIDs (160-bit SHA-1 hash of “file identifier”) • Queries have pseudo-unique random identifiers (QueryIDs) and hops-to-live count. • Routing tables contain table of previously retrieved FileIDs and their locations • Queries are routed to location with closest FileID at each stage; loops are detected with QueryID 31302313?

  6. Freenet Queries: Lookups and Stores a b • Copies of the file are stored at all nodes • File record for a is added to routing tables • Writes perform lookup, insert file along path if no match found e

  7. Freenet Properties • FileID-based clustering allows for improved routing as usage increases • LRU-like capacity management: rarely used files are purged from the system • Random nature of FileIDs allow for diversity of information at nodes • Attempts to supplant existing files will lead to real file propagation • Anonymity features: • File ownership assumed randomly by other nodes • Minimal routing information necessary at each hop • Hops-to-live count of 1 updated randomly

  8. Freenet Problems • Files that are stored in the network may not be found. • Freenet does not provide reliable storage • No notion of locality in routing • Simulations do not involve file insertion or node discovery

  9. PAST: Reliable Distributed Storage • Customizable file persistence • High availability and load balancing • Efficient Routing and Storage Allocation • Uses FileIDs generated from hashes like in Freenet • Uses owner credentials to verify identity of authors • Interface: Insert, Lookup, Reclaim

  10. PAST Architecture • FileID computed from hash of filename, owner’s public key and a random salt. • Each node receives a pseudorandom NodeID, independent of the node properties. • Owner specifies number k of replicas of a file to store in the system on insert. • File is stored in the k nodes with NodeIDs closest to the FileID. • Routing provided by Pastry.

  11. Pastry: Routing for P2P Networks • Paths with less than hops • Delivery guaranteed under at most node failures • Flexible proximity metric. • Each node contains: • Leaf set – l nodes with closest NodeIDs • Routing table – set of neighbors organized by NodeIDs • Neighborhood set – l closest nodes • Each NodeID is paired with its networkaddress • Direct routes to neighbors and l closest NodeIDs

  12. Pastry: Example • Routing table organized by similarity to NodeID. • Neighborhood set used for node addition/recovery. • Queries are forwarded to a numerically closer node (by shared NodeID header, and NodeID proximity).

  13. Pastry Routing Table 0=2M 0231 3321 3133 0302 Neighborhood Set 1033 3013 Leaf Set 1123 2300 2121 1202 1311 2031

  14. Pastry Routing Example 0=2M 0231 3321 Other nodes exist but are not shown 3133 0302 1033 3013 3133? 1123 2300 2121 1202 1311 2031

  15. Pastry Node Insertion Example 0=2M 0231 3321 3130 3133 0302 1033 3013 Leaf Set 1123 2300 3130 2121 1202 Neighborhood Set 1311 2031

  16. Pastry Node Removal Example 0=2M 3321 3133 3013

  17. PAST Insertions • fileID = Insert(name, owner-credentials, k, file) 0=2M 0231 3321 3130: File, Certificate 0302 3133 1033 3013 3130: File, Certificate 1123 3130: File, Certificate 2300 Insert File, FileID 3130 Insert File K times 2121 1202 Owner 1311 2031

  18. PAST Insertions • fileID = Insert(name, owner-credentials, k, file) 0=2M 0231 3321 k Store Receipts 0302 3133 1033 3013 k Store Receipts 1123 k Store Receipts 2300 2121 1202 Owner 1311 2031

  19. PAST Semantics • fileID = lookup(fileID) • Routed to NodeID = FileID • First of k closest nodes found returns file, credentials • Reclaim(fileID, owner-credentials) • Same semantics as Insert • Owner issues Reclaim Certificate • Storing nodes issue Reclaim Receipt • Changes in leaf sets will trigger changes in replica locations • A new node creates “pointers” to files it should contain; migration is gradual

  20. Load Balancing in PAST: Replica Diversion 3201 Leaf Set 3130 Leaf Set

  21. Load Balancing in PAST: File Diversion 3201 Leaf Set 3130 Leaf Set Change ID by changing salt Policies for acceptance of replicas and diverted replicas, and selection of diverted replica node. Maximum ratio of file size to free space for insertion tpri, tdiv

  22. Caching in PAST • Highly popular files might demand more replicas than specified. • Files located “far away” only need to be fetched once locally • Unused disk space is allocated as cache. • Caching performance degrades gradually with increased utilization • Cache insertion policy similar to diversion policies.

  23. PAST Performance: tpri comparison, tdiv =0.05

  24. PAST Performance: tpri comparison, tdiv =0.05

  25. PAST Performance:Ratio of File Diversions

  26. PAST Performance: Ratio of Replica Diversions

  27. PAST Performance: Failed Insertions

  28. PAST Performance: Cache Hits

  29. Conclusions • Content based routing improves scalability of distributed storage systems. • Need for user authentication in distributed systems. • Caching is crucial for system performance. • Diversion allows for graceful performance degradation. • Need file mutability, file search or indexing services

More Related