1 / 44

Scalable Group Communication for the Internet

Scalable Group Communication for the Internet. Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is joint work with Sussman, Marzullo and Dolev. Tal Anker Ziv Bar-Joseph Gregory Chockler Danny Dolev Alan Fekete Nabil Huleihel

kyne
Download Presentation

Scalable Group Communication for the Internet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is joint work with Sussman, Marzullo and Dolev

  2. Tal Anker Ziv Bar-Joseph Gregory Chockler Danny Dolev Alan Fekete Nabil Huleihel Kyle Ingols Roger Khazan Carl Livadas Nancy Lynch Keith Marzullo Yoav Sasson Jeremy Sussman Alex Shvartsman Igor Tarashchanskiy Roman Vitenberg Esti Yeger-Lotem Collaborators

  3. Outline • Motivation • Group communication - background • A novel architecture for scalable group communication services in WAN • A new scalable group membership algorithm • Specification • Algorithm • Implementation • Performance • Conclusions

  4. Modern Distributed Applications (in WANs) • Highly available servers • Web • Video-on-Demand • Collaborative computing • Shared white-board, shared editor, etc. • Military command and control • On-line strategy games • Stock market

  5. Important Issues in Building Distributed Applications • Consistency of view • Same picture of game, same shared file • Fault tolerance, high availability • Performance • Conflicts with consistency? • Scalability • Topology - WAN, long unpredictable delays • Number of participants

  6. Generic Primitives - Middleware, “Building Blocks” • E.g., total order, group communication • Abstract away difficulties, e.g., • Total order - a basis for replication • Mask failures • Important issues: • Well specified semantics - complete • Performance

  7. Research Approach • Rigorous modeling, specification, proofs, performance analysis • Implementation and performance tuning • Services  Applications • Specific examples  General observations

  8. Group Communication Send(G) G • Group abstraction - a group of processes is one logical entity • Dynamic Groups (join, leave, crash) Systems: Ensemble, Horus, ISIS, Newtop, Psync, Sphynx, Relacs, RMP, Totem, Transis

  9. Virtual Synchrony[Birman, Joseph 87] • Group members all see events in same order • Events: messages, process crash/join • Powerful abstraction for replication • Framework for fault tolerance, high availability • Basic component: group membership • Reports changes in set of group members

  10. Example: Highly Available VoD[Anker, Dolev, Keidar ICDCS1999] • Dynamic set of servers • Clients talk to “abstract” service • Server can crash, client shouldn’t know

  11. VoD Service: Exploiting Group Communication Group abstractionfor connection establishment and transparent migration (with simple clients) Membership services detect conditions for migration - fault tolerance and load balancing Reliable group multicast among servers for consistently sharing information Virtual Synchrony allows servers to agree upon migration immediately (no message exchange) Reliable messages for control • Server: ~2500 C++ lines • All fault tolerance logic at server

  12. Related Projects Specification Survey 99 Architecture for Group Membership in WAN DIMACS 98 Optimistic VS SRDS 00 Virtual Synchrony ICDCS 00 Dynamic Voting PODC 97 Moshe: Group Membership ICDCS 00 QoS Support TINA 96, OPODIS 00 Inheritance-based Modeling ICSE 00 Group communication Applications Highly Available VoD ICDCS 99 Object Replication PODC 96 CSCW NGITS 97

  13. A Scalable Architecture for Group Membership in WANs Tal Anker, Gregory Chockler, Danny Dolev, Idit Keidar DIMACS Workshop 1998

  14. Scalable Membership Architecture • Dedicated distributed membership servers “divide and conquer” • Servers involved only in membership changes • Members communicate with each other directly (implement “virtual synchrony”) • Two levels of membership • Notification Service NSView - “who is around” • Agreed membership views

  15. Membership {A,B,C,D,E},7 Membership {A,B,C,D,E},7 Notification Service (NS) Notification Service (NS) Architecture NSView: "Who is around" failure/join/leave Agreed View: Members set and identifier

  16. The Notification Service (NS) • Group members send requests: • join(Group G), • leave(Group G) directly to (local) NS • NS detects faults (member / domain) • Information propagated to all NS servers • NS servers notify membership servers of new NSView

  17. The NS Communication:Reliable FIFO links • Membership servers can send each other messages using NS • FIFO order If S1 sends m1 and later m2 then any server which receives both, receives m1 first. • Reliable links If S1 sends m to S2 then eventually either S2 receives m or S1 suspects S2 (and all of its clients).

  18. Moshe: A Group Membership Algorithm for WANs Idit Keidar, Jeremy Sussman Keith Marzullo, Danny Dolev ICDCS 2000

  19. Membership in WAN: the Challenge • Message latency is large and unpredictable • Frequent message loss • Time-out failure detection is inaccurate • We use a notification service (NS) for WANs • Number of communication rounds matters • Algorithms may change views frequently • View changes require communication for state transfer, which is costly in WAN

  20. Moshe’s Novel Concepts • Designed for WANs from the ground up • Previous systems emerged from LAN • Avoids delivery of “obsolete” views • Views that are known to be changing • Not always terminating (but NS is) • Runs in a single round (“typically”)

  21. Member-Server Interaction

  22. obsolete views Moshe Guarantees • View - <members, identifier> • Identifier is monotonically increasing • Conditional liveness property: Agreement on views If “all” eventually have the same last NSView then “all” eventually agree on the last view • Composable • Allows reasoning about individual components • Useful for applications

  23. Moshe Operation: Typical Case • In response to new NSView (members), • send proposal to other servers with NSView • send startChange to local members (clients) • Once proposals from all servers of NSView members arrive, deliver view: • members - NSView, • identifier higher than all previous to local members

  24. Goal: Self Stabilizing • Once the same last NSView is received by all servers: • All send proposals for this NSView • All the proposals reach all the servers • All servers use these proposals to deliver the sameview And they live happily ever after!

  25. A B C X +c proposal proposal X -c Out-of-Sync Case:unexpected proposal • To avoid deadlock: A must respond

  26. A B C +C +C +AB -C view view +C Out-of-Sync Case:unexpected proposal • Extra proposals are redundant, responding with a proposal may cause live-lock

  27. A B C +C +C +AB -C +C view view Out-of-Sync Case:missing proposal view This case exposed by correctness proof

  28. A B C +C +C +AB PropNum:1 Used: 1,1,1 1 1 1 view -C Used[C] =1= PropNum Detection! +C 2, [111] Missing Proposal Detection

  29. Handling Out-of-Sync Cases:“Slow Agreement” • Also sends proposals, tagged “SA” • Invoked upon blocking detection or upon receipt of “SA” proposal • Upon receipt of “SA” proposal with bigger number than PropNum, respond with same number • Deliver view only with “full house” of same number proposals

  30. How Typical is the “typical” Case? • Depends on the notification service (NS) • Classify NS good behaviors: symmetric and transitive perception of failures • Transitivity depends on logical topology, how suspicions propagate • Typical case should be very common • Need to measure

  31. Implementation • Use CONGRESS [Anker et al] • NS for WAN • Always symmetric, can be non-transitive • Logical topology can be configured • Moshe servers extend CONGRESS servers • Socket interface with processes

  32. The Experiment • Run over the Internet • In the US: MIT, Cornell (CU), UCSD • In Taiwan: NTU • In Israel: HUJI • Run for 10 days in one configuration, 2.5 days in another • 10 clients at each location • continuously join/leave 10 groups

  33. Two Experiment Configurations

  34. Percentage of “Typical” Cases • Configuration 1: • MIT: 10,786 views, 10,661 one round - 98.84% • Other sites: 98.8%, 98.9%, 98.97%, 98.85% • Configuration 2: • MIT: 2,559 views, 2,555 one round - 99.84% • Other sites: 99.82%, 99.79%, 99.81%, 99.84% • Overwhelming majority for one round! • Depends on topology  can scale

  35. 1400 1200 1000 number of runs 800 600 400 200 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 milliseconds Performance: Surprise! Histogram of Moshe duration MIT, configuration 1, runs up to 4 seconds (97%)

  36. Performance: Part II Histogram of Moshe duration MIT, configuration 2, runs up to 3 seconds (99.7%) 450 400 350 300 number of runs 250 200 150 100 50 0 0 150 300 450 600 750 900 1050 1200 1350 1500 1650 1800 1950 2100 2250 2400 2550 2700 2850 3000 milliseconds

  37. Performance over the Internet:What is Going on? • Without message loss, running time is close to biggest round-trip-time, ~650 ms. • As expected • Message loss has a big impact • Configuration 2 has much less loss, • more cases of good performance

  38. “Slow” versus “Typical” • Slow can take 1 or 2 rounds once it is run • Depending on PropNum • Slow after NE • One-round is run first, then detection, and slow • Without loss - 900 ms., 40% more than usual • Slow without NE • Detection by unexpected proposal • Only slow algorithm is run • Runs less time than one-round

  39. Unstable Periods:No Obsolete Views • “Unstable” = • constant changes; or • connected processes differ in failure detection • Configuration 1: • 379 of the 10,786 views  4 seconds, 3.5% • 167  20 seconds, 1.5% • Longest running time 32 minutes • Configuration 2: • 14 of 2,559 views  4 seconds, 0.5% • Longest running time 31 seconds

  40. Scalability Measurements • Controlled experiment at MIT and UCSD • Prototype NS, based on TCP/IP (Sasson) • Inject faults to test “slow” case • Vary number of members, servers • Measure end-to-end latencies at member, from join/leave/suspicion to corresponding view • Average of 150 (50 slow) runs

  41. End-to-End Latency: Scalable! • Member scalability: 4 servers (constant) • Server and member scalability: 4-14 servers

  42. Conclusion: Moshe Features • Avoiding obsolete views • A single round • 98% of the time in one configuration • 99.8% of the time in another • Using a notification service for WANs • Good abstraction • Flexibility to configure multiple ways • Future work: configure more ways • Scalable “divide and conquer” architecture

  43. Retrospective: Role of Theory • Specification • Possible to implement • Useful for applications (composable) • Specification can be met in one round “typically” (unlike Consensus) • Correctness proof exposes subtleties • Need to avoid live-lock • Two types of detection mechanisms needed

  44. Future Work: The QoS Challenge • Some distributed applications require QoS • Guaranteed available bandwidth • Bounded delay, bounded jitter • Membership algorithm terminates in one round under certain circumstances • Can we leverage on that to guarantee QoS under certain assumptions? • Can other primitives guarantee QoS?

More Related