1 / 67

Acknowledgement: slides include content from Hedera and MP-TCP authors

This topic explores load balancing in cloud data centers beyond ECMP through central scheduling (Hedera) and distributed scheduling (MP-TCP).

guillotte
Download Presentation

Acknowledgement: slides include content from Hedera and MP-TCP authors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS434/534: Topics in Network SystemsCloud Data Centers: Load Balancing beyond ECMP through Scheduling: Central Scheduling (Hedera) and Distributed Scheduling (MP-TCP)Yang (Richard) YangComputer Science DepartmentYale University208A WatsonEmail: yry@cs.yale.eduhttp://zoo.cs.yale.edu/classes/cs434/ Acknowledgement: slides include content from Hedera and MP-TCP authors

  2. Outline • Admin and recap • Cloud data center (CDC) networks • Background, high-level goal • Traditional CDC vs the one-big switch abstraction • VL2 design and implementation • Load-aware DC load balancing (scheduling) • Central scheduling: Hedera • Distributed scheduling by end hosts: MP-TCP

  3. Admin • PS1 update • Please set up meetings on potential projects

  4. Recap: VL2 L2 SemanticsDesign and Implementation VL2 DirectoryService … x  ToR2 y  ToR3 z  ToR3 … … x  ToR2 y  ToR3 z  ToR4 … . . . . . . . . . ToR1 ToR2 ToR3 ToR4 ToR3 y payload Lookup & Response y, z y z x ToR4 ToR3 z z payload payload Servers use flat names Routing uses locator (ToR address) 4

  5. Recap: VL2 Isolation/LB IANY IANY IANY T1 T2 T3 T4 T5 T6 IANY T5 T3 y z payload payload x z All Int switches assigned the same anycast addr. Hosts add two encapsulation headers Switches use ECMP to forward traffic 5

  6. Recap: ECMP Load Balancing Effectiveness • VL2 route traffic through ECMP hash of 5 tuples • Collision happens when there is a hash collision • Collision is bad for elephant flows • One small extension: break flow into flowcells H(f) % 3 = 0

  7. Outline • Admin and recap • Cloud data center (CDC) networks • Background, high-level goal • Traditional CDC vs the one-big switch abstraction • VL2 design and implementation • Overview • Topology • Control • layer 2 semantics • ECMP/VLB load balancing/performance isolation • Extension: Presto • Load-aware DC load balancing (scheduling)

  8. Easy-to-understand, but important Problem • DC networks have many ECMP paths between servers and a first cut solution of DC routing is to use flow-hash-based load balancing (ECMP), but this may be insufficient • Agnostic to available resources • Long lasting collisions between long (elephant) flows • Seriousness of problem depends on particular topology • More serious if limited up/down paths

  9. ECMP Collision Problem in Simple K-ary • Many equal cost paths going up to the core switches • But only one path down from each core switch S D

  10. ECMP Collision Problem in Simple K-ary • ECMP collisions possible in two different ways • Upward path • Downward path S1 S2 S3 D2 S4 D1 D4 D3

  11. Impacts of Collisions • Average of 61% of bisection bandwidth wasted on a network of 27K servers S1 S2 S3 D2 S4 D1 D4 D3

  12. Outline • Admin and recap • Cloud data center (CDC) networks • Background, high-level goal • Traditional CDC vs the one-big switch abstraction • VL2 design and implementation • Load-aware DC load balancing (scheduling) • Central scheduling: Hedera

  13. Discussion • What are issues to address if you introduce a central scheduler to address issues?

  14. Hedera Architecture • Detect large flows • Flows that need bandwidth but are network-limited • Estimate flow demands • Compute demands of large flows if no network limit • Place flows • Use estimated demands to heuristically find better placement of large flows on the ECMP paths Estimate Flow Demands Detect Large Flows Place Flows

  15. Elephant Detection • Scheduler continually polls edge switches for flow byte-counts • Flows exceeding B/s threshold are “large” • > %10 of hosts’ link capacity (i.e. > 100Mbps) • What if only mice on host? • Default ECMP load-balancing efficient for small flows

  16. Demand Estimation: Problem • Flows can be constrained by network during identification • Measured flow rate can be misleading • Need to find a flow’s “natural” bandwidth requirement when not limited by the network Q: what are measured rates of the flows (A->X, Y; B,C->Y)? 3 other flows 3 other flows A X Y B C

  17. Demand Estimation • Hedera solution: Assume no network limit, just allocate capacity between flows using min-max fairness • Given traffic matrix of large flows, modify each flow’s size at its source and destination iteratively… • Sender equally distributes bandwidth among outgoing flows that are not receiver-limited • Network-limited receivers decrease exceeded capacity equally among incoming flows • Repeat until all flows converge • Guaranteed to converge in O(|F|) time

  18. Hedera Demand Estimation A X B Y C Senders

  19. Hedera Demand Estimation A X B Y C Receivers

  20. Hedera Demand Estimation A X B Y C Senders

  21. Hedera Demand Estimation A X B Y C Receivers

  22. Flow Placement • Find a good allocation of paths for the set of large flows, such that the average bisection bandwidth of the flows is maximized • Discussion: Any 434/534 placement algorithm?

  23. Global Fit First • New flow detected, linearly search all possible paths from SD • Place flow on first path whose component links can fit that flow

  24. Global First Fit: Example Scheduler ? ? ? Flow A 0 1 2 3 Flow B Flow C

  25. Global First Fit Effectiveness • First fit of bin packing has an approximation factor of 2 • https://en.wikipedia.org/wiki/Bin_packing_problem#First-fit_algorithm

  26. Simulated Annealing • An approach to solve problem with local minimum • Annealing: slowly cooling metal to give it nice properties like ductility, homogeneity, etc • Heating to enter high energy state (shake up things) • Slowly cooling to let the crystalline structure settle down in a low energy state (gradient descent) f(x) x

  27. Simulated Annealing • 4 specifications • State space • Neighboring states • Energy • Temperature

  28. Simulated Annealing Structure

  29. Flow Placement using Simulated Annealing • State: All possible mapping of flows to paths • Constrained to reduce state space size: each destination is assigned to a single core switch (for k-ary fat tree, #hosts vs core switches?) • Neighbor State: Swap paths between 2 hosts • Within same pod, • Within same ToR, • etc

  30. Simulated Annealing • Function/Energy: Total exceeded b/w capacity • Using the estimated demand of flows • Minimize the exceeded capacity • Temperature: Iterations left • Fixed number of iterations (1000s)

  31. Simulated Annealing Scheduler ? • Example run: 3 flows, 3 iterations ? ? ? Core Flow A 2 2 2 0 1 2 3 Flow B 1 0 0 Flow C 0 2 3

  32. Simulated Annealing Scheduler ? • Final state is published to the switches and used as the initial state for next round ? ? ? Core Flow A 2 0 1 2 3 Flow B 0 Flow C 3

  33. Evaluation

  34. Evaluation Data Shuffle • 16-hosts: 120 GB all-to-all in-memory shuffle • Hedera achieves 39% better bisection BW over ECMP, 88% of ideal non-blocking switch

  35. Reactiveness • Demand Estimation: • 27K hosts, 250K flows, converges < 200ms • Simulated Annealing: • Asymptotically dependent on # of flows + # iterations • 50K flows and 10K iter: 11ms • Most of final bisection BW: first few hundred iterations • Scheduler control loop: • Polling + Estimation + SA = 145ms for 27K hosts

  36. Limitations • Dynamic workloads, large flow turnover faster than control loop • Scheduler will be continually chasing the traffic matrix • Need to include penalty term for unnecessary SA flow re-assignments

  37. Discussion • What you like about Hedera design? • What you do not like about Hedera design?

  38. Outline • Admin and recap • Cloud data center (CDC) networks • Background, high-level goal • Traditional CDC vs the one-big switch abstraction • VL2 design and implementation • Load-aware DC load balancing (scheduling) • Central scheduling: Hedera • Distributed scheduling by end hosts: MP-TCP

  39. Multi-Path TCP Basic Idea Benefits of EH control? • Instead of a central scheduler, end hostsdistributedly compute flow rates. = Logically a single pool Two separate paths

  40. End Host Control

  41. MPTCP Beyond DC: Multihomed Web Server 2 TCPs @ 50Mb/s 100Mb/s 100Mb/s 4 TCPs@ 25Mb/s

  42. MPTCP Beyond DC: Wifi + Cellular wifi path cellular path

  43. Outline • Admin and recap • Cloud data center (CDC) networks • Background, high-level goal • Traditional CDC vs the one-big switch abstraction • VL2 design and implementation • Load-aware DC load balancing (scheduling) • Central scheduling: Hedera • Distributed scheduling by end hosts: MP-TCP • Motivation • Background: TCP CC (resource allocation)

  44. Congestion Control (CC): Overview router 5 flow 1 router 3 5 Mbps 20 Mbps 10 Mbps 20 Mbps flow 2 20 Mbps router 2 router 1 router 4 router 6 High-level: Algorithm at end hosts to control the transmission rates.

  45. The Desired Properties of a Congestion Control Scheme • Efficiency: close to full utilization but low delay • fast convergence after disturbance • Fairness (resource sharing) • Distributedness (no central knowledge for scalability)

  46. Simple Model for TCP CC Design User 1 x1 d = sum xi > Xgoal? sum xi x2 User 2 xn User n Flows observe congestion signal d, and locally take actions to adjust rates.

  47. Linear Control • Proposed by Chiu and Jain (1988) • Considers the simplest class of control strategy Discussion: values of the parameters?

  48. fairness line: x1=x2 x(0) efficiency line: x1+x2=C State Space of Two Flows x2 overload underload x1

  49. x0 efficiency: distributed linear rule x0 x0 fairness intersection congestion x0 efficiency

  50. Implication: Congestion (overload) Case • In order to get closer to efficiency and fairness after each update, decreasing of rate must be multiplicative decrease (MD) • aD = 0 • bD < 1

More Related