Acknowledgement: slides include content from Hedera and MP-TCP authors

CS434/534: Topics in Network SystemsCloud Data Centers: Load Balancing beyond ECMP through Scheduling: Central Scheduling (Hedera) and Distributed Scheduling (MP-TCP)Yang (Richard) YangComputer Science DepartmentYale University208A WatsonEmail: yry@cs.yale.eduhttp://zoo.cs.yale.edu/classes/cs434/ Acknowledgement: slides include content from Hedera and MP-TCP authors

Outline • Admin and recap • Cloud data center (CDC) networks • Background, high-level goal • Traditional CDC vs the one-big switch abstraction • VL2 design and implementation • Load-aware DC load balancing (scheduling) • Central scheduling: Hedera • Distributed scheduling by end hosts: MP-TCP

Admin • PS1 update • Please set up meetings on potential projects

Recap: VL2 L2 SemanticsDesign and Implementation VL2 DirectoryService … x  ToR2 y  ToR3 z  ToR3 … … x  ToR2 y  ToR3 z  ToR4 … . . . . . . . . . ToR1 ToR2 ToR3 ToR4 ToR3 y payload Lookup & Response y, z y z x ToR4 ToR3 z z payload payload Servers use flat names Routing uses locator (ToR address) 4

Recap: VL2 Isolation/LB IANY IANY IANY T1 T2 T3 T4 T5 T6 IANY T5 T3 y z payload payload x z All Int switches assigned the same anycast addr. Hosts add two encapsulation headers Switches use ECMP to forward traffic 5

Recap: ECMP Load Balancing Effectiveness • VL2 route traffic through ECMP hash of 5 tuples • Collision happens when there is a hash collision • Collision is bad for elephant flows • One small extension: break flow into flowcells H(f) % 3 = 0

Outline • Admin and recap • Cloud data center (CDC) networks • Background, high-level goal • Traditional CDC vs the one-big switch abstraction • VL2 design and implementation • Overview • Topology • Control • layer 2 semantics • ECMP/VLB load balancing/performance isolation • Extension: Presto • Load-aware DC load balancing (scheduling)

Easy-to-understand, but important Problem • DC networks have many ECMP paths between servers and a first cut solution of DC routing is to use flow-hash-based load balancing (ECMP), but this may be insufficient • Agnostic to available resources • Long lasting collisions between long (elephant) flows • Seriousness of problem depends on particular topology • More serious if limited up/down paths

ECMP Collision Problem in Simple K-ary • Many equal cost paths going up to the core switches • But only one path down from each core switch S D

ECMP Collision Problem in Simple K-ary • ECMP collisions possible in two different ways • Upward path • Downward path S1 S2 S3 D2 S4 D1 D4 D3

Impacts of Collisions • Average of 61% of bisection bandwidth wasted on a network of 27K servers S1 S2 S3 D2 S4 D1 D4 D3

Outline • Admin and recap • Cloud data center (CDC) networks • Background, high-level goal • Traditional CDC vs the one-big switch abstraction • VL2 design and implementation • Load-aware DC load balancing (scheduling) • Central scheduling: Hedera

Discussion • What are issues to address if you introduce a central scheduler to address issues?

Hedera Architecture • Detect large flows • Flows that need bandwidth but are network-limited • Estimate flow demands • Compute demands of large flows if no network limit • Place flows • Use estimated demands to heuristically find better placement of large flows on the ECMP paths Estimate Flow Demands Detect Large Flows Place Flows

Elephant Detection • Scheduler continually polls edge switches for flow byte-counts • Flows exceeding B/s threshold are “large” • > %10 of hosts’ link capacity (i.e. > 100Mbps) • What if only mice on host? • Default ECMP load-balancing efficient for small flows

Demand Estimation: Problem • Flows can be constrained by network during identification • Measured flow rate can be misleading • Need to find a flow’s “natural” bandwidth requirement when not limited by the network Q: what are measured rates of the flows (A->X, Y; B,C->Y)? 3 other flows 3 other flows A X Y B C

Demand Estimation • Hedera solution: Assume no network limit, just allocate capacity between flows using min-max fairness • Given traffic matrix of large flows, modify each flow’s size at its source and destination iteratively… • Sender equally distributes bandwidth among outgoing flows that are not receiver-limited • Network-limited receivers decrease exceeded capacity equally among incoming flows • Repeat until all flows converge • Guaranteed to converge in O(|F|) time

Hedera Demand Estimation A X B Y C Senders

Hedera Demand Estimation A X B Y C Receivers

Hedera Demand Estimation A X B Y C Senders

Hedera Demand Estimation A X B Y C Receivers

Flow Placement • Find a good allocation of paths for the set of large flows, such that the average bisection bandwidth of the flows is maximized • Discussion: Any 434/534 placement algorithm?

Global Fit First • New flow detected, linearly search all possible paths from SD • Place flow on first path whose component links can fit that flow

Global First Fit: Example Scheduler ? ? ? Flow A 0 1 2 3 Flow B Flow C

Global First Fit Effectiveness • First fit of bin packing has an approximation factor of 2 • https://en.wikipedia.org/wiki/Bin_packing_problem#First-fit_algorithm

Simulated Annealing • An approach to solve problem with local minimum • Annealing: slowly cooling metal to give it nice properties like ductility, homogeneity, etc • Heating to enter high energy state (shake up things) • Slowly cooling to let the crystalline structure settle down in a low energy state (gradient descent) f(x) x

Simulated Annealing • 4 specifications • State space • Neighboring states • Energy • Temperature

Simulated Annealing Structure

Flow Placement using Simulated Annealing • State: All possible mapping of flows to paths • Constrained to reduce state space size: each destination is assigned to a single core switch (for k-ary fat tree, #hosts vs core switches?) • Neighbor State: Swap paths between 2 hosts • Within same pod, • Within same ToR, • etc

Simulated Annealing • Function/Energy: Total exceeded b/w capacity • Using the estimated demand of flows • Minimize the exceeded capacity • Temperature: Iterations left • Fixed number of iterations (1000s)

Simulated Annealing Scheduler ? • Example run: 3 flows, 3 iterations ? ? ? Core Flow A 2 2 2 0 1 2 3 Flow B 1 0 0 Flow C 0 2 3

Simulated Annealing Scheduler ? • Final state is published to the switches and used as the initial state for next round ? ? ? Core Flow A 2 0 1 2 3 Flow B 0 Flow C 3

Evaluation

Evaluation Data Shuffle • 16-hosts: 120 GB all-to-all in-memory shuffle • Hedera achieves 39% better bisection BW over ECMP, 88% of ideal non-blocking switch

Reactiveness • Demand Estimation: • 27K hosts, 250K flows, converges < 200ms • Simulated Annealing: • Asymptotically dependent on # of flows + # iterations • 50K flows and 10K iter: 11ms • Most of final bisection BW: first few hundred iterations • Scheduler control loop: • Polling + Estimation + SA = 145ms for 27K hosts

Limitations • Dynamic workloads, large flow turnover faster than control loop • Scheduler will be continually chasing the traffic matrix • Need to include penalty term for unnecessary SA flow re-assignments

Discussion • What you like about Hedera design? • What you do not like about Hedera design?

Outline • Admin and recap • Cloud data center (CDC) networks • Background, high-level goal • Traditional CDC vs the one-big switch abstraction • VL2 design and implementation • Load-aware DC load balancing (scheduling) • Central scheduling: Hedera • Distributed scheduling by end hosts: MP-TCP

Multi-Path TCP Basic Idea Benefits of EH control? • Instead of a central scheduler, end hostsdistributedly compute flow rates. = Logically a single pool Two separate paths

End Host Control

MPTCP Beyond DC: Multihomed Web Server 2 TCPs @ 50Mb/s 100Mb/s 100Mb/s 4 TCPs@ 25Mb/s

MPTCP Beyond DC: Wifi + Cellular wifi path cellular path

Outline • Admin and recap • Cloud data center (CDC) networks • Background, high-level goal • Traditional CDC vs the one-big switch abstraction • VL2 design and implementation • Load-aware DC load balancing (scheduling) • Central scheduling: Hedera • Distributed scheduling by end hosts: MP-TCP • Motivation • Background: TCP CC (resource allocation)

Congestion Control (CC): Overview router 5 flow 1 router 3 5 Mbps 20 Mbps 10 Mbps 20 Mbps flow 2 20 Mbps router 2 router 1 router 4 router 6 High-level: Algorithm at end hosts to control the transmission rates.

The Desired Properties of a Congestion Control Scheme • Efficiency: close to full utilization but low delay • fast convergence after disturbance • Fairness (resource sharing) • Distributedness (no central knowledge for scalability)

Simple Model for TCP CC Design User 1 x1 d = sum xi > Xgoal? sum xi x2 User 2 xn User n Flows observe congestion signal d, and locally take actions to adjust rates.

Linear Control • Proposed by Chiu and Jain (1988) • Considers the simplest class of control strategy Discussion: values of the parameters?

fairness line: x1=x2 x(0) efficiency line: x1+x2=C State Space of Two Flows x2 overload underload x1

x0 efficiency: distributed linear rule x0 x0 fairness intersection congestion x0 efficiency

Implication: Congestion (overload) Case • In order to get closer to efficiency and fairness after each update, decreasing of rate must be multiplicative decrease (MD) • aD = 0 • bD < 1

Acknowledgement: slides include content from Hedera and MP-TCP authors