320 likes | 466 Views
Quality of Service in Computer Networks. Alex Shpiner , Mellanox Technologies. High-performance communication , BGU, 2017. What is Quality of Service ( QoS )?.
E N D
Quality of Service in Computer Networks Alex Shpiner, Mellanox Technologies High-performance communication, BGU, 2017
What is Quality of Service (QoS)? • Network configuration that aims to provide optimal end-to-end performance for users, according to class-of service of the traffic. • High throughput • Low latency • Fairness • Minimal or no packet loss • Minimal or no jitter • variability (stability) of throughput or latency over time • Specific class-of-service may prioritize one property over the other • High throughput for storage • Low latency for control • Low jitter for real time audio
QoSComponents • Flow-control – eliminates packet loss caused by overflowing buffers. • Congestion control – prevents or reacts to congestion, reducing or controlling its effect on overall throughput in the network. • Service differentiation – applies differential handling to various traffic classes according to service priority.
InfiniBand vs. RoCE (RDMA over Ethernet or IB over Ethernet) • RDMA is natively implemented as part of InfiniBand • Requires end-to-end network to support IB (IB switches) • Conventional TCP traffic runs over Ethernet based network • Consolidating them is desirable • Requirement for RDMA over Ethernet • Runs over commodity network infrastructure • Solution: RoCE • This is RoCE version 2 packet format: Commodity Ethernet/IP headers InfiniBand-specific headers Ethernet IP UDP Payload InfiniBand L4
Network congestion 100% C 100% 300% C C 100% C
What perfect congestion control achieves? 33% 33% 100% 33%
Lossy vs. Loss-less Network • When buffer overflows: • In lossy network: packet is dropped • In lossless network: flow control prevents packets from being droped (snext slides) • High-performance networks are mostly lossless • Lossynetworks require end-to-end transport that is able to detect and retransmit lost data - these take computation resources and adds to protocol overhead. • If packet drop is not negligible, significant bandwidth may be lost on retransmissions. • To avoid drop, large costly buffers are used which introduce latency as they fill up.
Flow Control in Ethernet • Link layer protocol (2nd layer in OSI) • Switch to neighboring switch/NIC. NIC to switch. • When buffer fills up, the receiver sends pause frame to the sender. • When buffer empties up, the receiver sends unpauseframe to the sender. • Pausing granularity is per priority • Called Priority Flow Control (PFC) • 8 priorities can be defined Buffer
Flow Control in InfiniBand Buffer VL0 Buffer VL0 Buffer VL1 Buffer VL1 Buffer VL15 Buffer VL15 • InfiniBand defines point-to-point credit-based flow control scheme. • Receiving switch sends to its neighbor switch continuously information with amount of free space in the buffer. • Contrary to Ethernet flow control, in which pause frame is sent only if buffer occupancy crosses threshold. • A packet is never sent unless there is room for it, this ensures that packet loss is only a result of link-transmission errors. Transmitter Receiver MUX De- MUX Credit packet for VL1 Physical Link Data packet on VL1
Why flow control is not enough: Congestion Spreading This flow is also paused, since the pause control does not distinguish between flows. 33% instead of 67% PFC stops links (priorities) and not specific flows.
Congestion Spreading – another example F E D C B A • What is the throughput of victim flow? G X Y Transmitting Hoststo the same receiver Congested Flows Transmitting Host to another receiver Switch 1 Switch 2 Victim Flow
Congestion Spreading Congestion might spread over whole network 3 2 30 Ideal throughput 40-13.3=26.6Gbps 4 20 Throughput (Gbps) 1 10 0 H’ H1 H2 H3 Host R H1 H2 H3 R’ H’ 40/3=13.3Gbps
Credit Loop Deadlock • Do not transmit if not enough buffer to receive • Forwarding may cause a Cyclic Buffers Dependency • This received the name “Credit Loop” • If the buffers head fill up by packets that stay on the loop, it will dead lock f3 f2 12 22 f1 f4 1 2 2 1 3 23 11 3 f3 4 f2 4 f4 f1
Solutions to credit loop deadlock • Routing rules constraints over known topologies • Prevent turns that might cause credit loop • Not defined for general topology • Emptying the switch buffer • Losslessness is not preserved • And more…. f3 f2 12 22 f1 f4 1 2 2 1 3 23 11 3 f3 4 f2 4 f4 f1
Parking Lot Effect • Parking Lot Unfairness • In the chart below, the receiving sequence on Recv: A, B, A, C, A, B, A, D… A, B, A, C, A, B, A, D, A, B, A, C, A, B, A, D, … B, C, B, D, B, C, B, D, … C, D, C, D, … D, D, D, …
Congestion Control • Congestion control – prevents or reacts to congestion, reduces its effect on overall network performance. • Kicks-in when arriving traffic is larger than the output link capacity at some point in the network • Buffer is used to absorb excess traffic • Congestion control aims to reduce buffer usage while: • preserving bottleneck link utilization • preserving fairness • Works on end-to-end flow granularity • Throttles the rate of specific flow/s, hence does not create victims. • Contrary to link-level flow control, which stops all the traffic on the link (per priority)
Congestion Control Design Alternatives • How to identify congestion: • Packet drop • Con: long queues, retransmissions • Delay • Cons: backward delay, timestamping, unfair stable point • ECN (Explicit congestion notification) (in next slides) • Con: requires switch support • Acknowledgements (ACKs) • Con: false notification in case of reordering • Timeout (used with ACKs) • Control parameter: • Rate (inter-packet delay) • Cannot provide bound on number of packets in network • Window (number of packets in flight) • Rate ~= window [pkts] / round-trip-time [sec] • RTT-unfair
TCP Basics (New Reno) • Window based algorithm • Keeps un-acked packets < CWND • ACK-based algorithm • TCP uses ACK to notify the sender about successful packet arrival. • ACK X acknowledges arrival of packets 0….X-1 (cumulative ACK) • CWND fluctuates based on ACKs arrival • Rate increase • Upon ACK arrival: CWND += 1/CWND • (SS: CWND += 1) • Rate decrease • 3 duplicate ACKS: CWND = CWND/2 • Timeout: CWND = 1 MSS
Explicit Congestion Notification (ECN) • Switch-based enhancement that is used by end-hosts. • Allows end-to-end congestion notification without dropping packets • Supported by most advanced data center switches • Uses two bits in IP header (Diffserv field): • 00 – Non ECN capable • 01/10 – ECN capable • 11 – Congestion encountered • Upon congestion, switch changes 01/10 to 11. • ECN marking by probabilistic function based on queue length • Longer queue => higher marking probability
DCTCP – Data Center TCP • Suggested and implemented by MSFT • Using ECN marking • Smooths the rate and queue usage: • Alpha estimation is moving average of fraction of received ECN marked packets in the last window • F = marked packets / total packets • g – moving average parameter • Rate reduced by alpha ratio:
RDMA over Ethernet (RoCEv2) Packet Format Commodity Ethernet/IP headers InfiniBand-specific headers + payload ECN field in IP header is used to mark congestion (same as used for TCP) • In pure InfiniBand packet, FECN bit in BTH (Base Transport Header) BTH is used to mark congestion.
IB/RoCECongestion Control Algorithm: Congestion Point Congested Traffic • Congestion Point (switch): marks ECN bits in packet header based on queue length • Standard functionality supported by all commodity switches • also used for TCP Congested Traffic (ECN marked) Sender NIC Reaction Point (RP) Switch Congestion Point Receiver NIC Notification Point (NP)
IB/RoCECongestion Control Algorithm : Notification Point Congested Traffic (ECN marked) Notification Point: If ECN-marked packet arrives, sends CNP (Congestion Notification Packet) back to the sender CNP identifies a flow (QP) Congested Traffic CNP Sender NIC Reaction Point (RP) Switch Congestion Point Receiver NIC Notification Point (NP)
IB/RoCECongestion Control Algorithm: Reaction Point Congested Traffic(ECN marked) Reaction Point: Throttles sending rate based on Congestion Notification Packets (CNPs) arrival Congested Traffic CNP Sender NIC Reaction Point (RP) Switch Congestion Point Receiver NIC Notification Point (NP)
Service Differentiation • Specific class-of-service may prioritize one property over the other • High throughput for storage • Low latency for control • Low jitter for real time audio • Service differentiation – applies differential handling to various traffic classes according to service priority. • In-packet classification • Service level (SL) field in InfiniBand packet • Parallel queues • In InfiniBand: virtual lanes (VLs) • Network devices implement SL-to-VL mapping • Differential service for every queue: • Weighted round-robin, strict service (high or low priority) • Bounds on buffer utilization
Service Levels in InfiniBand • The SL is a field in the Local Route Header of the packet indicating the service class of the packet, enabling the implementation of differentiated services. • SL to VL mapping: every service level is assigned with a VL. • While the appropriate VL for a specific Service Level may differ over a packet’s path (in different switches), the Service Level remains constant. • Eg. in one switch the packet can be the highest priority, in other switch – the lowest priority.
High and Low Priority VLs in InfiniBand • InfiniBand defines two-level hierarchy of VLs service level: • High-priority vs low-priority • Weighted Round Robin inside priority • By assigning a packet to a service level and setting the service level to map to a particular virtual lane, packets can be classified with either a high or low priority. • High-priority traffic will preempt low-priority traffic… • To ensure forward progressing of low-priority packets we define a Limit of High Priority (LHP). The LHP is the maximum number of High-priority packets that may be scheduled on high-priority lanes before a packet from a Low-priority lane is selected. • Arbitration between individual virtual lanes of the same priority is carried out using a weighted fair arbitration scheme. • Each virtual lane is scheduled in table order and assigned a weight indicating the number of bytes it is allowed to transmit during its turn. High Priority Lanes MUX L H H H L H H H L Low Priority Lanes LHP