690 likes | 714 Views
TCP/IP Masterclass or So TCP works … but still the users ask: Where is my throughput?. Richard Hughes-Jones The University of Manchester www.hep.man.ac.uk/~rich/ then “Talks”. Layers & IP. The Transport Layer 4: TCP. TCP RFC 768 RFC 1122 Provides : Connection orientated service over IP
E N D
TCP/IP Masterclass orSo TCP works … but still the users ask:Where is my throughput? Richard Hughes-Jones The University of Manchesterwww.hep.man.ac.uk/~rich/ then “Talks” GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Layers & IP GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
The Transport Layer 4: TCP • TCP RFC 768 RFC 1122 Provides : • Connection orientated service over IP • During setup the two ends agree on details • Explicit teardown • Multiple connections allowed • Reliable end-to-end Byte Stream delivery over unreliable network • It takes care of: • Lost packets • Duplicated packets • Out of order packets • TCP provides • Data buffering • Flow control • Error detection & handling • Limits network congestion GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Frame header FCS IP header TCP header Application data 24 8 16 0 4 10 31 Source port Destination port Sequence number Acknowledgement number Hlen Resv Code Window Checksum Urgent ptr Options (if any) Padding The TCP Segment Format 20 Bytes GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Source port Destination port Sequence number Acknowledgement number Hlen Resv Code Window Checksum Urgent ptr Options (if any) Padding TCP Segment Format – cont. • Source/Dest port: TCP port numbers to ID applications at both ends of connection • Sequence number:First byte in segment from sender’s byte stream • Acknowledgement: identifies the number of the byte the sender of this (ACK) segment expects to receive next • Code: used to determine segment purpose, e.g. SYN, ACK, FIN, URG • Window: Advertises how much data this station is willing to accept. Can depend on buffer space remaining. • Options: used for window scaling, SACK, timestamps, maximum segment size etc. GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Receiver Sender Segment n Sequence 1024 Length 1024 RTT ACK of Segment n Ack 2048 Segment n+1 Sequence 2048 Length 1024 RTT ACK of Segment n +1 Ack 3072 Time TCP – providing reliability • Positive acknowledgement (ACK) of each received segment • Sender keeps record of each segment sent • Sender awaits an ACK – “I am ready to receive byte 2048 and beyond” • Sender starts timer when it sends segment – so can re-transmit • Inefficient – sender has to wait GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
TCP Cwnd slides Data to be sent, waiting for window to open. Application writes here Unsent Data may be transmitted immediately Data sent and ACKed Sent Data buffered waiting ACK Receiver’s advertised window advances leading edge Sending host advances marker as data transmitted Received ACK advances trailing edge Flow Control: Sender – Congestion Window • Uses Congestion window, cwnd, a sliding window to control the data flow • Byte count giving highest byte that can be sent with out an ACK • Transmit buffer size and Advertised Receive buffer size important. • ACK gives next sequence no to receive ANDThe available space in the receive buffer • Timer kept for each packet GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Lost data Application reads here Window slides Received butnot ACKed Data given to application ACKed but not given to user Receiver’s advertised window advances leading edge Last ACK given Next byte expected Expected sequence no. Flow Control: Receiver – Lost Data • If new data is received with a sequence number ≠ next byte expected Duplicate ACK is send with the expected sequence number GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
packet loss timeout CWND slow start: exponential increase retransmit: slow start again congestion avoidance: linear increase time How it works: TCP Slowstart • Probe the network - get a rough estimate of the optimal congestion window size • The larger the window size, the higher the throughput • Throughput = Window size / Round-trip Time • exponentially increase the congestion window size until a packet is lost • cwnd initially 1 MTU then increased by 1 MTU for each ACK received • Send 1st packet get 1 ACK increase cwnd to 2 • Send 2 packets get 2 ACKs increase cwnd to 4 • Time to reach cwnd size W TW= RTT*log2(W) (not exactly slow!) • Rate doubles each RTT GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
packet loss timeout CWND slow start: exponential increase retransmit: slow start again congestion avoidance: linear increase time How it works: TCP Congestion Avoidance • additive increase: starting from the rough estimate, linearly increase the congestion window size to probe for additional available bandwidth • cwnd increased by 1 segment per rtt • cwnd increased by 1 /cwnd for each ACK – linear increase in rate • TCP takes packet loss as indication of congestion ! • multiplicative decrease: cut the congestion window size aggressively if a packet is lost • Standard TCP reduces cwnd by 0.5 • Slow start to Congestion Avoidance transition determined by ssthresh GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
TCP Fast Retransmit & Recovery • Duplicate ACKs are due to lost segments or segments out of order. • Fast Retransmit: If the receiver transmits 3 duplicate ACKs (i.e. it received 3 additional segments without getting the one expected) • Sender re-transmits the missing segment • Set ssthresh to 0.5*cwnd – so enter congestion avoidance phase • Set cwnd = (0.5*cwnd +3 ) – the 3 dup ACKs • Increase cwnd by 1 segment when get duplicate ACKs • Keep sending new data if allowed by cwnd • Set cwnd to half original value on new ACK • no need to go into “slow start” again • At the steady state, cwnd oscillates around the optimal window size • With a retransmission timeout, slow start is triggered again GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Receiver Sender RTT ACK Segment time on wire = bits in segment/BW Time TCP: Simple Tuning - Filling the Pipe • Remember, TCP has to hold a copy of data in flight • Optimal (TCP buffer) window size depends on: • Bandwidth end to end, i.e. min(BWlinks) AKA bottleneck bandwidth • Round Trip Time (RTT) • The number of bytes in flight to fill the entire path: • Bandwidth*Delay Product BDP = RTT*BW • Can increase bandwidth by orders of magnitude • Windows also used for flow control GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Standard TCP (Reno) – What’s the problem? • TCP has 2 phases: • Slowstart Probe the network to estimate the Available BWExponential growth • Congestion AvoidanceMain data transfer phase – transfer rate glows “slowly” • AIMD and High Bandwidth – Long Distance networks Poor performance of TCP in high bandwidth wide area networks is due in part to the TCP congestion control algorithm. • For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd - Additive Increase, a=1 • For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ • Packet loss is a killer !! GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
TCP (Reno) – Details of problem #1 • Time for TCP to recover its throughput from 1 lost 1500 byte packet given by: • for rtt of ~200 ms @ 1 Gbit/s: 2 min UK 6 msEurope 25 msUSA 150 ms1.6 s26 s 28min GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Investigation of new TCP Stacks • The AIMD Algorithm – Standard TCP (Reno) • For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd - Additive Increase, a=1 • For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ • High Speed TCP a and b vary depending on current cwnd using a table • a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner for the network path • b decreases less aggressively and, as a consequence, so does the cwnd. The effect is that there is not such a decrease in throughput. • Scalable TCP a and b are fixed adjustments for the increase and decrease of cwnd • a = 1/100 – the increase is greater than TCP Reno • b = 1/8 – the decrease on loss is less than TCP Reno • Scalable over any link speed. • Fast TCP Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput. • HSTCP-LP, H-TCP, BiC-TCP GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Lets Check out this theory about new TCP stacks Does it matter ? Does it work? GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
MB-NG Managed Bandwidth Packet Loss with new TCP Stacks • TCP Response Function • Throughput vs Loss Rate – further to right: faster recovery • Drop packets in kernel MB-NG rtt 6ms DataTAG rtt 120 ms GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Packet Loss and new TCP Stacks • TCP Response Function • UKLight London-Chicago-London rtt 177 ms • 2.6.6 Kernel • Agreement withtheory good • Some new stacksgood at high loss rates GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Send data with TCP Drop Packets Monitor TCP with Web100 man03 lon01 High Throughput Demonstrations London (Chicago) Manchester rtt 6.2 ms(Geneva) rtt 128 ms Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz Cisco GSR Cisco GSR Cisco 7609 Cisco 7609 1 GEth 1 GEth 2.5 Gbit SDH MB-NG Core GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
MB-NG Managed Bandwidth High Performance TCP – MB-NG • Drop 1 in 25,000 • rtt 6.2 ms • Recover in 1.6 s Standard HighSpeed Scalable GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
High Performance TCP – DataTAG • Different TCP stacks tested on the DataTAG Network • rtt 128 ms • Drop 1 in 106 • High-Speed • Rapid recovery • Scalable • Very fast recovery • Standard • Recovery would take ~ 20 mins GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Layer 2 path Layer 2/3 path Photonic Switch Photonic Switch FAST demo via OMNInet and Datatag NU-E (Leverone) San Diego Workstations FAST display 2 x GE Nortel Passport 8600 A. Adriaanse, C. Jin, D. Wei (Caltech) 10GE FAST Demo Cheng Jin, David Wei Caltech J. Mambretti, F. Yeh (Northwestern) OMNInet StarLight-Chicago Nortel Passport 8600 10GE CERN -Geneva Workstations 2 x GE 2 x GE 7,000 km 2 x GE 2 x GE OC-48 DataTAG CERN Cisco 7609 CalTech Cisco 7609 Alcatel 1670 Alcatel 1670 S. Ravot (Caltech/CERN) GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Traffic flowChannel #2: FAST • Traffic flow Channel #1 : newReno Utilization: 90% Utilization: 70% FAST TCP vs newReno GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Problem #2 Is TCP fair? look at Round Trip Times & Max Transfer Unit GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
CERN (GVA) Starlight (Chi) Host #1 1 GE 1 GE Host #1 1 GE POS 2.5Gbps GbE Switch Host #2 Host #2 1 GE Bottleneck R R MTU and Fairness • Two TCP streams share a 1 Gb/s bottleneck • RTT=117 ms • MTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/s • MTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/s • Link utilization : 70,7 % Sylvain Ravot DataTag 2003 GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
CERN (GVA) Starlight (Chi) Sunnyvale Host #1 1 GE 10GE 1 GE GbE Switch POS 2.5Gb/s POS 10Gb/s Host #2 Host #2 1 GE 1 GE Bottleneck Host #1 R R R R RTT and Fairness • Two TCP streams share a 1 Gb/s bottleneck • CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s • CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s • MTU = 9000 bytes • Link utilization = 71,6 % Sylvain Ravot DataTag 2003 GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Problem #n Do TCP Flows Share the Bandwidth ? GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
bottleneck SLAC TCP/UDP Caltech/UFL/CERN Iperf or UDT iperf Ping 1/s ICMP/ping traffic 4 mins 2 mins Test of TCP Sharing: Methodology (1Gbit/s) • Chose 3 paths from SLAC (California) • Caltech (10ms), Univ Florida (80ms), CERN (180ms) • Used iperf/TCP and UDT/UDP to generate traffic • Each run was 16 minutes, in 7 regions Les Cottrell PFLDnet 2005 GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Remaining flows do not take up slack when flow removed Increase recovery rate RTT increases when achieves best throughput Congestion has a dramatic effect Recovery is slow Les Cottrell PFLDnet 2005 TCP Reno single stream • Low performance on fast long distance paths • AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) • Net effect: recovers slowly, does not effectively use available bandwidth, so poor throughput • Unequal sharing SLAC to CERN GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
2nd flow never gets equal share of bandwidth Big drops in throughput which take several seconds to recover from SLAC-CERN Fast • As well as packet loss, FAST uses RTT to detect congestion • RTT is very stable: σ(RTT) ~ 9ms vs 37±0.14ms for the others GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
> 2 flows appears less stable Appears to need >1 flow to achieve best throughput Two flows share equally SLAC-CERN Hamilton TCP • One of the best performers • Throughput is high • Big effects on RTT when achieves best throughput • Flows share equally GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Problem #n+1 To SACK or not to SACK ? GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
SACKs updated rtt 150ms Standard SACKs rtt 150ms HS-TCP Dell 1650 2.8 GHz PCI-X 133 MHz Intel Pro/1000 Doug Leith Yee-Ting Li The SACK Algorithm • SACK Rational • Non-continuous blocks of data can be ACKed • Sender transmits just lost packets • Helps when multiple packets lost in one TCP window • The SACK Processing is inefficient for large bandwidth delay products • Sender write queue (linked list) walked for: • Each SACK block • To mark lost packets • To re-transmit • Processing so long input Q becomes full • Get Timeouts GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
SACK … • Look into what’s happening at the algorithmic level with web100: • Strange hiccups in cwnd only correlation is SACK arrivals Scalable TCP on MB-NG with 200mbit/sec CBR Background Yee-Ting Li GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Real Applications on Real Networks • Disk-2-disk applications on real networks • Memory-2-memory tests • Transatlantic disk-2-disk at Gigabit speeds • Remote Computing Farms • The effect of TCP • The effect of distance • Radio Astronomy e-VLBI • Leave for Ralph’s talk GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
BaBar on Production network • Standard TCP • 425 Mbit/s • DupACKs 350-400 – re-transmits iperf Throughput + Web100 • SuperMicro on MB-NG network • HighSpeed TCP • Linespeed 940 Mbit/s • DupACK ? <10 (expect ~400) GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Applications: Throughput Mbit/s • HighSpeed TCP • 2 GByte file RAID5 • SuperMicro + SuperJANET • bbcp • bbftp • Apachie • Gridftp • Previous work used RAID0(not disk limited) GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
BaBar + SuperJANET • Instantaneous 200 – 600 Mbit/s • Disk-mem~ 590 Mbit/srememberthe end host bbftp: What else is going on? Scalable TCP • SuperMicro + SuperJANET • Instantaneous 0 - 550 Mbit/s • Congestion window – duplicate ACK • Throughput variation not TCP related? • Disk speed / bus transfer • Application architecture GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Transatlantic Disk to Disk Transfers With UKLight SuperComputing 2004 GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
MB-NG Managed Bandwidth Amsterdam SC2004 UKLIGHT Overview SLAC Booth SC2004 Cisco 6509 MB-NG 7600 OSR Manchester Caltech Booth UltraLight IP UCL network UCL HEP NLR Lambda NLR-PITT-STAR-10GE-16 ULCC UKLight K2 K2 Ci UKLight 10G Four 1GE channels Ci Caltech 7600 UKLight 10G Surfnet/ EuroLink 10G Two 1GE channels Chicago Starlight K2 GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Transatlantic Ethernet: TCP Throughput Tests • Supermicro X5DPE-G2 PCs • Dual 2.9 GHz Xenon CPU FSB 533 MHz • 1500 byte MTU • 2.6.6 Linux Kernel • Memory-memory TCP throughput • Standard TCP • Wire rate throughput of 940 Mbit/s • First 10 sec • Work in progress to study: • Implementation detail • Advanced stacks • Effect of packet loss • Sharing GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
SC2004 Disk-Disk bbftp • bbftp file transfer program uses TCP/IP • UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 • MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off • Move a 2 GByte file • Web100 plots: • Standard TCP • Average 825 Mbit/s • (bbcp: 670 Mbit/s) • Scalable TCP • Average 875 Mbit/s • (bbcp: 701 Mbit/s~4.5s of overhead) • Disk-TCP-Disk at 1Gbit/s GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
% CPU kernel mode Disk write 1735 Mbit/s Disk write + 1500 MTU UDP 1218 Mbit/s Drop of 30% Disk write + 9000 MTU UDP 1400 Mbit/s Drop of 19% Network & Disk Interactions (work in progress) • Hosts: • Supermicro X5DPE-G2 motherboards • dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory • 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 • six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size • Measure memory to RAID0 transfer rates with & without UDP traffic GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Remote Computing Farms in the ATLAS TDAQ Experiment GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
ATLAS Remote Farms – Network Connectivity GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
SFI and SFO Event Filter Daemon EFD Request event Send event data Request-Response time (Histogram) Process event Request Buffer Send OK Send processed event ●●● Time ATLAS Application Protocol • Event Request • EFD requests an event from SFI • SFI replies with the event ~2Mbytes • Processing of event • Return of computation • EF asks SFO for buffer space • SFO sends OK • EF transfers results of the computation • tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication. GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
TCP Congestion windowgets re-set on each Request • TCP stack RFC 2581 & RFC 2861 reduction of Cwnd after inactivity • Even after 10s, each response takes 13 rtt or ~260 ms • Transfer achievable throughput120 Mbit/s tcpmon: TCP Activity Manc-CERN Req-Resp • Round trip time 20 ms • 64 byte Request green1 Mbyte Response blue • TCP in slow start • 1st event takes 19 rtt or ~ 380 ms GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
tcpmon: TCP Activity Manc-cern Req-RespTCP stack tuned • Round trip time 20 ms • 64 byte Request green1 Mbyte Response blue • TCP starts in slow start • 1st event takes 19 rtt or ~ 380 ms • TCP Congestion windowgrows nicely • Response takes 2 rtt after ~1.5s • Rate ~10/s (with 50ms wait) • Transfer achievable throughputgrows to 800 Mbit/s • Data transferred WHEN theapplication requires the data GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
tcpmon: TCP Activity Alberta-CERN Req-RespTCP stack tuned • Round trip time 150 ms • 64 byte Request green1 Mbyte Response blue • TCP starts in slow start • 1st event takes 11 rtt or ~ 1.67 s • TCP Congestion windowin slow start to ~1.8s then congestion avoidance • Response in 2 rtt after ~2.5s • Rate 2.2/s (with 50ms wait) • Transfer achievable throughputgrows slowly from 250 to 800 Mbit/s GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester
Summary & Conclusions • Standard TCP not optimum for high throughput long distance links • Packet loss is a killer for TCP • Check on campus links & equipment, and access links to backbones • Users need to collaborate with the Campus Network Teams • Dante Pert • New stacks are stable and give better response & performance • Still need to set the TCP buffer sizes ! • Check other kernel settings e.g. window-scale maximum • Watch for“TCP Stack implementation Enhancements” • TCP tries to be fair • Large MTU has an advantage • Short distances, small RTT, have an advantage • TCP does not share bandwidth well with other streams • The End Hosts themselves • Plenty of CPU power is required for the TCP/IP stack as well and the application • Packets can be lost in the IP stack due to lack of processing power • Interaction between HW, protocol processing, and disk sub-system complex • Application architecture & implementation are also important • The TCP protocol dynamics strongly influence the behaviour of the Application. • Users arenow able to perform sustained 1 Gbit/s transfers GEANT2 Network Performance Workshop , 11-12 Jan 200, R. Hughes-Jones Manchester