220 likes | 373 Views
TCP Tuning and E2E Performance. Anders Magnusson. TREFpunkt - October 20, 2004. The speed-of-light problem. The sender must store every sent packet until it has received an ACK from the receiver
E N D
TCP Tuning and E2E Performance Anders Magnusson TREFpunkt - October 20, 2004
The speed-of-light problem • The sender must store every sent packet until it has received an ACK from the receiver • Due to the speed of light limitations this might take a while, even in small countries like Sweden • Theoretical RTT Luleå-Stockholm is (1000/300000)*2 = 6.7ms, in reality 20ms • TCP window size to keep up with 1Gbit/s must then be (1000/8)*.02 = 2.5Mbyte October 20, 2004
Operating system buffers Inside the operating system kernel there are usually a bunch of different buffers affecting performance The term “buffers” is somewhat misleading, usually it is just some sort of data structure that is used to reference data in memory (but in theory it could as well be real buffers) October 20, 2004
TCP window buffers • The TCP window sizes can be adjusted on virtually all operating systems • There are two windows, send and receive • The window size for one direction of flow is set to MIN(sender’s send window, receiver’s receive window) • The send window must be large enough to keep all segments sent during the RTT October 20, 2004
Socket buffers • Limits the amount of data an application may write to the kernel before being blocked • Often combined with the TCP send window, when ACKs are received the socket buffer data is adjusted accordingly • Must be >= TCP window size to avoid limitations October 20, 2004
MBUF clusters • There are limitations how many network buffers (in many OSes called MBUFs) that may be allocated • MBUFs may have external storage associated with them, allocated out of a separate (limited) area • These buffers are often allocated at compile time and it is not uncommon that physical memory is static allocated for them October 20, 2004
Other knobs to turn RFC1323 • Turns on “Window scaling option” needed to use larger TCP windows than 64k Initial window size • Avoid slow-start by injecting many packets into the network at connection startup Interface queues • Be able to store the packets that are ready to send until the network interface can transmit them October 20, 2004
Problems often seen Packet loss • On a long-distance high-speed connection, packet loss in a TCP flow will reduce the speed significantly • If the sender enters congestion avoidance, the congestion window will open linearly, and with large windows this will be really slow • With an RTT of 185ms and window size of 25MB it will take around 50 minutes to reach full speed October 20, 2004
Problems often seen Packet bursts • During the startup of a TCP bulk flow, the exponential increase in packet injection into the network during slow-start may cause packet bursts on links with large bandwidth-delay product • The result may be that intermediate switches/routers must drop packets, even though the TCP self-clocking would not permit more packets to be sent than could be received October 20, 2004
Problems often seen ACK/window updates • Traditional approach for bulk flows is for the receiver to send an ACK each second received packet • Window updates are sent as soon as data is delivered to the receiving process • This will cause the return traffic to be more than half the number of the transmitted packets • Interrupts, packet handling in the sending host may use a significant amount of CPU October 20, 2004
Problems often seen ARP timeouts • When an ARP entry times out, it is usually just removed from the ARP cache, and the next packet will initiate a new ARP request • If there is an ongoing packet flow, this approach may cause packets to be dropped until an ARP reply is received October 20, 2004
Tuning of NetBSD • sysctl -w net.inet.tcp.rfc1323=1 • Activate window scaling and timestamp options due to RFC1323. • sysctl -w kern.somaxkva=[sbmax] • Set maximum size for all socket buffers together in the system • sysctl -w kern.sbmax=[sbmax] • Set maximum size of socket buffer for one TCP flow • sysctl -w net.inet.tcp.recvspace=[wstd] • sysctl -w net.inet.tcp.sendspace=[wstd] • Set max size of TCP windows. • sysctl kern.mbuf.nmbclusters • View maximum number of mbuf clusters. Used for storage of data packets to/from the network interface. Can only be set by recompiling Your kernel. October 20, 2004
Tuning of FreeBSD • sysctl net.inet.tcp.rfc1323=1 • Activate window scaling and timestamp options due to RFC1323. • sysctl ipc.maxsockbuf=[sbmax] • Set maximum size of TCP window. • sysctl net.inet.tcp.recvspace=[wstd] • sysctl net.inet.tcp.sendspace=[wstd] • Set max size of TCP windows. • sysctl kern.ipc.nmbclusters • View maximum number of mbuf clusters. Used for storage of data packets to/from the network interface. Can only be set att boot time. October 20, 2004
Tuning of Linux • echo "1" > /proc/sys/net/ipv4/tcp_window_scaling • Activate window scaling according to RFC 1323 • echo [wmax] > /proc/sys/net/core/rmem_max • echo [wmax] > /proc/sys/net/core/wmem_max • Set maximum size of TCP windows. • echo [wmax] > /proc/sys/net/core/rmem_default • echo [wmax] > /proc/sys/net/core/wmem_default • Set default size of TCP windows. • echo "[wmin] [wstd] [wmax]" > /proc/sys/net/ipv4/tcp_rmem • echo "[wmin] [wstd] [wmax]" > /proc/sys/net/ipv4/tcp_wmem • Set min, default, max windows. Used by the autotuning function. • echo "bmin bdef bmax" > /proc/sys/net/ipv4/tcp_mem • Set maximum total TCP buffer-space allocatable. Used by the autotuning function. October 20, 2004
Tuning of Windows (2k, XP, 2k3) • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Tcp1323Opts=1 • Turn on window scaling option • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\TcpWindowSize =[wmax] • Set maximum size of TCP window October 20, 2004
How to set a Land Speed Record • Recipe: • Really high-quality networks • Hardware capable of sending/receiving fast enough • Operating system without foolish bottlenecks • Enthusiasts that spend weekends sending an obscene amount of data between Luleå and San Jose October 20, 2004
SUNET Internet Land Speed Record - Network setup GigaSunet OC-192 core 10GE OC192 Sprintlink OC-192 core End host in Luleå, Sweden 10GE Network path consists of 42(!) router hops, using paths shared with other users of the networks. End host in San Jose, CA October 20, 2004
Records submitted September 12 • 1 966 080 000 000 bytes in 3648 real seconds = 4310 Mbit/second • 1831 Gbytes in almost exactly an hour • 120 000 packets/second transferred with an MTU of 4470 bytes • Record submitted for the IPv4 single and multiple stream class is 124.935 Petabit-meters/second (which is a 78% increase of our previous record) October 20, 2004
Compared with others Compared to the previous record, we can note thatwe achieved this, using • Less powerful end hosts • 200% longer distance • Less than half the MTU size (which generates heavier CPU-load on the end-hosts) • The normal GigaSunet and Sprintlink production infrastructures October 20, 2004
Fiber path for the Internet LSR Distance from Luleå, Sweden to San Jose, CA is approximately 28,983 km (18,013 miles) October 20, 2004
Network load October 20, 2004
More to read… • http://proj.sunet.se/LSR • Describes how the Land Speed Record(s) were achieved • http://proj.sunet.se/E2E • About end-to-end performance in GigaSunet October 20, 2004