1 / 60

Towards Gigabit

This presentation discusses potential problems in gigabit networks, including hardware, driver, and OS issues, protocol stack overhead, TCP stability and utilization, as well as related experiments and measurements. It also covers hardware, drivers, and OS topics such as NIC drivers, device management, redundant copies, device polling, and zero-copy TCP.

barbara
Download Presentation

Towards Gigabit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Gigabit David Wei Netlab@Caltech For FAST Meeting July.2

  2. Potential Problems • Hardware / Driver / OS • Protocol Stack Overhead • Scalability of the protocol specification • TCP Stability /Utilization (New Congestion Control Algorithm) • Related Experiments & Measurements

  3. Hardware / Drivers /OS • NIC Driver • Device Management (Interrupt) • Redundant Copies • Device Polling (http://info.iet.unipi.it/~luigi/polling/) • Zero-Copy TCP • … www.cs.duke.edu/ari/publications/talks/freebsdcon

  4. Device Polling • Current process for NIC driver in FreeBSD: • Packet come to NIC • NIC->Hardware Interrupt • CPU jumps to the interrupt handler for that NIC • MAC layer process reads data from NIC to a queue • Upper layer process the data in queue (lower priority) • Drawback: CPU checks the NIC for every packet -- Context switching. Frequent interruption for high speed device • Live-Lock: CPU is too busy working on NIC interruption to process the data in the queue.

  5. Device Polling Device Polling: • Polling: CPU checks the device when it has time. • Scheduling: User specifies a time ratio for CPU to work on devices and on non-device processing. Advantages: • Balance between the device service and non-device processing • Improve performance in fast devices

  6. Protocol Stack Overhead Per-packet over head: • Ethernet Header / Checksum • IP Header / Checksum • TCP Header / Checksum • Coping / interruption process Solution: Increase packet size • Opt Packet Size=min{ packet size along the path} (Fragmentation results in low performance too.)

  7. Path MTU Discovery (1191) Current Method: • “Don’t Fragment” bits (Router: Drop/Fragment; Host: Test/Enforce) • MTU=min{576, first hop MTU} • MSS=MTU-40 • MTU<=65535 (Architecture) • MSS<=65495 (IP sign-bit bugs…) • Drawback: Usually too small

  8. Path MTU Discovery • How to Discover PMTU? Current: • Search (Proportional Decreasing / Binary) • Update (Periodically Increasing – set to the MTU of first hop) Proposed: • Search/Update with typical MTU values • Routers: provide suggestion of MTU in DTB indicating the DF pack drop.

  9. Path MTU Discovery Implementation Host: • Packetization Layer (TCP / Connection over UDP): DF/Packet Size • IP: Store PMTU for each known path (routing table) • ICMP: “Datagram Too Big” Message Router: • Send ICMP Packet when Datagram is too big. Implementation problems: • RFC 2923

  10. Scalability of Protocol Specifications • Windows Size Space (<=64K) • Sequence Number Space (Wrapping up, <=2G) • Inadequate Frequency of RTT Sampling (1 sample per Window)

  11. Sequence Number Space

  12. Sequence Number Space

  13. Sequence Number Space

  14. Sequence Number Space

  15. Sequence Number Space

  16. Sequence Number Space

  17. Sequence Number Space

  18. Sequence Number Space

  19. Sequence Number Space • MSL (Max Segment Life)>Variance of IP delay • MSL<Sequence Number Space/Bandwidth

  20. Sequence Number Space • MSL (Max Segment Life)>Variance in IP • MSL<8*|Sequence Number Space|/Bandwidth • |SN Space|=2^31=2GB • Bandwidth=1GB • MSL<=16sec • Variance of IP delay<=16 sec • Current TCP: 3 min. • Not scalable with bandwidth growth

  21. TCP-Extensions (1323) • Window Spaces: 16bit Scale Factor in SYN: Win=[Win]*2^S • RTT Measurement: Timestamp for each packet (generated by sender, relayed by receiver) • PAWS (Protect Against Wrapped Sequence Number): Use timestamp to expand the sequence space. (So the timer should not be too fast or too slow: 1ms ~ 1 sec) • Header Prediction: Simplify the process

  22. High Speed TCP Floyd ’02. Goals: • Achieve large window size with realistic loss rate (Use current window size in AIMD parameter) • High Speed in a single connection (10Gbps) • Easy to achieve high sending rate for a given loss rate. How to Achieve TCP-Friendliness? • Incremental Deployable (no router support required)

  23. High Speed TCP Problem in Steady State: • TCP response function: • Large congestion window requires a very low loss rate. Problem in Recovery: • Congestion Avoidance takes too long to recover (Consecutive Time-outs)

  24. Consecutive Time-out

  25. Consecutive Time-out

  26. Consecutive Time-out

  27. Consecutive Time-out

  28. High Speed TCP Change the TCP response function: • p is high (higher than maxP corresponding to the default cwnd size W): standard TCP • p is low: (cwnd >= W): use a(w), b(w) instead of constant a,b in the adjustment of cwnd. • For a given loss rate P and desired windows Size W1 at P: get a(w) and b(w). (Keep the linearity on a log-log scale. ∆ logW∆ logP)

  29. Change TCP Function • Standard TCP:

  30. Change TCP Function

  31. Change TCP Function

  32. Change TCP Function

  33. Expectations • Achieve large window with realistic loss rate • Relative fairness between standard TCP and High speed TCP (Acquired bandwidth  cwnd ) • Moderate decrease instead of halving window size when congestion detected (0.33 at 1000) • Pre-computed Look-up to implement a(w) and b(w).

  34. Slow Start Modification of Slow Start: • Problem: doubling cwnd for each RTT is too aggressive for large cwnd • Proposal: To limit ∆cwnd in a RTT in Slow Start.

  35. Limited Slow Start For each ACK: • Cwnd<=max_ss_threshold: ∆cwnd=MSS (Standard TCP Slow Start) • Cwnd>max_ss-threshold: ∆cwnd=0.5max_ss_threshold/cwnd (at most 0.5 max_ssthreshold each RTT)

  36. Related Projects • Cray Research (’92); • CASA Testbed (’94) • Duke (’99) • Pittsburg Supercomputing center • Portland State Univ.(’00) • Internet 2 (’01) • Web100 • Net100 (built on Web 100)

  37. Cray Research ’92 • TCP/IP Performance at Cray Research (Dave Borman) Configuration: • HIPPI between two dedicated Y/MPs with Model E IOS and Unicos 8.0 • Memory to memory transfer Results: • Direct channel-to-channel: MTU - 64K - 781 Mbps • Through a HIPPI switch: MTU - 33K - 416 Mbps MTU - 49K - 525 Mbps MTU - 64K - 605 Mbps

  38. CASA Testbed ’94 Applied Network Research of San Diego Supercomputer Center + UCSD • Goal: Delay and Loss Characteristics of HIPPI-based gigabit testbed • Link Feature: Blocking (HIPPI), tradeoff between high lost rate and high delay • Conclusion: Avoiding packet loss is more important than reduce delay • Performance (Delay*Bandwidth =2MB; 1323 on; Cray machines): 500Mbps TCP sustained throughput (TTCP/Netperf)

  39. Trapeze/IP (Duke) Goal: • What optimization is most useful to reduce host overheads for fast TCP? • How fast does TCP really go, at what cost? Approaches: • Zero-Copy • Checksum offloading Result: • >900Mbps for MTU>8K

  40. Trapeze/IP (Duke) • Zero-copy www.cs.duke.edu/ari/publications/talks/freebsdcon

  41. Trapeze/IP (Duke) www.cs.duke.edu/ari/publications/talks/freebsdcon

  42. Trapeze/IP (Duke) www.cs.duke.edu/ari/publications/talks/freebsdcon

  43. Trapeze/IP (Duke) www.cs.duke.edu/ari/publications/talks/freebsdcon

  44. Enabling High Performance Data Transfers on Hosts By Pittsburg Supercomputing center • Enable RFC 1191 MTU Discovery • Enable RFC 1323 Large Windows • OS Kernel: Large enough socket buffers • Application: Set its send and receive socket buffer sizes Detailed methods to tune various OS.

  45. PSU Experiment Goal: • Round Trip Delay and TCP throughput with different window size • Influence by different devices (CISCO 3508/3524/5500), different NIC Environment: • OS: FreeBSD 4.0/4.1 (without 1323?), Linux, Solaris • WAN: 155Mbps OC-3 over SONET MAN • Measurement Tools: Ping + TTCP

  46. PSU Experiment • "smaller" switches and low-level routers can easily muck things up. • bugs in Linux 2.2 kernels • Different NICs have different performance. • Fast PCI bus (64 bits * 66mhz) is necessary • Switch MTU size can make a difference (giant packets are better). • Bigger TCP window sizes can help but there seems to be a knee around 4MB that is not remarked upon in the literature.

  47. Internet-2 Experiment Goal: Single TCP connection with 700-800Mbps over WAN; Relations among Window Size, MTU and Throughput Back-to-Back • OS: FreeBSD 4.3 release • Architecture: 64bit-66Mhz PCI+… • Configuration: sendspace=recvspace=102400 • Setup: Direct connection (back-back) and WAN • WAN: Symmetric path: host1-Abilene-host2 • Measurement: Ping + IPerf

  48. Internet-2 Experiment Back-to-Back • No Loss • Found some bug in FreeBSD 4.3 WAN: • <=200Mbps • Asymmetry in different directions (cache of MTU…)

  49. Web 100 • Goal: Make it easy for non-expertise to achieve high bandwidth • Method: Get more information from TCP • Software: Measurement: embedded into kernel TCP App Layer: Diagnostics / Auto-Tuning • Proposal: RFC 2012 (MIB)

  50. Net 100 • Built on Web 100 • Auto-tune the parameter for non-experts. • Network-Aware OS • Bulk File Transportation for ORNL • Implementation of Floyd’s High Speed TCP

More Related