320 likes | 470 Views
A Load-Balanced Switch with an Arbitrary Number of Linecards. Isaac Keslassy , Shang-Tse (Da) Chuang, Nick McKeown Stanford University. Stanford 100Tb/s Router. “Optics in Routers” project http://yuba.stanford.edu/or/ Some challenging numbers: 100Tb/s R =160Gb/s linecard rate
E N D
A Load-Balanced Switch with an Arbitrary Number of Linecards Isaac Keslassy, Shang-Tse (Da) Chuang, Nick McKeown Stanford University
Stanford 100Tb/s Router • “Optics in Routers” project • http://yuba.stanford.edu/or/ • Some challenging numbers: • 100Tb/s • R=160Gb/s linecard rate • N=640 linecards • Performance guarantees
Router Wish List Scale to High Linecard Speeds • No Centralized Scheduler • Optical Switch Fabric • Low Packet-Processing Complexity Scale to High Number of Linecards • High Number of Linecards • Arbitrary Arrangement of Linecards Provide Performance Guarantees • 100% Throughput Guarantee • Delay Guarantee • No Packet Reordering
Load-Balanced Switch Out Out Out Forwarding mesh Load-balancing mesh R R In 3 2 1 R/N R/N R/N R/N R/N R/N R/N R/N R R In R/N R/N R/N R/N R/N R/N R/N R R R/N In R/N R/N
Load-Balanced Switch Out Out Out Forwarding mesh Load-balancing mesh R R In R/N R/N R/N R/N 1 R/N R/N R/N R/N R R In R/N R/N 2 R/N R/N R/N R/N R/N R R R/N In R/N R/N 3
Combining the Two Meshes R R/N R/N Out R/N One linecard R/N R R/N R/N Out R/N R/N R R/N Out In In Out Out R In R/N R/N R/N R/N R In R/N R/N R/N R R/N In R/N
R R In In Out Out In In 2R/N Out Out In In Out Out In In Out Out A Single Combined Mesh
References on Early Work • Initial Work • C.-S. Chang, D.-S. Lee and Y.-S. Jou, "Load Balanced Birkhoff-von Neumann Switches, part I: One-Stage Buffering," Computer Communications, Vol. 25, pp. 611-622, 2002. • Sigcomm’03 • I. Keslassy, S.-T. Chuang, K. Yu, D. Miller, M. Horowitz, O. Solgaard and N. McKeown, "Scaling Internet Routers Using Optics," ACM SIGCOMM '03, Karlsruhe, Germany, August 2003.
Router Wish List Scale to High Linecard Speeds • No Centralized Scheduler • Optical Switch Fabric • Low Packet-Processing Complexity Scale to High Number of Linecards • High Number of Linecards • Arbitrary Arrangement of Linecards Provide Performance Guarantees • 100% Throughput Guarantee • Delay Guarantee • No Packet Reordering
1 2 3 4 7 1 2 3 4 5 8 6 1 2 3 4 5 6 7 8 ExampleN=8 2R/8
8 2 1 3 4 7 6 5 5 3 2 1 6 7 8 4 When N is Too LargeDecompose into groups (or racks) 2R 2R 4R 4R 4R/4 2R 2R
1 1 2 L L 2 1 2 L L 2 1 When N is Too LargeDecompose into groups (or racks) Group/Rack 1 Group/Rack 1 2R 2R 2RL/G 2R 2R 2RL 2RL 2R 2R 2RL/G Group/RackG Group/Rack G 2RL/G 2R 2R 2R 2R 2RL 2RL 2R 2RL/G 2R
Router Wish List Scale to High Linecard Speeds • No Centralized Scheduler • Optical Switch Fabric • Low Packet-Processing Complexity Scale to High Number of Linecards • High Number of Linecards • Arbitrary Arrangement of Linecards Provide Performance Guarantees • 100% Throughput Guarantee • Delay Guarantee • No Packet Reordering
2RL/G 2RL/G 2RL/G 2RL/G 2RL/G + + = 2RL/G 2RL/G 2RL 2RL/G ≤ G * 2 2 1 L 1 L 1 L L 1 2 2 When Linecards are MissingFailures, Incremental Additions, and Removals… Group/Rack 1 Group/Rack 1 2R 2R 2RL 2RL/G 2R 2R 2RL 2RL 2R 2R • Solution: replace mesh with sum of permutations Group/RackG Group/Rack G 2R 2R 2R 2R 2RL 2RL 2R 2R
Optics L 2 1 L 1 L 2 2 1 2 L 1 Electronics Electronics Hybrid Electro-Optical ArchitectureUsing MEMS Switches Group/Rack 1 Group/Rack 1 2R 2R 2R 2R MEMS Switch 2R 2R MEMS Switch Group/RackG Group/Rack G 2R 2R 2R 2R 2R 2R
1 L 2 1 L 1 L 2 1 2 L 2 When Linecards are Missing Group/Rack 1 Group/Rack 1 2R 2R 2R 2R MEMS Switch 2R 2R MEMS Switch Group/RackG Group/Rack G 2R 2R 2R 2R 2R 2R
Router Wish List Scale to High Linecard Speeds • No Centralized Scheduler • Optical Switch Fabric • Low Packet-Processing Complexity Scale to High Number of Linecards • High Number of Linecards • Arbitrary Arrangement of Linecards Provide Performance Guarantees • 100% Throughput Guarantee • Delay Guarantee • No Packet Reordering
Questions • Number of MEMS Switches? • TDM Schedule?
Laser/Modulator MUX l1 l1 , l2 ,...,l64 l2 l64 Link Capacity ≈ 64 λ’s * 5 Gb/s/λ = 320 Gb/s = 2R 2 L 1 2 L 1 L L 1 2 1 2 All Link Capacities Are Equal Group/Rack 1 Group/Rack 1 2R 2R 2R 2R ≤ 2R MEMS Switch 2R 2R ≤ 2R ≤ 2R MEMS Switch Group/RackG Group/Rack G ≤ 2R 2R 2R MEMS Switch ≤ 2R 2R 2R 2R 2R ≤ 2R
2R 2R 2R 2R 2R 2R 2 1 1 2 2 1 Example2 Groups of 2 Linecards Group/Rack 1 Group/Rack 1 2R 2R 1 4R 4R 2R 2R 2 Group/Rack 2 Group/Rack 2 2R 2R 4R 4R 2R 2R
≤ 2R ≤ 2R ≤ 2R Group/Rack 2 Group/Rack 2 2R 2R 2R 2R Group/Rack G Group/Rack G G-1 2R 2R 2R 2R 1 1 2 L 1 1 L 2 1 1 Intuition on Worst-Case Group/Rack 1 Group/Rack 1 2R 2R L 2R 2R 2RL 2RL MEMS Switch 2R 2R MEMS Switch MEMS Switch
Number of MEMS Switches • Theorem:M ≤ L+G-1 • Examples:
Questions • Number of MEMS Switches? • TDM Schedule?
2R 2R 2R 2R 2 1 1 2 2 1 TDM Schedule Group A Group A 2R 2R 1 4R 4R 2R 2R 2 Group B Group B 2R 2R 4R 4R 2R 2R
Tx Group A Tx Group B TDM Schedule
Tx Group A Tx Group B TDM Schedule
Tx Group A Tx Group B Bad TDM Schedule
TDM Schedule Algorithm • Intuition • Create TDM schedule between groups: “Group A sends to group B” • Assign group connections to specific linecards: “Linecard A1 sends to linecard B3” • Theorem:There exists a polynomial-time algorithm to find a correct TDM schedule.
Algorithm Running Time milliseconds Worst Case Average Case Best Case number of linecards [Verilog simulation, linecard placement generated uniformly-at-random among 40 groups, 4ns clock cycle, 1000 runs per case. Source: Srikanth Arekapudi]
Open Questions • Greedy TDM algorithm with more capacity? • A better switch fabric architecture?