1 / 22

EE384Y: Packet Switch Architectures Part II Scaling Crossbar Switches

EE384Y: Packet Switch Architectures Part II Scaling Crossbar Switches. Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University nickm@stanford.edu http://www.stanford.edu/~nickm. Outline.

lot
Download Presentation

EE384Y: Packet Switch Architectures Part II Scaling Crossbar Switches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EE384Y: Packet Switch Architectures Part II Scaling Crossbar Switches Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University nickm@stanford.edu http://www.stanford.edu/~nickm

  2. Outline Up until now, we have focused on high performance packet switches with: • A crossbar switching fabric, • Input queues (and possibly output queues as well), • Virtual output queues, and • Centralized arbitration/scheduling algorithm. Today we’ll talk about the implementation of the crossbar switch fabric itself. How are they built, how do they scale, and what limits their capacity?

  3. Crossbar switchLimiting factors • N2crosspoints per chip, or NxN-to-1 multiplexors • It’s not obvious how to build a crossbar from multiple chips, • Capacity of “I/O”s per chip. • State of the art: About 300 pins each operating at 3.125Gb/s ~= 1Tb/s per chip. • About 1/3 to 1/2 of this capacity available in practice because of overhead and speedup. • Crossbar chips today are limited by “I/O” capacity.

  4. 16x16 crossbar switch: Scaling number of outputs: Trying to build a crossbar from multiple chips Building Block: 4 inputs 4 outputs Eight inputs and eight outputs required!

  5. Scaling line-rate: Bit-sliced parallelism k • Cell is “striped” across multiple identical planes. • Crossbar switched “bus”. • Scheduler makes same decision for all slices. Linecard 8 7 6 5 4 Cell Cell Cell 3 2 1 Scheduler

  6. Scaling line-rate: Time-sliced parallelism k • Cell carried by one plane; takes k cell times. • Scheduler is unchanged. • Scheduler makes decision for each slice in turn. Linecard Cell 8 7 6 5 4 Cell 3 Cell 2 Cell 1 Cell Cell Scheduler

  7. Scaling a crossbar • Conclusion: scaling the capacity is relatively straightforward (although the chip count and power may become a problem). • What if we want to increase the number of ports? • Can we build a crossbar-equivalent from multiple stages of smaller crossbars? • If so, what properties should it have?

  8. 3-stage Clos Network mxm 1 nxk kxn 1 1 n 1 2 1 n 2 … 2 … … … N m … m N N = n x m k >= n k

  9. With k = n, is a Clos network non-blocking like a crossbar? Consider the example: scheduler chooses to match (1,1), (2,4), (3,3), (4,2)

  10. With k = n is a Clos network non-blocking like a crossbar? Consider the example: scheduler chooses to match (1,1), (2,2), (4,4), (5,3), … By rearranging matches, the connections could be added. Q: Is this Clos network “rearrangeably non-blocking”?

  11. With k = n a Clos network is rearrangeably non-blocking Routing matches is equivalent to edge-coloring in a bipartite multigraph. Colors correspond to middle-stage switches. (1,1), (2,4), (3,3), (4,2) No two edges at a vertex may be colored the same. Each vertex corresponds to an n x k or k x n switch. Vizing ‘64: a D-degree bipartite graph can be colored in D colors. Therefore, if k = n, a 3-stage Clos network is rearrangeably non-blocking (and can therefore perform any permutation).

  12. How complex is the rearrangement? • Method 1: Find a maximum size bipartite matching for each of D colors in turn, O(DN2.5). • Method 2: Partition graph into Euler sets, O(N.logD) [Cole et al. ‘00]

  13. Edge-Coloring using Euler sets • Make the graph regular: Modify the graph so that every vertex has the same degree, D. [combine vertices and add edges; O(E)]. • For D=2i, perform i “Euler splits” and 1-color each resulting graph. This is logD operations, each of O(E).

  14. Euler partition of a graph • Euler partiton of graph G: • Each odd degree vertex is at the end of one open path. • Each even degree vertex is at the end of no open path.

  15. Euler split of a graph G G1 G2 • Euler split of G into G1 and G2: • Scan each path in an Euler partition. • Place each alternate edge into G1 and G2

  16. Edge-Coloring using Euler sets • Make the graph regular: Modify the graph so that every vertex has the same degree, D. [combine vertices and add edges; O(E)]. • For D=2i, perform i “Euler splits” and 1-color each resulting graph. This is logD operations, each of O(E).

  17. Implementation Scheduler Request graph Permutation Route connections Paths

  18. Implementation Pros • A rearrangeably non-blocking switch can perform any permutation • A cell switch is time-slotted, so all connections are rearranged every time slot anyway Cons • Rearrangement algorithms are complex (in addition to the scheduler) Can we eliminate the need to rearrange?

  19. Strictly non-blocking Clos Network Clos’ Theorem: If k >= 2n – 1, then a new connection can always be added without rearrangement.

  20. m x m M1 n x k k x n 1 1 I1 M2 O1 n n I2 … O2 … … … Im … Om N N N = n x m k >= n Mk

  21. 1 1 n n k k Clos Theorem x Ia Ob n – 1alreadyin use at inputand output. x + n • Consider adding the n-th connection between1st stage Iaand 3rd stage Ob. • We need to ensure that there is always somecenter-stage M available. • If k > (n – 1) + (n – 1) , then there is always an Mavailable. i.e. we need k >= 2n – 1.

  22. Scaling Crossbars: Summary • Scaling capacity through parallelism (bit-slicing and time-slicing) is straightforward. • Scaling number of ports is harder… • Clos network: • Rearrangeably non-blocking with k = n, but routing is complicated, • Strictly non-blocking with k >= 2n – 1, so routing is simple. But requires more bisection bandwidth.

More Related