550 likes | 646 Views
Switch Architectures. Input Queued, Output Queued, Combined Input and Output Queued. Outline. I. Introduction II. System Model III. The Least Cushion First/Most Urgent First Algorithm IV. Conclusion. Ⅰ. Introduction. Exponential growth of Internet traffic demands large scale switches
E N D
Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued
Outline • I. Introduction • II. System Model • III. The Least Cushion First/Most Urgent First Algorithm • IV. Conclusion
Ⅰ. Introduction • Exponential growth of Internet traffic demands large scale switches • Common Switch Architectures • Output Queued • High performance • Easier to provide QoS guarantee • Has serious scaling problem • Input Queued • More scalable • Suffers from HOL blocking • Virtual Output Queues can improve performance • Difficult to provide QoS guarantee
Output Queued-Shared Bus Output Port Input Port 1 1 1 2 2 3 4 3 4
Output Queued-Shared Memory Output Port Input Port Memory 1 1 2 2 3 3 4 4
Input port: 1 2 3 4 OUTPUT PORT: 1 2 3 4 Input Queued
Input port: For output port: 1 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 2 3 4 OUTPUT PORT: 1 2 3 4 Input Queued with VOQ
Ⅰ. Introduction Memory BW requirements for three common switch architectures: S :link speed N:switch size (N×N) • Input queueing is necessary ! • Can speedup the switch to improve performance CIOQ switch
Ⅰ. Introduction Matching Algorithms for Performance Improvement: matching
Input 1 Output 1 CIOQ Switch . . . . . . Input N Output N Identical Input Traffic Output 1 Input 1 Emulated . . . . . . OQ Switch Output N Input N Identical Departure Pattern Ⅰ. Introduction Exact Emulation: under identical input traffic, the departure times of every cell from both CIOQ switch and OQ switch are identical.
Ⅰ. Introduction • We propose a new scheduling algorithm called the least cushion first / most urgent first (LCF/MUF) algorithm • O(N) complexity with parallel comparators • Exactly emulates an OQ switch with a speedup of 2 times • No constraint on service discipline
Switching Fabrics Speedup=2 Ⅱ. SystemModel
Ⅱ. System Model • Switch fabric is speeded up by a factor of 2 • There are 2 scheduling phases in slot k, referred to as phase k.1 and phase k.2 • A cell delivered to its destined output port in phase k.1 can be transmitted out of the output port in the same slot (i.e., cut through) • A cell delivered in phase k.2 can only be transmitted in slot k+1 or after
Ⅲ. The Least Cushion First / Most Urgent First Algorithm • Let denote a cell at input port i destined to output port j • Definition 1: The cushion of cell : • The number of cells residing in output port j which will depart the emulated OQ switch earlier than cell • Definition 2: The cushion between input port i and output port j: • The minimum of for all cells at input port i destined to output port j • If there is no cell destined to output port j, then is set to
Ⅲ. The Least Cushion First / Most Urgent First Algorithm • Definition 3: The scheduling matrix of an NxN switch is an NxN square matrix whose (i,j)th entry equals • Definition 4: The input thread of cell at input port i: • The set of cells at input port i which has a cushion smaller than or equal to except cell itself • Let denote the size of
Ⅲ. The Least Cushion First / Most Urgent First Algorithm • LCF / MUF Algorithm • Step 1: • Select the (i,j)th entry which satisfies (Least Cushion First). If the selected entry is then stop. • If there are more than one entries with the least cushion residing in different columns, then select arbitrarily a column (i.e., an output port). • For the selected column, say, column j, determine row i which has the most urgent cell among all cells at all input ports (Most Urgent First).
Ⅲ. The Least Cushion First / Most Urgent First Algorithm • LCF / MUF Algorithm • Step 2: • Eliminate the ith row and the jth column (i.e., match output port j to input port i) of the scheduling matrix. • If the reduced matrix becomes null, then stop. Otherwise, use the reduced matrix and go to Step 1. • Consider for example the scheduling matrix given in page 13
Ⅳ. Conclusion • We propose a new scheduling algorithm - the least cushion first /most urgent first algorithm • Exactly emulates an OQ switch • No constraint on service discipline • Implement issues of the LCF / MUF algorithm • A switch has to know the cushions of all cells and the relative departure order of cells destined to the same output port • It could be difficult to obtain these information for a dynamic priority assignment scheme (e.g. WFQ) • Feasible for static priority assignment schemes
Outline • Systolic Array • Binary Heap • Pipelined Heap • Hardware Design
The Systolic Array Priority Queue Highest value New value Block n Block 3 Block 2 Block 1 Permanent Data Register Temporary Register NON-INCREASING PRIORITY VALUES n = 1000 Hardware required: 1000 comparators, 2000 registers. Performance: constant time.
The Binary Heap Priority Queue 1 16 2 3 14 10 4 5 6 7 4 7 8 3 8 9 10 11 12 3 2 3 5 7 1 2 3 4 5 6 7 8 9 10 11 12 13 15 14 VALUE 16 14 10 4 7 8 3 2 3 3 5 7 n =1000 Hardware required: 1 comparator, 1 register, 1 SRAM. Performance: O(log n).
The Pipelined-Heap • Modified binary heap data structure • Constant-time operation. Similar to the Systolic Array. • Good hardware scalability. Similar to the Binary Heap.
Binary Array(B) Token Array(T) operation value position 1 Level 1 16 2 3 Level 2 14 10 4 5 6 7 Level 3 4 7 7 3 8 9 10 11 12 13 14 15 Level 4 2 1 5 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 value 16 14 10 4 7 7 3 2 1 5 8 capacity 4 1 3 1 0 1 2 0 1 0 0 1 0 1 1 P-heap Data Structure (B,T)
The Enqueue (Insert) Operation operation value position 1 operation value position 1 enq 9 1 16 16 2 3 2 3 14 10 enq 9 2 14 10 4 5 6 7 4 5 6 7 8 7 3 8 7 3 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 2 4 5 2 4 5 (a) local-enqueue(1) (b) local-enqueue(2)
Enqueue (contd) operation value position 1 operation value position 1 16 16 2 3 2 3 14 10 14 10 4 5 6 7 4 5 6 7 8 9 3 enq 9 5 8 7 3 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 enq 7 10 2 4 5 2 4 5 (c) local-enqueue(3) (d) local-enqueue(4) operation value position 1 16 2 3 14 10 4 5 6 7 8 9 3 8 9 10 11 12 13 14 15 2 4 7 5 (e)
operation value position 1 16 2 3 14 10 4 5 6 7 8 7 3 8 9 10 11 12 13 14 15 2 4 5 (a) The Dequeue (Delete) Operation operation value position 1 deq 1 2 3 14 10 4 5 6 7 8 7 3 8 9 10 11 12 13 14 15 2 4 5 (b) local-dequeue(1)
Dequeue (contd) 1 operation value position 1 operation value position 14 14 2 3 2 3 8 10 deq 2 10 4 5 6 7 4 5 6 7 deq 4 7 3 8 7 3 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 2 4 5 2 4 5 (d) local-dequeue(3) (c) local-dequeue(2) operation value position 1 14 2 3 8 10 4 5 6 7 4 7 3 8 9 10 11 12 13 14 15 2 5 (e)
Pipelined Operation level level 1 1 2 2 3 3 4 4 5 5 6 6 level level 1 1 2 2 3 3 4 4 5 5 6 6
Hardware Requirements • log N SRAMs represent the Binary Array B, N = size of the P-heap . • log N registers represent the Token Array T. • log N comparators required, one for each level of the P-heap.
Binary Heap 1 Left(i) = 2*i Right(i) = 2*i + 1 Parent(i) = i / 2 A[i] >= A[Left(i)] A[i] >= A[Right(i)] 16 2 3 11 12 4 5 6 8 11 9 viewed as a binary tree 1 2 3 4 5 6 16 11 12 8 11 9 viewed as an array
Binary Heap : Insert Operation 1 1 16 16 2 3 2 3 11 12 11 14 4 5 6 7 4 5 6 7 8 10 9 14 8 10 9 12 viewed as a binary tree viewed as a binary tree 1 2 3 4 5 6 7 1 2 3 4 5 6 7 16 11 12 8 10 9 14 16 11 14 8 10 9 12 viewed as an array viewed as an array
Binary Heap : Delete Operation 1 1 1 16 16 12 14 2 3 2 3 2 3 11 14 11 14 11 12 4 5 6 7 4 5 6 4 5 6 8 10 9 12 8 10 9 8 10 9 viewed as a binary tree viewed as a binary tree viewed as a binary tree 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5 6 16 11 14 8 10 9 12 12 11 14 8 10 9 14 11 12 8 10 9 viewed as an array viewed as an array viewed as an array
Binary Heap Operations • Both insert and delete are O(log N) operations (i.e. number of levels in the tree) • 2*i can be implemented as left shift • i / 2 can be implemented as right shift
Some scheduling algorithm • Outline • PIM • RRM • iSLIP (Better solution)
Scheduling Algorithms • When we use a crossbar switch, we require a scheduling algorithm that match inputs with outputs. • This is equivalent to find a bipartite matching on a graph with N vertices. • The algorithm configures the fabric during each cell time and decides which inputs will be connected to which outputs.
Output side Input side P(1,1)=1 P(1,2)=3 Crossbar Switch P(3,2)=3 P(3,4)=1 P(4,4)=2 Scheduling packets • For Example P( input #, output #) = order to leave Scheduling Algorithm need to decide the path and order of packets through crossbar switch
High performance systems • Usually, we design algorithm with the following properties: • High Throughput • Starvation Free • Fast • Simple to Implement
Parallel Iterative Matching (PIM) • PIM has three steps to implement • Step1 : Request • Step2 : Grant • Step3 : Accept • Each decision is made randomly.
The mathematics model of algorithm • We can assume that • Every input in[i] maintains the following state information: • Table Ri[0] … Ri[N-1], where Ri[k] = 1, if In[i] has a request for Out[k] (0, otherwise) • Table Gdi[0] … Gdi[N-1], where Gdi[k] = 1, if In[i] receives a grant from Out[k] (0, otherwise) • Variable Ai, where Ai = k, if In[i] accepts the grant from Out[k] (-1, if no output is accepted).
The mathematics model (cond’t) • Every output Out[k] maintains the following state information: • Table Rdk[0] … Rdk[N-1], where Rdk[i] = 1, if Out[k] receives a request from In[i] (0, otherwise) • Variable Gk, where Gk = i, if Out[k] sends a grant to In[i] (-1, if no input is granted) • Variable Adk, where Adk = 1, if the grant from Out[k] is accepted. (0, otherwise).
The model of PIM • Therefore, we can represent PIM algorithm as
P(1,1)=1 P(1,2)=3 P(3,2)=3 P(3,4)=1 P(4,4)=2 (a) (b) (c) An example of PIM algorithm Second iteration Request Grant Accept
Problems with PIM • Hard to implement randomness in hardware • Unfairness occurs among connections under oversubscribed situation • Throughput is limited to approximately 63% for a single iteration
λ1,1=1 μ1,1=1/4 λ1,2=1 μ 1,2=3/4 μ 2,1=3/4 λ2,1=1 The unfairness problem
Round-Robin Matching Algorithm (RRM) • Use rotating priority to match inputs and outputs • Need a pointer gi to identify the highest priority element • Apply rotating priority on both inputs and outputs
a1 g2 4 4 4 1 1 1 4 1 2 P(1,1)=1 P(1,2)=3 3 3 3 2 2 2 P(3,2)=3 P(3,4)=1 P(4,4)=2 g4 (a) (b) (c) RRM scheduling
λ1,1= λ1,2 =1 μ1,1= μ1,2=1/4 λ2,1= λ 2,2=1 μ 2,1= μ 2,2=1/4 Synchronization Problem • When an output receives a request, the output should choose an input to grant and gi must vary to a new value • For example Efficiency = 50%