550 likes | 667 Views
Some Unsolved Problems in High Speed Packet Swtiching. Shivendra S. Panwar Joint work with : Yihan Li, Yanming Shen and H. Jonathan Chao Polytechnic University, Brooklyn, NY NY State Center for Advanced Technology in Telecommunications http://catt.poly.edu/CATT/panwar.html.
E N D
Some Unsolved Problems in High Speed Packet Swtiching Shivendra S. Panwar Joint work with: Yihan Li, Yanming Shen and H. Jonathan Chao Polytechnic University, Brooklyn, NY NY State Center for Advanced Technology in Telecommunications http://catt.poly.edu/CATT/panwar.html
Advice to Woodward and Bernstein: “Follow the money” -- Deep Throat (aka Mark Felt)
Advice to performance analysts: “Find the bottleneck”
Buffering in a Packet Switch • Fixed-size packet switches • Operates in a time-slotted manner • The slot duration is equal to the cell transmission time • Contention occurs when multiple inputs have arrivals destined to the same output • Buffering is needed to avoid packet loss • Buffering schemes in a packet switch • Output queueing (IQ) • Input queueing (OQ) • Virtual output queueing (VOQ) / combined input-output-queueing (CIOQ)
Output Queuing (OQ) • 100% throughput • Internal speedup of N • Impractical for large N Output 1 Input 1 3 Output 2 Input 2 3 Output 3 Input 3 3 Output 4 Input 4 3
Input Queuing (IQ) • Easy to implement • HOL Blocking, throughput 58.6% Input 1 Output 1 1 2 Head of Line Blocking Input 2 3 Output 2 2 Input 3 Output 3 4 3 Input 4 Output 4 4 2
1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 Virtual Output Queuing (VOQ) • Virtual Output Queuing (VOQ) • Overcome HOL blocking • No speedup requirement • Need scheduling algorithms to resolve contention • Complexity • Performance guarantee
Challenges in Switch Design • Stability • 100% throughput • Delay performance • Scalability • Scale to high number of linecards and to high linecard speeds • Distributed scheduler is more desirable than a centralized scheduler • Scheduler complexity • Pin count
High Speed Packet Switches • VOQ switches and scheduling algorithms • Buffered crossbar switch • Load Balanced switch • Multi-stage switch
ORM VOQ ISM 1 Input 1 Switch Fabric Output 1 N 1 1 1 1 1 N N N N Input 2 Output 2 N 1 Input 3 Output 3 N 1 Input 4 Output 4 N VOQ Switch Architecture Input Segmentation Module (ISM):Segment packets to fixed-length cells. Output Reassembly Module (ORM):Reassemble cells into packets.
Scheduling for VOQ Switch • Scheduling is needed to avoid output contention • A scheduling problem can be modeled as a matching problem in a bipartite graph • An input and an output are connected by an edge if the corresponding VOQ is not empty • Each edge may have a weight, which can be • The length of the VOQ • The age of the HOL cell
Maximum Weight Matching (MWM) 7 • MWM always finds a match with the maximum weight • Stable under any admissible traffic • Very high complexity • O(N3), impractical 4 3 7 8 5 6 References • L. Tassiulas, A. Ephremides, ``Stability properties of constrained queueing systems and scheduling for maximum throughput in multihop radio networks,'' IEEE Transactions on Automatic Control, Vol. 37, No. 12, pp. 1936-1949, December 1992. • E. Leonardi, M. Mellia, F. Neri, Marco A. Marsan, “On the stability of Input-Queued Switches with speed-up”,IEEE/ACM Transactions on Networking, Vol.9, No.1, pp.104-118, ISSN: S 1063-6692(01)01313, February 2001 10 5 2 Weight of the match: 25 • N. McKeown, V. Anantharam, and J. Walrand, “Achieving 100% Throughput in an Input-Queued Switch,” IEEE Transaction on Comm., vol. 47, no. 8, Aug. 1999, pp. 1260-1267. • J.G. Dai and B. Prabhakar, “The throughput of data switches with and without speedup,” INFOCOM 2000.
Maximum Weight Matching • The maximum weight matching algorithm is strongly stable under any admissible traffic pattern • Lyapunov function • Strongly stable • Admissible • References • Emilio Leonardi, Marco Mellia, Fabio Neri, Marco Ajmone Marsan, “On the stability of Input-Queued Switches with speed-up”,IEEE/ACM Transactions on Networking, Vol.9, No.1, pp.104-118, ISSN: S 1063-6692(01)01313, February 2001 • N. McKeown, V. Anantharam, and J. Walrand, “Achieving 100% Throughput in an Input-Queued Switch,” IEEE Transaction on Comm., vol. 47, no. 8, Aug. 1999, pp. 1260-1267.
Maximum Weight Matching • Fluid model • The maximum weight matching is rate stable if: • The arrival processes satisfy a strong law of large numbers (SLLN) with probability one , and • References • J.G. Dai and B. Prabhakar, “The throughput of data switches with and without speedup,” INFOCOM 2000, pp. 556-564.
Approximate MWM • 1-APRX • A function f(.) is a sub-linear function if limx∞ f(x)/x = 0 • Let the weight of a schedule obtained by a scheduling algorithm B be WB • Let the weight of the maximum weight match for the same switch state be W* • If WB ≥ W* - f(W*) B is a 1-APRX to MWM • B is stable if • Makes it possible to find stable matching algorithms with lower complexity than MWM. • References • D. Shah, M. Kopikare, “Delay bounds for approximate Maximum weight matching algorithms for input-queued switches”, IEEE INFOCOM, New York, USA, June 2002.
Average Delay Bound • Delay bound for MWM • Lyapunov function • References • E. Leonardi, M. Melia, F. Neri, and M. Ajmone Marson. Bounds on average delays and queue size averages and variances in input-queued cell-based switches. Proceedings of IEEE INFOCOM, 2001.
Average Delay Bound (contd.) • Delay bound for approximate-MWM • Lyapunov function Cb: weight difference to the MWM matching • Uniform traffic, they have the same result • References • D. Shah, M. Kopikare, “Delay bounds for approximate Maximum weight matching algorithms for input-queued switches”, IEEE INFOCOM, New York, USA, June 2002.
Open Issues • With simulations, MWM has the best delay performance (Cell delay) • Average delay: Choose the weight of a queue as Qa , then delay is increasing with a for a>0 • Is MWM the optimal scheduling scheme for achieving the minimum average cell delay? • What is the optimal scheduling scheme to achieve the minimum average packet delay (Including reassembly delay)?
7 4 3 7 8 5 6 10 5 2 Weight of the match: 23 Maximal Matching • Maximal Matching • Add connections incrementally, without removing connections made earlier • No more matches can be made trivially by the end of the operation • Solution may not be unique • Complexity O(NlogN)
Maximal Matching • A maximal matching achieves 100% throughput with speed-up S≥2 under any admissible traffic pattern • [Leonardi, ToN 2001] • 100% throughput • if with probability 1 • A maximal matching algorithm is rate stable with speed-up S≥2 [Dai, Infocom 2000] • References • Emilio Leonardi, Marco Mellia, Fabio Neri, Marco Ajmone Marsan, “On the stability of Input-Queued Switches with speed-up”, IEEE/ACM Transactions on Networking, Vol.9, No.1, pp.104-118, ISSN: S 1063-6692(01)01313, February 2001 • J.G. Dai and B. Prabhakar, “The throughput of data switches with and without speedup,” INFOCOM 2000, pp. 556-564.
Multiple Iterative Matching • Use multiple iterations to converge on a maximal matching • Parallel Iterative Matching (PIM) • iSLIP and DRRM • complexity of each iteration is O(logN) • O(logN) iterations are needed to converge on a maximal matching (iSLIP) • 100% throughput only under uniform traffic
iSLIP Output • Step 1:Request • Each input sends a request to every output for which it has a queued cell. • Step 2:Grant • If an output receives multiple requests it chooses the one that appears next in a fixed round-robin schedule. • The output arbiter pointer is incremented by one location beyond the granted input if, and only if, the grant is accepted in step 3. • Step 3:Accept • If an input receives multiple grants, it accepts the one that appears next in a fixed round-robin schedule. • The input arbiter pointer is incremented by one location beyond the accepted output. Input Request Grant Accept
Achieving 100% Throughput without Speedup • Matching algorithms using memory • Polling system based matching
Low Complexity Algorithms with 100% Throughput • Algorithms with memory • Use the previous schedule as a candidate • References • L. Tassiulas, “Linear complexity algorithms for maximum throughput in radio networks and input queued switches,” IEEE INFOCOM 1998, vol.2, New York, 1998, pp.533-539. • P. Giaccone, B. Prabhakar, D. Shah “Toward simple, high-performance schedulers for high-aggregate bandwidth switches”, IEEE INFOCOM 2002, New York, 2002. • Polling system based matching algorithms • Improve the efficiency by using exhaustive service • References • Y. Li, S. Panwar, H. J. Chao, “Exhaustive service matching algorithms for input queued switches,” 2004 Workshop on High Performance Switching and Routing (HPSR 2004), April 2004. • Y. Li, S. Panwar, H. J. Chao, “ Performance Analysis of a Dual Round Robin Matching Switch with Exhaustive Service,” IEEE GLOBECOM 2002.
Matching Algorithms with Memory • The queue length of each VOQ does not change much during successive time slots • In each time slot, there can be • At most one cell arrives to each input • At most one cell departs from each input • It is likely that a busy connection will continue to be busy over a few time slots, if the queue length is used as the weight of a connection • Use the match in the previous time slot as an candidate for the new match • Important results: • Randomized algorithm with memory [Tassiulas 98] • Derandomized algorithm with memory [Giaccone 02] • With higher complexity: APSARA, LAURA, SERENA [Giaccone 02]
Notations • For a NxN switch, there are N! possible matches • Q(t)=[qij]NxN, qij is the queue length of VOQij • M(t), a match at time t • The weight of M(t) • W(t)=<M(t),Q(t)> • the sum of the lengths of all matched VOQs
Randomized algorithm with memory • Randomized algorithm with memory Let S(t) be the schedule used at time t At time t+1, uniformly select a match R(t+1) at random from the set of all N! possible matches Let • Stable under any Bernoulli i.i.d. admissible arrival traffic • Very simple to implement, complexity O(logN) • Delay performance is very poor
Derandomized Algorithm with Memory • Hamiltonian walk • A walk which visits every vertex of a graph exactly once. • In a NxN switch, • N! vertices (possible schedules), a Hamiltonian walk visits each vertex once every N! time slots • H(t): the value of the vertex which is visited at time t • The complexity of generating H(t+1) when H(t) is known is O(1) • Derandomized algorithm with memory • Use the match generated by Hamiltonian walk instead of the random match • Similar performance as randomized algorithm
Compared to MWM … • Simple matching algorithms can achieve stability as MWM does • Not necessary to find “the best match” in each time slot to achieve 100% throughput • MWM has much better delay performance than randomized and derandomized matching • “better” matches lead to better delay performance
With Higher Complexity and Lower Delay • Introduce higher complexity for much lower delay than the randomized and derandomized algorithms • APSARA • include the neighbors of the latest match as candidates • LAURA: • merge the latest match with a random match to remember the heavy edges • SERENA • Merge the latest match with the arrival figure • Figure: generated from the current arrival pattern • Complexity O(N)
Polling System Based Matching • Exhaustive Service Matching • Inspired by exhaustive service polling systems • All the cells in the corresponding VOQ are served after an input and an output are matched • Slot times wasted to achieve an input-output match are amortized over all the cells waiting in the VOQ instead of only one • Cells within the same packet are transferred continuously • Hamiltonian walk is used to guarantee stability
Exhaustive Service Matching with Hamiltonian Walk (EMHW) • EMHW • Let S(t) be the match at time t. • At time t+1, generate match Z(t+1) by the Exhaustive Service Matching algorithm based on S(t), and H(t+1) by Hamiltonian walk • Let where <S,Q(t+1)> is the weight of S at time t+1. • Stable under any admissible traffic • Analyzed by an exhaustive service polling system • Implementation complexity • HE-iSLIP: O(logN)
E-iSLIP Average Delay Analysis • Exhaustive random polling system model • Symmetric system -- only consider one input • N VOQs per input, exhaustive service policy -- an exhaustive service polling system with N stations • The service order of the VOQs are not fixed -- random polling system, assume all station VOQs have the same probability of selection for service after a VOQ is served • Switch over time S • Average delay T [Levy and Kleinrock]
1 1 1 1 N N N N ORM VOQ ISM 1 Input 1 Switch Fabric Output 1 N 1 Input 2 Output 2 N 1 Input 3 Output 3 N 1 Input 4 Output 4 N Delay Performance of HE-iSLIP • Packet delay: the sum of cell delay and reassembly delay • Cell delay: measured from VOQ to destination output • Reassembly delay: time spent in an ORM, often ignored in other work
Packet Delay under Uniform Traffic • Pattern 1: packet size is 1 cell. SERENA iSLIP HE-iSLIP MWM
Packet Delay under Uniform Traffic • Pattern 3: packet length is variable, the average is 10 cells (Internet packet size distribution) • Pattern 2: packet length is 10 cells SERENA SERENA iSLIP MWM iSLIP MWM HE-iSLIP HE-iSLIP
HE-iSLIP MWM MWM HE-iSLIP When packet length is larger than 1 cell • Why does HE-iSLIP have a lower packet delay than MWM? • For example, when packet length is 10 cells: • Reassembly delay • Cell delay • Low cell delay + low reassembly delayneeded for low packet delay Open Problem: Which scheduler minimizes packet delay performance?
Packet-Based Scheduling • Packet-based scheduling algorithm • once it starts transmitting the first cell of a packet to an output port, it continues the transmission until the whole packet is completely received at the corresponding output port • Packet-based MWM is stable for any admissible Bernoulli i.i.d. traffic • Lyapunov function, MA. Marsan, A. Bianco, P. Giaccone, E. Leonardi, and F. Neri, “Packet Scheduling in Input-Queued Cell-Based Swithces,” INFOCOM 2001, pp. 1085-1094. • Packet-based MWM is stable under regenerative admissible input traffic • Fluid model, Y. Ganjali, A. Keshavarzian, D. Shah, “Input Queued Switches: Cell switching v/s Packet switching", Proceedings of Infocom, 2003. • regenerative: Let T be the time between two successive occurrences of the event that all ports are free with E(T) being finite • Modified waiting PB-MWM algorithm is stable under any admissible traffic
Buffered Crossbar Switch • One buffer for each crosspoint • Distributed arbitration for inputs and outputs • From each input, one cell can be sent to a crosspoint buffer if it has space • One cell can be sent to an output if at least one crosspoint buffer to that output is nonempty • References • Y. Doi and N. Yamanaka, “A High-Speed ATM Switch with Input and Cross-Point Buffers,” IEICE TRANS. COMMUN., VOL. E76, NO.3, pp. 310-314, March 1993. • R. Rojas-Cessa, E. Oki, Z. Jing, and H. J. Chao, “CIXB-1: Combined Input-One-Cell-Crosspoint Buffered Switch,” Proceedings of IEEE Workshop of High Performance Switches and Routers 2001.
Birkhoff-von Neumann Switch • When traffic matrix is known • Birkhoff-von Neumann decomposition • Reference • Cheng-Shang Chang, Wen-Jyh Chen and Hsiang-Yi Huang, "On service guarantees for input buffered crossbar switches: a capacity decomposition approach by Birkhoff and von Neumann," IEEE IWQoS'99, pp. 79-86, London, U.K., 1999.
0 Birkhoff-von Neumann Switch • Example • High complexity, impractical
Load-balancing Switching 1 ... ... ... ... ... ... … k … N Load-Balanced Switch • Load-balanced switch • Convert the traffic to uniform, then fixed switching • 100% throughput for broad class of traffic • No centralized scheduler needed, scalable
Original Work on LB Switch • Stability: the load-balanced switch is stable • Delay: burst reduction • Problem: unbounded out-of-sequence delays • Reference • C.-S. Chang, D.-S. Lee and Y.-S. Jou, “Load balanced Birkhoff-von Neumann switches, Part I: one-stage buffering,” Computer Comm., Vol. 25, pp. 611-622, 2002.
LB Switch variants • Solve the out-of-sequence problem • FCFS (First come first serve) • Jitter control mechanism • Increase the average delay • EDF (Earliest deadline first) • Reduce the average delay • High complexity • Mailbox switch • Prevent packets from being out-of-sequence • Not 100% throughput • References • C.-S. Chang, D.-S. Lee and C.-M. Lien, “Load balanced Birkhoff-von Neumann switches, Part II: multi-stage buffering,” Computer Comm., Vol. 25, pp. 623-634, 2002. • C.S. Chang, D. Lee, and Y. J. Shih, “Mailbox switch: A scalable twostage switch architecture for conflict resolution of ordered packets,” In Proceedings of IEEE INFOCOM, Hong Kong, March 2004.
More LB switch variants • FFF (Full frames first) (Infocom 2002, Mckeown) • Frame-based • No need for resequencing • Require multi-stage buffer communication-high complexity • FOFF (Full ordered frames first) (Sigcomm 2003, Mckeown) • Frame-based • Maximum resequencing delay N2 • Bandwidth wastage • References • I. Keslassy and N. McKeown, “Maintaining packet order in two-stage switches,” Proc. of the IEEE Infocom, June 2002. • I. Keslassy, S.-T. Chuang, K. Yu, D. Miller, M. Horowitz, O. Solgaard and N. McKeown , “Scaling Internet routers using optics,” ACM SIGCOMM ’03, Karlsruhe, Germany, Aug. 2003.
1 … N N N … … … i 2 2 2 1 1 1 … N ... ... ... ... ... ... ... ... ... ... ... ... Byte-Focal Switch Architecture Re-sequencing buffer 1st stage switch fabric 2nd stage switch fabric Arrival Input VOQ Second-stage VOQ (1,1) (1,1) 1 1 1 (1,k) (1,k) (1,N) (1,N) … … (i,1) (j,1) … j k i (j,k) (i,k) (j,N) (i,N) … … (N,1) (N,1) … (N,k) N N (N,k) N (N,N) (N,N)
Byte-Focal Switch • Packet-by-packet scheduling • Improves the average delay performance • The maximum resequencing delay is N2 • The time complexity of the resequencing buffer is O(1) • Does not need communications between linecards • References • Y. Shen, S. Jiang, S.S.Panwar, H.J. Chao, “Byte-Focal: a practical load-balanced swtich”, HPSR 2005, Hongkong.
Multi-Stage Switches • Single Stage Switches (e.g., Cross-point switch) • Single path between each input-output pair • Cannot meet the increasing demands of Internet traffic • No packets out-of-sequence • Easy to design • Lack of scalability • Multi-stage Switches (e.g., Clos-network switch) • Multiple paths between each input-output pair • Better tradeoff between the switch performance and complexity • Highly scalable and fault tolerant • Memory-less multi-stage switches • No packets out-of-sequence, may encounter internal blocking • Buffered multi-stage switches • Packet may be out-of-sequence, easy scheduling