490 likes | 629 Views
Limiting the Impact of Failures on Network Performance Joint work with Supratik Bhattacharyya, and Christophe Diot High Performance Networking Group, 25 Feb. 2004. Yashar Ganjali Computer Systems Lab. Stanford University yganjali@stanford.edu http://www.stanford.edu/~yganjali.
E N D
Limiting the Impact of Failures on Network Performance Joint work with Supratik Bhattacharyya, and Christophe Diot High Performance Networking Group, 25 Feb. 2004 Yashar Ganjali Computer Systems Lab. Stanford University yganjali@stanford.edu http://www.stanford.edu/~yganjali
Motivation • The core of the Internet consists of several large networks (IP backbones). • IP backbones are carefully provisioned to guarantee low latency and jitter for packet delivery. • Failures occur on a daily basis as a result of • Physical layer malfunction, • Router hardware/software failures, • Maintenance, • Human errors, … • Failures affect the quality of service delivered to backbone customers.
Outline • Background • Sprint’s IP backbone • Data • Impact Metrics • Time-based metrics • Link-based metrics • Measurements • Reducing the impact • Identifying critical failures • Causes analysis • Reducing critical failures
Background – Sprint’s IP backbone • IP layer operates above DWDM with SONET framing. • IS-IS protocol used to route traffic inside the network. • IP-level restoration • When an IP link fails, all routers in the network independently compute a new path around the failure • No protection in the underlying optical infrastructure.
Data • IS-IS Link State PDU logs • Collected by passive listeners from Sprint’s North America backbone. • Feb. 1st, 2003 to Jun. 30th, 2003. • SNMP logs • Link loads recorded once in every 5 minutes. • SONET layer alarms • Corresponding to minor and major problems in the optical layer • We are only interested in two alarms:SLOS, and SLOS cleared.
ANA-1 ANA-4 ANA-2 ANA-3 Inter-POP vs. Intra-POP
Outline • Background • Sprint’s IP backbone • Data • Impact Metrics • Time-based metrics • Link-based metrics • Measurements • Reducing the impact • Identifying critical failures • Causes analysis • Reducing critical failures
Two Perspectives • For a given impact metric • Time-based analysis: Measure the impact of failures on the given metric as a function of time. • Link-based analysis: Measure the impact of failures on the given metric as a function of failing links.
Time-based Impact Metrics • Number of Simultaneous Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path unavailability • Total rerouted traffic • Maximum load
Time-based Impact Metrics • Number of Simultaneous Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path unavailability • Total rerouted traffic • Maximum load
Number of Affected O-D Pairs B A C F D E
Time-based Impact Metrics • Number of Simultaneous Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path unavailability • Total rerouted traffic • Maximum load
Time-based Impact Metrics • Number of Simultaneous Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path unavailability • Total rerouted traffic • Maximum load
Path Unavailability B A C F D E
Time-based Impact Metrics • Number of Simultaneous Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path unavailability • Total rerouted traffic • Maximum load
Time-based Impact Metrics • Number of Simultaneous Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path unavailability • Total rerouted traffic • Maximum load
Maximum Load Throughout the Network 96% of link failures were not followed by an immediate change in maximum load.
Time-based Impact Metrics • Number of Simultaneous Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path unavailability • Total rerouted traffic • Maximum load
Path Coverage B A C F D E
Link-based Impact Metrics • Number of Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path coverage • Total rerouted traffic • Peak factor
Outline • Background • Sprint’s IP backbone • Data • Impact Metrics • Time-based metrics • Link-based metrics • Measurements • Reducing the impact • Identifying critical failures • Causes analysis • Reducing critical failures
Critical Failures • For each time-based metric • Removing failures occuring during 1-5% of time improves the metrics by a factor of at least 5. • For each link-based metric • Removing failures on 1-7% of links improves the metric by a factor of at least 3.
Critical Links • Any link which has a critical failures, is called a Critical Link. • We are interested in fixing such links.
Correlation of the Critical Sets Overall 23% of all links are critical.
Cause Analysis • Markopoulou et al. have used IS-IS update messages for characterizing link failures into the following categories [MIB+04]. • Maintenance • Unplanned • Shared failures • Router-related • Optical-related • Unspecified • Individual failures About 70% of all unplanned failures
IP link failure Time SLOS ~ 20ms SLOS Cleared ~ 12sec Matching SLOS Alarms with IP Link Failures 58% of all link failures are due to optical layer problems. 84% of critical failures are due to optical layer problems.
Reducing Critical Failures • Replace old optical fibers/parts. • Optical Protection. • Push the traffic away. • Also works for maximum load and peak factor.
Reducing Link Down-time • Low-failure links: • Failure are very rare. • Damping doesn’t help. • High-failure links: • Failure rate changes very slowly. • Fixed damping is wasteful.
Adaptive Damping Output: ADT: Adaptive damping timer Input: : time difference between the last two failures : threshold : constant function Adaptive_Damping begin if( < ) ADT := x ; else ADT := 0; end;