180 likes | 321 Views
Infrastructure-based Resilient Routing. Ben Y. Zhao, Ling Huang, Jeremy Stribling, Anthony Joseph and John Kubiatowicz University of California, Berkeley Sahara Winter Retreat, 2004. Challenges Facing Network Applications. Network connectivity is not reliable
E N D
Infrastructure-basedResilient Routing Ben Y. Zhao, Ling Huang, Jeremy Stribling, Anthony Joseph and John Kubiatowicz University of California, Berkeley Sahara Winter Retreat, 2004
Challenges Facing Network Applications • Network connectivity is not reliable • Disconnections frequent in the wide-area Internet • IP-level repair is slow • Wide-area: BGP 3 mins • Local-area: IS-IS 5 seconds • Next generation network applications • Mostly wide-area • Streaming media, VoIP, B2B transactions • Low tolerance of delay, jitter and faults • Our work: transparent resilient routing infrastructure that adapts to faults in not seconds, but milliseconds ravenben@eecs.berkeley.edu
Talk Overview • Motivation • A Structured Overlay Infrastructure • Mechanisms and policy • Evaluation • Summary ravenben@eecs.berkeley.edu
The Challenge • Routing failures are diverse • Many causes: • Router misconfigurations, cut fiber, planned downtime, protocol implementation bugs • Occur anywhere with local or global impact: • Single fiber cut can disconnect AS pairs • Isolating failures is difficult • Wide-area measurement is ongoing research • Single event leads to complex inter-protocol interactions • End user symptoms often dynamic or intermittent • Requires: • Fault detection from multiple distributed vantage points • In-network decision making necessary for timely responses ravenben@eecs.berkeley.edu
An Infrastructure Approach • Our goals • Overlay focused on resiliency • Route around failures to maintain connectivity • Respond in milliseconds (react instantaneously to faults) • Our approach • Large-scale infrastructure for fault and route discovery • Nodes are observation points (similar to Plato’s NEWS service) • Nodes are also points of traffic redirection(forwarding path determination and data forwarding) • Automated fault-detection and circumvention • No edge node involvement: fast response time, security focused on infrastructure • Fully transparent, no application awareness necessary ravenben@eecs.berkeley.edu
Qwest Backbone An Illustration Goal: fast fault detection and route-around Key: on the fly in-network traffic redirection ravenben@eecs.berkeley.edu
Why Structured Overlays • Resilient Overlay Networks (MIT) • Fully connected mesh • Allows each node full knowledge of network • Fast, independent calculation of routes • Nodes can construct any path, maximum flexibility • Cost of flexibility • Protocol needs to choose the “right” route/nodes • Per node O(n) state • Monitors n - 1 paths • O(n2) total path monitoring is expensive D S ravenben@eecs.berkeley.edu
O V E R L A Y v v v v v v v v v v v v v The Big Picture • Locate nearby overlay proxy • Establish overlay path to destination host • Overlay traffic routes traffic resiliently Internet ravenben@eecs.berkeley.edu
register register get (hash(B)) P’(B) put (hash(B), P’(B)) put (hash(A), P’(A)) Traffic Tunneling A, B are IP addresses Legacy Node B Legacy Node A B P’(B) Proxy P’(B) = B P’(A) = A Proxy Structured Peer to Peer Overlay • Store mapping from end host IP to its proxy’s overlay ID • Similar to approach in Internet Indirection Infrastructure (I3) ravenben@eecs.berkeley.edu
Tradeoffs of Tunneling via P2P • Less neighbor paths to monitor per node: O(log(n)) • Large reduction in probing bandwidth: O(n) O(log(n)) • Faster fault detection with low bandwidth consumption • Actively maintain path redundancy • Manageable for “small” # of paths • Redirect traffic immediately when a failure is detectedEliminate on-the-fly calculation of new routes • Restore redundancy when a path fails • Fast fault detection + precomputed paths = increased responsiveness • Cons: overlay imposes routing stretch (mostly < 2) ravenben@eecs.berkeley.edu
In-network Resiliency Mechanisms • Efficient fault detection • Use soft-state to periodically probe log(n) neighbor paths • “Small” number of routes reduced bandwidth • Exponentially weighted moving averagefor link quality estimation • Avoid route flapping due to short term loss artifacts • Loss rate Ln = (1 - ) Ln-1 + p • Simple approach taken, ongoing research available • Smart fault-detection / propagation (Zhuang04) • Intelligent and cooperative path selection (Seshardri04) • Maintaining backup paths • Each hop has flexible routing constraint • Create and store backup routes at node insertion • Restore redundancy via “intelligent” gossip after failures • Simple policies to choose among redundant paths ravenben@eecs.berkeley.edu
First Reachable Link Selection (FRLS) • Use estimated loss results to choose shortest “usable” path • Sort next hop paths by latency • Use shortest path withminimal quality > T • Correlated failures • Reduce with intelligent topology construction • Key is to leverage redundancy available ravenben@eecs.berkeley.edu
Evaluation • Metrics for evaluation • How much routing resiliency can we exploit? • How fast can we adapt to faults (responsiveness)? • Experimental platforms • Event-based simulations on transit stub topologies • Data collected over multiple 5000-node topologies • PlanetLab measurements • Microbenchmarks on responsiveness More details in paper (ICNP03) and poster session ravenben@eecs.berkeley.edu
Exploiting Route Redundancy (Sim) • Simulation of Tapestry, 2 backup paths per routing entry • Transit-stub topology shown, results from TIER and AS graphs similar ravenben@eecs.berkeley.edu
660 300 Responsiveness to Faults (PlanetLab) • Response time increases linearly with probe period • Minimum link quality threshold T = 70%, 20 runs per data point ravenben@eecs.berkeley.edu
Link Probing Bandwidth (Planetlab) • Medium sized routing overlays incur low probing bandwidth • Bandwidth increases logarithmically with overlay size ravenben@eecs.berkeley.edu
Conclusion • Pros and cons of infrastructure approach • Structured routing has low path maintenance costs • Allows “caching” of backup paths for quick failover • Transparent to user applications • Can no longer construct arbitrary paths • Structured routing with low redundancy close to ideal connectivity • Incur low routing stretch • Fast enough for highly interactive applications • 300ms beacon period response time < 700ms • On overlay networks of 300 nodes, b/w cost is 7KB/s • Ongoing questions • Is there lower bound on desired responsiveness?Should we use multipath redundant routing for resilience? • How to deploy as a single network across ISPs?VPN-like routing service? ravenben@eecs.berkeley.edu
Related Work • Redirection overlays • Detour (IEEE Micro 99) • Resilient Overlay Networks (SOSP 01) • Internet Indirection Infrastructure (SIGCOMM 02) • Secure Overlay Services (SIGCOMM 02) • Topology estimation techniques • Adaptive probing (IPTPS 03) • Internet tomography (IMC 03) • Routing underlay (SIGCOMM 03) • Many, many other structured peer-to-peer overlays Thanks to Dennis Geels / Sean Rhea for their work on BMark ravenben@eecs.berkeley.edu