Failure Detection and Recovery in MULTI6 draft-arkko-multi6dt-failure-detection-00.txt

Failure Detection and Recoveryin MULTI6draft-arkko-multi6dt-failure-detection-00.txt Multi6 Design Team -- Jari Arkko, Marcelo Bagnulo, Geoff Huston, Erik Nordmark, Margaret Wasserman, Iljitsch van Beijnum, Jukka Ylitalo

Presentation Outline • Background • Addresses • Interfaces to other components • Reachability • Principles of failure detection • Principles of alternative search • A sketch of a protocol • Design decisions • Architectural issues

Background

Background • MULTI6 design team work • HIP multihoming work • MOBIKE multihoming work • What SCTP has done • Movement detection in mobility protocols • Host address configuration mechanisms

Addresses

Multihoming Basics There’s more than one path for traffic Typically, multiple prefixes for some of the participants Observations: • Multiple addresses on one or both end hosts • Nodes should know about their own addresses • And learn about the peer’s addresses over the MULTI6 protocol

Node’s Own Addresses --Where Do They Come From? Addresses come from the other parts of the stack • The addresses are typically configured through protocols such as DHCP or IPv6 Neighbor Discovery • Processes related to addresses are not trivial -- Duplicate Address Detection, valid/deprecated, scoping, ... • Relationship to what the rest of the IP layer does (e.g. Router Discovery) or what the L2 does (e.g., 802.11 attachment)

Node’s Own Addresses --Where Do They Go To? • Addresses are taken away by the same mechanisms

Addresses -- What is their Status? • Security: we need to believe what the configuration mechanisms tell us • If an address is no longer usable, we need to believe it • Address allocation can be secure (but often not turned on); Nevertheless, M6 has nothing else to rely on either, so an address given to it should be considered at least as a candidate • Even if an address is assigned to an interface, it is not guaranteed that you will actually be able to use it • Link temporarily broken • Router down • Etc

A Few Address-Related Definitions Available address • Address is assigned to an interface • The address is valid (in IPv6) and has completed uniqueness tests Locally operational address • Address is available • L2 green light is on • Default router is reachable (IPv6 NUD)

Interfaces to Related Modules • An obvious set of “configuration” modules and protocols that handle address assignment and deletion and other related tasks • A growing body of work for improving the characteristics related to changing connectivity at the “lower layer” -- e.g. DNA WG • DNA WG: draft-ietf-dna-goals-03.txt • DHC WG: draft-ietf-dhc-dna-ipv4-09.txt

Reachability

Answer: No (Not even if you can talk to someone else) host1 host2 (broken) R R cnn. com Are Two Locally Operational Addresses Enough?

The Definition of an Address Pair Address pair • A pair of addresses (src, dst) used in communications between two peers Operational address pair • Both addresses are locally operational • Traffic flows when the pair is used

Symmetric vs. Asymmetric Address Pair Reachability Note that reachability may not always be two-way… Host1 should send from p to r, host2 from s to q Ping would never work here! (=> only) (<= only) r p R R host1 host2 q R R s

Detection and Search

Selecting an Address • How do we know there is a problem? • The address went away (certain) • Explicit test failed (certain…but might be a transient problem) • Lack of TCP progress, ICMP, … (hmm...) • Picking another pair • No existing protocol proposals for finding operational address pairs (multi6, hip, and mobike looking at this)

Picking Another Address Pair • The selection should not itself cause a new problem by congestion • If a site link goes down, it would be a bad idea for all hosts in the site to suddenly start a cartesian ping bomb • All hosts must obey exponential back-off while searching • Downside: • 4 addresses on both sides, 0.1 start timeout • exponential back-off would take 3200 seconds! • Suggestion: either this or a slight relaxation

Picking Another Address Pair, Cont’d • As a result, the order at which you try things out is important • Some signaling of preferences can be made while nodes tell each other what addresses they have • For the rest, a number of heuristics can apply to the order • Example: an address that worked 30 seconds ago would be a useful candidate to try • Suggestion: Leave details to implementations

Picking Another Address Pair, Cont’d • Testing for bidirectional reachability is easy • Testing for unidirectional reachability is harder • Reachability may depend on packet! • Multi6 protocol vs. PING • Multi6 protocol vs. payload packet • Significant?

Finding Pairs -- Unidirectional Case Peer A Peer B | | | | A decides that it has a problem

Finding Pairs -- Unidirectional Case Peer A Peer B | | | Poll 1 (src=A1, dst=B1) | |-------------------------------------------------------------->| | |

Finding Pairs -- Unidirectional Case Peer A Peer B | | | Poll 1 (src=A1, dst=B1) | |-------------------------------------------------------------->| | | B sees that apparently A has a problem, starts the same process

Finding Pairs -- Unidirectional Case Peer A Peer B | | | Poll 1 (src=A1, dst=B1) | |-------------------------------------------------------------->| | | | Poll 2 (src=B1, dst=A1) OK: 1 | | X----------------------------------------------| | |

Finding Pairs -- Unidirectional Case Peer A Peer B | | | Poll 1 (src=A1, dst=B1) | |-------------------------------------------------------------->| | | | Poll 2 (src=B1, dst=A1) OK: 1 | | X----------------------------------------------| | | | Poll 3 (src=A2, dst=B1) | |------------------------------X | | |

Finding Pairs -- Unidirectional Case Peer A Peer B | | | Poll 1 (src=A1, dst=B1) | |-------------------------------------------------------------->| | | | Poll 2 (src=B1, dst=A1) OK: 1 | | X----------------------------------------------| | | | Poll 3 (src=A2, dst=B1) | |------------------------------X | | | | Poll 4 (src=B2, dst=A1) OK: 1 | |<--------------------------------------------------------------| | |

Finding Pairs -- Unidirectional Case Peer A Peer B | | | Poll 1 (src=A1, dst=B1) | |-------------------------------------------------------------->| | | | Poll 2 (src=B1, dst=A1) OK: 1 | | X----------------------------------------------| | | | Poll 3 (src=A2, dst=B1) | |------------------------------X | | | | Poll 4 (src=B2, dst=A1) OK: 1 | |<--------------------------------------------------------------| | | | Poll 5 (src=A1, dst=B1) OK: 4 | |-------------------------------------------------------------->|

Design Decisions

Some Suggested Design Principles • Multi6 should not venture in to the area of the configuration modules or protocols -- we shall not reinvent DHCP, and we shall believe what ND tells us • Own addresses learned locally, peer addresses are communicated • Search procedures need to apply some form of exponential back-off • Multi6 only works as a fail-over • Not load balancing (would cause problems to TCP) • Not selection of “best” path (harder than “a working” path) • No mandated search order, no application input on “primary” or “backup” connection

Some Open Design Principles • Do we need to support unidirectional reachability? • It complicates the protocols • Many failure modes cause unidirectional reachability, particularly given ingress filtering • Is there any limitation in the scope of addresses allowed? • Statistically unique site-locals should work as well as global addresse with MULTI6

Some Architectural Issues

Some Architectural Issues • Division of work between configuration / lower-layer modules and MULTI6 • Some cross-layer communication is needed: • ULP progress information helps failure detection (similar to what IPv6 NUD already needs) • The multihoming layer needs to inform ULPs that a slow start is needed after we have switched to a new adderss • Division of work between MULTI6 and transport/application layers • Reachability information at MULTI6 or transport layers • Congestion information at transport layer • Application requirements for what is an acceptable connection

Failure Detection and Recovery in MULTI6 draft-arkko-multi6dt-failure-detection-00.txt