Failure Detectors

Failure Detectors

Can we do anything in asynchronous systems? • Reliable broadcast • Process j sends a message m to all processes in the system • Requirement: • If m is delivered by any correct process then it should be delivered by all correct processes • Intuition: message m may be received by the process but it may deliver it at a later point • Assumption • A single message send is atomic • If a message is sent, it would be received as long as the receiving processes does not fail

Is this algorithm correct? • Let 1..n be processes in the system • For x = 1 to n • Send m to x • Upon receiving m • Deliver m • What is wrong?

How to fix it? • Let 1..n be processes in the system • We will use this algorithm in our work with failure detectors

Main Requirements • Accuracy • When a process is suspected to have failed, it actually has • Completeness • When a process fails it is suspected • Assumption in this work: no repairs possible

Different Completeness Requirements • Strong completeness • Eventually every process that crashes is permanently suspected by all processes • Weak completeness • Eventually every process that crashes is permanently suspected by some correct process

Different Accuracy Requirements • Strong accuracy • No process is suspected before it crashes • Weak accuracy • Some correct process is never suspected • Eventual strong accuracy • There is a time (unknown to processes themselves) after whichno process is suspected before it crashes • Eventual weak accuracy • There is a time (unknown to processes themselves) after whichsome correct process is never suspected

Classification of Failure Detectors

Reducibility of Detectors • Given a failure detector P can we implement Q? • Given a failure detector Q, can we implement P?

Reducibility of Detectors D TD->D’ D’

Reducibility of Detectors Repeat forever { p queries local failure detector Dp} suspectp = Dp send (p, suspectp) to all [] When receive(q, suspectq) outputp = (outputp  suspectq ) – { q }

Reducibility of Detectors

Solving Consensus with Weak Failure Detector S Phase 1 for x = 1 to n – 1 report the new votes you learnt in the previous round wait until you receive votes from everyone you do not suspect to have failed end for Phase 2 report all the votes you have learnt wait until you receive votes from everyone you do not suspect to have failed Phase 3 Consider only those votes that are known to everyone Choose the vote of the smallest ID process as the decision

Solving Consensus with Weak Failure Detector S • Assume that the number of processes failed is strictly less than n/2 • Round based computation • Coordinator in round x is (x mod n) + 1 • Coordinator is just a process that follows a protocol that slightly differs from others • Otherwise, there are no other assumptions about it

Solving Consensus with Weak Failure Detector S • In each round • Phase 1 • Send your estimates to coordinator • Phase 2: at coordinator • Wait until at least (n+1)/2 messages are received • Use them to decide on a tentative decision • Send tentative decision to all • Phase 3 • Wait until tentative decision received from coordinator or coordinator is suspected • In the former case, send an ack, and revise your estimate to be the tentative decision • In the latter case, send a nack • Phase 4: at coordinator • If (n+1)/2 acks are received then make a final decision and send it using reliable broadcast

Solving Consensus with Weak Failure Detector S • Upon receiving reliable broadcast message • Decide on the value proposed in it

Other Results • S (or,  W) is the weakest failure detector that can be used for solving consensus • P is the weakest failure detector that can be used to solve leader election • The goal of the proposed survey in this area is to study this issue further.

Failure Detectors