720 likes | 883 Views
LINF 2345 Leader election and consensus with crash and Byzantine failures. Seif Haridi Peter Van Roy. Overview. Synchronous systems with crash failures Leader election in rings Fault-tolerant consensus. Leader Election in Rings. Background Rings.
E N D
LINF 2345Leader electionand consensus with crashand Byzantine failures Seif Haridi Peter Van Roy S. Haridi
Overview • Synchronous systems with crash failures • Leader election in rings • Fault-tolerant consensus S. Haridi
Leader Electionin Rings S. Haridi
BackgroundRings • The ring topology is a circular arrangement of nodes used often as a control topology in distributed computations • A graph is regular if all nodes have the same degree • A ring is an undirected, connected, regular graph of degree 2 • G is a ring, is equivalent to: • There is a one-to-one mapping of V to {0,…,n-1} such that the neighbors of node i are nodes i-1 and i+1 (modulo n) S. Haridi
The Leader Election Problem • A situation where a group of processors must select a leader among them • Simplified coordination • Helpful in achieving fault tolerance • Coordinator in two/three phase commits • Represents a general class of symmetry breaking problems • Deadlock removals S. Haridi
The Leader Election Problem • An algorithm solves the leader election problem if: • The terminated states are partitioned into elected and non-elected states • Once a processor enters an elected/non-elected state, its transition function will only move it to another (or the same) elected/non-elected state • In every admissible execution exactly one processor enters an elected state and all others enter a non-elected state. S. Haridi
The Leader Election ProblemRings p0 • In fact we have seen an election algorithm in the previous section on arbitrary network topology • For rings: • Edges go between pi and pi+1 (addition modulo n), for all i, 0in-1 • Processors have a consistent notion of left (clockwise) and right (anti clockwise) 2 1 2 1 p1 p2 2 1 Simple oriented ring S. Haridi
Anonymous Rings • A leader election algorithm for a ring is anonymous if • Every processor has the same state machine • Implies that processors do not have unique identifiers • An algorithm is uniform if does not use the value n, the number of processors • Otherwise the algorithm is nonuniform: • For each size n there is a state machine, but it might be different for different size n S. Haridi
Anonymous RingsImpossibility Results • Main result • There is no anonymous leader election algorithm for ring systems • The result can be stated more comprehensively as: • There is no nonuniform anonymous algorithm for leader election in synchronous rings • Impossibility results for synchronous systems implies the same impossibility results for asynchronous systems. Why? • Impossibility results for nonuniform implies the same for uniform. Why? S. Haridi
Anonymous RingsImpossibility Results • Impossibility results for synchronous systems implies the same impossibility results for asynchronous systems. Why? • Answer: An admissible execution in SS is also an admissible execution in AS • Therefore there is always at least one admissible execution of any AS algorithm that does not satisfy the correctness condition of a leader election algorithm • Impossibility results for nonuniform implies the same for uniform. Why? • If there is a uniform algorithm, it could be used as a nonuniform algorithm S. Haridi
Asynchronous Rings p0,0 • Processors have unique identifiers which could be any natural numbers • For each pi, there is a variable idi initialized to the identifier of pi • We specify a ring by listing the processors starting from the one with the smallest identifier • Each processor pi, 0in, is assigned idi p1,10 p2,5 p3,97 S. Haridi
Asynchronous RingsAn O(n2) Algorithm • Each processor sends a message with its id to its left neighbor, and waits for messages from its right neighbor • When a processor pi receives a message m, it checks the id of m • If m.id > pi.id, pi forwards m to its own left neighbor • Otherwise the message is consumed • A processor pk that receives a message with its own id declares itself as a leader, and sends a termination message to its left neighbor • A processor that receives a termination message forwards it to the left, and terminates as a non-leader S. Haridi
Asynchronous RingsAn O(n2) Algorithm • The algorithm never sends more than O(n2) messages • O(n2) means c.n2 is an upper bound, for some constant c • The processor with the lowest id may forward n messages plus one termination message • There is an admissible execution in which the algorithm sends (n2) messages • (n2) means c1.n2 is an upper bound and c2.n2 is a lower bound for some constants c1 and c2 S. Haridi
Asynchronous RingsAn O(n2) Algorithm • Example (an execution) • The message of processor with identifier i is sent exactly i+1 times • n termination messages • Total is 2 1 0 n-1 n-2 S. Haridi
Asynchronous RingsAn O(n log n) Algorithm • The k-neighborhood of a processor pi is the set of processors up to distance k from pi in the ring (to the left and to the right) • The algorithm operates in phases starting at 0: • At the kth phase a processor tries to be the winner of that phase • To be a k-phase winner it must have the largest id in its 2k-neighborhood • Only winners of phase k continue to phase k+1 • At the end only one processor survives, and is elected as the leader S. Haridi
Asynchronous RingsAn O(n log n) Algorithm • In phase 0, each processor pi attempts to be a phase 0 winner: • pi sends a probe, idi message to its 1-neighborhood • If id of a neighbor receiving the probe is greater that idi the message is swallowed • Otherwise the neighbor sends a reply message • If pi receives a reply message from both its neighbors, it becomes a phase-0 winner and continues with phase 1 S. Haridi
Asynchronous RingsAn O(n log n) Algorithm • In phase k, each processor pi that is a (k-1)-phase winner sends probe messages to its k-neighborhood • Each message traverses 2k processors one by one • A probe is forwarded by a processor if its id is smaller than the probe’s id, or it is not the last processor S. Haridi
Asynchronous RingsAn O(n log n) Algorithm • If the probe is not swallowed by the last processor, it sends back a reply • If pi receives reply messages from both directions it becomes a k-phase winner and continues with phase k+1 • A processor that receives its own probe declares itself a leader and sends a termination message around the ring S. Haridi
p1 p2 p3 p4 p5 p6 p7 p8 p9 p1 Asynchronous RingsAn O(n log n) Algorithm p1 p2 p3 p4 p5 p6 p7 p8 p9 p1 p1, p3, p5,p7 phase 0-winners p1 p2 p3 p4 p5 p6 p7 p8 p9 p1 p1, p5 phase 1-winners S. Haridi
Asynchronous RingsAn O(n log n) Algorithm: Messages • probe, id, k, i • reply, id, k • id: identifier of the processor • k: integer, the phase number • i: integer, a hop counter S. Haridi
Asynchronous RingsAn O(n log n) Algorithm • Initially: asleep = false • Upon receiving no message: • if asleep then • asleep := false • send probe, id, 0, 1 to left and right S. Haridi
Asynchronous RingsAn O(n log n) Algorithm • Initially: asleep = false • Upon receiving probe, j, k, d from left (resp. right): • if j = id then terminate as the leader • if j > id and d < 2k then • send probe, j, k, d+1 to right (resp. left) • if j > id and d = 2k then • send reply, j, k to left (resp. right) S. Haridi
Asynchronous RingsAn O(n log n) Algorithm • Initially: asleep = false • Upon receiving reply, j, k from left (resp. right): • if j id then send reply, j, k to right (resp. left) • else • if already received reply, j, k from right (resp. left) then • send probe, id, k+1,1 to left and right S. Haridi
Fault-Tolerant Consensus S. Haridi
Fault-Tolerance Consensus Overview • Study problems when a distributed system is unreliable • Processors behave incorrectly • The consensus problem • Requires processors to agree on common output based on their (possibly conflicting) inputs • Types of failures • Crash failure (a processor stops operating) • Byzantine failure (a processor behaves arbitrarily, also known as malicious failure) S. Haridi
Fault-Tolerance Consensus Overview • Synchronous systems • To solve consensus with Byzantine failure, less than a third of the processors may behave arbitrarily • We will show one algorithm in detail, which uses optimal number of rounds but has exponential message complexity • More sophisticated algorithms are possible, for example, an algorithm that has polynomial message complexity S. Haridi
Fault-Tolerance Consensus Overview • Asynchronous message passing systems • The consensus problem cannot be solved by deterministic algorithms, neither for crash nor Byzantine failures • This is a famous impossibility result first proved in 1985 by Fischer, Lynch, and Paterson • How do we get around this impossibility? • We can introduce a synchrony assumption or we can make the algorithm randomized (probabilistic). • Both solutions can be practical, but have their limitations S. Haridi
Synchronous Systems with Crash Failures • Assumptions • The communication graph is complete, i.e. a clique • Communication links are fully reliable • In the reliable SS • An execution consists of rounds • Each round consists of delivery of all messages pending in outbuf variables, followed by one computation step by each processor S. Haridi
Synchronous Systems with Crash Failures • An f-resilient system • A system where f processors can fail • Execution in an f-resilient system • There exist a subset F of at most f processors, the faulty processors (different for different executions) • Each round contains exactly one computation event for every processor not in F, and at most one computation event for every processor in F S. Haridi
Synchronous Systems with Crash Failures • Execution in an f-resilient system • Each round contains exactly one computation event for every processor not in F, and at most one computation event for every processor in F • If a processor in F does not have a computation event in some round, then it has no computation event in any subsequent round • In the last round in which a faulty processor has a computation event, an arbitrary subset of its outgoing messages are delivered S. Haridi
Synchronous Systems with Crash Failures • Clean failure • A situation where all or none of the processor’s messages are delivered in its last step • Consensus is easy and efficient for clean failure • We have to deal with non-clean failure • As we shall see, this is what makes the algorithm expensive S. Haridi
The Consensus Problem • Each pi has a special component xi, called the input, and yi, called the output • Initially • Each xi contains a value from some well-ordered set • yi is undefined • Solution to the consensus problem must satisfy the following conditions • Termination • Agreement • Validity S. Haridi
The Consensus Problem • Termination • In every admissible execution, yi is eventually assigned a value, for every nonfaulty processor pi • Agreement • In every execution, if yi and yj are assigned, then yi = yj, for all nonfaulty processors pi and pj • Validity • In every execution, if yi is assigned v for some value v on a nonfaulty processor pi, then there exists a processor pj such that xj=v S. Haridi
Simple Algorithm • Needs f+1 rounds • Every processor maintains a set of values it knows to exist in the system • Initially this set contains only its input value • In later rounds: • A processor updates its set by adding new values received from other processors • And broadcasts any new additions • At round f+1 each processor decides on the smallest value in its set S. Haridi
Simple AlgorithmConsensus in the Presence of Crash Failure • Initially V = {x} • Round k, 1 k f+1: • send { v V : pi has not already sent v } to all processors • receive Sj from pj, 0 j n-1, j i • if k = f + 1 then y := min(V) S. Haridi
Illustration of the Algorithmf = 3 • The algorithm requires f+1 rounds, and tolerates f crash failures Round 4 Round 3 Round 2 Round 1 p0 p1 p2 p3 p4 x x x x S. Haridi
Illustration of the Algorithmf = 3 • p2 and p4 survive • Others crash one at a time • p2 and p4 have the value x Round 4 Round 3 Round 2 Round 1 p0 p1 p2 p3 p4 x x x x S. Haridi
How the algorithm works • Why is one round not enough? • Hint: non-clean failures! • In the previous slides, the value x is sent across only one link instead of all links, because the processor has a non-clean failure • We need enough rounds to cover the possibility of a non-clean failure in each round S. Haridi
Synchronous Systems with Byzantine Failures • We want to reach an agreement in spite of malicious processors • In an execution of an f-resilient Byzantine system, there is at most a subset of f processors which are faulty • In a computation step of a faulty processor, its state and the messages sent are completely unconstrained • A faulty processor may also mimic the behavior of a crashed processor S. Haridi
The Consensus Problem • Termination • In every admissible execution, yi is eventually assigned a value, for every nonfaulty processor pi • Agreement • In every execution, if yi and yj are assigned, then yi = yj, for all nonfaulty processors pi and pj • Validity • In every execution, if yi is assigned v for some value v on a nonfaulty processor pi, then there exists a processor pj such that xj=v S. Haridi
Lower Bounds on the number of Faulty Processors • If a third or more processors can be Byzantine then consensus cannot be reached S. Haridi
Lower Bounds on the number of Faulty Processors • If a third or more processors can be Byzantine then consensus cannot be reached • In a system with three processors such that one is Byzantine, there is no algorithm that solves the consensus problem S. Haridi
Three Processor system 2 • Assume that there is a 3-processor Algorithm A that solves the Byzantine agreement problem if one is faulty • Take two copies of A and configure them into a hexagonal system S 1 3 S. Haridi
Three Processor System 2 2 3 • Input value for processors 1,2, and 3 is 0 • Input value for processors 1’,2’, and 3’ is 1 0 0 1’ A S 1 0 1 1 3 1 1 2’ 3’ S. Haridi
Three Processor System 2 3 • S is a synchronous system, each processor is running its algorithm in the triangle system A • Each processor in S knows its neighbors and it is unaware of other nodes • We expect S to exhibit a well defined behavior with its input • Observe S does not solve the consensus problem • Call the resulting execution (infinite synchronous execution) 0 0 1’ S 1 0 1 1 1 2’ 3’ S. Haridi
Execution from the point of view of processors 2 and 3 2 3 2 0 • Processors 2 and 3 see 1 as faulty, and since A is a consensus algorithm they both decide on 0 in execution of S 0 0 1 0 1’ 1 1 0 1 3 1 1 2’ 3’ S. Haridi
Execution from the point of view of processors 1’ and 2’ 2 3 2 0 0 • Processors 1’ and 2’ see 3 as faulty, and since A is a consensus algorithm they both decide on 1 in execution of S 1 1 0 1’ 2 1 1 1 3 1 1 2’ 3’ S. Haridi
Execution from the point of view of processors 1’ and 3 2 2 3 0 0 • Processors 1’ and 3 see 2 as faulty, and since A is a consensus algorithm they both must decide on one output value in execution of S • This is not possible since they already decided differently • A contradiction! Therefore A does not exist 3 1 0 1’ 1 1 0 1 3 1 1 2’ 3’ S. Haridi
Consensus Algorithm 1 • Takes exactly f+1 rounds • Requires n 3f+1 • The algorithm has two stages: • First, information is gathered by communication among processors • Second, each processor computes locally its decision value S. Haridi
Information Gathering Phase • Information in each processor is represented as a tree, in which each path from the root to leaf contains f+2 nodes, (height = f+1) • Nodes are labeled by sequences of processor names: • The root is labeled by the empty sequence • Let the label of the internal node v be (i1, i2, …, ir), then for each i, 0in-1, not in the sequence, v has a child labeled by (i1, i2, …, ir, i) S. Haridi