220 likes | 241 Views
The Consensus Problem in Fault Tolerant Computing. Sajayasree K K ME(CSE) E0 245 Fault Tolerant Computing. The Problem. The consensus problem is to form an agreement among the fault-free members of the resource population on a quantum of information in order
E N D
The Consensus Problem in Fault Tolerant Computing Sajayasree K K ME(CSE) E0 245 Fault Tolerant Computing
The Problem The consensus problem is to form an agreement among the fault-free members of the resource population on a quantum of information in order to maintain the performance and integrity of the system.
Organisation • Background • Different approaches • Problem formulation • The PMC model • The Byzantine Agreement • Fault Classification • Testing • Conclusion
Background What is the need for consensus? Connect computer resources to get a system with greater power and availability than any of its parts. The reverse can happen if faulty elements are allowed to corrupt the system.
Fault Byzantine Generals Lamport et al. 1982 System Diagnosis Perperata et al. 1967 Contain the fault Diagnose the fault Two Approaches How to overcome the inadvertent or malicious spread of information by the faulty segment of the population?
General Problem Formulation Reconfiguration Fault Diagnosis or masking Reliable Communication Unreliable communication medium Synchronization General layered approach to fault management
Problems: • Performance • Cost • Distributed and Central voting General Problem Formulation P1 P2 P3 General NMR system
The PMC Model 1967, Preparata, Metze and Chien. Each processor tests another PE. Construct a graph and a syndrome. Conditions: All failures are hard or permanent failures A fault-free processor is always able to determine accurately the condition of the PE it is testing. A faulty processor produces unreliable test results. No more than t PEs may be faulty
The PMC Model A 1 x B E 0 0 D C 0
The Byzantine Agreement Started by work of Wensley et al. in 1978. Software Implemented Fault Tolerance (SIFT) The number of PEs (n) must be greater than 3t, where t is the number of faulty elements. Each processor has a secret value. Values are exchanged by messages. Interactive Consistency: Consistency: Each fault free PE should form an identical vector of values whose ith element corresponds to the ith processor in the system. Meaningfulness: A vector element corresponding to a fault-free processor should be the actual secret value of that processor.
Byzantine General Problem The Byzantine Generals Problem introduced by Lamport, Shodtak and Pease 1982. Byzantine commanding general, who has surrounded the enemy with his many armies each led by a lieutenant general, wishes to organize a concerted plan of action, i.e., to attack or to retreat.
Fault Classification Analysis of characteristics of fault faulty processor results in proposition of fault models. Fault models proposed define the behavior of a PE once it has become faulty. System Diagnosis: description of test results given the status of tester and tested Byzantine agreement: description of limitations of a faulty processor. In general, the more constraints in the fault model, the easier it will be to form consensus.
Fault Classification: A failure in system Diagnosis Interactions of a faulty PE
Fault Classification: A failure in system Diagnosis Fault Class
Fault Classification: A failure in Byzantine Agreement In worst case faulty PEs are assumed to work with complete knowledge about the state of the system :Adversary Model Limitations to adversary model. Defining algorithms that work only for this model can be limiting and impractical. So another classification of faults are introduces where stronger class is a subset of weaker class.
Fault Classification: A failure in Byzantine Agreement Fail Stop Byzantine Fault
Conclusion Despite their different characteristics, the Byzantine agreement and system diagnosis have very similar goals, namely to produce a correct agreement despite the number of faults. Show similarities of both approaches to allow future research to draw from both areas rather than continuing apart.
References • Michael Barborak, Miroslaw Malek and Anton Dahbura, “The Consensus Problem in Fault-Tolerant Computing”, ACM Computing Surveys, Vol. 25, No. 2, June 1993. • Michael Fischer, Nancy Lynch and Michael Paterson, “Impossibility of Distributed Consensus with One Faulty Process”, Journal of the ACM, April 1985. • PODC Influential Paper Award 2001, http://www.podc.org/influential/2001.html