The Consensus Problem in Fault-Tolerant Computing: A Comprehensive Overview

The Consensus Problem in Fault Tolerant Computing Sajayasree K K ME(CSE) E0 245 Fault Tolerant Computing

The Problem The consensus problem is to form an agreement among the fault-free members of the resource population on a quantum of information in order to maintain the performance and integrity of the system.

Organisation • Background • Different approaches • Problem formulation • The PMC model • The Byzantine Agreement • Fault Classification • Testing • Conclusion

Background What is the need for consensus? Connect computer resources to get a system with greater power and availability than any of its parts. The reverse can happen if faulty elements are allowed to corrupt the system.

Fault Byzantine Generals Lamport et al. 1982 System Diagnosis Perperata et al. 1967 Contain the fault Diagnose the fault Two Approaches How to overcome the inadvertent or malicious spread of information by the faulty segment of the population?

General Problem Formulation Reconfiguration Fault Diagnosis or masking Reliable Communication Unreliable communication medium Synchronization General layered approach to fault management

Problems: • Performance • Cost • Distributed and Central voting General Problem Formulation P1 P2 P3 General NMR system

The PMC Model 1967, Preparata, Metze and Chien. Each processor tests another PE. Construct a graph and a syndrome. Conditions: All failures are hard or permanent failures A fault-free processor is always able to determine accurately the condition of the PE it is testing. A faulty processor produces unreliable test results. No more than t PEs may be faulty

The PMC Model A 1 x B E 0 0 D C 0

The Byzantine Agreement Started by work of Wensley et al. in 1978. Software Implemented Fault Tolerance (SIFT) The number of PEs (n) must be greater than 3t, where t is the number of faulty elements. Each processor has a secret value. Values are exchanged by messages. Interactive Consistency: Consistency: Each fault free PE should form an identical vector of values whose ith element corresponds to the ith processor in the system. Meaningfulness: A vector element corresponding to a fault-free processor should be the actual secret value of that processor.

An Example

Byzantine General Problem The Byzantine Generals Problem introduced by Lamport, Shodtak and Pease 1982. Byzantine commanding general, who has surrounded the enemy with his many armies each led by a lieutenant general, wishes to organize a concerted plan of action, i.e., to attack or to retreat.

Fault Classification Analysis of characteristics of fault faulty processor results in proposition of fault models. Fault models proposed define the behavior of a PE once it has become faulty. System Diagnosis: description of test results given the status of tester and tested Byzantine agreement: description of limitations of a faulty processor. In general, the more constraints in the fault model, the easier it will be to form consensus.

Fault Classification: A failure in system Diagnosis Interactions of a faulty PE

Test Validity Models

Fault Classification: A failure in system Diagnosis Fault Class

Fault Classification: A failure in Byzantine Agreement In worst case faulty PEs are assumed to work with complete knowledge about the state of the system :Adversary Model Limitations to adversary model. Defining algorithms that work only for this model can be limiting and impractical. So another classification of faults are introduces where stronger class is a subset of weaker class.

Fault Classification: A failure in Byzantine Agreement

Fault Classification: A failure in Byzantine Agreement Fail Stop Byzantine Fault

Testing

Conclusion Despite their different characteristics, the Byzantine agreement and system diagnosis have very similar goals, namely to produce a correct agreement despite the number of faults. Show similarities of both approaches to allow future research to draw from both areas rather than continuing apart.

References • Michael Barborak, Miroslaw Malek and Anton Dahbura, “The Consensus Problem in Fault-Tolerant Computing”, ACM Computing Surveys, Vol. 25, No. 2, June 1993. • Michael Fischer, Nancy Lynch and Michael Paterson, “Impossibility of Distributed Consensus with One Faulty Process”, Journal of the ACM, April 1985. • PODC Influential Paper Award 2001, http://www.podc.org/influential/2001.html

The Consensus Problem in Fault-Tolerant Computing: A Comprehensive Overview

The Consensus Problem in Fault-Tolerant Computing: A Comprehensive Overview

Presentation Transcript

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

Consensus problem in fault tolerant distributed computing

The Fault-Tolerant Group Steiner Problem

ECE 753: FAULT-TOLERANT COMPUTING

Fault Tolerant Distributed Computing system.

Fault-Tolerant Consensus

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

Fault-Tolerant Computing Basics

The Fault-Tolerant Group Steiner Problem

Fault Tolerant Distributed Computing system.

Fault-Tolerant Consensus

Fault-tolerant Computing

Fault-Tolerant Computing Basics