Fault Tolerance

Fault Tolerance • Motivation: Systems need to be much more reliable than their components • Use Redundancy: Extra items that can be used to make up for failures • Types of Redundancy: • Hardware • Software • Time • Information

Fault-Tolerant Scheduling • Fault Tolerance: The ability of a system to suffer component failures and still function adequately • Fault-Tolerant Scheduling: Save enough time in a schedule that the system can still function despite a certain number of processor failures

FT-Scheduling: Model • System Model • Multiprocessor system • Each processor has its own memory • Tasks are preloaded into assigned processors • Task Model • Tasks are independent of one another • Schedules are created ahead of time

Basic Idea • Preassign backup copies, called ghosts. • Assign ghosts to the processors along with the primary copies • A ghost and a primary copy of the same task can’t be assigned to the same processor • For each processor, all the primaries and a particular subset of the ghost copies assigned to it should be feasibly schedulable on that processor

Requirements • Two main variations: • Current and future iterations of the task have to be saved if a processor fails • Only future iterations need to be saved; the current iteration can be discarded

Forward and Backward Masking • Forward Masking: Mask the output of failed units without significant loss of time • Backward Masking: After detecting an error, try to fix it by recomputing or some other means

Failure Types • Permanent: The fault is incurable • Transient: The unit is faulty for some time, following which it starts functioning correctly again • Intermittent: Frequently cycles between a faulty and a non-faulty state

Faults and Errors • A fault is some physical defect or malfunction • An error is a manifestation of a fault • Latency: • Fault Latency: Time between occurrence of a fault and its manifestation as an error • Error Latency: Time between the generation of an error and its being caught by the system

Hardware Failure Recovery • If transient, it may be enough to wait for the fault to go away and then reinvoke the computation • If permanent, reassign the tasks to other, functional, processors

Faults: Output Characteristics • Stuck-at: A line is stuck at 0 or 1. • Dead: No output (e.g., high-impedance state) • Arbitrary: The output changes with time

Factors Affecting HW F-Rate • Temperature • Radiation • Power surges • Mechanical shocks • HW failure rate often follows the “bathtub” curve

Some Terminology • Fail-safe Systems: Systems which end up in a “safe” state upon failure • Example: All traffic lights turning red in an intersection • Fail-stop Systems: Systems that stop producing output when they fail

Example of HW Redundancy • Triple-Modular Redundancy (TMR): • Three units run the same algorithm in parallel • Their outputs are voted on and the majority is picked as the output of the TMR cluster • Can forward-mask up to one processor failure

Mathematical Background • Basic laws of probability • Density and distribution functions • Notion of stochastic independence • Expectation, variance, etc. • Memoryless distribution • Markov chains • Steady-state & transient solutions • Bayes’s Law

Hardware FT • N-Modular Redundancy (NMR) • Basic structure • Variations • Reliability evaluation • Independent failures • Correlated failures • Voter: • Bit-by-bit comparison • Median • Formalized majority • Generalized k-plurality

Exploiting Appln Semantics • Acceptance Test: Specify a range outside which the output is tagged as faulty (or at least suspicious) • No acceptance test is perfect: • Sensitivity: Probability of catching an incorrect output • Specificity: Probabililty that an output which is flagged as wrong is really wrong • Specificity = 1 - False Positive Probability

Checkpointing • Store partial results in a safe place • When failure occurs, roll back to the latest checkpoint and restart • Issues: • Checkpoint positioning • Implementation • Kernel level • Application level • Correctness: Can be a problem in distributed systems

Terminology • Checkpointing Overhead: The part of the checkpointing activity that is not hidden from the application • Checkpointing Latency: Time between when a checkpoint starts being taken to when it is stored in non-volatile storage.

Reducing Chkptg Overhead • Buffer checkpoint writes • Don’t checkpoint “dead” variables: • Never used again by the program, or • Next operation with respect to the variable is a write • Problem is how to identify dead variables • Don’t checkpoint read-only stuff, like code

Reducing Chkptg Latency • Consider compressing the checkpoint. Usefulness of this approach depends on: • Extent of the compression possible • Work required to execute the compression algorithm

Optimization of Chkptg • Objective in general-purpose systems is usually to minimize the expected execution time • Objective in real-time systems is to maximize the probability of meeting task deadlines • Need a mathematical model to determine this • Generally, we place checkpoints approximately equidistant from each other and just determine the optimal number of them

Distributed Checkpointing • Ordering of Events: • Easy to do if there’s just one thread • If there are multiple threads: • Events in the same thread are trivial to order • Event A in thread X is said to precede Event B in thread Y if there is some communication from the X after event A that arrives at Y before event B • Given two events A and B in separate threads, • A could precede B • B could precede A • They could be concurrent

Distributed Checkpointing • Domino Effect: An uncontrolled cascade of rollbacks can roll the entire system back to the starting state • To avoid the domino effect, we can coordinate the checkpointing • Tightly synchronize the checkpoints in all processors • Koo-Toueg algorithm

Checkptg with Clock Sync • Assume the clock skew is bounded at d and minimum message delivery time is f • Each processor: • Takes a local checkpoint at some specified time, t • Following its checkpoint, it does not sent out any messages until it is sure that this message will be received only after the recipient has itself checkpointed; i.e., until t+f+d

Koo-Toueg Algorithm • A processor that wants to checkpoint, • Does so, locally • Tells all processors which have communicated with it the last message (timestamp or message number) received from them • If these processors don’t have a checkpoint recording the transmission of this message, they take a checkpoint • This can result in a surge of checkpointing activity visible at the non-volatile storage

Software Fault Tolerance • It is practically impossible to produce a large piece of software that is bug-free • E.g., Even the space shuttle flew with several potentially disastrous bugs despite extensive testing • Single-version Fault Tolerance • Multi-version Fault Tolerance

Fault Models • Reasonably trustworthy hardware fault models exist • Many software fault models exist in the literature, but not one can be fully trusted to represent reality

Single-Version FT • Wrappers: Code “wrapped around” the software that checks for consistency and correctness • Software Rejuvenation: Reboot the machine reasonably frequently • Use data diversity: Sometimes an algorithm may fail on some data but not if these data are subjected to minor perturbations

Multi-version FT • Very, very expensive • Two basic approaches • N-version programming • Recovery Blocks

N-Version Programming (NVP) • Theoretically appealing, but hard to make it effective • Basic Idea: • Have N independent teams of programmers develop applications independently • Run them in parallel and vote on them • If they are truly independent, they will be highly reliable

Failure Diversity • Effectiveness hinges on whether faults in the versions are statistically independent of one another • Forces against truly independent failures: • Common programming “culture” • Common specifications • Common algorithms • Common software/hardware platforms

Failure Diversity • Incidental Diversity • Prohibit interaction between teams of programmers working on different versions and hope they produce independently failing versions • Forced Diversity • Diverse specifications • Diverse programming languages • Diverse development tools and compilers • Cognitively diverse teams: Probably not realistic

Experimental Results • Experiments suggest that correlated failures do occur at a much higher rate than would be the case if failures in the versions were stochastically independent • Example: Study conducted by Brilliant, Knight, and Leveson at UVa and UCI • 27 students writing code for anti-missile application • 93 correlated failures observed: if true independence had existed, we’d have expected about 5

Recovery Blocks • Also uses multiple versions • Only one version is active at any time • If the output of this version fails an acceptance test, another version is activated

Byzantine Failures • The worst failure mode known • Original Motivating Problem (~1978): • A sensor needs to disseminate its output to a set of processors. How can we ensure that, • If the sensor is functioning correctly: All functional processors obtain the correct sensor reading • If the sensor is malfunctioning: All functional processors agree on the sensor reading

Byzantine Generals Problem • Some divisions of the Byzantine Army are besieging a city. They must all coordinate their attacks (or coordinate their retreat) to avoid disaster • The overall commander communicates to his divisional commanders by means of a confidential messenger. This messenger is trustworthy and doesn’t alter the message; it can only be read by its intended recipient

Byz Generals Problem (contd.) • If the C-in-C is loyal • He sends consistent orders to the subordinate generals • All loyal subordinates must obey his order • If the C-in-C is a traitor • All loyal subordinate generals must agree on some default action (e.g., running away)

Impossibility with 3 Generals • Suppose there are 2 divisions, A and B. • Commander-in-chief is a traitor and sends message to Com(A) saying “Attack!” and to Com(B) saying “Retreat!” • Com(A) sends a messenger to Com(B), saying “The boss told me to attack!” • Com(B) receives: • Direct order from the C-in-C saying “Retreat” • Message from Com(A) saying “I was ordered to attack”

Byz. Generals Problem (contd.) • Com(B)’s dilemma: • Either the C-in-C or Com(A) is a traitor: it is impossible to know which • Further communication with Com(A) won’t add any useful information • Not possible to ensure that if Com(A) and Com(B) are both loyal, they both agree on the same action • The problem cannot be solved if there are 3 generals who may include at least one traitor

Byz. Generals Problem (contd.) • Central Result: To reach agreement with a total of N participants with up to m traitors, we must have N > 3m

Byzantine Generals Algorithm • Byz(0) // no-failure algorithm • C-in-C sends his order to every subordinate • The subordinate uses the order he receives, or the default if he receives no order

Byz(m) // For up to m traitors (failures) • (1) C-in-C sends order to every subordinate, G_i: let this be received as v_i • (2) G_i acts as the C-in-C in a Byz(m-1) algorithm to circulate this order to his colleagues • (3) For each (i,j) such that i!=j, let w_(i,j) be the order that G_i got from G_j in step 2 or the default if no message was received. G_i calculates the majority of the orders {v_i, w_(i,j)} and uses it as the correct order to follow

Fault Tolerance

Fault Tolerance

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance