Algorithm-Based Fault Tolerance Theory of Check Placement

Algorithm-Based Fault ToleranceTheory of Check Placement Greg Bronevetsky

So Far… • Learned how certain computations could be checked using algorithm-specific checks. • In any algorithm we can develop checks to verify any set of data items. • How effective are these checks? • How many faults can given set of checks detect?

Abstract Checks • Suppose we are given (g,h)-checks • Check defined on g data elements • If all elements correct, returns 0 • If 0 and h elements erroneous, return 1 • If h elements erroneous, undefined

n checks: i. di and sum 1 check: sum Checking Example • Assume (2, 1) checks • 2 elements, 1-failure detect • Both sets of checks can detect single errors • Neither can locate individual errors d1 d1 d2 d2 + + sum sum … … dn dn

But with one more check… • If also check sum • can detect any pair of errors • can locate single errors • Need general theory of effective and efficient check placement d1 d2 + sum … dn n checks: i. di and sum1 more check: sum

Goals • Need models for correlating processor faults to data errors • Given fault model and set of checks need to derive fault detectability and locatability

Papers covered • V.S.S. Nair, J.A. Abraham, P. Banerjee. "Efficient techniques for the analysis of algorithm-based fault tolerance (ABFT) schemes", 1996. • Choon-Sik Park and Mineo Kaneko, "An Efficient Technique for Design of ABFT Systems Based on Modified PD Graph". • Choon-Sik Park, "Algorithm-Based Fault Tolerant Systems Based on Graph-Theoretic Error Occurence+Propagation Models", 2000. (PhD Thesis) • V.S.S. Nair, J.A. Abraham. "Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection", 1990.

Outline • Matrix-based formalism of Nair et al • Dependence graph-based formalism of Park et al • Includes fault propagation models • Framework for hierarchical fault tolerant systems by Nair et al • Building fault tolerant systems out of fault tolerant components

P r o c e s s o r s Basic Framework • Each processor and check associated with set of elements d1 P1 C1 d2 C d3 P2 h e d4 C2 c P3 d5 k s d6 C3 P4 d7

Basic Framework • Data(Pi) = set of data elements affected by processor i • If Pi fails, any subset of of Data(Pi) may be erroneous • No notion of errors propagating based on data dependences • Data() defines the Processor-Data (PD) Matrix

P r o c e s s o r s Associated PD Matrix d1 P1 d2 Data Elements d3 P2 d4 Processors P3 d5 d6 P4 d7

Basic Framework • Check(di) = set of checks that check data element di. • Must be non-empty if we expect to detect errors • Check defines the Data-Check (DC) Matrix • Paper focuses on (g,1) checks • g data elements • can detect upto 1 fault

Associated DC Matrix d1 • C1 and C2 are (3,1) checks • C3 is a (2,1) check Checks C1 d2 C d3 h Data Elements e d4 C2 c d5 k s d6 C3 d7

The PC Matrix • Finally, associate processors and checks: • Processor-check (PC) matrix = PDDC Checks DC Data Elements PD  = Data Elements Processors # elements verified by check PC = Processors

Using the PC Matrix • PC matrix shows if we can detect single-processor errors: • Assume all checks are (g,h) checks • If each row of PC has all entries h failure of that process will be detected • Regardless of which entries actually become erroneous # elements verified by check PC Processors

P r o c e s s o r s Using the PC Matrix • If each row of PC has all entries h failure of that process will be detected d1 P1 C1 d2 C d3 P2 h e d4 C2 c P3 d5 k s d6 C3 P4 d7 # elements verified by check PC Processors

Relaxing Detectability • Condition is too conservative • Suppose we have (3, 2) checks • Pi’s PD row is: • There are 2 checks. DC matrix: • PC Matrix: d1 P1 C1 d2 d3 d4 C2 d5

Relaxing Detectability • C1 may be overwhelmed by errors • Will not notice error <d1, d2 d5> • By above criterion system can’t detect failure in P1 d1 P1 C1 d2 d3 d4 C2 d5

Reaching New Detectability Definition • But how could C1 be overwhelmed? • When all 3 of its elements have errors • Recall, these are (3,2) checks d1 P1 C1 d2 d3 d4 C2 d5

Reaching New Detectability Definition • But C1 and C2 overlap on d5 • Thus if C1 overwhelmed, C2 detects error • It is not overwhelmed • Thus, for any error pattern can see if any check will notice d1 P1 C1 d2 d3 d4 C2 d5

Trivial Algorithm 2 • Try every possible error pattern • Exponentially many of them • For each pattern see if some check will detect it • Before: ensured that no check overwhelmed • Pro: Correct and not conservative • Con: Expensive

New Definition of Detectability • Work with error patterns • Ex: <d1, d2, d5>, <d1, d3, d4>, <d3>, etc. • If one check detects given error pattern, no problem if other checks overwhelmed • Repeat until all error patterns detected: If some check not overwhelmed, eliminate all detectable error patterns from consideration

Example of Detectability Algorithm d1 • Is failure of P1 detectable? • P1 fails  d1, d2 and/or d3 may have errors • C1, C2 overwhelmed • C3 not overwhelmed P1 C1 d2 d3 C2 P2 (2,1) checks d4 C3 d5 C4

Example of Detectability Algorithm d1 • Look at errors C3 can detect: d3 • Remove them from consideration • Since any error pattern involving d3 will be detected P1 C1 d2 d3 C2 P2 (2,1) checks d4 C3 d5 C4

Example of Detectability Algorithm d1 • Look at remaining error patterns: combinations of d1 and/or d2 • Now C2 not overwhelmed • Remove any error patterns involving d2 P1 C1 d2 C2 P2 (2,1) checks d4 C3 d5 C4

Example of Detectability Algorithm d1 • Look at remaining error patterns: d1 • C1 not overwhelmed • Remove any of its error patterns P1 C1 C2 P2 (2,1) checks d4 C3 d5 C4

Example of Detectability Algorithm • All of P1’s error patterns detected • We are done! P1 C1 C2 P2 (2,1) checks d4 C3 d5 C4

Failing Check Processors • What if processor performing check fails? • Add “pseudo” data elements to represent processors • Each check will also check its processor’s pseudo-data element • New element has  weight, so error in it will overwhelm any check

Final System P1 • Check C3 is in P1 • Checks C1, C2 and C4 on P2 C1 d2 d3 C2 P2 (2,1) checks d4 C3 d5 C4 d6 d7

The Infinities d1 P1 C1 d2 d3 C2 P2 (2,1) checks d4 C3 d5 C4 d6 d7 DC Checks PD Data Elements Processors Data Elements PC # elements verified by check Processors

The Infinities d1 P1 • If P1 fails, C1 and C2 overwhelmed • C3 also overwhelmed by +1 • Because C3 runs on failed P1 • Only C4 not overwhelmed C1 d2 d3 C2 P2 (2,1) checks d4 C3 d5 C4 d6 PC # elements verified by check d7 Processors

The Infinities d1 P1 • Remove all error patterns detected by C4 • Any that include d2 C1 d2 d3 C2 P2 (2,1) checks d4 C3 d5 C4 d6 PC # elements verified by check d7 Processors

The Infinities d1 P1 • C1 and C2 no longer overwhelmed • Remove error patterns detected by C1 and C2 • Any that include d1 andd3 C1 d3 C2 P2 (2,1) checks d4 C3 d5 C4 d6 PC # elements verified by check d7 C4’s entry must become 0Others may go lower Processors

The Infinities P1 • Now P1’s row is all 0’s and ’s • All real data elements successfully checked • Only pseudo-elements remain • Don’t care C1 C2 P2 (2,1) checks d4 C3 d5 C4 d6 PC # elements verified by check d7 C1’s and C2’s entries must become 0Others may go lower Processors

The Infinities P1 • Note failure of P2 not detectable • d5 only checked by C4, which runs on P2 • Thus, entry will never drop to  C1 C2 P2 (2,1) checks d4 C3 d5 C4 d6 PC # elements verified by check d7 Processors

Multi-Process Errors • Want to know if system detect failures of r processors • For every subset of r processors • Take union of all data elements they touched • Pretend each r-set is single processor • Use above algorithm to check if all resulting error patterns detectable

Fault Locatability • We only see errors, not faults • For each error pattern, want to know which fault caused it • Given two fault patterns, are they distinguishable? • Only if they have different patterns of failed checks • Will give intuition for analysis

0-1 Disagreement • Take rows Ri and Rj of rPC (faults Fi and Fj) • For every possible error pattern in Ri and Rj look at what each check says on this pattern • If check responses different on each pattern: Fi and Fj can be differentiated

1-0 Disagreement • Want to differentiate faults Fi and FiFj j • Compare each error pattern of Fi and Fj: Eik and Ejl • If some check meets Eik on 1 & h spots and meets Eil on 0 spots then Ejk and EjkEjl distinguishable • If this is true for all error patterns then Fi and FiFj distinguishable

1-0 disagreement in both directions 1-0 Disagreement Example

1-0 Disagreement Example • Clearly, Eik and Ejl look different • EikEjlcorresponds to fault pattern: • Checks would say: • Different from Eik or Ejl : Distinguishable!

Fault Locatability • If can show 1-0 disagreement between every single-process fault and every r-process fault:System is r-fault locatable • Algorithm for locatability is obscure • Read the paper

Summary • Presented matrix-based framework for evaluating error detectability & locatability • Framework deals with arbitrary errors • More work by V.S.S. Nair with other coauthors

Outline • Matrix-based formalism of Nair et al • Dependence graph-based formalism of Park et al • Includes fault propagation models • Framework for hierarchical fault tolerant systems by Nair et al • Building fault tolerant systems out of fault tolerant components

Graph-Based Framework • Developed by Choon-Sik Park • Does in graphs what Nair et al work does in matrices • Assumes (g,1) checks • Differences: • Different definition of fault locatability • Unknown if equivalent • Presents more limited faulterror models • As opposed to “anything and everything” • Will first present general view, then specific error models

Basic Picture Faults Errors Data Checks … … … … c Fi eiu … … … … c` Fj … … … ejv … ProcessorData, DataData dependence info maintained

k-Faults • Faults may cause number of possible errors • For given fault, many errors possible • If given error happens, all associated data elements definitely corrupted • k-Faults: faults generating errors that corrupt k data elements Faults Errors Data Fi eiu

Fault Detectability • System is k-fault detectable if for every error pattern check c s.t. |ceiu|=1 •  means intersection of affected data elements • Proof: • If there exists such check then every error pattern induced by fault will be detected • If k-fault detectable then must  some check that reliably yells for any possible error pattern • Can allow the check that yells to be the check in definition

Fault Management • k-fault detectability: If a fault affects k data elements then checks will detect it • k-fault locatability: For all faults that affect k data elements, can tell any pair of faults apart • Will examine all fault patterns Fi that come from k data elements failing

Fault Locatability 1 • To locate faults, must ensure that different faults cause different errors • Theorem 1:System k-fault locatable only if for error patterns eiu, ejv (from faults Fi and Fj) eiuejv •   symmetric difference • Proof clear:If two faults can show up as same error, can’t tell them apart

Algorithm-Based Fault Tolerance Theory of Check Placement

Algorithm-Based Fault Tolerance Theory of Check Placement

Presentation Transcript

Algorithm-Based Fault Tolerance for Matrix Operations

Fault Tolerance

Fault Tolerance

Fault Tolerance and Reliable Data Placement

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Algorithm-Based Fault Tolerance Matrix Multiplication

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance