1 / 107

Algorithm-Based Fault Tolerance Theory of Check Placement

Algorithm-Based Fault Tolerance Theory of Check Placement. Greg Bronevetsky. So Far…. Learned how certain computations could be checked using algorithm-specific checks. In any algorithm we can develop checks to verify any set of data items. How effective are these checks?

kasie
Download Presentation

Algorithm-Based Fault Tolerance Theory of Check Placement

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithm-Based Fault ToleranceTheory of Check Placement Greg Bronevetsky

  2. So Far… • Learned how certain computations could be checked using algorithm-specific checks. • In any algorithm we can develop checks to verify any set of data items. • How effective are these checks? • How many faults can given set of checks detect?

  3. Abstract Checks • Suppose we are given (g,h)-checks • Check defined on g data elements • If all elements correct, returns 0 • If 0 and h elements erroneous, return 1 • If h elements erroneous, undefined

  4. n checks: i. di and sum 1 check: sum Checking Example • Assume (2, 1) checks • 2 elements, 1-failure detect • Both sets of checks can detect single errors • Neither can locate individual errors d1 d1 d2 d2 + + sum sum … … dn dn

  5. But with one more check… • If also check sum • can detect any pair of errors • can locate single errors • Need general theory of effective and efficient check placement d1 d2 + sum … dn n checks: i. di and sum1 more check: sum

  6. Goals • Need models for correlating processor faults to data errors • Given fault model and set of checks need to derive fault detectability and locatability

  7. Papers covered • V.S.S. Nair, J.A. Abraham, P. Banerjee. "Efficient techniques for the analysis of algorithm-based fault tolerance (ABFT) schemes", 1996. • Choon-Sik Park and Mineo Kaneko, "An Efficient Technique for Design of ABFT Systems Based on Modified PD Graph". • Choon-Sik Park, "Algorithm-Based Fault Tolerant Systems Based on Graph-Theoretic Error Occurence+Propagation Models", 2000. (PhD Thesis) • V.S.S. Nair, J.A. Abraham. "Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection", 1990.

  8. Outline • Matrix-based formalism of Nair et al • Dependence graph-based formalism of Park et al • Includes fault propagation models • Framework for hierarchical fault tolerant systems by Nair et al • Building fault tolerant systems out of fault tolerant components

  9. P r o c e s s o r s Basic Framework • Each processor and check associated with set of elements d1 P1 C1 d2 C d3 P2 h e d4 C2 c P3 d5 k s d6 C3 P4 d7

  10. Basic Framework • Data(Pi) = set of data elements affected by processor i • If Pi fails, any subset of of Data(Pi) may be erroneous • No notion of errors propagating based on data dependences • Data() defines the Processor-Data (PD) Matrix

  11. P r o c e s s o r s Associated PD Matrix d1 P1 d2 Data Elements d3 P2 d4 Processors P3 d5 d6 P4 d7

  12. Basic Framework • Check(di) = set of checks that check data element di. • Must be non-empty if we expect to detect errors • Check defines the Data-Check (DC) Matrix • Paper focuses on (g,1) checks • g data elements • can detect upto 1 fault

  13. Associated DC Matrix d1 • C1 and C2 are (3,1) checks • C3 is a (2,1) check Checks C1 d2 C d3 h Data Elements e d4 C2 c d5 k s d6 C3 d7

  14. The PC Matrix • Finally, associate processors and checks: • Processor-check (PC) matrix = PDDC Checks DC Data Elements PD  = Data Elements Processors # elements verified by check PC = Processors

  15. Using the PC Matrix • PC matrix shows if we can detect single-processor errors: • Assume all checks are (g,h) checks • If each row of PC has all entries h failure of that process will be detected • Regardless of which entries actually become erroneous # elements verified by check PC Processors

  16. P r o c e s s o r s Using the PC Matrix • If each row of PC has all entries h failure of that process will be detected d1 P1 C1 d2 C d3 P2 h e d4 C2 c P3 d5 k s d6 C3 P4 d7 # elements verified by check PC Processors

  17. Relaxing Detectability • Condition is too conservative • Suppose we have (3, 2) checks • Pi’s PD row is: • There are 2 checks. DC matrix: • PC Matrix: d1 P1 C1 d2 d3 d4 C2 d5

  18. Relaxing Detectability • C1 may be overwhelmed by errors • Will not notice error <d1, d2 d5> • By above criterion system can’t detect failure in P1 d1 P1 C1 d2 d3 d4 C2 d5

  19. Reaching New Detectability Definition • But how could C1 be overwhelmed? • When all 3 of its elements have errors • Recall, these are (3,2) checks d1 P1 C1 d2 d3 d4 C2 d5

  20. Reaching New Detectability Definition • But C1 and C2 overlap on d5 • Thus if C1 overwhelmed, C2 detects error • It is not overwhelmed • Thus, for any error pattern can see if any check will notice d1 P1 C1 d2 d3 d4 C2 d5

  21. Trivial Algorithm 2 • Try every possible error pattern • Exponentially many of them • For each pattern see if some check will detect it • Before: ensured that no check overwhelmed • Pro: Correct and not conservative • Con: Expensive

  22. New Definition of Detectability • Work with error patterns • Ex: <d1, d2, d5>, <d1, d3, d4>, <d3>, etc. • If one check detects given error pattern, no problem if other checks overwhelmed • Repeat until all error patterns detected: If some check not overwhelmed, eliminate all detectable error patterns from consideration

  23. Example of Detectability Algorithm d1 • Is failure of P1 detectable? • P1 fails  d1, d2 and/or d3 may have errors • C1, C2 overwhelmed • C3 not overwhelmed P1 C1 d2 d3 C2 P2 (2,1) checks d4 C3 d5 C4

  24. Example of Detectability Algorithm d1 • Look at errors C3 can detect: d3 • Remove them from consideration • Since any error pattern involving d3 will be detected P1 C1 d2 d3 C2 P2 (2,1) checks d4 C3 d5 C4

  25. Example of Detectability Algorithm d1 • Look at remaining error patterns: combinations of d1 and/or d2 • Now C2 not overwhelmed • Remove any error patterns involving d2 P1 C1 d2 C2 P2 (2,1) checks d4 C3 d5 C4

  26. Example of Detectability Algorithm d1 • Look at remaining error patterns: d1 • C1 not overwhelmed • Remove any of its error patterns P1 C1 C2 P2 (2,1) checks d4 C3 d5 C4

  27. Example of Detectability Algorithm • All of P1’s error patterns detected • We are done! P1 C1 C2 P2 (2,1) checks d4 C3 d5 C4

  28. Failing Check Processors • What if processor performing check fails? • Add “pseudo” data elements to represent processors • Each check will also check its processor’s pseudo-data element • New element has  weight, so error in it will overwhelm any check

  29. Final System P1 • Check C3 is in P1 • Checks C1, C2 and C4 on P2 C1 d2 d3 C2 P2 (2,1) checks d4 C3 d5 C4 d6 d7

  30. The Infinities d1 P1 C1 d2 d3 C2 P2 (2,1) checks d4 C3 d5 C4 d6 d7 DC Checks PD Data Elements Processors Data Elements PC # elements verified by check Processors

  31. The Infinities d1 P1 • If P1 fails, C1 and C2 overwhelmed • C3 also overwhelmed by +1 • Because C3 runs on failed P1 • Only C4 not overwhelmed C1 d2 d3 C2 P2 (2,1) checks d4 C3 d5 C4 d6 PC # elements verified by check d7 Processors

  32. The Infinities d1 P1 • Remove all error patterns detected by C4 • Any that include d2 C1 d2 d3 C2 P2 (2,1) checks d4 C3 d5 C4 d6 PC # elements verified by check d7 Processors

  33. The Infinities d1 P1 • C1 and C2 no longer overwhelmed • Remove error patterns detected by C1 and C2 • Any that include d1 andd3 C1 d3 C2 P2 (2,1) checks d4 C3 d5 C4 d6 PC # elements verified by check d7 C4’s entry must become 0Others may go lower Processors

  34. The Infinities P1 • Now P1’s row is all 0’s and ’s • All real data elements successfully checked • Only pseudo-elements remain • Don’t care C1 C2 P2 (2,1) checks d4 C3 d5 C4 d6 PC # elements verified by check d7 C1’s and C2’s entries must become 0Others may go lower Processors

  35. The Infinities P1 • Note failure of P2 not detectable • d5 only checked by C4, which runs on P2 • Thus, entry will never drop to  C1 C2 P2 (2,1) checks d4 C3 d5 C4 d6 PC # elements verified by check d7 Processors

  36. Multi-Process Errors • Want to know if system detect failures of r processors • For every subset of r processors • Take union of all data elements they touched • Pretend each r-set is single processor • Use above algorithm to check if all resulting error patterns detectable

  37. Fault Locatability • We only see errors, not faults • For each error pattern, want to know which fault caused it • Given two fault patterns, are they distinguishable? • Only if they have different patterns of failed checks • Will give intuition for analysis

  38. 0-1 Disagreement • Take rows Ri and Rj of rPC (faults Fi and Fj) • For every possible error pattern in Ri and Rj look at what each check says on this pattern • If check responses different on each pattern: Fi and Fj can be differentiated

  39. 1-0 Disagreement • Want to differentiate faults Fi and FiFj j • Compare each error pattern of Fi and Fj: Eik and Ejl • If some check meets Eik on 1 & h spots and meets Eil on 0 spots then Ejk and EjkEjl distinguishable • If this is true for all error patterns then Fi and FiFj distinguishable

  40. 1-0 disagreement in both directions 1-0 Disagreement Example

  41. 1-0 Disagreement Example • Clearly, Eik and Ejl look different • EikEjlcorresponds to fault pattern: • Checks would say: • Different from Eik or Ejl : Distinguishable!

  42. Fault Locatability • If can show 1-0 disagreement between every single-process fault and every r-process fault:System is r-fault locatable • Algorithm for locatability is obscure • Read the paper

  43. Summary • Presented matrix-based framework for evaluating error detectability & locatability • Framework deals with arbitrary errors • More work by V.S.S. Nair with other coauthors

  44. Outline • Matrix-based formalism of Nair et al • Dependence graph-based formalism of Park et al • Includes fault propagation models • Framework for hierarchical fault tolerant systems by Nair et al • Building fault tolerant systems out of fault tolerant components

  45. Graph-Based Framework • Developed by Choon-Sik Park • Does in graphs what Nair et al work does in matrices • Assumes (g,1) checks • Differences: • Different definition of fault locatability • Unknown if equivalent • Presents more limited faulterror models • As opposed to “anything and everything” • Will first present general view, then specific error models

  46. Basic Picture Faults Errors Data Checks … … … … c Fi eiu … … … … c` Fj … … … ejv … ProcessorData, DataData dependence info maintained

  47. k-Faults • Faults may cause number of possible errors • For given fault, many errors possible • If given error happens, all associated data elements definitely corrupted • k-Faults: faults generating errors that corrupt k data elements Faults Errors Data Fi eiu

  48. Fault Detectability • System is k-fault detectable if for every error pattern check c s.t. |ceiu|=1 •  means intersection of affected data elements • Proof: • If there exists such check then every error pattern induced by fault will be detected • If k-fault detectable then must  some check that reliably yells for any possible error pattern • Can allow the check that yells to be the check in definition

  49. Fault Management • k-fault detectability: If a fault affects k data elements then checks will detect it • k-fault locatability: For all faults that affect k data elements, can tell any pair of faults apart • Will examine all fault patterns Fi that come from k data elements failing

  50. Fault Locatability 1 • To locate faults, must ensure that different faults cause different errors • Theorem 1:System k-fault locatable only if for error patterns eiu, ejv (from faults Fi and Fj) eiuejv •   symmetric difference • Proof clear:If two faults can show up as same error, can’t tell them apart

More Related