CMSC 5719 MSc Seminar Fault-Tolerant Computing

CMSC 5719 MSc SeminarFault-Tolerant Computing XU, Qiang (Johnny) 徐強 [Partly adapted from Koren & Krishna, and B. Parhami Slides]

Why Learn This Stuff?

Outline • Motivation • Fault classification • Redundancy • Metrics for Reliability • Case studies

Fault-Tolerance - Basic definition • Fault-tolerant systems - ideally systems capable of executing their tasks correctly regardless of either hardware failures or software errors • In practice - we can never guarantee the flawless execution of tasks under any circumstances • Limit ourselves to types of failures and errors which are more likely to occur

Need For Fault-Tolerance • Critical applications require extreme fault tolerance (e.g., aircrafts, nuclear reactors, medical equipment, and financial applications) • A malfunction of a computer in such applications can lead to catastrophe • Their probability of failure must be extremely low, possibly one in a billion per hour of operation • System operating in a harsh environment with high failure possibilities • electromagnetic disturbances • particle hits and alike • Complex systems consisting of millions of devices

Get to Know the Enemy: What cause Faults? Aging (a.k.a., Circuit Wearout) Manufacturing Defects

Get to Know the Enemy: What cause Faults? Internal Electronic Noise Electromagnetic Interference

Get to Know the Enemy: What cause Faults? Bugs … Malicious attack (beyond the scope)

Fault Classification according to Duration • Permanent Faults- never go away, component has to be repaired or replaced • Transient Faults- disappear after a relatively short time • Example - a memory cell whose contents are changed due to some electromagnetic interference • Overwriting the memory cell with the right content will make the fault go away • Intermittent Faults- cycle between active and benign states • Example - a loose connection • An increasing threat largely due to temeprature and voltage fluctuations

Failures during Lifetime • Three phases of system lifetime • Infant mortality (imperfect test, weak components) • Normal lifetime (transient/intermittent faults) • Wear-out period (circuit aging)

Seriously, Why Fault-Tolerance Comes Back? • Simply put, technology-driven With technology scaling Total Cost Reliability Cost Transistor Cost Time Today’s chips are extremely complex (billion transistors running with less noise margin) and are much hotter! We cannotafford heavyweight, macro-scale redundancy for commodity computing systems.

The Impact of Technology Scaling • More leakage • More process variability • Smaller critical charges • Weaker transistors and wires Higher random failure rate Burn-in test less effective Faster wear-out

What Can We Do when Confronting Enemies? • Surrender, but don’t become traitor • Fail, but safe, i.e., don’t corrupt anything (ATM machine) • Not that easy as you may think, you have to detect faults! • Weaken the enemies • fault-avoidance and fault-removal • Process improvement with less threats • Testing and DfT to remove defective circuits • Careful design reviews to remove design bugs • More training to reduce operator errors • Always some faults cannot be avoided and removed completely • Make yourself stronger • Fault-tolerance • Adding redundancy to detect, diagnose, confine, mask, compensate and recover from faults • Mind the cost in terms of hardware, power, and performance • Fault-evasion (a.k.a., Fault-prediction) • Observe, learn and take pre-emptive steps to stop fault from occurring

User Fi A Motivating Case Study Data availability and integrity concerns Distributed DB system with 5 sites Full connectivity, dedicated links Only direct communication allowed Sites and links may malfunction Redundancy improves availability S: Probability of a site being available L: Probability of a link being available Single-copy availability = SL Unavailability = 1 – SL = 1 – 0.99  0.95 = 5.95% Data replication methods, and a challenge File duplication: home / mirror sites File triplication: home / backup 1 / backup 2 Are there availability improvement methods with less redundancy?

A = SL + (1 – SL)SL Primary site can be reached Mirror site can be reached Primary site inaccessible Data Duplication: Home and Mirror Sites S: Site availability e.g., 99% L: Link availability e.g., 95% Fi mirror User Duplicated availability = 2SL – (SL)2 Unavailability = 1 – 2SL + (SL)2 = (1 – SL)2 = 0.35% Fi home Data unavailability reduced from 5.95% to 0.35% Availability improved from  94% to 99.65%

A = SL + (1 – SL)SL + (1 – SL)2SL Primary site can be reached Backup 1 can be reached Backup 2 can be reached Primary and backup 1 inaccessible Primary site inaccessible Data Triplication: Home and Two Backups Fi backup 1 S: Site availability e.g., 99% L: Link availability e.g., 95% User Triplicated avail. = 3SL – 3(SL)2– (SL)3 Unavailability = 1 – 3SL – 3(SL)2+ (SL)3 = (1 – SL)3 = 0.02% Fi home Fi backup 2 Data unavailability reduced from 5.95% to 0.02% Availability improved from  94% to 99.98%

A = (SL)4 + 4(1 – SL)(SL)3+ 6(1 – SL)2(SL)2 All 4 pieces can be reached Exactly 3 pieces can be reached Only 2 pieces can be reached Data Dispersion: Three of Five Pieces Piece 0 Piece 4 User Piece 1 S: Site availability e.g., 99% L: Link availability e.g., 95% Dispersed avail. = 6(SL)2– 8(SL)3 + 3(SL)4 Availability = 99.92% Unavailability = 1 – Availability = 0.08% Piece 3 Piece 2 Scheme  Nonredund. Duplication Triplication Dispersion Unavailability 5.95% 0.35% 0.02% 0.08% Redundancy 0% 100% 200% 67%

Questions Ignored in Our Simple Example 1. How redundant copies of data are kept consistent When a user modifies the data, how to update the redundant copies (pieces) quickly and prevent the use of stale data in the meantime? 2. How malfunctioning sites and links are identified Malfunction diagnosis must be quick to avoid data contamination 3. How recovery is accomplished when a malfunctioning site/link returns to service after repair The returning site must be brought up to date with regard to changes 4. How data corrupted by the actions of an adversary is detected This is more difficult than detecting random malfunctions The example does demonstrate, however, that:  Many alternatives are available for improving dependability  Proposed methods must be assessed through modeling  The most cost-effective solution may be far from obvious

Redundancy • Redundancy is at the heart of fault-tolerance • Incorporation of extra components in the design of a system to improve its reliability • Four forms of redundancy: • Hardware redundancy (spatial redundancy) • Static, dynamic and hybrid redundancy • Software redundancy • N-version programming • Information redundancy • Error detecting and correcting codes • Usually requires extra hardware for processing • Time redundancy • Re-execution

Physical Redundancy • Physically replicate modules • Effective for all sorts of faults • Mind the area/energy overhead • Design issues • How many copies? • How to detect faults? • How to recover from faults? • How to organize redundancy (passive, active, or hybrid)?

Triple Modular Redundancy (TMR) • The best known FT technique • Tolerate single error (soft or hard error) in any module • Low performance overhead • Simple design • Very high cost in terms of area and energy • To tolerate simultaneous faults, we can resort to N-modular redundancy (NMR) • N is an odd integer • Tolerates up to (N-1)/2 faults • Single point of failure at voter • Voter is typically small and hence often assumed to be very reliable

Reliability of TMR Systems • M-of-N system with M=2, N=3 - system good if at least two modules are operational • A voter picks the majority output • Voter can fail - reliability of voter Rvot(t) • = Rvot(t) ( 3R² (t) - 2R³ (t) )

Triplicated Processor/Memory System • All communications (in either direction) between triplicated processors and triplicated memories go through majority voting • Higher reliability than a single majority voting of triplicated processor/memory structure

Design Redundancy • Use diverse designs to furnish the same service • Another kind of physical redundancy • Advantages • Protection against design deficiency • Lower cost with simple “back-up” unit

Watchdog Processor • Performs concurrent system-level error detection • Monitoring bus connecting main processor and memory • Targets control-flow checking: correct program blocks executed in the right order • Can detect hardware/software faults causing erroneous instructions to be executed or wrong execution paths • Watchdog needs the program's control-flow information

DIVA: Dynamic Implementation Verification Architecture • Core computation, communication, and control validated by checker • Checker relaxes the burden of correctness on the core processor • Key checker requirements: simple, fast, and reliable

N-Version Programming • N independent teams of programmers develop software to same specifications - N versions are run in parallel - output voted on • If programs are developed independently - very unlikely that they will fail on same inputs • Assumption - failures are statistically independent; probability of failure of an individual version = q • Probability of no more than m failures out of N versions – • What are the limitations?

Information Redundancy - Coding • A data word with d bits is encoded into a codeword with c bits - c>d • Not all combinations are valid codewords • To extract original data - c bits must be decoded • If the c bits do not constitute a valid codeword an error is detected • For certain encoding schemes - some types of errors can also be corrected • Key parameters of code: • Number of erroneous bits that can be detected • Number of erroneous bits that can be corrected • Overhead of code: • Additional bits required • Additional hardware/latency for encoding and decoding

Hamming Distance • The Hamming distance between two codewords - the number of bit positions in which the two words differ • A Hamming distance of two between two codewords implies that a single bit error will not change one of the codewords into the other

Distance of a Code • The Distance of a code - the minimum Hamming distance between any two valid codewords • Example - Code with four codewords -{001,010,100,111} • has a distance of 2 • can detect any single bit error • Example - Code with two codewords - {000,111} • has a distance of 3 • can detect any single or double bit error • if double bit errors are not likely to happen - code can correct any single bit error

Coding vs. Redundancy • The code {000,111} can be used to encode a single data bit • 0 can be encoded as 000 and 1 as 111 • This code is identical to TMR • The code {00,11} can also be used to encode a single data bit • 0 can be encoded as 00 and 1 as 11 • This code is identical to a duplex • To detect up to k bit errors, the code distance should be at least k+1 • To correct up to k bit errors, the code distance should be at least 2k+1

Separability of a Code • A code is separable if it has separate fields for the data and the code bits • Decoding consists of disregarding the code bits • The code bits can be processed separately to verify the correctness of the data • A non-separable code has the data and code bits integrated together - extracting the data from the encoded word requires some processing • The simplest separable codes are the parity codes • A parity code has a distance of 2 • Can detect all odd-bit errors • Even or odd parity code?

Error-Correcting Parity Codes • Simplest scheme - data is organized in a 2-dimensional array • Bits at the end of row – parity over the row • Bits at the bottom of column – parity over the column • A single-bit error anywhere will cause a row and a column to be erroneous • This identifies a unique erroneous bit • This is an example of overlapping parity - each bit is covered by more than one parity bit

Cyclic Redundancy Check (CRC) • Many applications need to detect burst errors • Why CRC is popular? • Effectiveness: A n-bit CRC check can detect all errors of less than n bits and a large portion of longer multi-bit errors • Ease of hardware implementation: shifters and XORs • How does it work? • Consider dataword and codeword as polynomials • At transmitter side, Codeword = Dataword * Generator • Generator function is a pre-defined CRC polynomial • An example CRC-16 polynomial: • At receiver side, divide Codeword by CRC polynomial and check whether the remainder is zero

Time Redundancy • Perform execution multiple times (typically twice), and then compare the results • Effective for transient faults • Does it work for permanent errors? • Cost of time redundancy • Performance cost, can we mitigate it? • Energy cost, can we mitigate it?

Reversible Computation • Many operations are reversible • Addition/subtraction; shift left/shift right; etc. • If reversible operations result in unexpected value, we know there’s a problem • What operations are non-reversible? • Devil are in the details

With Redundancy, What can We Do? • Forward Error Correction (FEC) • Also known as forward error recovery (FER), although it’s actually not recovery • Use redundancy to mask error effects • System continues to go forward in presence of errors Triple modular redundancy (TMR)

With Redundancy, What can We Do? • Backward Error Recovery (BER) • Use redundancy to recover from errors • System go backward to a saved good state Periodic checkpoint and replay

The Impact of FEC vs. BER

The Impact of FEC vs. BER • When failure rate is very high, which one is preferred?

Fault-Tolerance is NOT Free! • Fault-tolerance can be achieved to any arbitrary degree if you’re willing to throw resources for it • Canonical solutions are there for a long time • Many FT solutions hurt performance, e.g., • Checkpoint and replay • Tightly lockstepped redundant cores • Redundant multithreading • Many FT solutions increase cost, e.g., • TMR and NMR • RAID • N-programming • Almost all the FT techniques increase energy consumption

Fault-Tolerance for Designers • Fault-tolerance is essentially redundancy allocation and management, anddesign is about tradeoff! • As designers, smarter FT solutions can be obtained by • Know your enemies better (what causes the failure, failure rate, failure distribution, etc.) • Know your design better (specific properties, anything “free”, when and what to sacrifice, etc.)

Levels of Fault-Tolerance Software redundancy Virtualization Task migration Redundant multithreading Fault-tolerant scheduling Core-level redundancy Dynamic verification Block-level redundancy ECC for memory Circuit hardening

Lots of FT Buzzwords over Time … • Reliability – continuation of service while being used • Availability – readiness for use whenever needed • Serviceability – ease of service or repair • Safety - absence of catastrophic consequences • Maintainability - ability to undergo modifications and repairs • Surviability, Confidentiality, Accessibility … • Security - the degree of protection against danger, loss, and criminals • Dependability – the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers (defined by IFIP 10.4 Working Group on Dependable Computing and Fault Tolerance ) • You’re right, I don’t know what it exactly means … • Who cares?

Designers Need Measures • A measure is a mathematical abstraction, which expresses only some subset of the object's nature, i.e., FT capability here • Reliability, R(t), probability that the system is up during the whole interval [0,t], for non-repairable products • Availability, A(t), fraction of time system is up during the interval [0,t], for repairable products • Point Availability, Ap(t), probability that the system is up at time t • Long-Term Availability, • People usually talk about “the 9’s”

Designers Need Intuitive Measures • Mean Time To Failure,MTTF, average time the system remains up before it goes down and has to be repaired or replaced • MTTF is about the mean only, so there is also nTTF • Mean Time To Repair,MTTR • Mean Time Between Failures,MTBF = MTTF + MTTR • Failures in Time,FIT, number of failures per 109 hours Be careful to the assumptions behind these measures!

More Detailed (Complex) Measures • The assumption of the system being in state ‘’up” or ‘’down” is very limiting • Example: Multicore processors • Let Pi = Prob {i processors are operational} • Let c = computational capacity of a processor (e.g., number of fixed-size tasks it can execute) • Computational capacity of i processors: Ci = i  c • Average computational capacity of system: • Performability, consider everything from the perspective of the application • Application is used to define ‘’accomplishment levels” L1, L2,...,Ln, each representing a QoS level • vector (P(L1),P(L2),...,P(Ln)) where P(Li) is the probability that the computer functions well enough to permit the application to reach up to accomplishment level Li

Example: Tandem for Transaction Processing • Design objective: • “Nonstop” operation • Modular system expansion • FT design features: • Loosely-coupled multi-computer architecture • Hardware/software module fast-fail • Error-correcting memory • Error-detecting message • Watch-dog timers • …

Example: AT&T Switching Systems • Design objective: • High-availability: 2 hours downtime in 40 years • Differentiate user aggravation level • Extremely low disconnection rate for established calls • Low failure rate for call establishment • FT design features: • Redundant processors • 30% of control logic devoted to self-checking (for 1981 3B20D processor) • Various forms of EDC and ECC • Watch-dog timers • Multiple levels of fault recovery • …

Example: Personal Computer • Design objective: • Fast and cheap • Occasional corruption is tolerable • Expected lifetime: couple of years • FT design features: • ECC for memory and hard disk • … • More FT features will be in-place for commodity ICs in the near future due to increasing reliability threats • the key is cost-effectiveness!

CMSC 5719 MSc Seminar Fault-Tolerant Computing

CMSC 5719 MSc Seminar Fault-Tolerant Computing

Presentation Transcript

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

Fault Tolerant Distributed Computing system.

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

Fault-Tolerant Computing Basics

fault-tolerant

ECE 753: FAULT-TOLERANT COMPUTING

Fault Tolerant Distributed Computing system.

Fault-Tolerant Computing Systems #1 Introduction

Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance

Fault-tolerant Computing

Fault-Tolerant Computing Basics