290 likes | 389 Views
Greg Bronevetsky. ED 4 I: Error Detection by Diverse Data and Duplicated Instructions. ED 4 I Background. A code transformation system developed at the Stanford Center for Reliable Computing . Authors: Nahmsuk Oh, Subhasish Mitra, Edward J. McCluskey
E N D
Greg Bronevetsky ED4I: Error Detection by Diverse Data and Duplicated Instructions
ED4I Background • A code transformation system developed at the Stanford Center for Reliable Computing. • Authors: Nahmsuk Oh, Subhasish Mitra, Edward J. McCluskey • ED4I allows us to run a program on two slightly different inputs and still be able to compare results at the end.
Motivation • The simplest way to detect Byzantine Faults is to run the same program on multiple processors and compare results. • ED4I is Byzantine Fault detection for uniprocessors. • Must take into account both temporary and and permanent faults.
Definitions • Temporary Faults – any fault that temporarily affects a processor, long enough to execute several instructions. • Ex: Radiation hitting wires, frayed wires. • Permanent Faults – a fault that affects a processor for a long period of time. • Ex: Spilling Coke on the chip, cut wires.
Problem Statement • We can detect Byzantine Failures by running each program or procedure twice and comparing the results. • However, this does not guard against permanent faults since the results of both runs will be the same. • Need to make the two runs different so that the same fault will affect the results differently. • Overhead = 100%.
Key Idea • Lets feed into the program two different sets of data and then compare the results. • Key Insight: • If the program only uses arithmetic operations, we can alter the input by multiplying all input numbers by a constant. • Then the modified output will be the (real output) * (the constant). • Thus, you can verify that the two computations succeeded AND the two computations will be affected by errors differently.
New Program • If we alter the input to the program, we must alter the program to work with this modified input. • The transformation is given the constant k (called the “diversity factor”) and it creates the “k-factor diverse program”. • The new program will have the same control flow graph as the old program but all the variables will be k-multiples of the of original ones.
Transformations • If k<0, branches flip directions (> ↔ <, ≥ ↔ ≤) • All constants in code get multiplied by k. • Addition and Subtraction of variables unchanged. • Multiplication: v1*v2*....*vn → (v1*v2*....*vn)/kn-1 • Division: v1/v2 → (v1/v2)*k
Fault Detection Probability • For functional unit hi (such as the adder), fault f and diversity factor k: • Xi = is the set of inputs to hi • Ei = subset of X containing the inputs that will result in erroneous output due to the fault. • E'i = subset of Ei that will escape detection • Ci(k) = Probability of catching an error in hi.
Data Integrity Probability • For functional unit hi, fault f and diversity factor k: • Xi = is the set of inputs to hi • Ei = subset of X containing the inputs that will result in erroneous output due to the fault. • E'i = subset of Ei that will escape detection • Di(k) = Probability of missing no errors in hi.
Choosing the value of k • For some functional units we can derive Ci(k) and Di(k) analytically for each k. • This is too hard in general so we resort to trying out a range of k's empirically to determine Ci(k) and Di(k).
Bus Signal Line • Bus wire stuck at either 0 or 1. • Derived results for a 12-bit bus:
Adder • Experimental results for a 12-bit ripple carry adder: • Experimental results for a 12-bit carry look-ahead adder:
Multiplier & Divider • Experimental Results for • 12-bit array multiplier • 8-bit Wallace Tree multiplier • SRT divider
Shifter • Experimental Results for 16-bit multiplexer-based shifter:
Using Benchmarks to pick k • Need to determine how much each functional unit is used in the average program. • Add, sub, mult and shift use the obvious functional units. • “memory access” uses the memory bus • “branch” uses a carry-lookahead adder
Benchmarked Data Integrity • Calculated Data Integrity=Di(k) given above usage statistics. (high Di(k) top priority) • Highlighted columns provide the best data integrity for each benchmark.
Benchmarked Detection Probability • Calculated Detection Probability=Ci(k) given above usage statistics. • Highlighted columns provide the best detection probability for each benchmark.
Optimum k • Optimum k selected: • Must maximize the Data Integrity=Di(k). • Given maximum Di(k), maximize Ci(k). • For each program, should get an estimate of how it uses the different functional units and pick k accordingly.
Dealing with Overflow • By multiplying all variables by k, we may cause them to overflow. • Can scale variables up to next largest type. • Scale down variables by dividing by k. Must only check higher order bits when comparing new results to results of original program. • Can use compile-time range checking to determine vulnerability to overflow and pick k accordingly
Floating Point Numbers • Above technique fails for floating point numbers. • IEEE 754 format: • K=-2 will only change the sign bit and some bits in the exponent. • Solution: pick separate k's for the exponent and the mantissa and run the program once with each k. • Overhead = 200%.
Picking k for the mantissa • To find errors in mantissa, pick k to be 3/2. • A stuck-at-1 fault: • In original program, variable x's value corrupted to: • In transformed program,Since However, the mantissa must be <2, so if • the mantissa is right shifted by 1 and normalized.
Transformed variables • So now, the value in transformed program is: • Value in original program is:
Fault Detection in Mantissa • If there is a stuck-at-1 fault • Value in transformed program: • Value in original program * k (for checking):
We can detect Mantissa errors! • Note that the error values for the original and the transformed programs are different! • We actually use k= in order to flip the sign • bit for improved detection capability
k for exponents • In order to flip all the bits of the exponent, need to transform program to use k= and k= • If a fault invalidates a bit of the exponent, the fault will be detected by comparing to the exponents of one of the two transformed programs.
Effectiveness for Mantissa • Effectiveness of k= (for IEEE 754 single precision)
Effectiveness for Exponent • Effectiveness of k= (for IEEE 754 single precision)
Summary • ED4I effectively detects Byzantine Failures in numerical applications on uniprocessors. • Purely software solution using Data Diversity. • Detects permanent and temporary faults. • Works with fixed-point and floating point numbers. • Compatible with arithmetic and logical operations (probably with any bitwise logical operation if it can be recast into arithmetic) • High overhead: 100% or 200%.