1 / 27

Spring 2008 CSE 591 Compilers for Embedded Systems

Spring 2008 CSE 591 Compilers for Embedded Systems. Aviral Shrivastava Department of Computer Science and Engineering Arizona State University. Lecture 3: Soft Errors. Models and Techniques. Outline. Soft Errors Recap Process Technology and Packaging Solutions

Download Presentation

Spring 2008 CSE 591 Compilers for Embedded Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spring 2008 CSE 591Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University

  2. Lecture 3: Soft Errors Models and Techniques

  3. Outline • Soft Errors Recap • Process Technology and Packaging Solutions • Gate-level and Circuit-level Solutions • Microarchitectural Solutions • Single-core • Multi-threaded • Software Solutions • Multi Bit Upsets (MBUs) • Single Event Latchup

  4. Phenomenon of Soft Error • Transient Faults • Random and spontaneous bit-changes in system • Can be caused by • Circuit noise • Cross-talk • More than 50% due to radiation strike

  5. Metrics • FIT: Failure in Time • No. of failures in 1 billion hours of operation • MTTF: Mean Time To Failure • 1000 FITs => MTTF of 114 years • 1 GByte of RAM @ 500 FIT/Mbit can expect an error every two weeks • ECC reduces failure rate by 2 orders of magnitude • hypothetical Terabyte system would experience a soft error every few minutes

  6. Trends • DRAM • System error rate of DRAMs is fairly constant • SRAM • Increasing exponentially • Logic • Increasing exponentially

  7. Masking Effects • Logic Masking • Occurs when particle strikes a portion of combinational logic that is blocked from affecting the output due to a subsequent gate whose result is completely determined by its other input values • Electrical Masking • Occurs when the pulse resulting from a particle strike is attenuated by subsequent logic gates, and does not affect the result of the circuit • Latching Window Masking • Occurs when the pulse resulting from a particle strike reaches a latch, but not at the clock transition where the latch captures its input values • Microarchitectural Masking • Occurs when the incorrect value in the latch is ignored in evaluation of a program variable • Software Masking • Occurs when an incorrect value of a variable is ignored by the software while computing the outputs

  8. Faults, Errors, Failures(“Fault Tolerant Computer Systems”, by Pradhan) • Fault • Defect in hardware or software component • defect for cosmic ray = upset from high-energy neutron strike • Error • manifestation of a fault, resulting in deviation from accuracy • faults cause errors (but, not vice versa) • a masked fault is not an error! • vulnerability factor = fraction of faults that cause errors • Failure • non-performance of expected action • errors cause failures (but not vice versa) • a corrected error doesn’t cause a failure

  9. Fault Tolerance in Microprocessors • Information Redundancy • Protecting data words with information coding • Parity or Hamming codes • ECC codes mainly in memory arrays • Cost is extra/additional storage for coding overhead, and checking logic • Space Redundancy • Carrying out the same computation on multiple independent hardware at the same time • Errors are exposed by checking the independent results • Cause large hardware overhead • Good for permanent faults • Time Redundancy • Execute the same computation on the same hardware at different times

  10. The Soft Error Opportunity • Key differences with classical fault tolerance • FIT budget 100x – 1000x more than Tandem-style machines • Traditional “big hammer” solutions too expensive for volume market & can be an overkill • Why architecture plays a critical role? • error often defined in architecture & microarchitecture • e.g., strike on a branch predictor doesn’t cause an error • architectural solutions are often more cost-effective • one bit of parity can protect 64 bits, overhead < 2% • radiation-hardened cells can have overhead around 20-40%

  11. Outline • Soft Errors Recap • Process Technology and Packaging Solutions • Gate-level and Circuit-level Solutions • Microarchitectural Solutions • Single-core • Multi-threaded • Software Solutions • Multi Bit Upsets (MBUs) • Single Event Latchup

  12. Processing and Packaging Solutions • Reduce the number of particles that strike • Reduce upsets • Use of highly purified fabrication materials • Remove traces of boron and heavy metals • Surround by metallic frame • Reduce low-energy particles • But neutrons can pass through > 10 ft of concrete • Process Technology Solutions • Partially depleted SOI: no help after 250 nm • Fully depleted SOI: very expensive

  13. Transistor Level Techniques • Normally CMOS inverter is scaled with 2:1 ratio between p- and n-channel devices • To compensate for electron and hole mobilities • Changing this ratio can increase the tolerance

  14. Gate-Level Techniques • Some gates are more vulnerable than others • Radiation hardened designs use NAND gates • When all inputs are low, drive of p-stack is low, high leakage of n-transistors  rise in the output slow  functional failure • Gates vulnerability may change by 5X depending on the state • NAND gate • Extremely vulnerable when inputs 10 • Not vulnerable when inputs 00 • How to synthesize to minimize vulnerability

  15. Circuit-Level Techniques • Adding resistance introduces additional time constants that filter out the very fast SEU-induced transients • High temperature coefficients of poly-silicon resistors • Difficult to control variation of resistance

  16. Outline • Soft Errors Recap • Process Technology and Packaging Solutions • Gate-level and Circuit-level Solutions • Microarchitectural Solutions • Single-core • Multi-threaded • Software Solutions • Multi Bit Upsets (MBUs) • Single Event Latchup

  17. Architectural Vulnerability Factor • AVF: Probability that a fault in a particular structure will results in system failure • AVF of branch predictor = 0% • AVF of PC = 100% • ACE-bit: “Architectural bits” that must be correct for “Correct Execution” • Count number of ACE-bits in a structure • Indentifying Un-ACE bits • Microarchitectural Un-ACE bits: Cannot influence correct instruction execution • Idle or Invalid state, e.g., inputs to un-chosen paths of mux • Mis-speculated state, e.g., wrong path instruction • Predictor structures, e.g., branch predictor • Ex-ACE state, e.g., registers • Architectural Un-ACE bits: Affect correct path execution, but does not change the output • NOP-instructions • Prefetch instructions • Predicated false instructions • Dynamically dead instructions, FDD, TDD • Computing AVF from a Performance Model • Gather the number of ACE-bits in each cycle

  18. Vulnerability Contributions • DCache - largest contributor to vulnerability • Data + tags • ICache: Close second • Instructions only • Tags are (almost) not vulnerable • Register File, Pipeline • Rate of errors may be higher in Pipeline and RF • Compute Cache and Register File Vulnerability

  19. Vulnerability Variations • System vulnerability changes with time • How can you use this information?

  20. 4x reduction in vulnerability D-Cache: Flushing

  21. 10x reduction in vulnerability D-Cache: Write Policy

  22. 3x reduction in vulnerability using write-thru (30x total) D-Cache: Refresh

  23. DIVA Microarchitecture BPred I-$ Dec/Ren IQ Rename Regs ALU D-$ Arch Regs LR3 + LR7  LR15 4 8 12 If both checks succeed, write 12 into LR15 Storage Check Rd LR3 and LR7 from Arch Regs and confirm it equals 4 and 8 ALU Check Add 4+8 and confirm it equals 12

  24. Microarchitecture Details • Instructions are fed to checker in order during commit • The logic and storage checks detect errors in ALUs • and datapath • The checker core is a simple in-order pipeline – easy to • design and verify • An error in an earlier stage (LR3 instead of LR2) can be • detected by also adding a ren/decode stage to the checker • In-order core has no stalls (need bypass for register file) • – no data dependences, cache misses, branch mispredicts • Contention for register file and data cache can degrade • primary thread

  25. Recovery • The architected register file and data cache are ECC • protected – when an error is detected, it is assumed • that checker and architected state are correct • Primary core is re-started from faulting instruction • A fault in the primary core may result in deadlock: • e.g. instruction that produces R5 is waiting for R5 to be • produced (instead of R4) • A timeout in the checker signals an error

  26. Page Mapping FNC FC PPC (Partially Protected Caches) Processor • 2 Caches at the same level of memory hierarchy • Main Cache, and the protected mini-cache • Mini-cache • low power, low latency • Timing slack to harden it • Compiler maps data to the two caches • Map Failure-Critical data to the protected mini-cache • Map Not Failure-Critical data to unprotected main cache • Intuition is to provide protection to only the FC data • In multimedia applications, the multimedia data is NOT failure critical • An error  Loss in Quality of Service • How to use PPCs for general applications? Processor Pipeline HPC PPC Unprotected Main Cache Protected Mini Cache Mini Cache Main Cache Memory Controller FNC FC Memory

  27. Razor • Originally proposed to tolerate process variations • Shadow latch clocked with a delayed clock • If difference in values latched, raise error • How to use it to detect soft errors?

More Related