1 / 23

Self-calibrating Online Wearout Detection Authors: Jason Blome Shuguang Feng

Self-calibrating Online Wearout Detection Authors: Jason Blome Shuguang Feng Shantanu Gupta Scott Mahlke. MICRO-40 December 3, 2007. [Srinivasan, DSN‘04]. [Borkar, MICRO‘05]. Motivation. “Designing Reliable Systems from Unreliable Components…”

allene
Download Presentation

Self-calibrating Online Wearout Detection Authors: Jason Blome Shuguang Feng

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Self-calibrating Online Wearout Detection Authors: Jason Blome Shuguang Feng Shantanu Gupta Scott Mahlke MICRO-40 December 3, 2007 1

  2. [Srinivasan, DSN‘04] [Borkar, MICRO‘05] Motivation • “Designing Reliable Systems from Unreliable Components…” - Shekhar Borkar (Intel) Failures will be wearout induced More failures to come 2

  3. RAMP Current Approaches • Traditional • Design margins • Burn-in • Detection: based on replication of computation • TMR (Tandem/HP NonStop servers) • DIVA (Bower, MICRO’05) • Prediction: utilizes precise analytical models and/or sensors • Canary circuits (SentinelSilicion, RidgeTop) • RAMP (Srinivasan, UIUC/IBM) Impractical Static Costly 3

  4. Wearout Mechanisms • Many failure mechanisms have been shown to be progressive • Hot carrier injection (HCI) • Negative Bias Temperature Inversion (NBTI) • Electromigration (EM) • Oxide Breakdown (OBD) 4

  5. Objective • Propose a failure prediction technique that exploits the progressive nature of wearout • Monitor impact on path delays • Prediction • Monitors evolution of wearout • Proactive • enables failure avoidance/mitigation • Continuous feedback • False negatives and positives • Detection • Identifies existing fault • Reactive • enables failure recovery • End-of-life feedback • False negatives 5

  6. Oxide Breakdown (OBD) • Accumulation of defects leads to a conductive path Percolation Model [Stathis, JAP‘06] 6

  7. [Rodriguez, Stathis, Linder, IRPS ‘03] OBD HSPICE Model • Post-breakdown leakage modeling [BSIM4.6.0, ‘06] 7

  8. Characterization Testbench • 90nm standard cell library tcircuit tcell 8

  9. Impact on Propagation Delay 9

  10. 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 Delay Profiling Unit (DPU) 1 input signal 1 1 Latency Sampling uArch Module 1 1 10

  11. TRIX Analysis Magnitude of divergence between TRIXglobal and TRIXlocal reflects amount of degradation 11

  12. TRIX Analysis Details • Exponential Moving Average (EMA) • Triple-smoothed Exponential Moving Average 12

  13. Noisy Latency Profile Percent Nominal Delay (%) Increasing Age 13

  14. 0 1 0 0 0 0 0 1 1 0 DPU with TRIX Hardware TRIXl Calculation input signal Latency Sampling Prediction TRIXg Calculation 14

  15. + Wearout Detection Unit (WDU) TRIXl Calculation Latency Sampling Prediction TRIXg Calculation 15

  16. Fully Synthesized, P&R, OR1200 Core Monte Carlo Simulator Evaluation Framework Gate-level Processor Simulator OR1200 Verilog Synthesis and Place and Route 90nm Library Timing, Power, and Temperature Simulations MediaBench Suite Workload Simulator HSPICE Simulations OBD Wearout Model Wearout Simulator 16

  17. WDU Accuracy 17

  18. WDU Overhead 18

  19. WDU Overhead 19

  20. Long-term Vision • Introspective Reliability Management (IRM) • Intelligent reliability management directed by on-chip sensor feedback • Prospective sensors • Delay (WDU) • Leakage/Vt • Temperature 20

  21. Introspective Reliability Management 21

  22. Conclusions • Many progressive wearout phenomenon impact device-level performance. • It’s possible to characterize this impact and anticipate failures • WDU performance • Failure predicted within 20% of end of life (tunable) • Area overhead < 3% (hybrid) • Low-level sensors can be used to enable intelligent reliability management 22

  23. Questions? ? 23

More Related