330 likes | 442 Views
MAPLD 2005. A Study of the Impact of Temperature on FPGA-based TMR Designs. Karen Tomko Dept. of Electrical and Computer Engineering and Computer Science University of Cincinnati Cincinnati, OH. Amr Ahmadain Dept. of Electrical and Computer Engineering and Computer Science
E N D
MAPLD 2005 A Study of the Impact of Temperature on FPGA-based TMR Designs Karen Tomko Dept. of Electrical and Computer Engineering and Computer Science University of Cincinnati Cincinnati, OH Amr Ahmadain Dept. of Electrical and Computer Engineering and Computer Science University of Cincinnati Cincinnati, OH
Presentation Outline • Overview: The Big Picture • Study Motivations • Study Objectives • Solution Approach • Study Assumptions • Derivation of the Reliability Model • Implementation of the Reliability Model • Conclusions and Future Work • References
Overview: The Big Picture • In current high-performance sub-90 nm technology leakage power • is rising dramatically over time with ever-shrinking feature sizes [1] • increases exponentially with junction temperature • is also electro-thermally coupled with junction temperature [2] • Junction temperature in turn, leads to • exponential increase in leakage power • Exponential reduction in the Mean Time to Failure [3]
Overview: The Big Picture (cont’d.) Static and Dynamic Power Dissipation versus physical gate length (nm) International Technology Roadmap for Semiconductors (ITRS) 2001, 2002. Courtesy of: Leakage Power: Moore’s Law Meets Static Power, Computer, December 2003, IEEE Computer Society.
Study Motivations • FPGA-based Triple Modular Redundancy (TMR) designs results in • considerable increase in total design area • reduction in circuit performance • Tripling the total power dissipation [4] • Tripling the static power dissipation fires back at • junction temperature • Junction temperature in turn, fires back at reliability
Study Motivations (cont’d.) • Recent work has been done to alleviate the cost of a full TMR design • Selective TMR (STMR) technique which applies TMR only to sensitive sub-circuits [5]. • Partial TMR which applies TMR only to sensitive design components [6] • The above work indicates an increased awareness that a full TMR-based design is not always the “perfect solution”
The Bathtub Curve Study Motivations (cont’d.) • Reliability prediction methods of electronic systems such as [7], [8] have traditionally considered • The effect of ONLY steady-state temperature • Constant failure rate during the device’s useful life • Varying failurerates during the infant mortality and wear-out phases
Study Motivations (cont’d.) • The assumption of constant failure rate and steady-state temperature may cause some errors in reliability prediction [9] • Failure rate might change even during the useful life of a device. • Temperature, itself, might also vary with time • These errors could lead to a pessimistic prediction instead of a realistic one • There is a need for a reliability prediction model which accurately captures the evolution of the system with time and temperature at each phase of its lifetime to avoid
Study Objectives • Phase I (Current Phase) • Develop a time and temperature-dependent reliability model for an FPGA-based TMR design as the foundation of the prediction framework • Phase II • Build a reliability prediction framework where the time and temperature-dependent reliability function is predicted using real data • Evaluate the impact of FPGA-based TMR designs on junction temperature in leakage-dominant technologies and the overall influence on system reliability
Solution Approach • Build a non-stationary (non-homogeneous) Markov chain to model the TMR system states and transitions where the assumption of a constant failure rate and steady-state temperature can be relaxed • A non-homogeneous Markov chain is a chain where the transition probabilities is a function of time [10] • Calculate the TMR system reliability as a function of the failure rate
Notation • ndiscrete time step • t continuous time unit • αm Weibull distribution shape parameter of a TMR module • αv Weibull distribution shape parameter of the majority voter • R(n, s(n)) reliability function that is dependent on both time and temperature as a function of time • z(n, s(n) hazard rate function that is dependent on both time and temperature as a function of time • s(n)stress (temperature) as a function of time • eA Activation Energy • B (eA/KB) parameter of the Arrhenius relationship associated with the activation energy • C parameter of the Arrhenius relationship that depends on product geometry, fabrication methods and other factors • pnUone-step transition probability matrix that is dependent on the time , n • U/D set of Up/Down States
Study Assumptions • Time to Failure of the system modules are statistically independent • The majority voter has a different hazard rate (zv) than that assumed for the TMR modules and hence, a different (αv) • Module failure rates are time and temperature-dependent Assumed TMR configuration
Study Assumptions (contd.) • We use the Arrhenius relationship to model the relationship between life and temperature • The Arrhenius-Weibull distribution is assumed to be the life distribution of the TMR system modules [10], where the Probability Distribution Function (PDF) is
Derivation of the Reliability Model Step 1: Definition of system states The system defines the following three states: • State ‘0’: System (all modules) functional • State ‘1’: One module failed and two modules functional • State ‘2’: System Failure; two modules failed or voter failed • States ‘0’ and ‘1’ are the Up states and state ‘2’ is the Down state
Continuous-time State Transition Diagram Discrete-time State Transition Diagram Derivation of the Reliability Model (contd.) Step 2: Determining state transition probabilities • The hazard function of the Arrhenius-Weibull distribution [11]
Derivation of the Reliability Model (contd.) • Step 3: Calculation of the reliability function The reliability of a NHDTMC at time n as given in [13] is expressed as where Substitutingthe state transition probability matrices into the above equation and carrying out the necessary matrix multiplications, yields the reliability of the system, R(n), at time n
Implementation of the Reliability Model • The reliability model has been implemented for three different types temperature stresses • Steady-state temperature stress • Cyclic stress • Progressive stress • The reliability and failure rate functions have been implemented using numerical integration techniques. The Gauss-Kronrod Quadrature method has been used [14]
Implementation of the Reliability Model (contd.) • Experimental Setup • Experiments have been designed based on changing the values of two sets of parameters • Stress test-related parameters. • Probability distribution-related parameters. In this study, the parameter is the Weibull distribution shape parameter, α • Values of stress-test related parameters have been chosen to span a minimum, typical and maximum operational stress levels • The value of α takes the following set of values: 0.8, 1.0, 1.4, 2.0
Results: Steady-State Temperature Stress α = 2.0 α = 0.8 1 1 = T 328 K T = 328 K 0.8 0.8 = T 373 K 0.6 0.6 T = 373 K Reliability Reliability 0.4 = 0.4 T 423 K T = 423 K 0.2 0.2 0 0 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 Time [hours] Time [hours] Temperature = 423 K Temperature = 328 K 1 1 α = 0.8 0.8 α = 0.8 0.8 α = 1.0 α = 1.0 0.6 0.6 Reliability Reliability α = 1.4 α = 1.4 0.4 0.4 α = 2.0 α = 2.0 0.2 0.2 0 0 0 5 10 15 20 0 2 4 6 8 10 Time [hours] Time [hours]
Results: Cyclic Stress α = 2.0 α = 0.8 1 1 0.8 = p Period 0.8 = p Period 0.6 p 0.6 = Period 2 p = Period 2 Reliability Reliability 0.4 0.4 p = Period 4 = p Period 4 0.2 0.2 0 0 0 10 20 30 40 0 10 20 30 40 Time [hours] Time [hours] Period = π Period = 4π 1 1 α = 0.8 α = 0.8 0.8 0.8 α = 1.0 α = 1.0 0.6 0.6 Reliability α = Reliability 1.4 α = 1.4 0.4 0.4 a = 2.0 α = 2.0 0.2 0.2 0 0 0 10 20 30 40 0 10 20 30 40 Time [hours] Time [hours]
Results: Progressive Stress α = 2.0 α = 0.8 1 1 Slope = 0.25 0.8 = Slope 0.25 0.8 = Slope 0.5 0.6 0.6 = Slope 0.5 Reliability Reliability 0.4 = Slope 1.0 0.4 = Slope 1.0 0.2 0.2 0 0 0 10 20 30 40 50 60 0 10 20 30 40 Time [hours] Time [hours] Slope = 1.0 Slope = 0.25 1 1 a = 0.8 a = 0.8 0.8 0.8 a = 1.0 a = 1.0 0.6 0.6 Reliability Reliability a = 1.4 a = 1.4 0.4 0.4 a = 2.0 a = 2.0 0.2 0.2 0 0 0 10 20 30 40 50 60 0 10 20 30 40 Time [hours] Time hours
Results: Reliability vs. Type of Stress Test α = 0.8 α = 1.0 1 1 = = T 423 K T 328 K 0.8 0.8 0.6 0.6 Reliability Reliability = p Period p = Period 4 0.4 0.4 = Slope 1.0 = Slope 0.25 0.2 0.2 0 0 0 10 20 30 40 50 0 10 20 30 40 50 Time [hours] Time [hours] α = 1.4 α = 2.0 1 1 = T 373 K = T 423 K 0.8 0.8 0.6 0.6 p p = Period 2 = Period Reliability Reliability 0.4 0.4 = Slope 0.5 = Slope 1.0 0.2 0.2 0 0 0 10 20 30 40 50 0 10 20 30 40 50 Time [hours] Time [hours]
Conclusions • Reliability varies greatly as a result of applying different types of stress tests • The assumption of a constant failure rate and steady-state temperature may lead to errors in reliability prediction overly conservative decisions • The value of the Weibull distribution shape parameter (α) has a negligible effect on system reliability for all types of stress tests • The stress test-related parameters (temperature, period and slope) have a visible impact on system reliability for all types of stress tests
Future Work • Validate the proposed reliability model by performing detailed simulations of the physics of specific failure mechanisms and evaluate the overall impact on the system reliability • Model the FPGA partial reconfiguration process using a Markov chain with repair • Evaluate the combined effects of temperature overstress and radiation at both the system and transistor level
Backup Slides: Derivation of the Reliability Model • Approximate the continuous-time Markov Chain (CTMC) by a discrete-time Markov Chain (DTMC) For this step, the two-step technique given in [12] is used • Convert the CTMC to a DTMC • Approximate the continuous-time hazard function to a discrete-time hazard function
Backup Slides: Derivation of the Reliability Model Discrete-time State Transition Diagram For brevity,A(n, s(n)), B(n, s(n)), and C(n, s(n)), are expressed as A(n), B(n), C(n) where A(n), B(n), and C(n) are given in terms of the hazard function as follows
Backup Slides: Derivation of the Reliability Model • To approximate the continuous-time hazard function to a discrete-time hazard function, we use the Probability Mass Function (PMF) of the discrete Weibull Distribution as given in [12] is expressed as The PMF can equivalently be given as f(n, s(n))= R(n) – R(n+ 1) where R(n) is the reliability function given as
Backup Slides: Derivation of the Reliability Model By substituting the reliability function into the PMF, we get where now the q in is given by
Backup Slides: Derivation of the Reliability Model • Build the state transition matrix using the approximated discrete-time state transition probabilities The transition probability matrix of the NHDTMC as given in [13] is given as
References • N.S. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. Hu, M.J. Irvin, M. Kandemir, and V. Narayanan, “Leakage Current: Moore’s Law Meets Static Power,” Computer, Vol. 36(12), Dec. 2003, pp.68-75. • K. Banerjee, S. Lin, A. Keshavarzi, S. Narendra, and V. De, “A Self-Consistent Junction Temperature Estimation Methodology For Nanometer Scale ICs with Implications for Performance and Thermal Management”, Technical Digest of the IEEE International Electron Devices Meeting (IEDM’03), 2003, pp. 36.7.1-36.7.4 • P. Lall, M.G. Pecht and E.B. Hakim, “Influence of Temperature on Microelectronics and System Reliability”, CRC Press LLC, 1997. • N. Rollins, M.J. Wirthlin, P.S. Graham, “Evaluation of Power Costs in Applying TMR to FPGA Designs”, Proceedings of the 7th Annual Military and Aerospace Programmable Logic Devices International Conference (MAPLD), Sept., 2004.
References (contd.) • P.K. Samurdrala, J. Ramos, and S. Katkoori, “Selective Triple Modular Redundancy for SEU Mitigation on FPGAs”, Proceedings of the 6th Annual Military and Aerospace Programmable Logic Devices International Conference (MAPLD), Sept., 2003. • B. Patt, D.E. Johnson, M.J. Wirthlin, M. Caffrey, K. Morgan, and P. Graham, “Improving FPGA Design Robustness with Partial TMR”, Proceedings of the 8th Annual Military and Aerospace Programmable Logic Devices International Conference (MAPLD), Sept., 2005. To Be Published. • U.S. Department of Defense, Reliability Prediction of Electronic Equipment, MIL-HDBK 217F, Washington, D.C., 1991. • Siemens, SN29500 Reliability and Quality Specification Failure Rates of Components, 1986. • A. Mettas, P. Vassiliou, “Modeling and Analysis of Time-Dependent Stress Accelerated Life Data”, Proceedings of the 2002 Annual Reliability and Maintainability Symposium (RAMS), Jan., 2002, pp. 343-348.
References (contd.) • W. Nelson, “Accelerated Testing: Statistical Models, Test Plans and Data Analyses”, John Wiley & Sons, 1990. • ReliaSoft, “Accelerated Life Testing Reference”, [Online book], 2001, Available at HTTP: http://www.weibull.com/acceltestwebcontents.htm • D.P. Siewiorek and R.S. Swarz, “Reliable Computer Systems”, Digital Press, 1992. • A. Platis, N. Limnois, and M.L. Du, “Hitting Time in a Finite Non-Homogeneous Markov Chain with Applications”, Journal of Applied Stochastic Models and Data Analysis, Vol. 14(3), 1998, pp. 241-253. • Wolfram Research, "Gauss-Kronrod Quadrature", [Online document], Available at HTTP: http://mathworld.wolfram.com/Gauss-KronrodQuadrature.html