230 likes | 329 Views
A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded Microprocessor. Jason Blome, Scott Mahlke, Daryl Bradley*, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd. Embedded Everywhere. Not just cellphones
E N D
A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded Microprocessor Jason Blome, Scott Mahlke, Daryl Bradley*, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd. 1
Embedded Everywhere • Not just cellphones • Safety critical applications: • Automotive • Healthcare Patterson and Hennessy 2005 2
Embedded Domain Constraints • Power efficient performance • Longer clock cycle times • Increased logic depth between stages • Higher area ratio of combinational logic to state elements • Less speculative state • Potentially less masking • Limited real estate All of these high level constraints affect the behavior of faults and the potential of fault tolerance techniques 3
Objectives • Understand the effects of transient faults on a typical embedded design • Architectural contributions to soft error effects • Production-grade core • Reference synthesis flow • Design for test methodologies • Simulate faults in both combinational and sequential logic 4
Soft Error Rate Contributions Soft Error Rate Contributions Mitra 2005 Shivakumar 2002 Increasing contribution of faults in combinational logic to the overall soft error rate 5
ALU Processor Model • ARM926EJ-S • Cell library characterized for 130 nm • 5 ns clock cycle time ARM926EJ-S Instruction Fetch Instruction Decode Data cache Data Interface MMU Instruction Address Logic Register Bank Mux Array Instruction cache Shift MMU Write Buffer/ Bus Interface Multiply Bus Interface Data Address Logic 6
Analysis Infrastructure testbench reference design test design benchmark error checking and logging fault injection scheduler fault injection/error analysis framework report generation 7
0 0 CLK tsetup thold Fault Masking • Logical: faulted value does not affect logical operation of the circuit • Architectural/Software: incorrect state is written before it is read • Latching-Window: the fault pulse does not reach a state element within the latching window • Electrical: the fault pulse is electrically attenuated by subsequent gates in the circuit 8
94% 7% 16% 4% Observed Error Rates Faults Occurring in Registers Faults Occurring in Combinational Logic At the software interface, error rates within 3% 9
Observed Error Rates Faults Occurring in Registers Faults Occurring in Combinational Logic Faults in combinational logic have a much more dramatic effect on system state 10
Architectural Errors per Cycle Faults Occurring in Registers Faults Occurring in Combinational Logic 11
Architectural Corruption Characteristics Bits per Architectural Register Corrupted Number of Architectural Registers Corrupted 12
Results Summary • Faults occurring in logic: • Will likely be much more frequent in embedded design • Tend to have a more dramatic effect on system state • Multi-bit/multi-register architectural errors common • Design for test methodologies can greatly impact soft error characteristics • Error rates at the software interface consistent with those observed in high-performance microprocessors 13
Traditional Error Detection/Protection • Reliable Encoding • ECC/Parity • Limited use for faults in logic • Unclear where/how much to protect • Redundant Computation • In space • Area/energy overhead • In time • Energy overhead • Requires performance slack 14
Cycle 1: 51 Errors instr_reg_ID[0, 16, 22, 31] ID_decode_info[0, 16, 31] stored_instr[29, 30] Cycle 2: 51 Errors instr_reg_EX[0, 16, 22, 31] EX_decode_info[0, 16, 31] Cycle 3: 17 Errors ALU_out[0, 1, 2, 3, 4, 5, 6] Cycle 5: 29 Errors Reg0_reg[0, 1, 2, 3, 4, 5, 6] Cycle 4: 18 Errors ALU_result_wb[0,1,2,3,4,5,6] ALU Case Study I IRoute Instruction Fetch Instruction Decode Data cache Data Interface MMU Instruction Address Logic Register Bank Mux Array Instruction cache Shift MMU Write Buffer/ Bus Interface Multiply Bus Interface Data Address Logic 15
Cycle 1: 9 Errors instr_reg_ID[3,12,17, 18,24,26,29,30,31] Cycle 2: 62 Errors instr_reg_EX shifter_data_opEx_reg Shifter_data_reg alu_cc_reg Cycle 3: 49 Errors Shifter_data_EX alu_out_reg ALU Cycle 4: 183 Errors writeback and forwarding state register bank Case Study II IPipe Instruction Fetch Instruction Decode Data cache Data Interface MMU Instruction Address Logic Register Bank Mux Array Instruction cache Shift MMU Write Buffer/ Bus Interface Multiply Bus Interface Data Address Logic 16
Fault Characteristics • Case Study I: uCORE.uIRoute.U600 • First cycle error sites: 51 errors • uIRoute.INSTRHeld_reg[0] • uIRoute.INSTRHeld_reg[16] • uIRoute.INSTRHeld_reg[22] • uIRoute.INSTRHeld_reg[31] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[0] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[16] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[31] • u9EJ.uARM9.uCORECTL.uIPIPE.StoredInstrInt_reg[29] • u9EJ.uARM9.uCORECTL.uIPIPE.StoredInstrInt_reg[30] • Case Study II: uCORE.u9EJ.uARM9.uCORECTL.uIPIPE.U3626 • First cycle error sites:9 errors • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[3] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[12] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[17] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[18] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[24] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[26] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[29] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[30] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[31] 17
Embedded Design Space Potential • Leverage significant signal fanout • Determine that a fault has occurred during the cycle that it occurs • Transition detection circuits • Selectively deploy fault detection units • Intersection of high fanout fault targets • No roll-back necessary – simply flush the pipeline • Low cost/area overhead critical for embedded designs 18
Conclusion • Design domain critical: • Affects fault behavior • Limits applicable tolerance techiques • Key observations: • Faults in combinational logic much more likely in embedded designs • Faults in combinational logic behave dramatically different than those in state elements • Fault fanout offers potential for low overhead detection 19
transient fault soft error Soft Error Terminology transistor 20
Pulse Detection flip-flop D Q CLK ~Q error shadow latch 22
Microarchitectural Errors per Cycle Faults Occurring in Registers Faults Occurring in Combinational Logic Multi-bit errors common for Faults in combinational logic 23