1 / 35

Design of High Availability Systems and Networks Appendix V alidation

Design of High Availability Systems and Networks Appendix V alidation. Prof. Ravi K. Iyer Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering and Coordinated Science Laboratory University of Illinois at Urbana-Champaign iyer@crhc.uiuc.edu.

Download Presentation

Design of High Availability Systems and Networks Appendix V alidation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design of High Availability Systems and NetworksAppendixValidation Prof. Ravi K. Iyer Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering and Coordinated Science Laboratory University of Illinois at Urbana-Champaign iyer@crhc.uiuc.edu http://www.crhc.uiuc.edu/DEPEND

  2. Outline • Introduction • Validation methods • Design phase • Fault simulation • Prototype phase • HW or SW implemented fault injection • Operational phase • Measurement and analysis of field systems

  3. Challenges • Assessing the system dependability for • different technologies • different computers • different network topologies • different communication protocols • Validating the networked system characteristics • availability validation: (i) detection and (ii) recovery • dependability: (i) user level, (ii) system level and (iii) network level

  4. Experimental AnalysisDifferent Design Phases • Early Design Phase • Approach and Goals: • CAD environments used to evaluate design via simulation • Simulated fault injection experiments • Evaluate effectiveness of fault-tolerant mechanisms • Provide timely feedback to system designers • Information produced • accurate fault models • error latency, coverage • error detection and recovery time distribution • Limitation: • Simulations need accurate inputs and validation of results • Prototype Phase • Approach and Goals: • System run under controller workload conditions • Controlled fault injections used to evaluate system under faults • Information produced • error latency, propagation times detection distributions, availability • Limitation: • Can only study artificial faults and cannot provide dependability measures • Operational Phase • Approach and Goal: • Measurement-based approach to study naturally occurring errors • Study systems in the field, under real workloads • Analyze collected error and performance data • Provides valuable information on actual failure characteristics • Information produced • actual failure • characteristics, failure • rates, time to failure • distribution • Limitation: • Limited to detected errors and the systems studied

  5. Evaluation - Experimental Methods • Design phase • Fault simulation (simulated fault injection) • Electrical, logic, or functional level • Hierarchical simulation • Issues: Simulation time, level of simulation, fault conditions, accurate fault models • Prototype phase • Fault injection in prototype systems • Hardware fault injection • Software fault injection • Radiation-based fault injection • Issues: Fault models, joint HW /SW fault injection, fault injection into networked environments and applications • Operational phase • Study of naturally occurring faults in real environments • Essential for believable analysis of today’s complex systems • What can we say about future systems based on measurements from current systems • Issues: HW/SW instrumentation, analysis tools

  6. Key Issues in Fault Injection • Effective fault injection mechanisms using hardware, software, and hybrid technology to accurately assess and validate networked systems • Practical evaluation methods to accurately quantify fault effect and recovery mechanisms in complex environments • Evaluation of error detection, diagnosis, and recovery techniques • Quantification of confidence in the fault-injection based validation • Usable fault tolerance benchmark for assessing systems and networks • Common evaluation/validation framework

  7. Time to failure Availability Performance Mathematical models of failures Monte Carlo Simulation Real World Results Monte Carlo Simulation Inject Faults Error latency Coverage evaluation Fault detection Reliability estimation Functional simulation models Real World Results Functional Simulation Inject Faults Real World Time to failure Availability Performance Mathematical models of failures Monte Carlo Simulation Functional simulation Results Design Phase: Fault Injection Hybrid Simulation

  8. Simulation at Different Levels • Electrical level • transistor circuit chip • Logic level • circuit VLSI systems • Function level • VLSI system computer and network systems Levels of Simulated Fault Injection Fault Injection Electrical level Change current Change voltage Logic level Stuck-at 0 or 1 Inverted fault Function level Change CPU registers Network Flip memory bit Electrical Circuits Functional Units Logic Gates Physical Process Logic Operation

  9. Issues in Simulated Fault Injection • Fault models • Fault conditions, fault types • Number of faults • Fault times • Fault locations • Workload • Real applications • Benchmarks • Synthetic programs • Simulation time explosion • Mix-mode simulation • Importance sampling • Concurrent simulation • Accelerated fault simulation • Hierarchical simulation

  10. Ionizing particles hit  Vdd B A SiO2 B AB n+ p p+ n+ channel stop n Fault model GND - is the angle of incidence  Fault Injection at Electrical Level • Why is it needed? • Study the impact of physical causes • Simple stuck-at models do not represent many real types of faults Transistor Level Simulation Device Physics Level Simulation

  11. Simulated Fault Injection at Logic Level • Fault Models • Basic models • stuck-at (permanent) - forcing logic value for entire simulation duration • inverted fault (transient) - altering logic value momentarily • Fault dictionary approach • Use electrical level simulation to derive logic-level fault models • dictionary entry - input vector, injection time, fault location Transistor Level Description of 4-bit Adder A B Cin Input 0000 0000 0 S Cout ---- F 34% ---F F 39% --F- F 7% F-F- F 20% : Input 1111 1111 1 ---- F 23% ---F - 1% --FF F 9% -F-- - 33% -FFF - 33% FFFFF 1% A(3:0) B(3:0) Logic Level Fault Dictionary Current-Burst Fault Model Cout Cin S(3:0) For all nodes, for all input combinations

  12. Mixed-mode hierarchical Fault simulation Fault description type of fault transient/ stuck-at location/time Target system description Fault tracing facility Trace Graphical Analysis Visual identifications Error propagation Manifestation Fault Injection: Electrical to Chip Level • FOCUS • A chip-level simulation environment developed at UIUC • Acceleration:mix-mode simulation, importance sampling • Fault simulator: SPLICE1 • Applied to study a jet-engine microprocessor controller FOCUS Experimental Environment Automatic fault injection Impact analysis Statistical analysis Design feedback

  13. Error Category Occurrences Percentage Charge Threshold Injected transients 1050 100.0 3.0pC Logic upsets 437 41.6 3.0pC Latched error 60 5.7 3.0pC Pin errors 59 5.6 3.0pC Fault Injection: Electrical to Chip Level (Example) • HA1602 Study • Microprocessor-based-jet-engine controller (Hamilton Standard) • Code chosen such that all functional untis are exercised • Fault injection locations randomly selected • For each location, current transients injected at five levels: 0.5, 1,2,3, and 4 pico Coulombs • Over 1000 faults injected • Resultant errors; • Logic upsets • Latched errors • Pin errors

  14. Fault Injection at Function Level • Diversity of Components • Object-oriented approach • Fault Models • Various types - depending on the tupe of components • Examples • Single bit-flip for a memory or register fault • Message corruption for communication channel fault’service interrupt for a node fault • More detailed fault models derived from lower-level simulation • Impact of Software • Impact of faults is application dependent • Software effect can be studied at this level

  15. System Level Simulation Hardware Software Host 2 Local Network Myrinet Control Program module i module j module j Switch Host 3 Host 1 Chip Level Simulation Host intrf. Host Interface Host 4 256K memory Other details: DMA, etc. Custom processor (LANai) Lower level fault effects are propagated to the higher levels using fault dictionaries Transistor Level Simulation Vdd A B LANai AB Reg Reg Alpha particles hit ALU Device Physics Level Simulation Ctrl Fault model GND ADDER SiO2 Reg n+ p p+ n+ channel stop Logic Level Simulation n Hierarchical Fault SimulationExample: LANai Processor of Myrinet Network Switch

  16. Control of Experiment Activation Injection Observation Collection Advantages -Accurate, low perturbation Disadvantages -Low flexibility -High-cost Prototype Phase: Hardware-Implemented Fault Injection • Developed at LAAS-CNRS, France • Both probes and socket insertion are used • Can inject up to 32 injection points • Applications • A subsystem of a railway interlocking control system • A distributed communication system Generation of System Activity Input Files System Under Study Fault Injector Monitor/ Controller MESSALINE - Architecture Input Files

  17. Prototype Phase: Software Implemented Fault Injection • Advantages: flexibility, low cost • Disadvantages: perturbation of workload, low time resolution Software Fault Injection Techniques • Software faults and errors • modify the text/data segment of the program • Memory faults • flip memory bits • CPU faults • modify CPU registers, cache, buffers • Bus faults • use traps before and after an instruction to change the code or data used by the instruction and then restore them after the instruction is executed • hardware approach may be better • Network faults • modify or delete transmitted messages • introduce faults in network controllers, drivers, buffers

  18. Prototype Phase: Fault Injection Requirements • Distributed test and evaluation environment • Support for the architecture independent approach • Evaluate hardware and software implemented fault tolerance of single node architectures, distributed systems and embedded applications • Support fault injection to variety of targets including CPU registers, cache, memory, I/O, network, applications,and OS functions • Examples of fault injection strategies include: • random components and locations • selected hardware and software components (can be the predefined or random locations within a component) • application data and control flow • triggered by high stress conditions • impact the system timing (e.g., to mimic omission failures) • Allow collecting and analysis of results to derive measures for characterizing the system (e.g., coverage, fault severity, propagation, latency, availability ….)

  19. User Interface N-FTAPE SERVER Software Libraries Measurement of system activity Fault injectors Workloads (synthetic & real programs) Fault models N-FTAPE Architecture Machine A (client A) Communication Agent TCP/IP Customized Fault Injector Workload/ application(s) TCP/IP TCP/IP Machine B (client B) Machine C (client C) FAULT INJECTION TARGETS

  20. Architecture of the N-FTAPE Environment • Distributed client-server architecture • Separation of the target(s) and the evaluation tool • Modular and portable • Well-defined interface between user applications and the test environment • Ability to customize the environment by specifying attributes of a fault injection strategy • injection approach (e.g., random, custom, stress- /path-based) • injection target (e.g., application, system, network) • injection location (CPU, memory, I/O, network controller) and time (e.g., high system stress) • a fault model for maximum impact on system operation and application execution • the type of analysis

  21. CPU MEMORY I/O Measure Example of Stress-Based Fault-Injection in N-FTAPE Fault Injection Spec’s Injection Strategy Stress-based Path-based Random Injection Method By hardware By software Fault Location CPU Memory Disk I/O Network I/O Other I/Os Injection Time Load threshold Program execution path Fault arrival rate Workload Spec’s Rates and Mixes Interaction Intensity Fault Injector Workload Generator CPU System Under Test Load Level I/O

  22. Benchmark - Motivation • There are no availability benchmarks • Challenges PERFORMANCE BENCHMARK FAULT TOLERANCE BENCHMARK Workload to exercise functionality of the system Workload to exercise functionality of the system Ensures maximal exercise of FTMs Error Detection / Recovery Metrics: Time, throughput Metrics: Performance Degradation Crashes ?? • Benefits of benchmarks • Provide a numerical point of comparison among different systems • Yield insight into the operation of specific Fault Tolerance Strategies • and implementations

  23. Benchmark Definition/Specification Catastrophic Incident: • An event that causes the computer system to become unusable. Some examples are • operating system panics and hangs and • failed error recovery attempts that lead to an unusable system configuration Performance Degradation: • The amount of additional time required by an application due to the presence of faults, taking into account the number of faults injected. The time overhead results from • the overhead of error recovery routines and • the loss of resources such as CPUs or disks, which decreases the available compute or I/O bandwidth

  24. SW Monitor Fault Injector Applications Monitoring Program Supervisor TEST SYSTEM HARDWARE MONITOR CONTROL HOST GPIB or RS 232 backplane probes LAN or other bidirectional links Prototype Phase: Hybrid Fault Injection • Developed at University of Illinois • Faults injected by software • Impact measured by both software and hardware • Used to study Tandem Integrity prototype Layout of HYBRID

  25. Operational Phase: Measurement-Based Analysis • Step 1: data processing • Step 2: model identification and measure acquisition • Step 3: model solution if necessary • Step 4: analysis of models and measures Methodology Models & Measures Model Identifying Measure Acquiring Data Processing Model Solution Analysis of Models Measures Coalesced Data Field Data Results Models Measures Step 1 Step 2 Step 3 Step 4

  26. Correlated Failures

  27. Correlated Failures (cont.)

  28. Correlated Failures

  29. Correlated Failures (cont.)

  30. Failure Prediction • Approaches • Predict future failures based on current and historical on-line error information • Heuristic approach • Error reports and heuristic rules • Performance anomalies and signatures • Statistical approach • Statistical relationships among error states

  31. Failure Prediction (cont.) • Statistical Analysis of Symptoms • To recognize intermittent failures through statistical analysis • Recognition steps • Error record (tuples) - coleasing redundant reports • Error group (bursts) - error records occurring within 15 minutes • Error event - error groups occurring within 24 hours • Symptom - error types common to at least half of the groups in an event • Super event - related events • Two events are grouped into a super event if • They have at least one symptom in common, or • A symptom of one event is a subset of a symptom in another event, or • If they are single-group events, and they have at least two error types in common

  32. Failure Prediction (cont.)

  33. Failure Prediction (cont.)

  34. Evaluation/Validation Operation Design Prototype Fault Injection Analysis on Field Failure Data Models Formal Methods HW Implemented SW Implemented Analytical Simulation Corrections of Assumptions Coverage, Error Latency Coverage Failure Rates, Fault Models System Evaluation/Validation

  35. Concluding Remarks • Design/Simulation • Phase • Fault tolerance issues • need well established system level fault models • impact of software faults • effect of failures on robustness and system integrity • Simulation issues • simulation time explosion • validation of the simulation methodology • Prototype (Fault Injection) Phase • Fault models and their validity • hardware • - permanent • - transient • software • - errors • - faults/defects • Comparison (validation) of various fault injection tools • claims, portability, coverage • Operational Measurement • Phase • What to measure • When to measure • From case studies to fundamental results • Isolation of machine specific vs. general system & software dependability characteristics • On-line diagnosis • Prediction of impact of configuration, technology and workload changes based on field measurements

More Related