530 likes | 549 Views
This paper discusses fault tolerance and scheduling for real-time applications in the automotive industry, focusing on safety and redundancy requirements. It covers fault models, software redundancy techniques, off-the-shelf solutions, and a synthesis-based solution for managing distributed systems. The programming model, contributions, and data flow modeling are explored, along with the use of a platform library for fault-tolerant data flow modeling in distributed systems. The paper also presents a scheduling synthesis tool for optimizing performance in distributed systems.
E N D
DRAFTSDistributed Real-time Applications Fault Tolerant Scheduling Claudio Pinello (pinello@eecs.berkeley.edu) DRAFTS
Motivation • Drive-by-Wire applications DRAFTS
Motivation • No rods increased passive safety • Interior design freedom BMW, Daimler, Cytroen, Chrysler, Bertone, SKF, etc… DRAFTS
Problem Overview • Safety: system failure must be as unlikely as in traditional systems • Fault tolerance: redundancy is key DRAFTS
Faults • SW faults: bugs • can be reduced by disciplined coding • even better by code generation • HW faults • harsh environment • many units (>50 uProcessors in a car; subsystems with 10-15 uP’s) DRAFTS
Fault Model • Silent Faults • faults result in omission errors • Detectable Faults • faults result in detectably corrupted data (e.g. CRC-protected channels) • Non-silent Faults • faults result in value errors • Byzantine Faults • malicious attacks, non-silent faults, unbounded delays, etc… DRAFTS
Software Redundancy • Space redundancy • execute replicas on different HW • send results on different/multiple channels DRAFTS
Pros: design once Cons: N-x costs, 1x speed Pros: reduced cost Cons: degradation, 1x speed multiple designs N-copies Solution Plant Plant Plant Plant Abstractinput Abstractinput Abstractinput Abstractinput ArbiterBest ArbiterBest ArbiterBest ArbiterBest AbstractOut AbstractOut AbstractOut AbstractOut CoarseCTRL CoarseCTRL CoarseCTRL CoarseCTRL FineCTRL FineCTRL FineCTRL FineCTRL Plant Plant Iterator Iterator Iterator Iterator Abstractinput Abstractinput AbstractOut AbstractOut Iterator Iterator DRAFTS
Redundancy Management • Managing a distributed system with multiple results requires careful programming • keep N-copies synchronized • exchange and apply results • detect and isolate faults • recover DRAFTS
Off-The-Shelf solutions TTP-based architectures FT-CORBA middle-ware Synthesis Debugged and portable libraries Possible solutions Development tools DRAFTS
Automotive Domain • Production costs dominate NRE costs • multi-vendor supply-chain • interest in full utilization of architectures • Validation and certification are critical • validate process • validate product DRAFTS
Shortcomings of OTS solutions • TTP • proprietary communication network • network redundancy default is 2-way • active replication potential underutilization of resources • FT CORBA • fairly large overhead middleware DRAFTS
Synthesis-based Solution • Synthesize only needed glue-code • at the extreme: get rid of OS • Customizable replication mechanisms • use passive replicas • Treat architecture as a distributed execution machine • exploit parallelism to speed up execution DRAFTS
Schedule Synthesis Plant CPU CPU CPU CPU CPU CPU Mapping Plant Abstractinput ArbiterBest AbstractOut CoarseCTRL Sens Sens Sens Sens CoarseCTRL CoarseCTRL CoarseCTRL Input Input Input Act Act Act Output Output Output ArbiterBest ArbiterBest ArbiterBest FineCTRL Sens CoarseCTRL CPU CPU CPU CPU CPU CPU Input Act ArbiterBest Output Sens FineCTRL Iterator FineCTRL Iterator Iterator Iterator DRAFTS
Synthesis-based Solution • Enables fast architecture exploration DRAFTS
Contributions • Programming Model • Metropolis platform • Schedule synthesis tool and optimization strategy • Verification Tools DRAFTS
Programming Model • Definition of a programming model that • Is amenable to specifying feedback controllers • Is convenient for analysis, simulation and synthesis • Supports degraded functionality/accuracy • Supports redundancy • Deterministic DRAFTS
Pros: Deterministic behavior Actors perform deterministic computation (no internal states) Requires all inputs to fire an actor Explicit parallelism Good for periodic algorithms Shortcomings: Requires all inputs to fire an actor, but source actors may fail! Static Data-flow Model B C A DRAFTS
Pendulum Example Plant Abstractinput ArbiterBest AbstractOut CoarseCTRL FineCTRL Iterator Bang-Bang Linear DRAFTS
Model Extensions • Node Criticality • Node Typing (sensor, input, arbiter, etc.) • Some types (input and arbiter) can fire with missing inputs • Tokens have “Epoch” and “Valid” fields • Specialized single-place buffer links • manage redundant sources (and destinations) DRAFTS
Data Tokens: Epoch • iteration index of the periodic algorithm • Actors ask for “current” inputs • Using >= we can account for missing results (self-synchronization) Data Epoch Valid DRAFTS
Data Tokens: Valid • Valid models the effect of fault detection: • True: data was received/produced correctly • False: data was not received on time or was corrupted • Firing rules (and actors) may use it to change their behavior Data Epoch Valid DRAFTS
FTDataFlow modeling • Metropolis used as framework to develop the set of tools • FTDF is a platform library in Metropolis • modeling, simulation, fault injection • supports semi-automatic replication • results visualization DRAFTS
DF_SENactor sensor actor DF_INactor input actor DF_AINactor abstract input actor DF_FUNactor data-flow actor DF_ARBactor arbiter actor DF_AOUTactor abstract output actor DF_OUTactor output actor DF_ACTactor actuator actor DF_MEM state memory DF_Injector fault injection Actor Classes DRAFTS
Pendulum Example Inject Plant Abstractinput ArbiterBest AbstractOut CoarseCTRL FineCTRL Iterator DRAFTS
Simulation output Fault DRAFTS
Summary on FTDF • Extended SDF to deal with • missing/redundant inputs • different criticality • functionality types • Developed Metropolis platform • modeling, simulation, fault-injection, visualization of results • support for adding redundancy DRAFTS
Architecture Connectivity: bipartite graph Computation and communication times:actor/cpu data/channel matrices of execution and transmission times Same as SynDEx model Architecture Model CPU CPU CPU CPU CPU CPU DRAFTS
Fault Behavior • Failure patterns • Subsets of Arch-Graph that may fail simultaneously • For each failure pattern specify criticality level • i.e. which functionalities must be guaranteed • typically for empty failure pattern all functionality must be guaranteed DRAFTS
Synthesis Problem Plant CPU CPU CPU CPU CPU CPU Mapping Plant Abstractinput ArbiterBest AbstractOut CoarseCTRL Sens Sens Sens Sens CoarseCTRL CoarseCTRL CoarseCTRL Input Input Input Act Act Act Output Output Output ArbiterBest ArbiterBest ArbiterBest FineCTRL Sens CoarseCTRL CPU CPU CPU CPU CPU CPU Input ArbiterBest Act Output Sens FineCTRL Iterator FineCTRL Iterator Iterator Iterator • Given • Application • Architecture • Fault Behavior • Derive • Redundancy • Schedule DRAFTS
Pendulum Example CPU CPU CPU • Actuator/Sensor location • Tolerate any single fault • {empty} all functionality • {one CPU} may drop FineController, and sensor/actuator on that CPU • {one Channel} may drop FineController Sens Sens Sens Act Act DRAFTS
Refined I/O Plant Sens Act Output Sens Input ArbiterBest CoarseCTRL Act Sens FineCTRL Iterator DRAFTS
Full Replication Plant Sens CoarseCTRL Input Act Output ArbiterBest Sens CoarseCTRL Input ArbiterBest Act Output Sens FineCTRL Iterator Iterator Iterator DRAFTS
Simulation output DRAFTS
Schedule Synthesis Strategy • Leverage existing dataflow scheduling tools (e.g. SynDEx) to achieve a distributed static schedule that is also fault-tolerant • At design time (off-line) • devise redundant schedule • At run-time • trivial reconfiguration: skip actors that cannot fire DRAFTS
Generating Schedules Maximum performance • Full architecture • Schedule all functionalities • For each failure pattern • Mark the faulty architecture components (critical functionalities cannot run there) • Schedule all functionalities • Merge the schedules Add redundancy DRAFTS
Generating Schedules • Full architecture • Schedule all functionalities • For each failure pattern • Mark the faulty architecture components • Schedule the critical functionalities • Merge the schedules DRAFTS
Merge into FTS [ECU0]Input receiver (requires 1) [ECU0]Function1 (required) [ECU0]Function2 (optional) [ECU1]Arbiter [ECU1]Output driver (requires 1) • Care must be taken to deal with multiple routings, clear non optimality [ECU0]Sensor1 [ECU1]Sensor2 [ECU1]Input receiver (requires 1) [ECU1]Function1 (required) [ECU0]Function2 (optional) [ECU0]Arbiter [ECU0]Output driver (requires 1) [ECU0]Actuator1 [ECU1]Actuator2 DRAFTS
Heuristic 1: Limit CPU Load • Full architecture • Schedule all functionalities • For each failure pattern • Mark the faulty architecture components (critical functionalities cannot run there) • Re-schedule only critical functionalities (constrain non critical as in full architecture) • Merge the schedules Redundancy for critical only DRAFTS
Heuristic 2: Limit Bus Load Heuristic 3: passive replicas (limit CPU load) • Prune redundant communication [ECU0]Sensor1 [ECU1]Sensor2 [ECU0]Input receiver (requires 1) [ECU1]Input receiver (requires 1) [ECU0]Function1 (required) [ECU1]Function1 (required) [ECU0]Function2 (optional) [ECU0]Arbiter [ECU1]Arbiter [ECU0]Output driver (requires 1) [ECU1]Output driver (requires 1) [ECU0]Actuator1 [ECU1]Actuator2 DRAFTS
Total Orders • For each processor and for each channel find a total order that is compatible with the partial order of FTS • Prototype: “any compatible total order” DRAFTS
Schedule optimization • Exploit architectural redundancy as a performance boost (in absence of faults) • replicas overloading and deallocation • passive replicas • graceful degradation: reduced functionality (and resource demands) under faults DRAFTS
Active Replicas CPU CPU Behavior: Active Replication: B P1 P2 A A A D C B B Architecture: C C D D DRAFTS
Deallocation & Degradation K P CPU CPU D D P K Behavior: Deallocation: B P1 C1 C2 P2 A A A D C B C Architecture: B->D C->D C B D D DRAFTS
Aggressive Heuristics • Some heuristics can be certified to not break fault-tolerance/fault behavior • Others may need verification of the results • E.g. human inspection and modification DRAFTS
(Off-line) Verification Functional Verification • For each failure pattern the corresponding functionality is correctly executed • Timing Verification/Analysis • Worst case iteration time under each fault DRAFTS
Functional Verification • Apply equivalence checking methods to FT Schedule, under all fault scenarios (failure patterns) • Based on application DAGs & Architecture graph DRAFTS
Functional Verification (example - continued) [ECU0]Sensor1 [ECU1]Sensor2 Sensor1 Sensor2 Input receiver (requires 1) [ECU0]Input receiver (requires 1) [ECU1]Input receiver (requires 1) Function1 (required) Function2 (optional) ? [ECU0]Function1 (required) Arbiter [ECU1]Function1 (required) Output driver (requires 1) [ECU0]Function2 (optional) Actuator1 Task Graph – Actuator1 [ECU0]Arbiter [ECU1]Arbiter Sensor1 Sensor2 [ECU0]Output driver (requires 1) [ECU1]Output driver (requires 1) ? Input receiver (requires 1) Function1 (required) Function2 (optional) [ECU0]Actuator1 [ECU1]Actuator2 Arbiter Output driver (requires 1) Actuator2 • For the full functionality case, the arbiter must include both functions. • The output function only requires one of the actuators be visible. • In the other graphs (which include failures) , the arbiter only needs the • single required input (Function1) Task Graph – Actuator2 Source: Sam Williams DRAFTS
F.Verification comments • Takes milliseconds to run small cases. Few minutes for large schedules • Tool was written in PERL (performance was sufficient) • Schedule Verification is performed offline (not time critical) • Credits: Sam Williams DRAFTS
Conclusions • Contributions • Programming Model FTDF • Metropolis platform • Schedule synthesis tool (in collaboration with INRIA) • Schedule optimization strategy • Functional verification (in collaboration with Sam Williams) • Replica determinism analysis (not shown here) DRAFTS