Design of Reliable Systems and Networks ECE 442 Checkpointing & Recovery (I)

Design of Reliable Systems and NetworksECE 442Checkpointing & Recovery (I) Prof. Ravi K. Iyer Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering and Coordinated Science Laboratory University of Illinois at Urbana-Champaign iyer@crhc.uiuc.edu http://www.crhc.uiuc.edu/DEPEND

Outline • Recovery basics • Forward error recovery • Backward error recovery • Example approach (libft)

Recovery - Basic Concepts • Providing fault tolerance involves three phases • Error detection • Assessment of the extent of the damage • Error recovery to eliminate errors and restart afresh • Forward error recovery- the continuation of the currently execute process from some further point with compensation for the corrupted and missed data. The assumptions: • The precise error conditions that caused the detection and the resulting damage can be accurately assessed • The errors in the process’s (system’s) state can be removed • The process (system) can move forward • Example: exception handling and recovery

Recovery - Basic Concepts (cont.) • Backward error recovery- the current process is rolled back to a certain, error-free, point and re-executes the corrupted part of the process thus continuing the same requested service. The assumptions: • The nature of faults cannot be foreseen and errors in the process’s (system’s) states cannot be removed without re-executing • The process (system) state can be restored to a previous error-free state of the process (system)

Comparison Forward & Backward Error Recovery

Checkpoint and Rollback • Applicability • When time redundancy is allowed • To transient hardware and many software design faults (e.g., timing faults) • To both nonredundant and redundant architectures • When it is feasible to determine checkpoints in an application • Checkpointing • Maintains/saves precise system state or a “snapshot” at regular intervals • Snapshot can be as small as one instruction • Typically, checkpoint interval includes many instructions • May not be ideal when there is much error detection latency • Rollback recovery • When error is detected • Roll back (or restore) process(es) to the saved state, i.e., a checkpoint • Restart the computation

Checkpoint and Rollback: What do we need? • Implement an appropriate error-detection mechanism • Internal to the application: various self-checking mechanisms (e.g., data integrity, control-flow checking, acceptance tests) • External to the application: signals (e.g., abnormal termination), missing heartbeats, watchdog timers • Determine the data to be checkpointed - process state • Volatile states • Program stack (local variables, return pointers of function calls) • Program counter, stack pointer, open file descriptors, signal handlers • Static and dynamic data segments • Persistent states • User files related to the current program execution (whether to include the persistent state in the process state depends on the application, e.g., the persistent state is often an important part of a long-running application) • Store the checkpoint data on a stable storage

Checkpoint and Rollback: What do we need?(cont.) • Determine events to be logged and replayed • Messages • Events (provoke a message to be sent) • Transactions • Determine checkpoint times based on • Elapsed time • Message received or sent, e.g., parallel or distributed applications • Amount of dirtied state, e.g., database applications • Critical function invocation/exit • Provide procedure to restart the computation • Provide way to handle a persistent error

Checkpoint and Rollback Example: libft (AT&T) • Software-implemented fault tolerance • Detection and recovery from errors that cause an application process to crash or hang • Software faults • Transient faults in underlying hardware and operating system layers (if not handled in those layers) • Four levels of software fault tolerance envisaged L0: No tolerance to faults in application software (manual restart from an initial internal state); full restart L1: Automatic detection and restart from an initial internal state (no checkpoint) L2: L1 plus periodic checkpointing, logging and recovery of internal state L3: L2 plus persistent data recovery L4: Continuous operation without interruption (use of hot spares, voting, multicast messaging, consistency mechanisms); not supported

Checkpoint and Rollback Example: libft (cont.) • watchd is a watchdog daemon process that watches the life of a local application process using two methods • Sends a null message to the local application process using IPC (Inter Process Communication) and checks the return value • if the watchd cannot make a connection it tries again (after waiting some time) • if the second attempt fails, the watchd assumes that the process is hung • Asks the application process to send a heartbeat message to watchd periodically • if the heartbeat message is not received within a specified time, the watchd assumes that the process is hung • watchd recovers the application at an initial internal state or at the last checkpointed state • watchd also watches one neighboring watchd in a circular fashion to detect node failures • watchd also watches itself: upon initialization watchd creates a backup watchd , which keeps pooling the primary to detect errors

Checkpoint and Rollback Example: libft (cont.) • Libft is a user-level UNIX library of C functions that allows application programmers to: • Specify critical volatile data in an application (critical()) • Take a checkpoint, checkpoint() • Recover the checkpointed data • Log events (ftread() and ftwrite() functions to log messages automatically) • Locate and reconnect a server, getsrvloc(), getsrvport(), ftconnect() • Do exception handling at the application level • Do N-version programming and use recovery blocks

Checkpoint and Rollback Example: libft (cont.) • Flexible, portable, and reusable software components - watchd and libft - are employed to provide fault tolerance • watchd and libft • can be embedded in any UNIX-based application software with minimal programming efforts • can separate fault detection and volatile data recovery from the application functions • Installation does not require any change to a UNIX-based operating system

Checkpoint and Rollback Example: libft (cont.) • Cost and effectiveness • Performance overhead of 0.1% to 14% • Small programming efforts to embed the FT components in the application software (weeks) • Reusability of software components allows more complicated schemes to be developed using the existing routines as a starting point • Can get significant coverage of non-design errors • Highly effective at tolerating certain bugs that would be expensive to fix.

Design of Reliable Systems and Networks ECE 442 Checkpointing & Recovery (I)