Checkpoint Based Recovery from Power Failures

Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov

Goals • Consistent checkpoint • A consistent snapshot of memory for a specific time in the past. • Safe even under power failure • The checkpoint is never “in transition” • Small storage overhead • Not much more than double the memory. • Low performance overhead • Should not stall the processor for too long. • Scalable • Scales well in large core networks such as meshes.

Related Work • On the feasibility of incremental checkpointing for scientific computing by J. Sancho et al • Speculates about the future role of checkpointing in parallel machines. • As the number of processing nodes grows exponentially, failure of any one node becomes much more likely. • Error correction codes and other redundancies would introduce too much overhead when used alone. • As a result, researching Checkpoint recovery is growing in importance.

Related Work • Modular Checkpointing for Atomicity by L. Ziarek et al. • Introduces an abstraction called stabilizers to make checkpointing easier. • Targets message-passing machines • Makes consistent checkpointing more challenging.

Related Work • SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery by D. Sorin et al. • Explores the concept of checkpointing in logical time. • Multiple checkpoints. • Each dirty cache line has a tag indicating when it was modified relative to a checkpoint. • Low execution overhead. • Not safe from power failures.

Related Work • ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors by M. Prvulovic et al. • Explores different ways of rollback recovery in shared-memory multiprocessor systems. Considers: • the scope of the checkpoint • memory • checkpointing mechanism. • Achieves about 6% checkpointing overhead. • Not safe from power failures. • Not geared towards non-volatile memory: requires fast writes.

Related Work • Efficient Initialization and Crash Recovery for Log-based File Systems over Flash Memory by Chin Wu et al. • As Flash Memory becomes cheaper and denser, the uses for Flash increase. • Uses flash for recovering file systems. • Yet another use of flash for recovery. • Use a log-based method to accelerate remounting after system crash by minimizing the amount of information that has to be changed upon reboot.

Memory Controller Memory Controller DRAM DRAM Memory Controller Memory Controller DRAM DRAM Core L1 L2

DRAM Memory Controller Memory Controller DRAM DRAM Checkpointer DRAM Checkpointer Memory Controller Memory Controller DRAM DRAM DRAM Checkpointer DRAM Checkpointer Core Checkpoint A Checkpoint B Checkpoint Coordinator Address Decoder L1 Cache Checkpoint Controller Checkpoint A Buffer Buffer Buffer Buffer Checkpoint B Log Log Log Log L2 Checkpoint A Cache Checkpoint Controller Check point Check point Check point Check point Checkpoint B

Checkpointing Techniques • For Caches and Cores: • Each cache/core has two flash storages adjacent to it. • One is for the previous checkpoint • One for the current checkpoint. • During a checkpoint, the cache/core internal state is copied to flash storage. • For DRAM: • The checkpointing system snoops on DRAM. • DRAM changes are continuously logged to flash memory. • A chain of parallel buffers ensues that DRAM checkpointing almost never causes a stall.

Responsibilities of the Main Components • Checkpoint Coordinator • Notifies the nodes and DRAM checkpointers that a checkpoint is beginning. • DRAM Checkpointer • Continuously logs DRAM changes. • Checkpoints when instructed by the coordinator. • Cache Checkpoint Controller • Checkpoints the adjacent cache when instructed by the coordinator.

Steps for Checkpointing (1 of 2) • The coordinator sets the checkpoint signal to 1. • In parallel each • Core: • Pauses processing instructions. • Copies internal state to flash memory. • Cache Checkpoint Controller: • Copies cache internal state to flash memory (data is copied one line at a time). • DRAM Checkpointer: • Flushes buffer to flash log. • Notifies checkpoint coordinator that the buffer has been flushed.

Steps for Checkpointing (2 of 2) • The coordinator sets the checkpoint signal to 0. • In parallel each • Core: • Flips flash memory bit to indicate the new checkpoint buffer. • Cache Checkpoint Controller: • Flips flash memory bit to indicate the new checkpoint buffer. • DRAM Checkpointer: • Marks checkpoint boundary in flash log.

Core Checkpoint A Checkpoint B L1 Cache Checkpoint Controller Checkpoint A Checkpoint B L2 Checkpoint A Cache Checkpoint Controller Checkpoint B F F F F F F F F

Address Decoder Buffered Changes Buffer Buffer Buffer Buffer Log Log Log Log Check point Check point Check point Check point Previous Checkpoint Changes Next Checkpoint Changes end start Previous Checkpoint (random access)

Recovering • Determining which Checkpoint to use • System checks which Checkpoint is the most recent • If the most recent checkpoint was in progress during crash, the older checkpoint is used. • Restoring Previous State • Each architectural register is rewritten. • Each cache is written to by its adjacent FLASH buffer (one cache line at a time) • Main Memory is recovered • Take advantage of pipelined write if available. • Resume Execution • Resume program counter • Notify that CPU’s that the system is restoring from a checkpoint (single bit)

Checkpoint Based Recovery from Power Failures

Checkpoint Based Recovery from Power Failures

Presentation Transcript

Learning from governance failures

Complexity revisited: learning from failures

PERFORMANCE BASED MEASURES OF RECOVERY IN POWER

Learning From Project Failures

Complexity revisited: learning from failures

Recovery From Anaesthesia

Selective Recovery From Failures In A Task Parallel Programming Model

Agility Recovery Power

Recovering From Failures

Peer Power and Recovery from Addiction

Checkpoint

Recovery from Crashes

Distributed Local Recovery from Multiple Link Failures in MPLS-TE Networks

PERFORMANCE BASED MEASURES OF RECOVERY IN POWER/SPEED SPORTS

Automatic VPN Client Recovery from IPsec Pass-through Failures

Recovery from addictions..

Complexity revisited: learning from failures

Checkpoint Based Recovery from Power Failures

Automatic VPN Client Recovery from IPsec Pass-through Failures