A Fault Tolerant Protocol for Massively Parallel Machines

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign

Outline • Motivation • Background • Design • Protocols • Results • Summary • Future Work Parallel Programming Laboratory Univ. of Illinois, U-C

Motivation • As machines grow in size • MTBF decreases • Applications have to tolerate faults • Checkpoint/Rollback doesn’t scale • All nodes are rolled back just because 1 crashed • Even nodes independent of the crashed node are restarted • Restart cost is similar to Checkpoint period Parallel Programming Laboratory Univ. of Illinois, U-C

Requirements • Fast and scalable Checkpoints • Fast Restart • Only crashed processor to be restarted • Minimize effect on fault free processors • Restart cost less than checkpoint period • Low fault free runtime overhead • Transparent to the user Parallel Programming Laboratory Univ. of Illinois, U-C

Background • Checkpoint based methods • Coordinated – Blocking [Tamir84], Non-blocking [Chandy85] Co-check, Starfish, Clip – fault tolerant MPI • Uncoordinated – suffers from rollback propagation • Communication – [Briatico84], doesn’t scale well • Log-based • Pessimistic – MPICH-V1 and V2, SBML [Johnson87] • Optimistic – [Strom85] unbounded rollback, complicated recovery • Causal Logging – [Elnozahy93] Manetho, complicated causality tracking and recovery Parallel Programming Laboratory Univ. of Illinois, U-C

Design • Message Logging • Sender side message logging • Asynchronous checkpoints • Each processor has a buddy processor • Stores its checkpoint in the buddy’s memory • Processor Virtualization • Speed up restart Parallel Programming Laboratory Univ. of Illinois, U-C

System implementation Processor Virtualization User View • Charm++ • Parallel C++ with Data driven objects - Chares • Runtime maps objects to physical processors • Asynchronous method invocation • Adaptive MPI • Implemented on Charm++ • Multiple virtual processors on a physical processor Parallel Programming Laboratory Univ. of Illinois, U-C

Benefits of Virtualization • Latency Tolerant • Adaptive overlap of communication and computation • Supports migration of virtual processors Parallel Programming Laboratory Univ. of Illinois, U-C

C A B Message Logging Protocol Correctness: Messages should be processed in the same order before and after the crash Problem: C A B After Crash Before Crash Parallel Programming Laboratory Univ. of Illinois, U-C

Message Logging.. • Solution: • Fix an order the first time and always follow it • Receiver gives each message a ticket number • Process messages in order of ticket number • Each message contains • Sender ID – who sent it • Receiver ID – to whom was it sent • Sequence Number (SN) – together with sender and receiver IDs, identifies a message • Ticket Number (TN) – decide order of processing Parallel Programming Laboratory Univ. of Illinois, U-C

Message to Remote Chares Chare P sender <SN,TN, Receiver> <SN, TN, Message> <Sender, SN> Chare Q receiver • If <sender, SN> has been seen earlier TN is marked as received • Otherwise create new TN and store the <sender, SN,TN> Parallel Programming Laboratory Univ. of Illinois, U-C

Message to Local Chare • Multiple Chares on 1 processor • If processor crashes all trace of local message is lost • After restart it should have the same TN • Store <sender, receiver, SN, TN> on buddy Processor R Chare Q <SN,TN, Receiver> <SN, TN, Message> <sender, SN> Chare P Ack <sender, receiver, SN, TN> Buddy of Processor R Parallel Programming Laboratory Univ. of Illinois, U-C

Checkpoint Protocol • A processor asynchronously decides to checkpoint • Packs up the state of all its chares and sends it to the buddy • Message logs are part of a chare’s state • Message log on senders can be garbage collected • Deciding when to checkpoint is an interesting problem Parallel Programming Laboratory Univ. of Illinois, U-C

Reliability • Only one scenario when our protocol fails • Processor X (buddy of Y) crashes and restarts • Checkpoint of Y is lost • Y now crashes before saving its checkpoint • Result of not assuming reliable nodes for storing checkpoint • Still increases reliability by orders of magnitude • Probability can be minimized by having Y checkpoint after X crashes and restarts Parallel Programming Laboratory Univ. of Illinois, U-C

Basic Restart Protocol After a crash, a Charm++ process is restarted on a new processor Gets checkpoint and local message log from buddy Chares are restored and other processors are informed of it Logged messages for chares on restarted processors are resent The highest TN, from a crashed chare, seen is also sent Messages are reprocessed by the restarted chares Local messages check first in the restored local message log Parallel Programming Laboratory Univ. of Illinois, U-C

Parallel Restart • Message Logging allows fault-free processors to continue with their execution • However, sooner or later some processors start waiting for crashed processor • Virtualization allows us to move work from the restarted processor to waiting processors • Chares are restarted in parallel • Restart cost can be reduced Parallel Programming Laboratory Univ. of Illinois, U-C

Present Status • Most of Charm++ has been ported • Support for migration has not yet been implemented in the fault tolerant protocol • Simple AMPI programs work • Barriers to be done • Parallel restart not yet implemented Parallel Programming Laboratory Univ. of Illinois, U-C

Experimental Evaluation • NAS benchmarks could not be used • Used a 5-point stencil computation with a 1-D decomposition • 8 quad 500 Mhz PIII cluster with 500 MB of RAM per node, connected by ethernet Parallel Programming Laboratory Univ. of Illinois, U-C

Overhead Measurement of overhead for an application with low communication to computation ratio Parallel Programming Laboratory Univ. of Illinois, U-C

Measurement of overhead for an application with high communication to computation ratio Parallel Programming Laboratory Univ. of Illinois, U-C

Recovery Performance Execution Time with increasing number of faults on 8 processors (Checkpoint period 30s) Parallel Programming Laboratory Univ. of Illinois, U-C

Summary • Designed a fault tolerant protocol that • Performs fast checkpoints • Performs fast parallel restarts • Doesn’t depend on any completely reliable node • Supports multiple faults • Minimizes the effect of a crash on fault free processors • Partial implementation of the protocol Parallel Programming Laboratory Univ. of Illinois, U-C

Future Work • Include support for migration in the protocol • Parallel restart • Extend to AMPI • Test with NAS benchmark • Study the tradeoffs involved in deciding the checkpoint period Parallel Programming Laboratory Univ. of Illinois, U-C

A Fault Tolerant Protocol for Massively Parallel Machines

A Fault Tolerant Protocol for Massively Parallel Machines

Presentation Transcript

Massively Parallel Processors

Fault-Tolerant Broadcast

Implementing Fault-Tolerant Services Using State Machines

Emulating Massively Parallel (Peta FLOPS ) Machines

A Massively Parallel Architecture for Bioinformatics

Fault Tolerant Parallel Data-Intensive Algorithms

Fault-Tolerant Broadcast

Design of a Fault-Tolerant Link Establishment Protocol

Integrated Fault Tolerant Techniques Using Parallel Processing

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

Fault Tolerant Parallel Data-Intensive Algorithms

FTMP: A Fault-Tolerant Multicast Protocol

Fault Tolerant Configuration

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

fault-tolerant

Fault-tolerant routing

Fault-Tolerant Consensus