CS599 Presentation A Fault Detection Service for Wide Area Distributed Computations

CS599 Presentation A Fault Detection Service for Wide Area Distributed Computations P. Stelling, I. Foster, C. Kesselman, C.Lee, G. Von Laszewski, Proc. 7th IEEE Symp. on High Performance Distributed Computing, 268-278, 1998. Presenter : Narayanan Sadagopan

Outline • Motivation • Design goals • Heartbeat Monitor • Performance • Future Work • References

Motivation • Why is fault tolerance needed ? (especially in distributed computing as opposed to sequential computing) • Wide area distributed computing enables users to access variety of computational resources. • Wide area distributed computing has given rise to a new class of applications. E.g. : distributed simulations, tele-immersion, etc. • Also, existing applications are being modified so that they can run on a computational grid.

Motivation • Why is fault tolerance needed? (contd.) • Different components (modules) of the application may run on different computational resources. • Any one of these components can fail, and the other components may not come to know about it. • Applications may still want to continue • Fault tolerance is needed !

Motivation • Two approaches to fault tolerance • Leave it to the application. • Application programmers will have to consider • Detecting several failure conditions. • Recovering from these failures. • Huge code (which may obscure the application details) • Provide it as a basic service of wide area distributed computing(like resource discovery, information service)

Design Goals • How basic should this service be? • Two possible approaches • Should we provide mechanisms for fault detection and recovery? • Should we provide just fault detection and let the application handle recovery?

Design Goals • However, recovery from faults may be application specific. • On detecting a fault an application may: • Terminate (fail-stop) • Ignore fault • Continue by re-allocating a new resource.

Design Goals • Guiding principles • “End to End Argument Arguments in System Design”, J.H.Saltzer,D.P.Reid, D.D. Clark • (MIT Laboratory of Computer Science) • If a functionality can be implemented at two or more levels, provide the most basic functionality at the lower level and let the upper levels take care of the rest i.e push the functionality towards the application.

Design Goals • Guiding principles(contd.) • Result from theory of distributed systems • Consensus algorithm • All participating processes must propose and unanimously agree upon a value • Useful in leader election and voting, which are essential fault tolerance functions.

Design Goals • Result from Consensus algorithm • Consensus algorithm cannot be implemented in an asynchronous environment (where no timing assumptions can be made) • corresponds well with the current Internet

Design Goals • Guiding Principles (contd.) • Implication • The least we can do is provide a mechanism whereby the participating components can be informed of faults i.e provide fault detection as a basic service in the wide area distributed computing environment.

Design Goals • Fault Detector Reliability • We are operating over the internet! • Variable communication latency. • Information about the fault should be timely. • Best effort traffic • Information should reach (in the first place).

Design Goals • Fault Detector Reliability (contd.) • Should we add a lot of overhead by building a reliable fault detector? • Results from Consensus algorithm: • an unreliable failure detector is sufficient.

Design Goals • Fault Detector Reliability (contd.) • What is an “unreliable” detector? • May erroneously report the failure of a component only to correct the error later.

Design Goals • What is an unreliable detector? (contd.) • Advantages • Well suited for distributed fault detection as lesser consistency is needed. • An unreliable protocol => greater scalability, lesser overhead, lower latency.

Design Goals • What is an unreliable detector? (contd.) • How unreliable can the fault detector be? • All failed components are eventually identified. • Atleast one component should be identified as functioning by all other functioning components!

Design Goals • Fault Detector Reliability (contd.) • Trade-offs • Should an application wait longer for the fault detector to stabilize? • Longer wait => lower probability of error. • Reaction time increases.

Design Goals • Tradeoffs (contd.) • Should an application react quickly to fault reports (possibly false)? • Faster response to faults. • Effort spent in recovery may be wasted. • Final decision should be made by the application by comparing relative costs.

Design Goals • System Model • Theoretically, all components like disks, processors, network cards, threads, processes can be monitored. • However, practically only processes & hosts are monitored. • Assumption : A fault at a lower level can be detected by the corresponding higher level. E.g.: a thread crash can be detected by a process and it will terminate. • The effort spent for monitoring low level entities is not worth the gain achieved!

Design Goals • Fault Detector Characteristics • Scalable • should perform well as the number of processes increase • Accurate and complete • should identify faults correctly & minimize false reports • Timeliness • Fault information should be timely so that a quick response can be taken.

Design Goals • Low overhead • Monitoring should not impact the performance of processes, hosts, network, etc. • Flexibility • Application should be able to decide • entities to be monitored • frequency of monitoring • criteria for failure

Heartbeat Monitor (HBM) • Heart Beat Monitor (HBM) : fault detector in Globus • Architecture • 3 components • HBM Local Monitor (HBMLM) • runs on each host part of the Globus environment • monitors registered processes and the host itself • sends “heartbeats” to external agents (data collectors).

HBM • Architecture (contd..) • HBM Client Library (HBMCL) • used by processes to register and unregister with the HBMLM • only registered processes (Heartbeat Client - HBCs) are monitored • HBM Data Collector (HBMDC) • Usually application specific. • Receives “heartbeats” sent by HBMLMs.

HBM • HBMDC (contd.) • Maintains and updates status of the components/processes. • Infers failure of a component after time-out for a heartbeat. • Uses HBMDC-APIs to register callbacks which are executed based on normal conditions or exceptions.

HBM Architecture

HBM Client Library (HBMCL) • Two forms • Library linked to the client program • APIs • globus_module_activate, globus_hbm_client_register(), globus_hbm_client_unregister_all(), globus_module_deactivate • Standalone process • program : globus-hbm-client-register • accepts process id as an argument

HBMCL • Life Cycle • Initialization • reads the TCP port number from a config file • Registering with the HBMLM (either program or API) • uses TCP • Multiple data collectors • single register message • multiple register messages

HBMCL • Execution • monitored by HBMLM • no overhead on the client (HBC) • Termination • Normal termination • Process de-registers with the HBMLM

HBMCL • Termination (contd.) • Abnormal termination • Process is not able to de-register • HBMLM marks the process as “abended” • All the interested HBMDCs are informed about the termination => deregistration is not selective. • HBMLM deletes the entry for that process (client)

HBMCL Register Startup Status Query HBMLM Execute De-Register Terminate

HBM Local Monitor (HBMLM) • one HBMLM per host • program : globus-hbm-localmonitor • active on any host which is a part of the Globus/Gusto testbed => it should not be started manually by users. • Life Cycle • Initialization • Reads the TCP & UDP port numbers from a configuration file. • TCP is used for the register/de-register messages • UDP is used for sending “heartbeats” to the data collectors.

HBMLM • Initialization (contd.) • If using different ports than the default, update the configuration file so that the HBCs can know the port number. • Restore state from a checkpoint file • Reads all the registered processes, their state, time for the next heartbeat, etc • Initializes internal tables.

HBMLM • Life Cycle (contd.) • Execution loop • Verifies state of registered processes (uses ps command) • Sends “heartbeats” to the interested data collectors. • Checkpoints the state tables. • Waits for register/unregister messages (using a timed select mechanism).

HBMLM • How are register messages processed? • contains HBMDC identity, Heartbeat interval ,e-mail address of the user • On receiving a register message • Creates a working entry for the client & the HBMDC • Schedules heartbeats to be sent to the particular HBMDC at regular intervals

HBMLM • How are unregister messages processed? • Informs the appropriate HBMDCs using 5 attempts • Clears the working entry for the client • Abnormal termination • HBMLM informs the HBMDCs about the abnormality • HBMDCs (in Globus) send an e-mail notification to the user

Data Collector 1 Data Collector 2 Data Collector 3 Data Collector 4 Data Collector 5 HBMLM Process1 (4,5) Process2 (3) Process3 (2) Process 4 (1)

HBMDC • HBMDC is built using HBMDC APIs • Data collector is usually application specific. • HBMDC Implementation • Some configurable parameters: • Callbacks : functions which are executed when some events occur. • heartbeats_missing_overdue : threshold before a process heartbeat is marked overdue

HBMDC • heartbeats_missing_shutdown : threshold before a process is marked as “abended” • network_variance_allowance_secs : threshold for a heartbeat to be considered missing.

Performance • False Positives • According to experiments, most of the failure reports are delivered • Loss of failure reports is partly due to network partitioning and not UDP! • Failure criteria of application determines the rate of “false positives” • As the timeout interval decreases, the false positive rate increases. • For lower timeout intervals, heartbeat rates should be high

Performance • Utilization • Low CPU utilization at both the HBMLM & HBMDC • Low network overhead due to use of UDP

Future Work • Adapting heartbeat frequency in response to losses in failure reports • Adapting failure criteria in the light of network dynamics

References • http://www.globus.org/hbm/heartbeat_spec.html

CS599 Presentation A Fault Detection Service for Wide Area Distributed Computations

CS599 Presentation A Fault Detection Service for Wide Area Distributed Computations

Presentation Transcript

Line Fault Detection

Fault Tolerance Distributed

A Multi-Application Controller for a Tiled Display Wall for Wide-area Distributed Computing

Wide-Area Distributed Storage Acceleration using MPTCP

Flexible, Wide-Area Storage for Distributed Systems with WheelFS

Fault detection

Fault Detection

Distributed Online Simultaneous Fault Detection for Multiple Sensors

Mariposa: a wide-area distributed database system

Building Distributed, Wide-Area Applications with WheelFS

An Architecture for Secure Wide-Area Service Discovery

A Distributed Paging RAM Grid System for Wide-Area Memory Sharing

Wide-Area Route Control for Distributed Services

Termination Detection Algorithm for Distributed Computations

Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing

Distributed Fault Detection for untimed and for timed Petri nets

Hardening Distributed Volunteer Computations

Flexible, Wide-Area Storage for Distributed Systems with WheelFS

Mariposa: a wide-area distributed database system

Distributed Computations MapReduce/Dryad

Building Distributed, Wide-Area Applications with WheelFS

Fault detection