1 / 8

Fault Tolerance

Fault Tolerance. CSCI 4780/6780. Failures in Distributed Systems. Partial failures – characteristic of distributed systems Goals: Construct systems which can automatically recover from partial failures System should operate in an acceptable way even during failures.

alessa
Download Presentation

Fault Tolerance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault Tolerance CSCI 4780/6780

  2. Failures in Distributed Systems • Partial failures – characteristic of distributed systems • Goals: • Construct systems which can automatically recover from partial failures • System should operate in an acceptable way even during failures

  3. Basic of Dependable Systems • Availability – Property that the system is operating correctly at a given moment • Reliability – Property that a system can continuously run without failures • Safety – Failures should not lead to catastrophes • Maintainability – How easy is it to repair a failed system

  4. Failures, Errors and Faults • Failure – A system not meeting its promises • Error – Part of system’s state that may lead to failure • Eg: Damaged packets • Fault – Cause of error • Bad transmission medium, bad disk, etc. • Types of faults • Transient – Occur once and disappear • Intermittent – Appear, vanish and reappear • Permanent – Continues until repair

  5. Failure Models • Different types of failures.

  6. Arbitrary Failures • Crash failures is a benign way of halting the service • Fail-stop failures – Halting can be detected by other processes • The halting server may announce its status • Fail-silent systems – Halting is not announced • Other processes need to detect the failure • Fail-safe – Server is producing random output • Other servers can detect the failure

  7. Failure Masking by Redundancy • Hiding failures from other processes • Three types of redundancies • Information redundancy – Extra data is added to hide failure. • Eg. Hamming codes • Timing redundancy – Extra actions are performed for hiding failures • Redoing a transaction • Physical redundancy – Extra equipment (processes) for hiding failures • Extra disks, process pools etc.

  8. Triple Modular Redundancy

More Related