Why Do Computers Stop and What Can Be Done About It?

Why Do Computers Stop and What Can Be Done About It? Jim Gray Presentation – Joe Tucek

Introduction • Many applications require high availability • Patient monitoring • Transaction processing (banks) • Yet, computers fail • 99.6% uptime is an hour a week of downtime • This paper – • What Tandem Non-Stop does • How well it works • What more can we do

Overview • Introduction • Principles of Reliable Design • Failure Results • Software Reliability Techniques • “The” Solution • Conclusions

Principles of Reliable Design • Availability is MTTF/(MTTF+MTTR) • Really small MTTR is good • “Spare modules are configured to give the appearance of instantaneous repair” • Modules must fail independently • Modules must fail stop • Mirrored hard disks • 10,000 hour MTTF, 24 hour MTTR • 1000 year MTTR for the system

Failure Results

Failure Results • Software and administration are • 65% of all failures • 80% of non-environmental failures • Of software and administration failiures • Software is 37% • Administration is 63% • Simplifying and minimizing administrator involvement is the top priority!

Overview • Introduction • Principles of Reliable Design • Failure Results • Software Reliability Techniques • Modularity • Heisenbugs • Process Pairs • “The” Solution • Conclusions

Software Reliability-Modularity • Hardware uses fail-fast modules • Copy hardware’s successful technique • Modular processes • Cannot corrupt each other (fault isolation) • Can easily be made fail fast • Defensive programming • Can simply be killed (fail stop) • So, we can use redundant modules • But computers are deterministic automata…

Software Reliability-Heisenbugs • What bugs are the hardest to find? • Race conditions • Limit conditions • “Unusual” system states • These are “heisenbugs” • They go away after you add debugging symbols • Or run in debugger • Or add that printf • Or do anything at all to find the stupid little thing • 131 of 132 faults is a heisenbug

Software Reliability-Process Pairs • Process Pairs are redundant modules • Lockstep execution for hardware faults • Manual checkpoint rollback • Still used in scientific computing • Automatic Checkpointing • Easier to do • “Delta” Checkpoints • Programmatically harder • Persistent Processes • Fast fail-over with “amnesia”

Transactions • The reason Jim Gray is famous… • Either things happen or they don’t • You never see things half-happened • What happens cannot be corrupting • What happens has happened forever • ACID • Atomicity, Consistency, Integrity, Durability

Transactions • Consider “persistent” process pairs • They’re easy, but they leave things half done • Why not combine them with transactions? • Of course, the OS and DB must be reliable  • Still state of the art • Microreboot -- A Technique for Cheap Recovery. Candea, Kawamoto, Fujiki, Friedman, Fox. OSDI 2004.

Conclusions • Computers mostly fail due to software and administration • Modular redundancy -> hardware reliable • Same can work for software • Most “hard” bugs are heisenbugs • So redundant processes solve them • Transactions are the way to store state • And yet, 20 years later, computers crash…

Why Do Computers Stop and What Can Be Done About It?