Why panic () ? Improving Reliability through Restartable File Systems

Swaminathan Sundararaman, Sriram Subramanian, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Michael M. Swift Why panic()? Improving Reliability through Restartable File Systems

Data Availability Slave Nodes GFS Master GFS Master Slave Nodes • Applications require data • Use FS to reliably store data • Both hardware and software can fail • Typical Solution • Large clusters for availability • Reliability through replication

User Desktop Environment App App App OS FS Disk Raid Controller Disks • Replication infeasible for desktop environments • Wouldn’t RAID work? • Can only tolerate H/W failures • FS crash are more severe • Services/applications are killed • Requiring OS reboot and recovery • Need: better reliability in the event of file system failures

Outline Motivation Background Restartable file systems Advantages and limitations Conclusions

Failure Handling in File Systems • Exception paths not tested thoroughly • Exceptions: failed I/O, bad arguments, null pointer • On errors: call panic,BUG,BUG_ON • After failure: data becomes inaccessible • Reason for no recovery code • Hard to apply corrective measures • Not straightforward to add recovery

Realworld Example: Linux 2.6.15 ReiserFS int journal_mark_dirty(….){ struct reiserfs_journal_cnode *cn = NULL; if (!cn) { cn = get_cnode(p_s_sb); if (!cn) { reiserfs_panic(p_s_sb, "get_cnode failed!\n"); }} } File systems already detect failures void reiserfs_panic(struct super_block *sb, ...) { BUG(); /* this is not actually called, but makes reiserfs_panic() "noreturn" */ panic("REISERFS: panic %s\n“, error_buf);} Recovery: simplified by generic recovery mechanism

Possible Solutions Lightweight Heavyweight CuriOS EROS Stateful Nooks/Shadow Xen, Minix L4, Nexus SafeDrive Singularity Stateless • Code to recover from all failures • Not feasible in reality • Restart on failure • Previous work have taken this approach FS need: stateful & lightweight recovery

Restartable File Systems FS Failures: completely transparent to applications Goal: build lightweight & stateful solution to tolerate file-system failures Solution: single generic recovery mechanism for any file system failure • Detect failures through assertions • Cleanup resources used by file system • Restore file-system state before crash • Continue to service new file system requests

Challenges • Transparency • Multiple applications using FS upon crash • Intertwined execution • Fault-tolerance • Handle a gamut of failures • Transform to fail-stop failures • Consistency • OS and FS could be left in an inconsistent state

Guarantying FS Consistency • Not all FS support crash-consistency • FS state constantly modified by applications • Periodically checkpoint FS state • Markdirty blocks as Copy-On-Write • Ensure each checkpoint is atomically written • On Crash: revert back to the last checkpoint FS consistency required to prevent data loss

Overview of Our Approach 5 3 Periodically create checkpoints Open (“file”) write() read() write() write() Close() 1 Application File System Crash 2 VFS 6 Unwind in-flight processes 3 checkpoint File System 2 Move to recent checkpoint 4 4 1 Epoch 0 Epoch 1 Replay completed operations 5 time Re-execute unwound process Legend: Completed In-progress Crash 6

Checkpoint Mechanism • File systems constantly modified • Hard to identify a consistent recovery point • Naïve Solution: Prevent any new FS operation and call sync • Inefficient and unacceptable overhead

Key Insight App App App All requests go through the VFS layer VFS File System ext3 VFAT Control requests to FS and dirty pages to disk Page Cache File Systems write to disk through Page Cache Disk

Generic COW based Checkpoint App App App VFS VFS VFS File System File System File System 1 1 Page Cache Page Cache Page Cache Disk Disk Disk STOP STOP At Checkpoint After Checkpoint Regular Membrane

Interaction with Modern FSes • Have built-in crash consistency mechanism • Journaling or Snapshotting • Seamlessly integrate with these mechanism • Need FSes to indicate beginning and end of an transaction • Works for data and ordered journaling mode • Need to combine writeback mode with COW

Light-weight Logging • Log operations at the VFS level • Need not modify existing file systems • Operations: open, close, read, write, symlink, unlink, seek, etc. • Read: • Logs are thrown away after each checkpoint • What about logging writes?

Page Stealing Mechanism Write (fd, buf, offset, count) VFS VFS VFS File System File System File System Page Cache Page Cache Page Cache Before Crash During Recovery After Recovery • Mainly used for replaying writes • Goal: Reduce the overhead of logging writes • Soln: Grab data from page cache during recovery

Handling Non-Determinism

Skip/Trust Unwind Protocol

Evaluation Setup

OpenSSH Benchmark

Postmark Benchmark

Recovery Time Restart ext2 during random-read micro benchmark

Recovery Time (Cont.)

Advantages • Improves tolerance to file system failures • Build trust in new file systems (e.g., ext4, btrfs) • Quick-fix bug patching • Developer transform corruptions to restart • Restart instead of extensive code restructuring • Encourage more integrity checks in FS code • Assertions could be seamlessly transformed to restart • File systems more robust to failures/crashes

Limitations Inode# Mismatch File1: inode# 15 File1: inode# 12 create (“file1”) stat (“file1”) write (“file1”, 4k) create (“file1”) write (“file1”, 4k) stat (“file1”) Application VFS File System File : file1 Inode# : 12 File : file1 Inode# : 15 Epoch 0 Epoch 0 After Crash Recovery Before Crash • Only tolerate fail-stop failures • Not address-space based • Faults could corrupt other kernel components • FS restart may be visible to application • e.g., Inode numbers could be changed after restart

Conclusions • Failures are inevitable in file systems • Learn to cope and not hope to avoid them • Generic recovery mechanism for FS failures • Improves FS reliability availability of data • Users: Install new FSes with confidence • Developers: Ship FS faster; as not all exception cases are now show-stoppers

Thank You! Advanced Systems Lab (ADSL) University of Wisconsin-Madison http://www.cs.wisc.edu/adsl Questions and Comments

Why panic () ? Improving Reliability through Restartable File Systems

Why panic () ? Improving Reliability through Restartable File Systems

Presentation Transcript

Dewey Panic?!

INFORMATION SYSTEMS CONTROLS FOR SYSTEMS RELIABILITY – PART I

Moral Panic Theory

Maintenance Strategies with Limited Resources

Don’t Panic!

HAPTER 7

Chapter 12 – File Operations

UNIX FILES

Advanced Operating Systems

File System Extensibility and Non-Disk File Systems

Treatment of Panic Disorder

ecs150 Fall 2007 : Operating System #5: File Systems (chapters: 6.4~6.7, 8)

Chapter 12: File System Implementation

Module 9: File-Systems

Building and Improving Your PreK-3rd Grade Systems

Reliability of fibre-optic data links in the CMS experiment

File Systems

8. Distributed DBMS Reliability

File systems: outline

Fundamentals of Reliability Centered Maintenance RCM