1 / 25

Using Time Travel to Diagnose Computer Problems

Using Time Travel to Diagnose Computer Problems. Andrew Whitaker , Rick Cox, Steve Gribble The University of Washington. Example Scenario. Mozilla Web browser locks up after installing an extension Current approaches are inadequate:

darby
Download Presentation

Using Time Travel to Diagnose Computer Problems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Time Travel to Diagnose Computer Problems Andrew Whitaker, Rick Cox, Steve Gribble The University of Washington

  2. Example Scenario • Mozilla Web browser locks up after installing an extension • Current approaches are inadequate: • Google search reveals too many possibilities: bad extension, HTTP pipelining enabled, glibc update, invalid hostname, “some upgrade of some gnome or GTK package”, Mozilla bugs • Help menus cannot anticipate all error cases • Reinstalling Mozilla does not fix the problem

  3. General Problem • WYNOT errors: system worked yesterday, not today • Other examples: • Misconfigured Internet servers • Administrator mistakes are the largest source of downtime • Conflicts between applications • Registry corruption, “DLL hell” • Security policy • Over-zealous firewall • Spyware, adware, viruses Goal: automate the diagnosis of change-induced errors

  4. working working Chronus Overview • Use search to identify the transition from a working to a failing state: fault point working failing failing • Search requirements: • Time-travel mechanism • Testing mechanism

  5. User-written software probe Is the system working? Chronus When did the system stop working? Analysis tools (diff, regdiff, log files) Why did the system stop working? Usage Model

  6. Outline • Introduction • Design and implementation • Debugging Experience

  7. Time Travel Mechanism • Log state changes using a time-travel disk • Boot a historical virtual machine • Captures boot-time configuration parameters • Capture state changes onto a copy-on-write disk • Avoids tampering with the system timeline

  8. Time Travel Implementation • Functionality split across two VMs • Parent implements time-travel functionality • Child executes normal user programs COW Disk Time-travel disk Parent VM Child VM disk requests Denali VMM

  9. Software Probes • Probe is arbitrary code that evaluates system correctness • Two varieties of probes: • Internal probes run inside the child VM • External probes run on a remote machine • Strategies for obtaining probes: • Pre-packaged libraries • Written on the fly by expert user or administrator

  10. Outline • Introduction • Design and implementation • Debugging Experience

  11. #!/bin/sh mozilla & sleep 5 mozilla -remote ping() echo ‘SUCCESS’ > /TTOUTPUT blocks if Mozilla hangs • Step 2: invoke search over a time range: % search -begin 169354 -end 180025 173562: FAILURE 173541: SUCCESS 173551: SUCCESS 173556: FAILURE 173553: FAILURE 173552: SUCCESS Debugging the Mozilla Hang • Step 1: write a probe that tests the behavior: #!/bin/sh mozilla & sleep 5 mozilla -remote ping() echo ‘SUCCESS’ > /TTOUTPUT 169354: SUCCESS 180025: FAILURE 169354: SUCCESS 180025: FAILURE 174689: FAILURE 172021: SUCCESS 173355: SUCCESS 174022: FAILURE 173688: FAILURE 173521: SUCCESS 173604: FAILURE

  12. Mozilla Hang, Continued • Step 3: compute the change: % attach child-disk 173552 173553 % diff -r /child-before /child-after file /.mozilla/default/zc1irw5u.slt/chrome/chrome.rdf differs: <RDF:Description about="urn:mozilla:package:stockticker" c:baseURL="jar:file:///root/.mozilla/default/zc1irw5u.slt /chrome/stockticker.jar!/content/" c:locType="profile" c:author="Jeremy Gillick" c:authorURL="http://jgillick.nettripper.com/" c:description="Shows your favorite stocks in a customized ticker." c:displayName="StockTicker 0.4.2" c:extension="true” c:name="stockticker" c:settingsURL="chrome://stockticker/content/options.xul” />

  13. Summary • Chronus uses search to find a failure-inducing state change • User-supplied probe need only test for correctness • “Time travel” built on a logging disk and a virtual machine monitor • Chronus can diagnose many common configuration errors More details to appear at OSDI 2004

  14. Questions?

  15. Emerging Challenge: Evaluation in the Post-performance Era • How do we demonstrate “correctness”? • Conventional benchmarking cannot account for the “human factor” • Alternate approaches: • Proximate metrics • Bug count • User studies (Aaron Brown’s work) • Proofs • Research directions • Validating proximate metrics • Designing systems with evaluation in mind

  16. Fault point Time system was working system was NOT working Blank

  17. Why it works • Testing complexity does not scale with system complexity: HTTP GET Apache Perl MySQL Client Linux Error 400: Bad Request Firewall

  18. Chronus in Action • 1) Notice a failure: • mount_nfs: rpcbind on server: RPC Port mapper failure - RPC: Timed out • 2) Write a probe to test for failure: • #!/bin/sh • echo 'SUCCESS' > /TTOUTPUT • 3) Use Chronus to locate the failure in time: • 4) Use diff to extract the result • file /etc/rc.conf differs: • ipfilter=YES

  19. Motivation • Can we substitute HW effort for human effort? 1970’s Total ownership cost breakdown Hardware costs 2000’s People costs

  20. block writes block reads index Time travel disks • Log a window of recent disk block changes: checkpoint region log region

  21. Motivation • Complex systems require expert users • Performance tuning • Security policy specification • Software upgrades • Implications: • High cost • System administration is 60-80% of TCO • Poor quality • e.g., unpatched home user machines How can we use computer power to simplify system administration tasks?

  22. Rollback-based Recovery • Key challenge: recovering lost work • Requires application assistance: • Windows XP: application-specific rollback • Operator Undo: application-specific state repair • May corrupt system state Lost work failing configuration known good configuration

  23. STRIDER • Configuration debugging by observing program side effects • Disadvantages relative to Chronus • False positives • Registry-specific • Requires heuristics to prune the search space • Misses indirect dependencies • Advantages: • Only requires one program invocation • Can compare configurations across space • e.g., a Registry on a remote machine

  24. block writes index block reads time Time Travel Disk Implementation • Capture and record block updates to a log region: checkpoint region log region

  25. Chronus Design Choices • Time-travel disks • Pro: Captures all state changes without OS/app support • Pro: Simple (~1200 lines of code) • Con: Lack of semantic knowledge • Con: Inconsistent results from raw disk snapshot • Virtual machine restarts • Pro: More complete than application-level restarts • Pro: Faster, safer than physical machines restarts • Con: Requires that all devices have been virtualized • Con: Misses changes in the hardware-abstraction layer

More Related