1 / 20

Latency as a Performability Metric for Internet Services

Latency as a Performability Metric for Internet Services. Pete Broadwell pbwell@cs.berkeley.edu. Outline. Performability background/review Latency-related concepts Project status Initial test results Current issues. 9. 9. 9. 9. 9. Motivation.

eryk
Download Presentation

Latency as a Performability Metric for Internet Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latency as a Performability Metric for Internet Services Pete Broadwell pbwell@cs.berkeley.edu

  2. Outline • Performability background/review • Latency-related concepts • Project status • Initial test results • Current issues

  3. 9 9 9 9 9 Motivation • A goal of ROC project: develop metrics to evaluate new recovery techniques • Problem: basic concept of availability assumes system is either “up” or “down” at a given time • “Nines” only describe fraction of uptime over a certain interval

  4. Why Is Availability Insufficient? • Availability doesn’t describe durations or frequencies of individual outages • Both can strongly influence user perception of service, as well as revenue • Availability doesn’t capture system’s capacity to support degraded service • degraded performance during failures • reduced data quality during high load (Web)

  5. What is “performability”? • Combination of performance and dependability measures • Classical defn: probabilistic (model-based) measure of a system’s “ability to perform” in the presence of faults1 • Concept from traditional fault-tolerant systems community, ca. 1978 • Has since been applied to other areas, but still not in widespread use 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994

  6. D=number of data disks pi(t)=probability that system is in state i at time t wi(t) =reward (disk I/O operations/sec) m = disk repair rate l = failure rate of a single disk drive Performability Example Discrete-time Markov chain (DTMC) model of a RAID-5 disk array1 1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997

  7. FAILURE Normal throughput RECOVER Average throughput REPAIR DETECT Degraded throughput Visualizing Performability Throughput I/O operations/sec Time

  8. Perf Time Metrics for Web Services • Throughput - requests/sec • Latency – render time, time to first byte • Data quality • harvest (response completeness) • yield (% queries answered)1 1 E. Brewer, Lessons from Giant-Scale Internet Services, 2001

  9. Applications of Metrics • Modeling the expected failure-related performance of a system, prior to deployment • Benchmarking the performance of an existing system during various recovery phases • Comparing the reliability gains offered by different recovery strategies

  10. Related Projects • HP: Automating Data Dependability • uses “time to data access” as one objective for storage systems • Rutgers: PRESS/Mendosus • evaluated throughput of PRESS server during injected failures • IBM: Autonomic Storage • Numerous ROC projects

  11. Arguments for Using Latency as a Metric • Originally, performability metrics were meant to capture end-user experience1 • Latency better describes the experience of an end user of a web site • response time >8 sec = site abandonment = lost income $$2 • Throughput describes the raw processing ability of a service • best used to quantify expenses 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994 2 Zona Research and Keynote Systems, The Need for Speed II, 2001

  12. Current Progress • Using Mendosus fault injection system on a 4-node PRESS web server (both from Rutgers) • Running latency-based performability tests on the cluster • Inject faults during load test • Record page-load times before, during and after faults

  13. Cachinginfo Request Response Test Setup PRESS web server + Mendosus Test clients Page Emulatedswitch Normal version: cooperative caching HA version: cooperative caching + heartbeat monitoring

  14. Effect of Component Failure on Performability Metrics Perform- ability metric Throughput Latency Time FAILURE REPAIR

  15. Observations • Below saturation, throughput is more dependent on load than latency • Above saturation, latency is more dependent on load Thru = 3/s Lat = .14s Thru = 6/s Lat = .14s Thru = 7/s Lat = .4s 1 2 3 4 5 Time

  16. How to Represent Latency? • Average response time over a given time period • Make a distinction between “render time” & “time to first byte”? • Deviation from baseline latency • Impose a greater penalty for deviations toward longer wait times?

  17. X users get “server too busy” msg Response Time with Load Shedding Policy Responsetime (sec) Abandonment threshold 8s Load-shedding threshold Time FAILURE REPAIR

  18. Load Shedding Issues • Load shedding means returning 0% data quality – a different kind of performability metric • To combine load shedding and latency, define a “demerit” system: • Such systems quickly lose generality, however - “Server too busy” msg – 3 demerits - 8 sec response time – 1 demerit/sec

  19. Further Work • Collect more experimental results! • Compare throughput and latency-based results of normal and high-availability versions of PRESS • Evaluate usefulness of “demerit” systems to describe the user experience (latency and data quality)

  20. Latency as a Performability Metric for Internet Services Pete Broadwell pbwell@cs.berkeley.edu

More Related