150 likes | 290 Views
Reference-Driven Performance Anomaly Identification. University of Rochester. Kai Shen, Christopher Stewart, Chuanpeng Li, and Xin Li. Performance Anomalies. Complex software systems (like operating systems and distributed systems): Many system features and configuration settings
E N D
Reference-Driven Performance Anomaly Identification University of Rochester Kai Shen, Christopher Stewart, Chuanpeng Li, and Xin Li SIGMETRICS 2009
Performance Anomalies • Complex software systems (like operating systems and distributed systems): • Many system features and configuration settings • Wide-ranging workload behaviors and concurrency • Their interactions • Performance anomalies: • Low performance against expectation • Due to implementation errors, mis-configurations, or mis-managed interactions, … • Anomalies degrade the system performance; make system behaviors undependable SIGMETRICS 2009
An Example Identified by Our Research • Linux anticipatory I/O scheduler • HZ is number of timer ticks per second, so (HZ/150) ticks is around 6.7ms. • However, inaccurate integer divisions: • HZ defaults to 1000 at earlier Linux versions, so anticipation timeout is 6 ticks. • It defaults to 250 at Linux 2.6.23, so timeout becomes one tick. Premature timeouts lead to additional disk seeks. /* max time we may wait to anticipate a read (default around 6ms) */ #define default_antic_expire ((HZ / 150) ? HZ / 150 : 1) SIGMETRICS 2009
Challenges and Goals • Challenges: • Often involving semantics of multiple system components • No obvious failure symptoms; normal performance isn’t always known or even clearly defined • Performance anomaly identifications relatively rare: • 4% of resolved Linux 2.4/2.6 I/O bugs are performance-oriented • Goals: • Systematic techniques to identify performance anomalies; improve performance dependability • Consider wide-ranging configurations and workload conditions SIGMETRICS 2009
Reference-driven Anomaly Identification • Given two executions T (target) and R (reference): • If T performs much worse than R against expectation, we identify T as anomalous to R. • Examples: • How to systematically derive the expectations? SIGMETRICS 2009
Change Profiles • Goal – derive expected performance deviations between reference and target (or with a change of system parameters) • Approach – inference from real system measurements • Change profile – probabilistic distribution of performance deviations • p-value(–0.5) = 0.039 SIGMETRICS 2009
Scalable Anomaly Quantification • Approach: • Construct single-para. profiles through real system measurements • Analytically synthesize multiple single-para. profiles for scalability • Convolution-like synthesis • Assuming independent performance effects of different parameters • Assemble multi-para. performance deviation distribution using convolutions of single-para. change profiles • Generally applicable bounding analysis • Bound multi-para. p-value anomaly from single-para. p-values (no need for parameter independence) • Find a tight bound (small p-value) through Monte Carlo method SIGMETRICS 2009
Evaluation • Linux I/O case study: • Five workload parameters and three system conf. parameters • Performance measurements at 300 sampled executions; use each other as references to identify anomalies • Anomalies are target executions with p-values 0.05 or less • Validate through cause analysis; probable false positive without validated cause • Results • Linux 2.6.10 – 35 identified; 34 validated; 1 probable false positive • Linux 2.6.23 – 12 identified; 9 validated; 3 probable false positives • Linux 2.6.23 (target) vs. 2.6.10 (reference) – 15 identified; all validated SIGMETRICS 2009
Comparison • Bounding analysis for multi-parameter anomaly quantification • Convolution synthesis assuming parameter independence • Rank target-reference anomaly using raw perf. difference • Convolution identifies more anomalies, but higher false positives SIGMETRICS 2009
Anomaly Cause Analysis • Given symptom (anomalous perf. degradation from reference to target), root cause analysis is still challenging • Root cause sometimes lies in complex component interactions • Most useful hints often relate to low-level system activities • Efficient mechanisms available to acquire large amount of system metrics (some anomaly-related); but difficult to sift through • Approach: reference-driven filtering of anomaly-related metrics • Compare metric manifestations of an anomalous target and its normal reference • Those that differ significantly may be anomaly-related SIGMETRICS 2009
System Events and Metrics Traced Events: • Process management • creation of a kernel thread; process fork or clone; process exit; process wait; process signal; wake up a process; CPU context switch • System call • enter a system call; exit a system call • Memory system • allocating pages; freeing pages; swapping pages in; swapping pages out • File system • file exec; file open; file close; file read; file write; file seek; file ioctl; file prefetch operation; starting to wait for a data buffer; end to wait for a data buffer • IO scheduling • I/O request arrival at the block level; re-queue an I/O request; dispatch an I/O request; remove an I/O request; I/O request completion • SCSI device • SCSI read request; SCSI write request • Interrupt • enter an interrupt handler; Exit an interrupt handler • Network socket • socket call; socket send; socket receive; socket creation Derived System Metrics: • Inter-arrival time of each type of events • Delays between causal events • delay between a system call enter and exit • delay between file system buffer wait start and end • delay between a block-level I/O request arrival and is dispatch • delay between a block-level I/O request dispatch and its completion • Parameter of events • file prefetch size • SCSI I/O request size • file offset of each I/O operation to block device • I/O concurrency • system call level • block level • SCSI device level up to 1361 metrics in Linux 2.6.23 SIGMETRICS 2009
A Case Result • Anomaly cause: incorrect timeout setting when timer ticks per second (HZ) changes from 1000 to 250 in Linux 2.6.23 • Top ranked metrics – anticipatory I/O timeouts and anticipation breaks #define default_antic_expire ((HZ / 150) ? HZ / 150 : 1) SIGMETRICS 2009
Effects of Anomaly Corrections • Anomaly corrections lead to predictable performance behavior patterns SIGMETRICS 2009
Related Work • Peer differencing for debugging • Delta debugging [Zeller’02]: differencing program runs of various inputs • PeerPressure [Wang et al.’04]: differencing Windows registry settings • Triage [Tucek et al.’07]: differencing basic block execution frequency →Target program/system failures; failure symptoms easily identifiable; correct peers presumably known • Our performance anomaly identification • Challenge: both anomalous and normal performance behaviors are hard to identify in complex systems • Key contribution: scalable construction of performance deviation profiles SIGMETRICS 2009
Summary • Principled use of references in performance anomaly identification • Scalable construction of performance deviation profiles to identify anomaly symptoms • Target-reference differencing of system metric manifestations to help identify anomaly causes • Identified real performance problems in Linux and J2EE-based distributed system SIGMETRICS 2009