1 / 23

Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization

Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization. Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors, Andrew R. Pleszkun, and Manish Vachharajani University of Colorado at Boulder Department of Electrical and Computer Engineering

kennan
Download Presentation

Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors, Andrew R. Pleszkun, and Manish Vachharajani University of Colorado at Boulder Department of Electrical and Computer Engineering DRACO Architecture Research Group

  2. Thesis Statement • Hardware Performance Monitoring (HPM) can be utilized to provide a low-overhead alternative to current techniques for profiling run-time code behavior.

  3. Introduction A 80 • Profile information is critical to success of profile-based optimizations • Point Profile - BB count, edge profile, etc. • Path Profile - correlated points • Off-line Path Profiling Methods: • Use static/dynamic instrumentation to gather full path profile • On-line Path Profiling Method: • Interpretation and MRET • Both incur high overhead!! • Slowdown of 2-3x with Pin for BB counting 20 B C D 30 70 E F G Edge Profile: ABDFG 70-50 Path Profile: ABDFG 60 ACDFG 10 …

  4. Performance Monitoring Itanium-2 PMU Features • HPM through on-chip Performance Monitoring Units (PMUs) • Itanium, Pentium 4, PowerPC • Coarse-grained, fine-grained features • Obstacles to PMU profiling • Non-deterministic (sampling) • Sample aliasing • Less information • Compiler analysis can extend PMU information!!! Goal: Use sampled branch vectors on PMU to derive a path profile comparable to software path profiling techniques.

  5. Contributions • Characterize the information provided by PMU sampling of branch vectors • Characterize the effect compiler analysis on PMU information • Demonstrate the construction of a PMU-based path profiler

  6. PMU Profiling Framework Terminology Branch Vector: Series of addresses from BTB Partial Path: Path of ops in compiler IR Online Annotated Binary PMU Offline Compiler Analysis perfmon interface Partial Path Extensions Kernel Buffer Address Map Branch Vectors … Dominator Analysis Interrupt on kernel buffer overflow Path Profile Generation Partial Paths Branch Vector Hash Table Intermediate File Profile Information

  7. PMU Configuration • Itanium-2 PMU BTB masks • Taken Mask (All, T, NT, None) • Predicted Target Address Mask (All, Correct, Incorrect, None) • Predicted Predicate Mask (All, Correct, Incorrect, None) • Branch Type Mask (All, Indirect, Return, IP-relative) • Configuration depends on goal • Branch prediction performance? Building call graph? • PMU configured to sample only taken branches for path information • Not taken branches can be inferred in control flow graph

  8. Partial Path Extensions Join Point • Compiler view of CFG can be used to extend paths • Extend until point of uncertainty • Up until Join Point • Down until Branch Point BTB Branch Vector 1-2-3-4 1 Partial Path from Branch Vector 2 Extended Partial Path 3 4 Branch Point

  9. Dominator Analysis Join Point • Dominator Analysis • Finds all blocks guaranteed to execute • Partial Path Extensions • Subset of dominator analysis • Constrained to a path BTB Branch Vector 1-2-3-4 1 Partial Path from Branch Vector 2 Basic Blocks added with Dom. Analysis 3 4 Terminology Dominator: u dominates v if all paths from Entry to v include u Post Dominate: u post-dominates v if all paths from v to Exit include u Branch Point

  10. Path Profile Generation BTB Trace • Combine compiler analysis and PMU branch vectors to generate a path profile comparable to software path profiling techniques • Issues: • Path of a branch vector inherently different • Random start and end of path - path ambiguity • Spans boundaries compiler-based paths do not • Number of paths increases exponentially • Must map PMU paths to compiler paths • Region Formation • Split partial paths • Path Matching • Path Crediting Hot Path

  11. Region 1 Region 2 Region 3 Region Formation A • Use region-based paths • Makes total # paths more manageable • Functions can be large • Create loop-based regions • Programs spend most of time in loops • Rules for Region R: • R must be single entry • R may not cross function boundaries • R may not cross loop boundaries B C D L E M N F G O H P Q I J R K S T U V W X Y

  12. Region 1 Region 2 Region 3 Path Matching and Crediting A • Path Matching • Find list of all paths that contain partial path • Path Crediting • Distribute partial path weight equally among matched paths • Ex. ABDLMOP, ABDEFHIK, OPRSUVX B C D L E M N F G O H P Q I J R K S T U V W X Y

  13. Methodology • Experiments run on Itanium-2 with 2.6.10 kernel • Developed tool using perfmon kernel interface and libpfm-3.1 to interface with PMU • Benchmarks • Set of SPEC2000 benchmarks • Compiled with the OpenIMPACT Research Compiler • Compared to full path profile gathered with a Pin path profiling tool

  14. Effect of Sampling Period • Sampling Overhead due to: • Periodic interrupt, copying between buffers, hash table insertion

  15. PMU vs Actual Instruction Distribution • Kullback-Leibler Divergence (Entropy) • d = k=0 pk log2(pk/qk) • Relative measure of distance between two distributions

  16. Code Coverage • Explore how PMU branch vectors translate to code coverage information • Code Coverage Types • Single BB: Simulates PC-sampling • Branch Vectors • Branch Vectors w/ Dom. Analysis • Coverage percentage is percent of actually covered code discovered with compiler-aided analysis of branch vectors Number of Instructions and Actual Code Covered

  17. Code Coverage

  18. Hot Instruction Thresholds • For top 10-30% of instructions, code coverage does well (80-100%) • Drops off at around 40-50% of hot instructions

  19. Stability • Across 20 runs, PMU code coverage varies ~5-10%

  20. Multiple Runs • Regular Sampling: 1) gzip, parser, twolf improve greatly • Randomized Sampling may discover code regular sampling cannot

  21. Partial Path Characteristics • Partial Path extensions increase length ~20% • However, splitting drastically decreases lengths • ~30% on function boundaries, ~20% more on loop back edges

  22. Accuracy Results • Accuracy measured similar to Wall’s weight matching scheme[Wall91] • Threshold = .125%

  23. Conclusion • Motivates and presents initial results and rational for PMU-based profiling • Characterizes branch vector sampling • Improves code coverage > 50% over PC-sampling • Branch vector paths are inter-procedural • Characterizes effect of compiler analysis • Partial path extensions increase length by ~20% • Dominator analysis on branch vectors improve code coverage > 50% • Demonstrates construction of a PMU-based path profiler • ~85% accurate at 1% overhead (at sampling period of 5M) Questions?

More Related