280 likes | 298 Views
Learn about performance tuning techniques, compiler optimizations, profiling tools, and optimization strategies for efficient software development. Discover how to identify and prioritize areas for performance improvement.
E N D
CSc 352Performance Tuning Saumya Debray Dept. of Computer Science The University of Arizona, Tucson debray@cs.arizona.edu
Background • Performance tuning modifying software to make it more efficient • often the performance metric is execution speed • other metrics also possible, e.g., memory footprint, response time, energy efficiency • How to get performance improvements • “system tweaking” (e.g., compiler optimizations) can get some improvement; typically this is relatively small • most large improvements are algorithmic in nature • needs active and focused human intervention • requires data to identify where to focus efforts
When to optimize? • Get the program working correctly • calculating incorrect results quickly isn’t useful • “premature optimization is the root of all evil” – Knuth (?) • Determine whether performance is adequate • Optimization unnecessary for many programs • Figure out what code changes are necessary to improve performance be cognizant of the possibility that performance tuning may be necessary later on ►design and write the program with this in mind
Compiler optimizations • Invoked using compiler options, e.g., “gcc –O2” • usually several different levels supported (gcc: -O0 … -O3) • may also allow fine-grained control over code optimization • gcc supports ~200 optimization-related command-line options • They address machine-level inefficiencies, not algorithm-level inefficiencies • e.g., gcc optimizations improve hardware register usage… • … but not sequential search over a long linked list • Significant performance improvements usually need human intervention
Example • about 10% improvement overall • not atypical; possible to do better • compiler optimization effect small if either: • code already highly optimized; or • algorithm is lousy
Where to optimize? Consider a program with this execution time distribution: doubling speed of func3 overall improvement = 5% doubling speed of func1 overall improvement = 30% focusing on func1 gives better results for time invested
Profiling tools • These are tools that: • monitor the program’s execution at runtime • give data on how often functions are called, where the program spends its time • provide guidance on where to focus one’s efforts • Many different tools available, we’ll focus on: • gprof: connected to gcc • gcov: already used in connection with code coverage • kcachegrind: connected to valgrind
Using gprof • Compile using “gcc –pg” • this adds some book-keeping code, so this will be a little slower • Run this executable, say a.out, on “representative” inputs • creates a data file “gmon.out” • Run “gprof a.out” • extracts information from gmon.out • “flat profile” : time and #calls info per function • “call graph” : time and #calls per function broken down on each place where the function is called
Using gprof: example ave. time per call spent in the function and its descendants % time spent in each function time accounted for by each function alone no. of times called ave. time per call spent in the function
Using the profile information • Expect %time and self-seconds to correlate • If self μs/call high[or: self-seconds is high and calls is low]: • each call is expensive; overhead is due to the code for the function • if calls is high and self μs/call is low: • each call is inexpensive; overhead mainly due to no. of function calls • if self μs/call is low and total μs/call is high: • each call is expensive, but overhead due to some descendant routine
Examining the possibilities 1 • Code for the function is expensive [self μs/call high] • need to get a better idea of where time is being spent in the function body • may help to pull parts of the function body into separate functions • allows more detailed profile info • can be “inlined back” after performance optimization • Optimization approach: • reduce the cost of the common-case execution path through the function
Examining the possibilities 2 • No. of calls to a function is the problem [calls is high but self μs/call is low]: • need to reduce the number/cost of calls • possible approaches: • [best] avoid the call entirely whenever possible, e.g.: • use hashing to reduce the set of values to be considered; or • see if the call can be avoided in the common case (e.g., maybe by maintaining asome extra information) • reduce the cost of making the call • inline the body of the called function into the caller
Examining the possibilities • Often, performance improvement will involve a tradeoff. E.g.: • transform linear to binary search: • reduces no. of values considered in the search • requires sorting • transform a simple linked list into a hash table • reduces the no. of values considered when searching • requires more memory (hash table), some computation (hash values) • Need to be aware of this tradeoff
Approaching performance optimization • Different problems may require very different solutions • Essential idea: • avoid unnecessary work whenever possible • prefer cheap operations to expensive ones • Apply these ideas at all levels: • library routines used • language-level operations (e.g., function calls vs. macros) • higher-level algorithms
Optimization 1: Filtering • Useful when: • we are searching a large collection of items, most of which don’t match the search criteria • determining whether a particular item matches is expensive • there is a (relatively) cheap check that is satisfied iff an item does not match • What we do: • use the cheap check to quickly disqualify items that won’t match • effectiveness depends on how many items get disqualifed
Filtering • Hashing • particularly useful for strings (but not restricted to them) • can give order-of-improvement performance improvements • sensitive to quality of hash function • Binary search • knowing that the data items are sorted allows us to quickly exclude many of them that won’t match
Filters can apply to complex structures • In a research project, we were searching through a large no. of code fragments looking for repetition: • code in compiler’s internal form (directed graph), not source code • we used a 64-bit “fingerprint” for each code region 48 bits type and size of the first 8 code blocks in the region (6 bits per block: 2 bits for type, 6 bits for no. of instrs) 16 bits size of region
Optimization 2: Buffering • Useful when: • an expensive operation is being applied to a large no. of items • the operation can also be applied collectively to a group of items • What we do: • collect the items into groups • apply the operation to the groups instead of individual items • Most often used for I/O operations
Optimization technique 3: precomputation • Useful when: • a result can be computed once and reused many times • we can predict which results will be computed • we can look up a result cheaply • What we do: • identify operations that get executed over and over • compute the result ahead of time and save it • use the saved result later in the program
Optimization 3: cacheing • Useful when: • we repeatedly perform an expensive operation • there is a cheap way to check whether a computation has been done before • What we do: • keep a cache of computations and results; reuse a result if it is already in the cache • Difference from precomputation: • caches usually have a limited size • the cache may need to be emptied if it fills up
Optimization 4: Using cheaper operations • Macros vs. functions • sometimes it may be cheaper to write a code fragment as a macro than as a function • the macro does not incur the cost of function call/return • macro arguments may be evaluated multiple times #define foo(x, y, z) …. x …. y … x … y … x… y … z … x … y … foo(e1, e2, e3) …. e1 …. e2 … e1 … e2 … e1 … e2 … e3 … e1 … e2 … • Function inlining • conceptually similar to (but slightly different from) macros • replace a call to a function by a copy of the function body • eliminates function call/return overhead
Hashing and Filtering • Many computations involve looking through data to find those that have some property for each data item X { if (X has property) { process X } } • This can be expensive if: no. of items is large; and /or checking for the property is expensive. • Hashing and filtering can be used to reduce the cost of checking. Total cost = no. of data items x cost of checking each item
Filtering: Basic Idea • Given: • a set of items S • some property P • Find: • a function h such that • h() is easy to compute; • h(x) says something useful about whether x has property P Goal: (Cheaply) reduce no. of items to process h
Filtering: Examples • isPrime(n): • full test: check for divisors between 1 and n • filter: n == 2 or n is odd • filters out even numbers > 2 • equality of two strings s1 and s2 • full test: strcmp(s1, s2) • filter: s1[0] == s2[0] • isDivisibleBy3(n) • s1 and s2 are anagrams The filter depends on the property we’re testing! Must be a necessary condition: (forall x)[filter (x) full_test(x)]
Hashing • Conceptually related to filtering • Basic idea: Given a set of items S and a property P: • use a hash function h() to divide up the set S into a number of “buckets” • usually, h() maps S to integers (natural numbers) • h(x) == h(y) means x and y are in the same bucket • if x and y fall in the same bucket, they may share the property P (need to check) • if x and y are in different buckets, they definitely don’t share the property P (no need to check)
Hashing: An Implementation • compute a hash function h() where • h(x) {0, …, n-1} • use h() to index into the appropriate bucket • search/insert in this bucket hash bucket 0 1 2 … n-1 hash table (n buckets)
Performance Tuning: Summary • Big improvements come from algorithmic changes • but don’t ignore code-level issues (e.g., cheaper operations) • Use profiling to understand performance behavior • where to focus efforts • reasons for performance overheads • Figure out how to transform the program based on nature of overheads • Good design, modularization essential