CPE 619 Workloads: Types, Selection, Characterization

CPE 619Workloads: Types, Selection, Characterization Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama in Huntsville http://www.ece.uah.edu/~milenka http://www.ece.uah.edu/~lacasa

Part II: Measurement Techniques and Tools Measurements are not to provide numbers but insight - Ingrid Bucher • Measure computer system performance • Monitor the system that is being subjected to a particular workload • How to select appropriate workload • In general performance analysis should know • What are the different types of workloads? • Which workloads are commonly used by other analysts? • How are the appropriate workload types selected? • How is the measured workload data summarized? • How is the system performance monitored? • How can the desired workload be placed on the system in a controlled manner? • How are the results of the evaluation presented?

Types of Workloads benchmark v. trans. To subject (a system) to a series of tests In order to obtain prearranged results not available on Competitive systems. – S. Kelly-Bootle, The Devil’s DP Dictionary • Test workload – denotes any workload used in performance study • Real workload – one observed on a system while being used • Cannot be repeated (easily) • May not even exist (proposed system) • Synthetic workload – similar characteristics to real workload • Can be applied in a repeated manner • Relatively easy to port; Relatively easy to modify without affecting operation • No large real-world data files; No sensitive data • May have built-in measurement capabilities • Benchmark == Workload • Benchmarking is process of comparing 2+ systems with workloads

Test Workloads for Computer Systems • Addition instructions • Instruction mixes • Kernels • Synthetic programs • Application benchmarks

Addition Instructions • Early computers had CPU as most expensive component • System performance == Processor Performance • CPUs supported few operations; the most frequent one was addition • Computer with faster addition instruction performed better • Run many addition operations as test workload • Problem • More operations, not only addition • Some more complicated than others

Instruction Mixes • Number and complexity of instructions increased • Additions were no longer sufficient • Could measure instructions individually, but they are used in different amounts • => Measure relative frequencies of various instructions on real systems • Use as weighting factors to get average instruction time • Instruction mix – specification of various instructions coupled with their usage frequency • Use average instruction time to compare different processors • Often use inverse of average instruction time • MIPS – Million Instructions Per Second • FLOPS – Millions of Floating-Point Operations Per Second • Gibson mix: Developed by Jack C. Gibson in 1959 for IBM 704 systems

Example: Gibson Instruction Mix • Load and Store 13.2 • Fixed-Point Add/Sub 6.1 • Compares 3.8 • Branches 16.6 • Float Add/Sub 6.9 • Float Multiply 3.8 • Float Divide 1.5 • Fixed-Point Multiply 0.6 • Fixed-Point Divide 0.2 • Shifting 4.4 • Logical And/Or 1.6 • Instructions not using regs 5.3 • Indexing 18.0 Total 100 1959, IBM 650 IBM 704

Problems with Instruction Mixes • In modern systems, instruction time variable depending upon • Addressing modes, cache hit rates, pipelining • Interference with other devices during processor-memory access • Distribution of zeros in multiplier • Times a conditional branch is taken • Mixes do not reflect special hardware such as page table lookups • Only represents speed of processor • Bottleneck may be in other parts of system

Kernels • Pipelining, caching, address translation, … made computer instruction times highly variable • Cannot use individual instructions in isolation • Instead, use higher level functions • Kernel = the most frequent function (kernel = nucleus) • Commonly used kernels: Sieve, Puzzle, Tree Searching, Ackerman's Function, Matrix Inversion, and Sorting • Disadvantages • Do not make use of I/O devices • Ad-hoc selection of kernels (not based on real measurements)

Synthetic Programs • Proliferation in computer systems, OS emerged, changes in applications • No more processing-only apps, I/O became important too • Use simple exerciser loops • Make a number of service calls or I/O requests • Compute average CPU time and elapsed time for each service call • Easy to port, distribute (Fortran, Pascal) • First exerciser loop by Buchholz (1969) • Called it synthetic program • May have built-in measurement capabilities

Example of Synthetic Workload Generation Program Buchholz, 1969

Synthetic Programs • Advantages • Quickly developed and given to different vendors • No real data files • Easily modified and ported to different systems • Have built-in measurement capabilities • Measurement process is automated • Repeated easily on successive versions of the operating systems • Disadvantages • Too small • Do not make representative memory or disk references • Mechanisms for page faults and disk cache may not be adequately exercised • CPU-I/O overlap may not be representative • Not suitable for multi-user environments because loops may create synchronizations, which may result in better or worse performance

Application Workloads • For special-purpose systems, may be able to run representative applications as measure of performance • E.g.: airline reservation • E.g.: banking • Make use of entire system (I/O, etc) • Issues may be • Input parameters • Multiuser • Only applicable when specific applications are targeted • For a particular industry: Debit-Credit for Banks

Benchmarks • Benchmark = workload • Kernels, synthetic programs, application-level workloads are all called benchmarks • Instruction mixes are not called benchrmarks • Some authors try to restrict the term benchmark only to a set of programs taken from real workloads • Benchmarking is the process of performance comparison of two or more systems by measurements • Workloads used in measurements are called benchmarks

Popular Benchmarks • Sieve • Ackerman’s Function • Whetstone • Linpack • Dhrystone • Lawrence Livermore Loops • SPEC • Debit-card Benchmark • TPC • EMBS

Sieve (1 of 2) • Sieve of Eratosthenes (finds primes) • Write down all numbers 1 to n • Strike out multiples of k for k = 2, 3, 5 … sqrt(n) • In steps of remaining numbers

Sieve (2 of 2)

Ackermann’s Function (1 of 2) • Assess efficiency of procedure calling mechanisms • Ackermann’s Function has two parameters, and it is defined recursively • Benchmark is to call Ackerman(3,n) for values of n = 1 to 6 • Average execution time per call, the number of instructions executed, and the amount of stack space required for each call are used to compare various systems • Return value is 2n+3-3, can be used to verify implementation • Number of calls: (512x4n-1 – 15x2n+3 + 9n + 37)/3 • Can be used to compute time per call • Depth is 2n+3 – 4, stack space doubles when n++

Ackermann’s Function (2 of 2) (Simula)

Whetstone • Set of 11 modules designed to match observed frequencies in ALGOL programs • Array addressing, arithmetic, subroutine calls, parameter passing • Ported to Fortran, most popular in C, … • Many variations of Whetstone, so take care when comparing results • Problems – specific kernel • Only valid for small, scientific (floating) apps that fit in cache • Does not exercise I/O

LINPACK • Developed by Jack Dongarra (1983) at ANL • Programs that solve dense systems of linear equations • Many float adds and multiplies • Core is Basic Linear Algebra Subprograms (BLAS), called repeatedly • Usually, solve 100x100 system of equations • Represents mechanical engineering applications on workstations • Drafting to finite element analysis • High computation speed and good graphics processing

Dhrystone • Pun on Whetstone • Intent to represent systems programming environments • Most common was in C, but many versions • Low nesting depth and instructions in each call • Large amount of time copying strings • Mostly integer performance with no float operations

Lawrence Livermore Loops • 24 vectorizable, scientific tests • Floating point operations • Physics and chemistry apps spend about 40-60% of execution time performing floating point operations • Relevant for: fluid dynamics, airplane design, weather modeling

SPEC • Systems Performance Evaluation Cooperative (SPEC) (http://www.spec.org) • Non-profit, founded in 1988, by leading HW and SW vendors • Aim: ensure that the marketplace has a fair and useful set of metrics to differentiate candidate systems • Product: “fair, impartial and meaningful benchmarks for computers“ • Initially, focus on CPUs: SPEC89, SPEC92, SPEC95, SPEC CPU 2000, SPEC CPU 2006 • Now, many suites are available • Results are published on the SPEC web site

SPEC (cont’d) • Benchmarks aim to test "real-life" situations • E.g., SPECweb2005 tests web server performance by performing various types of parallel HTTP requests • E.g., SPEC CPU tests CPU performance by measuring the run time of several programs such as the compiler gcc and the chess program crafty. • SPEC benchmarks are written in a platform neutral programming language (usually C or Fortran), and the interested parties may compile the code using whatever compiler they prefer for their platform, but may not change the code • Manufacturers have been known to optimize their compilers to improve performance of the various SPEC benchmarks

SPEC Benchmark Suits (Current) • SPEC CPU2006: combined performance of CPU, memory and compiler • CINT2006 ("SPECint"): testing integer arithmetic, with programs such as compilers, interpreters, word processors, chess programs etc. • CFP2006 ("SPECfp"): testing floating point performance, with physical simulations, 3D graphics, image processing, computational chemistry etc. • SPECjms2007: Java Message Service performance • SPECweb2005: PHP and/or JSP performance. • SPECviewperf: performance of an OpenGL 3D graphics system, tested with various rendering tasks from real applications • SPECapc: performance of several 3D-intensive popular applications on a given system • SPEC OMP V3.1: for evaluating performance of parallel systems using OpenMP (http://www.openmp.org) applications. • SPEC MPI2007: for evaluating performance of parallel systems using MPI (Message Passing Interface) applications. • SPECjvm98: performance of a java client system running a Java virtual machine • SPECjAppServer2004: a multi-tier benchmark for measuring the performance of Java 2 Enterprise Edition (J2EE) technology-based application servers. • SPECjbb2005: evaluates the performance of server side Java by emulating a three-tier client/server system (with emphasis on the middle tier). • SPEC MAIL2001: performance of a mail server, testing SMTP and POP protocols • SPECpower_2008: evaluates the energy efficiency of server systems. • SPEC SFS97_R1: NFS file server throughput and response time

SPEC CPU Benchmarks

SPEC CPU2006 Speed Metrics • Run and reporting rules – guidelines required to build, run, and report on the SPEC CPU2006 benchmarks • http://www.spec.org/cpu2006/Docs/runrules.html • Speed metrics • SPECint_base2006 (Required Base result); SPECint2006 (Optional Peak result) • SPECfp_base2006 (Required Base result); SPECfp2006 (Optional Peak result) • The elapsed time in seconds for each of the benchmarks is given and the ratio to the reference machine (a Sun UltraSparc II system at 296MHz) is calculated • The SPECint_base2006 and SPECfp_base2006 metrics are calculated as a Geometric Mean of the individual ratios • Each ratio is based on the median execution time from three VALIDATED runs

SPEC CPU2006 Throughput Metrics • SPECint_rate_base2006 (Required Base result); SPECint_rate2006 (Optional Peak result) • SPECfp_rate_base2006 (Required Base result); SPECfp_rate2006 (Optional Peak result) • Select the number of concurrent copies of each benchmark to be run (e.g. = #CPUs) • The same number of copies must be used for all benchmarks in a base test • This is not true for the peak results where the tester is free to select any combination of copies • The "rate" calculated for each benchmark is a function of: (the number of copies run * reference factor for the benchmark) / elapsed time in seconds which yields a rate in jobs/time. • The rate metrics are calculated as a geometric mean from the individual SPECrates using the median result from three runs

Debit-Credit (1/3) • Application-level benchmark • Was de-facto standard for Transaction Processing Systems • Retail bank wanted 1000 branches, 10k tellers, 10,000k accounts online with peak load of 100 TPS • Performance in TPS where 95% of all transactions with 1 second or less of response time (arrival of last bit, sending of first bit) • Each TPS requires 10 branches, 100 tellers, and 100,000 accounts • System claiming 50 TPS performance should run: 500 branches; 5,000 tellers; 5,000,000 accounts

Debit-Credit (2/3)

Debit-Credit (3/3) • Metric: price/performance ratio • Performance: Throughput in terms of TPS such that 95% of all transactions provide one second or less response time • Response time: Measured as the time interval between the arrival of the last bit from the communications line and the sending of the first bit to the communications line • Cost = Total expenses for a five-year period on purchase, installation, and maintenance of the hardware and software in the machine room • Cost does not include expenditures for terminals, communications, application development, or operations • Pseudo-code Definition of Debit-Credit • See Figure 4.5 in the book

TPC • Transaction Processing Council (TPC) • Mission: create realistic and fair benchmarks for TP • For more info: http://www.tpc.org • Benchmark types • TPC-A (1985) • TPC-C (1992) – complex query environment • TPC-H – models ad-hoc decision support (unrelated queries, no local history to optimize future queries) • TPC-W – transaction Web benchmark (simulates the activities of a business-oriented transactional Web server) • TPC-App – application server and Web services benchmark (simulates activities of a B2B transactional application server operating 24/7) • Metric: transaction per second, also include response time (throughput performance is measure only when response time requirements are met).

EMBS • Embedded Microprocessor Benchmark Consortium (EEMBC, pronounced “embassy”) • Non-profit consortium supported by member dues and license fees • Real world benchmark software helps designers select the right embedded processors for their systems • Standard benchmarks and methodology ensure fair and reasonable comparisons • EEMBC Technology Center manages development of new benchmark software and certifies benchmark test results • For more info: http://www.eembc.com/ • 41 kernels used in different embedded applications • Automotive/Industrial • Consumer • Digital Entertainment • Java • Networking • Office Automation • Telecommunications

The Art of Workload Selection

The Art of Workload Selection • Workload is the most crucial part of any performance evaluation • Inappropriate workload will result in misleading conclusions • Major considerations in workload selection • Services exercised by the workload • Level of detail • Representativeness • Timeliness

Services Exercised • SUT = System Under Test • CUS = Component Under Study

Services Exercised (cont’d) • Do not confuse SUT w CUS • Metrics depend upon SUT: MIPS is ok for two CPUs but not for two timesharing systems • Workload: depends upon the system • Examples: • CPU: instructions • System: Transactions • Transactions not good for CPU and vice versa • Two systems identical except for CPU • Comparing Systems: Use transactions • Comparing CPUs: Use instructions • Multiple services: Exercise as complete a set of services as possible

Example: Timesharing Systems Hierarchy of interfaces • Applications Application benchmark • Operating System Synthetic Program • Central Processing Unit Instruction Mixes • Arithmetic Logical Unit Addition instruction

Example: Networks • Application: user applications, such as mail, file transfer, http,… • Workload: frequency of various types of applications • Presentation: data compression, security, … • Workload: frequency of various types of security and (de)compression requests • Session: dialog between the user processes on the two end systems (init., maintain, discon.) • Workload: frequency and duration of various types of sessions • Transport: end-to-end aspects of communication between the source and the destination nodes (segmentation and reassembly of messages) • Workload: frequency, sizes, and other characteristics of various messages • Network: routes packets over a number of links • Workload: the source-destination matrix, the distance, and characteristics of packets • Datalink: transmission of frames over a single link • Workload: characteristics of frames, length, arrival rates, … • Physical: transmission of individual bits (or symbols) over the physical medium • Workload: frequency of various symbols and bit patterns

Example: Magnetic Tape Backup System • Backup System • Services: Backup files, backup changed files, restore files, list backed-up files • Factors: File-system size, batch or background process, incremental or full backups • Metrics: Backup time, restore time • Workload: A computer system with files to be backed up. Vary frequency of backups • Tape Data System • Services: Read/write to the tape, read tape label, auto load tapes • Factors: Type of tape drive • Metrics: Speed, reliability, time between failures • Workload: A synthetic program generating representative tape I/O requests

Magnetic Tape System (cont’d) • Tape Drives • Services: Read record, write record, rewind, find record, move to end of tape, move to beginning of tape • Factors: Cartridge or reel tapes, drive size • Metrics: Time for each type of service, for example, time to read record and to write record, speed (requests/time), noise, power dissipation • Workload: A synthetic program exerciser generating various types of requests in a representative manner • Read/Write Subsystem • Services: Read data, write data (as digital signals) • Factors: Data-encoding technique, implementation technology (CMOS, TTL, and so forth) • Metrics: Coding density, I/O bandwidth (bits per second) • Workload: Read/write data streams with varying patterns of bits

Magnetic Tape System (cont’d) • Read/Write Heads • Services: Read signal, write signal (electrical signals) • Factors: Composition, inter-head spacing, gap sizing, number of heads in parallel • Metrics: Magnetic field strength, hysteresis • Workload: Read/write currents of various amplitudes, tapes moving at various speeds

Level of Detail • Workload description varies from least detailed to a time-stamped list of all requests • 1) Most frequent request • Examples: Addition Instruction, Debit-Credit, Kernels • Valid if one service is much more frequent than others • 2) Frequency of request types • List various services, their characteristics, and frequency • Examples: Instruction mixes • Context sensitivity • A service depends on the services required in the past • => Use set of services (group individual service requests) • E.g., caching is a history-sensitive mechanism

Level of Detail (Cont) • 3) Time-stamped sequence of requests (trace) • May be too detailed • Not convenient for analytical modeling • May require exact reproduction of component behavior • 4) Average resource demand • Used for analytical modeling • Grouped similar services in classes • 5) Distribution of resource demands • Used if variance is large • Used if the distribution impacts the performance • Workloads used in simulation and analytical modeling • Non executable: Used in analytical/simulation modeling • Executable: can be executed directly on a system

Representativeness • Workload should be representative of the real application • How do we define representativeness? • The test workload and real workload should have the same • Arrival Rate: the arrival rate of requests should be the same or proportional to that of the real application • Resource Demands: the total demands on each of the key resources should be the same or proportional to that of the application • Resource Usage Profile: relates to the sequence and the amounts in which different resources are used

Timeliness • Workloads should follow the changes in usage patterns in a timely fashion • Difficult to achieve: users are a moving target • New systems  new workloads • Users tend to optimize the demand • Use those features that the system performs efficiently • E.g., fast multiplication  higher frequency of multiplication instructions • Important to monitor user behavior on an ongoing basis

Other Considerations in Workload Selection • Loading Level: A workload may exercise a system to its • Full capacity (best case) • Beyond its capacity (worst case) • At the load level observed in real workload (typical case) • For procurement purposes  Typical • For design  best to worst, all cases • Impact of External Components • Do not use a workload that makes external component a bottleneck  All alternatives in the system give equally good performance • Repeatability • Workload should be such that the results can be easily reproduced without too much variance

Summary • Services exercised determine the workload • Level of detail of the workload should match that of the model being used • Workload should be representative of the real systems usage in recent past • Loading level, impact of external components, and repeatability or other criteria in workload selection

WorkloadCharacterization

CPE 619 Workloads: Types, Selection, Characterization