Lies, Damn Lies and Benchmarks

Lies, Damn Lies and Benchmarks • Are your benchmark tests reliable? 6.1 Advanced Operating Systems

Typical Computer Systems Paper • Abstract: What this paper contains. • Most readers will be reading just this. • Introduction: Present a problem. • The universe cannot go on, if the problem persists. • Related Work: Show the work of competitors. • They are stink. • Solution: Present the suggested solution. • We are the best. 6.2 Advanced Operating Systems

Typical Paper (Cont.) • Technique: Go into details. • Many drawings and figures. • Experiments: Prove our point, Evaluation Methodology. • Which benchmarks adhere to my assumptions? • Results: Show how the enhancement is great. • The objective benchmarks agree that we are the best. • Conclusions: Highlights of the paper. • Some readers will be reading besides the abstract, also this. 6.3 Advanced Operating Systems

SPEC • SPEC is Standard Performance Evaluation Corporation. • Legally, SPEC is a non-profit corporation registered in California. • SPEC's mission: To establish, maintain, and endorse a standardized set of relevant benchmarks and metrics for performance evaluation of modern computer systems. • "SPEC CPU2000 is the next-generation industry-standardized CPU-intensive benchmark suite." • Composed of 12 integer (CINT2000) and 14 floating-point benchmarks (CFP2000). 6.4 Advanced Operating Systems

Some Conferences Statistics • Number of papers published: 209 • Papers that used a version of SPEC: 138 (66%) • Earliest conference deadline: December 2000 • SPEC CPU2000 announced: December 1999 6.5 Advanced Operating Systems

Partial use of CINT2000 6.6 Advanced Operating Systems

Why not using it all? • It seemed that many papers are not using all benchmarks of the suite. • Selected excuses were: • “The chosen benchmarks stress the problem …” • “Several benchmarks couldn’t be simulated …” • “A subset of CINT2000 was chosen …” • “… select benchmarks from CPU2000 …” • “More benchmarks wouldn't fit into our displays …” 6.7 Advanced Operating Systems

Omission Explanation • Roughly a third of the papers (34/108) presentany reason at all. • Many reasons are not so convincing. • Are the claims in the previous slide persuasive? 6.8 Advanced Operating Systems

What has been omitted • Possible reasons for the omissions: • eon is written in C++. • gap calls ioctl system call, which is a device specific call. • crafty uses a 64-bit word. • perlbmk has problems with 64-bit processors 6.9 Advanced Operating Systems

CINT95 • Still widespread even though it retired on June 2000. • Smaller suite (8 vs. 12). • Over 50% of full use, but around for at least 3 years already. • Only 5 papers out of 36 explain the partial use. 6.10 Advanced Operating Systems

Using of CINT2000 • The using of CINT has been increasing over the years. • The benchmarking of new systems is done by old tests 6.11 Advanced Operating Systems

Fenhanced 10 Amdahl's Law • Fenhanced is the percents of the benchmarks that were enhanced. Speedup is: = = • Example: if we have a way to improve just the gzip benchmark by a factor of 10, what fraction of usage must be gzip to achieve a 300% speedup? 3=  Fenhanced=20/27=74% CPU Timeold CPU Timeold CPU Timeold (1 - Fenhanced) + CPU Timeold Fenhanced (1/ speedupenhanced) CPU Timenew 1 Fenhanced speedupenhanced (1 - Fenhanced ) + 1 (1 - Fenhanced ) + 6.12 Advanced Operating Systems

Breaking Amdahl's Law • "The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode be used." • Just the full suite can accurately gauge the enhancement. • It is possible that other benchmarks : • produce similar results. • degrade performance. • invariant to the enhancement. Even in this case the published results are too high according to Amdahl's Law. 6.13 Advanced Operating Systems

I shouldn't have left eon out Tradeoffs • What about papers that offer performance tradeoffs? • Performance tradeoff are more than 40% of the papers. • An average paper contains just 8 tests out of the 12. • What do we assume about missing results? 6.14 Advanced Operating Systems

Besides SPEC • Categories of benchmarks: • Official benchmarks like SPEC; there are also official benchmarks by non-vendor source. • They will not always concentrate on the points important for your usage. • Traces – real users whose activities are logged and kept. • An improved (or worsened) system may change the users behavior. 6.15 Advanced Operating Systems

Besides SPEC (Cont.) • Microbenchmarks – test just an isolated component of a system. • Using multiple microbenchmarks will not test the interaction between the components. • Ad-hoc benchmarks – run a bunch of programs that seem interesting. • If you suggest a way to compile Linux faster, Linux compilation can be a good benchmark. • Synthetic Benchmarks – write a program to test yourself. • You can stress your point. 6.16 Advanced Operating Systems

Whetstone Benchmark • Historically it is the first synthetic microbenchmark. • The original Whetstone benchmark was designed in the 60's. First practical implementation on 1972. • Was named after the small town of Whetstone, where it was designed. • Designed to measure the execution speed of a variety of FP instructions (+, *, sin, cos, atan, sqrt, log, exp). Contains small loop of FP instructions. • The majority of its variables are global; hence will not show up the RISC advantages, where large number of registers enhance the local variables handling. 6.17 Advanced Operating Systems

The Andrew benchmark • Andrew benchmark was suggested at 1988. • In the early 90's the Andrew benchmark was one of the popular non-vendor benchmark for file system efficiency. • The Andrew benchmark: • Copies a directory hierarchy containing a source code of a large program. • "stat"s every file in the hierarchy. • Reads any byte of every copied file. • Compiles the code in the copied hierarchy. • Does this reflect the reality? Who does work like this? 6.18 Advanced Operating Systems

Kernel Compilation • Maybe a "real" job can be more representative? • Measure the compilation of the Linux kernel. • The compilation reads large memory areas only once. This reduces the influence of the cache efficiency. • The influence of the L2 cache will be drastically reduced. 6.19 Advanced Operating Systems

Benchmarks' Contribution • On 1999 Mogul presented statistics which have shown that while HW is usually measured by SPEC; when it comes to the code of the Operating System, no standard is popular. • Distributed Systems are commonly benchmarked by NAS. • On 1993, Chen & Patterson wrote: "Benchmarks do not help in understanding system performance". 6.20 Advanced Operating Systems

Lies, Damn Lies and Benchmarks