180 likes | 251 Views
Lecture 6: Performance of Multiprocessor Systems. Speedup. Execution time on 1 processor T 1 Speedup = ----------------------------------------------- = -------- Execution time on p processors T p t s : time for the serial part of the algorithm
E N D
Speedup Execution time on 1 processor T1 Speedup = ----------------------------------------------- = -------- Execution time on p processors Tp ts : time for the serial part of the algorithm tp : time for the parallelizable part of the algorithm T1 = ts + tpSpeedup ideal Tp = ts + tp/p ts + tp Speedup(p) = ---------------- ts + tp/p p
Amdahl’s Law If the sequential component of an algorithm is 1/s of the program’s execution time, then maximum speedup that can be achieved on a parallel computer is s. ts = (1/s) x T1 tp = (1- 1/s) x T1 T1 Speedup(p) = ------------------------ s T1/s + (1-1/s)T1 ------------- p Speedup(p) = s p lim p ∞
Superlinear speedup Speedup(p) > p superlinear speedup Reasons: • Increased cache size • Random algorithms • Parallel algorithm
Speedup T1 Speedup = -------- Tp • Relative speedup: single processor execution time of the parallel algorithm is used • Absolute speedup: execution time of the best parallel algorithm on one processor is used
Efficiency Speedup(p) T1 Efficiency(p) = ------------------- = ---------- ≤ 1 p p x Tp Efficiency 1 p
Amdahl’s Law If the sequential component of an algorithm is 1/s of the program’s execution time, then maximum speedup that can be achieved on a parallel computer is s. ts = (1/s) x T1 tp = (1- 1/s) x T1 T1 Speedup = ------------------------ s T1/s + (1-1/s)T1 ------------- p Speedup = s p lim p ∞
Gustafson’s Law work time p p work time p p ts ts Fixed size ws ts ts wp ts tp/p tp tp tp tp ws wp ws ws ws wp ts ws tp /p ws ws wp wp wp wp wp ts tp /p ts tp /p 1 2 3 4 1 2 3 4 Fixed time 1 2 3 4 1 2 3 4
Gustafson’s Law Scaled Speedup (Fixed-size Speedup) Tp = ts + tp T1 = ts + p.tp If the sequential component of an algorithm is 1/s of the program’s execution time ts = (1/s) x Tp tp = (1- 1/s) x TpSpeedup ideal Speedup(p) = 1/s + (1-1/s)p Speedup(p) = ∞ p lim p ∞
Sizeup Total work on 1 processor Sizeup = ------------------------------------------- Total work on p processors ws: serial work wp: parallelizable work wp’: scaled parallelizable work ws + wp’ ws + p.wp Sizeup = ---------------- = ----------------- ws + wpws + wp
Roofline Performance Model Arithmetic intensity is the ratio of floating-point operations in a program to the number of data bytes accessed by the program from main memory floating-point operations Arithmetic intensity = --------------------------------------- = FLOPs/Byte number of data bytes
Roofline Performance Model Attainable GFLOPs/second Peak memory bandwidth x Arithmetic intensity = min Peak floating-point performance
Roofline Performance Model Peak floating-point performance is given by the hardware specifications of the computer (FLOPs/second) For multicore chips, peak performance is the collective performance of all the cores on the chip. So, multiply the peak per chip by the number of chips Peak memory performance is also given by the hardware specifications of the computer (Mbytes/second) Maximum floating-point performance that the memory system of the computer can support for a given arithmetic intensity, can be plotted as Peak memory bandwidth x Arithmetic intensity (bytes/second) x (FLOPs/bytes) ==> FLOPs/second
Roofline Performance Model Roofline sets an upper bound on performance Roofline of a computer does not vary by benchmark kernel
Stream Benchmark A synthetic benchmark Measures the performance of long vector operations They have no temporal locality and they access arrays that are larger than the cache size http://www.cs.virginia.edu/stream/ref.html define N 2000000 ... void tuned_STREAM_Copy() { int j; #pragmaomp parallel for for (j=0; j<N; j++) c[j] = a[j]; } void tuned_STREAM_Scale(double scalar) { int j; #pragmaomp parallel for for (j=0; j<N; j++) b[j] = scalar*c[j]; } void tuned_STREAM_Add() { int j; #pragmaomp parallel for for (j=0; j<N; j++) c[j] = a[j]+b[j]; } void tuned_STREAM_Triad(double scalar) { int j; #pragmaomp parallel for for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j]; }