Parallel Concepts

Parallel Concepts An Introduction

The Goal of Parallelization 1 processor cpu time 4 processors communication overhead Elapsed time finish start 4 procs 8 procs 2 procs 1 processor Reduction in elapsed time Elapsed time • Reduction of elapsed time of a program • Reduction in turnaround time of jobs • Overhead: • total increase in cpu time • communication • synchronization • additional work in algorithm • non-parallel part of the program • (one processor works, others spin idle) • Overhead vs Elapsed time is better expressed as Speedup and Efficiency

Speedup and Efficiency Speedup Efficiency ideal 1 Super-linear Saturation Disaster Number of processors Number of processors • Both measure the parallelization properties of a program • Let T(p) be the elapsed time on p processors • The SpeedupS(p) and the EfficiencyE(p) are defined as: • For ideal parallel speedup we get: • Scalable programs remain efficient for large number of processors S(p) = T(1)/T(p) E(p) = S(p)/p T(p) = T(1)/p S(p) = T(1)/T(p) = p E(p) = S(p)/p = 1 or 100%

Amdahl’s Law • This rule states the following for parallel programs: • The non-parallel (serial) fractions of the program includes the communication and synchronization overhead • Thus, the maximum parallel Speedup S(p) for a program that has parallel fraction f: The non-parallel fraction of the code (I.e. overhead) imposes the upper limit on the scalability of the code (1) 1 = s + f ! program has serial and parallel fractions (2) T(1) = T(parallel) + T(serial) = T(1) *(f + s) = T(1) *(f + (1-f)) (3) T(p) = T(1) *(f/p + (1-f)) (4) S(p) = T(1)/T(p) = 1/(f/p + 1-f) < 1/(1-f) ! for p-> inf. (5)S(p) < 1/(1-f)

The Practicality of Parallel Computing Speedup P=8 P=4 P=2 Percentage parallel code 8.0 1970s 7.0 Best Hand-tuned codes ~99% range 1980s 6.0 1990s 0% 20% 40% 60% 80% 100% 5.0 4.0 3.0 2.0 1.0 • In practice, making programs parallel is not as difficult as it may seem from Amdahl’s law • It is clear that a program has to spend significant portion (most) of run time in the parallel region David J. Kuck, Hugh Performance Computing, Oxford Univ.. Press 1996

Fine-Grained Vs Coarse-Grained Coarse-grained MAIN A B C D E F G H I J K L M N O p q r s Fine-grained t • Fine-grain parallelism (typically loop level) • can be done incrementally, one loop at a time • does not require deep knowledge of the code • a lot of loops have to be parallel for decent speedup • potentially many synchronization points (at the end of each parallel loop) • Coarse-grain parallelism • make larger loops parallel at higher call-tree level potentially in-closing many small loops • more code is parallel at once • fewer synchronization points, reducing overhead • requires deeper knowledge of the code

Other Impediments to Scalability p0 p1 p2 p3 start finish Elapsed time • Load imbalance: • the time to complete a parallel execution of a code segment is determined by the longest running thread • unequal work load distribution leads to some processors being idle, while others work too much with coarse grain parallelization, more opportunities for load imbalance exist • Too many synchronization points • compiler will put synchronization points at the start and exit of each parallel region • if too many small loops have been made parallel, synchronization overhead will compromise scalability.

Parallel Programming Models • Classification of Programming models: • Control flow - number of explicit threads of execution • Address space - access to global data from multiple threads • Communication - data transfer part of language or library • Synchronization - mechanism to regulate access to data • Data allocation - control of the data distribution to execution threads

Computing p with DPL 1 p= =S 4 dx (1+x2) 0<i<N 0 4 N(1+((i+0.5)/N)2) • Notes: • essentially sequential form • automatic detection of parallelism • automatic work sharing • all variables shared by default • number of processors specified outside of the code • compile with: • f90 -apo -O3 -mips4 -mplist • the mplist switch will show the intermediate representation PROGRAM PIPROG INTEGER, PARAMETER:: N = 1000000 REAL (KIND=8):: LS,PI, W = 1.0/N PI = SUM( (/ (4.0*W/(1.0+((I+0.5)*W)**2),I=1,N) /) ) PRINT *, PI END

Computingpwith Shared Memory 1 p= =S 4 dx (1+x2) 0<i<N 0 4 N(1+((i+0.5)/N)2) • Notes: • essentially sequential form • automatic work sharing • all variables shared by default • directives to request parallel work distribution • number of processors specified outside of the code #define n 1000000 main() { double pi, l, ls = 0.0, w = 1.0/n; int i; #pragma omp parallel private(i,l) reduction(+:ls) { #pragma omp for for(i=0; i<n; i++) { l = (i+0.5)*w; ls += 4.0/(1.0+l*l); } #pragma omp master printf(“pi is %f\n”,ls*w); #pragma omp end master } }

Computingp with Message Passing 1 p= =S 4 dx (1+x2) 0<i<N 0 4 N(1+((i+0.5)/N)2) • Notes: • thread identification first • explicit work sharing • all variables are private • explicit data exchange (reduce) • all code is parallel • number of processors is specified outside of code #include <mpi.h> #define N 1000000 main() { double pi, l, ls = 0.0, w = 1.0/N; int i, mid, nth; MPI_init(&argc, &argv); MPI_comm_rank(MPI_COMM_WORLD,&mid); MPI_comm_size(MPI_COMM_WORLD,&nth); for(i=mid; i<N; i += nth) { l = (i+0.5)*w; ls += 4.0/(1.0+l*l); } MPI_reduce(&ls,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); if(mid == 0) printf(“pi is %f\n”,pi*w); MPI_finalize(); }

Computingp with POSIX Threads

Comparing Parallel Paradigms • Automatic parallelization combined with explicit Shared Memory programming (compiler directives) used on machines with global memory • Symmetric Multi-Processors, CC-NUMA, PVP • These methods collectively known as Shared Memory Programming (SMP) • SMP programming model works at loop level, and coarse level parallelism: • the coarse level parallelism has to be specified explicitly • loop level parallelism can be found by the compiler (implicitly) • Explicit Message Passing Methods are necessary with machines that have no global memory addressability: • clusters of all sort, NOW & COW • Message Passing Methods require coarse level parallelism to be scalable • Choosing programming model is largely a matter of the application, personal preference and the target machine. • it has nothing to do with scalability. Scalability limitations: • communication overhead • process synchronization • scalability is mainly a function of the hardware and (your) implementation of the parallelism

Summary • The serial part or the communication overhead of the code limits the scalability of the code (Amdahl Law) • Programs have to be >99% parallel to use large (>30 proc) machines • Several Programming Models are in use today: • Shared Memory programming (SMP) (with Automatic Compiler parallelization, Data-Parallel and explicit Shared Memory models) • Message Passing model • Choosing a Programming Model is largely a matter of the application, personal choice and target machine. It has nothing to do with scalability. • Don’t confuse Algorithm and implementation • Machines with a global address space can run applications based on both, SMP and Message Passing programming models

Parallel Concepts