1 / 58

Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science

Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science. Outline. Introduction to OpenMP OpenMP Programming Model OpenMP Directives OpenMP Clauses Run-Time Library Routine Environment Variables Summary. What is OpenMP.

sitara
Download Presentation

Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MohsanJameel Department of Computing NUST School of Electrical Engineering and Computer Science

  2. Outline • Introduction to OpenMP • OpenMP Programming Model • OpenMP Directives • OpenMP Clauses • Run-Time Library Routine • Environment Variables • Summary

  3. What is OpenMP • Application program interface (API) that is used to explicitly direct multi-threaded, shared memory parallelism • Consists of: • Compiler directives • Run time routines • Environment variables • Specification maintained by the OpenMP, Architecture Review Board (http://www.openmp.org) • Version 3.0 has been released May 2008

  4. What OpenMP is Not • Not Automatic parallelization • User explicitly specifies parallel execution • Compiler does not ignore user directives even if wrong • Not just loop level parallelism • Functionality to enable coarse grained parallelism • Not meant for distributed memory parallel systems • Not necessarily implemented identically by all vendors • Not Guaranteed to make the most efficient use of shared memory

  5. History of OpenMP • In the early 90's, vendors of shared-memory machines supplied similar, directive-based, Fortran programming extensions: • The user would augment a serial Fortran program with directives specifying which loops were to be parallelized. • First attempt at a standard was the draft for ANSI X3H5 in 1994. It was never adopted, largely due to waning interest as distributed memory machines became popular. • The OpenMP standard specification started in the spring of 1997, taking over where ANSI X3H5had left off, as newer shared memory machine architectures started to become prevalent

  6. Goal of OpenMP • Standardization : • Provide a standard among a variety of shared memory architectures/platforms • Lean and mean : • Establish a simple and limited set of directives for programming shared memory machines. • Ease of Use : • Provide capability to incrementally parallelize a serial program • Provide the capability to implement both coarse-grain and fine-grain parallelism • Portability : • Support Fortran (77, 90, and 95), C, and C++

  7. Outline • Introduction to OpenMP • OpenMP Programming Model • OpenMP Directives • OpenMP Clauses • Run-Time Library Routine • Environment Variables • Summary

  8. OpenMP Programming Model • Thread Based Parallelism • Explicit Parallelism • Compiler Directive Based • Dynamic Threads • Nested Parallelism Support • Task parallelism support (OpenMP specification 3.0)

  9. Shared Memory Model

  10. Execution Model ID=0 ID=1,2,3…N-1

  11. Terminology • OpenMP Team=: Master + workers • A parallel region is block of code executed by all threads simultaneously. • Master thread always has thread ID=0 • Thread adjustment is done before entering parallel region. • An “if” clause can be used with parallel construct, incase the condition evaluate to FALSE, parallel region is avoided and code run serially • Work-sharing construct is responsible for dividing work among the threads in parallel region

  12. Example OpenMP Code Structure

  13. Components of OpenMP

  14. Introduction to OpenMP • OpenMP Programming Model • OpenMP Directives • OpenMP Clauses • Run-Time Library Routine • Environment Variables • Summary

  15. Go to helloworld.c

  16. C/C++ Parallel Region Example !$OMP PARALLEL write (*,*) “Hello” !$OMP END PARALLEL thread 0 thread 2 thread 1 Hello world from thread = 0 Number of threads = 3 Hello world from thread = 1 Hello world from thread = 2

  17. OpenMP Directives

  18. OpenMP Scoping • Static Extent: • The code textually enclosed between beginning and end of structure block • The static extent does not span other routines • Orphaned Directive: • An OpenMP directive appear independently • Dynamic Extent: • It include extent of both static extent and orphaned directives

  19. OpenMP Parallel Regions • A block of code that will be executed by multiple threads • Properties - Fork-Join Model - Number of threads won’t change inside a parallel region - SPMD execution within region - Enclosed block of code must be structured, no branching into or out of block • Format #pragmaomp parallel clause1 clause2 …

  20. OpenMP Threads • How many threads? • Use of the omp_set_threads() library function • Setting of the OMP_NUM_THREADS environment variable • Implementation default • Dynamic Threads : • By default, the same number of threads are used to execute each parallel region • Two methods for enabling dynamic threads • Use of the omp_set_dynamic() library function • Setting of the OMP_DYNAMIC environment variable

  21. OpenMP Work-sharing constructs Data parallelism Functional parallelism Serialize a section

  22. Example: Count3s in an array • Lets assume we have an array of N integers. • We want to find how many 3s are in the array. • We need • a for loop • if statement, and • a count variable • Lets look at its serial and parallel version

  23. Serial: Count3s in an array int count, n=100; int array[n]; // initialize array for(i=0;i<length;i++) { if (array[i]==3) count++; }

  24. Work-sharing construct: “for loop” • “for loop” work-sharing construct is thought of as data parallelism construct.

  25. Parallelize 1st attempt: Count3s in an array int count, n=100; int array[n]; // initialize array #pragmaomp parallel for default(none) shared(n,array,count) private(i) for(i=0;i<length;i++) { if (array[i]==3) count++; }

  26. Work-sharing construct:Example of “for loop” #pragmaomp parallel for default(none) shared(n,a,b,c) private(i) for (i=0;i<n;i++) { c[i] = a[i] + b[i]; }

  27. Work-sharing construct: “section” • “Section” work-sharing construct is thought of as functional parallelism construct.

  28. Parallelize 2nd attempt: Count3s in an array • Say we also want to count 4s in same array. • Now we have two different function i.e. count 3 and count 4. int count, n=100; int array[n]; // initialize array #pragmaomp parallel sections default(none) shared(n,array,count3,count4) private(i) #pragmaomp parallel section for(i=0;i<length;i++) { if (array[i]==3) count3++; } #pragmaomp parallel section for(i=0;i<length;i++) { if (array[i]==4) count4++; } No date race condition in this example. WHY?

  29. Work-sharing construct:Example 1 of “section” #pragmaomp parallel sections default(none) shared(a,b,c,d,e,n) private(i) { #pragmaomp section { printf("Thread %d executes 1st loop \n”,omp_get_thread_num()); for(i=0;i<n;i++) a[i]=3*b[i]; } #pragmaomp section { printf("Thread %d executes 1st loop \n”,omp_get_thread_num()); for(i=0;i<n;i++) e[i]=2*c[i]+d[i]; } } final_sum=sum(a,n) + sum(e,n); printf("FINAL_SUM is %d\n",final_sum)

  30. Work-sharing construct:Example 2 of “section” 1/2

  31. Work-sharing construct:Example 2 of “section” 2/2

  32. Work-sharing construct:Example of “single” • In parallel region “single block” is used to specify that this block is executed only by one thread in the team of threads. Lets look at an example

  33. Introduction to OpenMP • OpenMP Programming Model • OpenMP Directives • OpenMP Clauses • Run-Time Library Routine • Environment Variables • Summary

  34. OpenMP Clauses: Data sharing 1/2 • shared(list) • shared clause is used to specify which data is shared among thread. • All threads can read and write to this shared variable. • By default all variables are shared. • private(list) • private variable are local to thread. • Typical example of private variable is loop counter, since each thread has its own loop counter initialized at entry point.

  35. OpenMP Clauses: Data sharing 2/2 • A private variable is defined between entry and exit point of parallel region. • A private variable within parallel region has no scope out side of it • firstprivate and lastprivate clauses are used to increase scope of variable beyond parallel region. • firstprivate: All variables in the list are initialized with the original value that object had before entering parallel region • lastprivate: The thread that executes the last iteration or section updates the value of object in list.

  36. Example: firstprivate and lastprivate int main(){ int C, B , A=10; /*--- Start of parallel region ---*/ #pragmaomp parallel for default(none) firstprivate(A) lastprivate(B) private(i) for (i=0;i<n;i++) { … B = i + A; … } /*--- End of parallel region ---*/ C=B; }

  37. OpenMP Clauses: nowait • nowaitclause is used to avoid implicit synchronization at end of work-sharing directive

  38. OpenMP Clause: schedule • schedule clause is supported in loop construct only. • Used to control the manner in which loop iterations are distributed over the threads. • Syntax: schedule(kind[,chunk_size) • Types: • static[,chunk]: distribute iterations in blocks of size “chunk over the threads in a round-robin fashion • dynamic[,chunk]: fixed portions of work; size is controlled by the value chunk, when thread finishes its portion it starts with next portion. • guided[,chunk]: same as “dynamic”, but size of the portion of work decreases exponentially. • runtime[,chunk]: iteration scheduling scheme is set at runtime thought environment variable OMP_SCHEDULE

  39. The Experiment with schedule clause

  40. OpenMP Critical construct Example summation of a vector int main(){ int sum, n=5; int a[5]={1,2,3,4,5}; /*--- Start of parallel region ---*/ #pragmaomp parallel for default(none) shared(sum,a,n) private(i) for (i=0;i<n;i++) { sum += a[i]; } /*--- End of parallel region ---*/ printf(“sum of vector a =%d”,sum); } race condition

  41. OpenMP Critical construct int main(){ int sum, local_sum, n=5; int a[5]={1,2,3,4,5}; /*--- Start of parallel region ---*/ #pragmaomp parallel default(none) shared(sum,a,n) private(local_sum,i) { #pragmaomp for for (i=0;i<n;i++) { local_sum += a[i]; } #pragmaomp critical { sum+=local_sum } }/*--- End of parallel region ---*/ printf(“sum of vector a =%d”,sum); }

  42. Parallelize 3rd attempt: Count3s in an array int count, n=100; int array[n]; // initialize array #pragmaomp parallel default(none) shared(n,array,count) private(i,local_count) { #pragmaomp parallel for for(i=0;i<length;i++) { if (array[i]==3) local_count++; } #pragmaomp critical { count+=local_count } } /*--- End of Parallel region ---*/

  43. OpenMP Clause: reduction • OpenMP provides a reduction clause which is used with for loop and section directives. • reductionvariable must be shared among threads • race condition is avoided implicitly. int main(){ int sum, n=5; int a[5]={1,2,3,4,5}; /*--- Start of parallel region ---*/ #pragmaomp parallel for default(none) shared(a,n) private(i)\ reduction(+:sum) for (i=0;i<n;i++) { sum += a[i]; } /*--- End of parallel region ---*/ printf(“sum of vector a =%d”,sum); }

  44. Parallelize 4th attempt: Count3s in an array int count, n=100; int array[n]; // initialize array #pragmaomp parallel for default(none) shared(n,array) private(i) \ for(i=0;i<length;i++) { if (array[i]==3) count++; } /*--- End of Parallel region ---*/ reduction(+:count)

  45. Tasking in OpenMP

  46. Tasking in OpenMP • In OpenMP 3.0 the concept of tasks has been added to the OpenMP execution model • The Task model is useful is case where the number of parallel pieces and the work involved in each piece varies and/or unknown • Before inclusion of the Task model OpenMP was not suited for unstructured problem • Tasks are often set up within a single construct in a manager-worker model.

  47. Task Parallelism Approach 1/2 • Threads line up as workers, go through the queue of work to be done, and do a task • Threads do not wait, as in loop parallelism, rather go back to queue and do more tasks. • Each task is executed serially by work thread that encounter that task in queue. • Load balancing occur as short and long task are done as threads become available.

  48. Task Parallelism Approach 2/2

  49. Example: Task parallelism

  50. Best Practices • Optimize barrier use • Avoid ordered construct • Avoid large critical regions • Maximize parallel regions • Avoid multiple use of parallel regions • Address poor load balance

More Related