1 / 26

Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP *

Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP *. Jeremy R. Johnson. Introduction. Objective: To further study the shared memory model of parallel programming. Introduction to the OpenMP standard for shared memory parallel programming Topics

coen
Download Presentation

Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP *

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Processing (CS 667)Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing

  2. Introduction • Objective: To further study the shared memory model of parallel programming. Introduction to the OpenMP standard for shared memory parallel programming • Topics • OpenMP vs. Pthreads • hello_pthreadsc • hello_openmp.c • Parallel Regions and execution model • Data parallelism with loops • Shared vs. private variables • Scheduling and chunk size • Synchronization and reduction variables • Functional parallelism with parallel sections • Case Studies Parallel Processing

  3. OpenMP • Extension to FORTRAN, C/C++ • Uses directives (comments in FORTRAN, pragma in C/C++) • ignored without compiler support • Some library support required • Shared memory model • parallel regions • loop level parallelism • implicit thread model • communication via shared address space • private vs. shared variables (declaration) • explicit synchronization via directives (e.g. critical) • library routines for returning thread information (e.g. omp_get_num_threads(), omp_get_thread_num() ) • Environment variables used to provide system info (e.g. OMP_NUM_THREADS) Parallel Processing

  4. Benefits • Provides incremental parallelism • Small increase in code size • Simpler model than message passing • Easier to use than thread library • With hardware and compiler support smaller granularity than message passing. Parallel Processing

  5. Further Information • Adopted as a standard in 1997 • Initiated by SGI • www.openmp.org • computing.llnl.gov/tutorials/openMP • Chandra, Dagum, Kohr, Maydan, McDonald, Menon, “Parallel Programming in OpenMP”, Morgan Kaufman Publishers, 2001. • Chapman, Jost, and Van der Pas, “Using OpenMP: Portable Shared Memory Parallel Programming,” The MIT Press, 2008. Parallel Processing

  6. ... P0 P1 Pn Memory Shared vs. Distributed Memory P0 P1 Pn ... M0 M1 Mn Interconnection Network Shared memory Distributed memory Parallel Processing

  7. Shared Memory Programming Model • Shared memory programming does not require physically shared memory so long as there is support for logically shared memory (in either hardware or software) • If logical shared memory then there may be different costs for accessing memory depending on the physical location. • UMA - uniform memory access • SMP - symmetric multi-processor • typically memory connected to processors via a bus • NUMA - non-uniform memory access • typically physically distributed memory connected via an interconnection network Parallel Processing

  8. Hello_openmp.c #include <stdio.h> #include <stdlib.h> #include <omp.h> int main(int argc, char **argv) { int n; if (argc > 1) { n = atoi(argv[1]); omp_set_num_threads(n); } printf("Number of threads = %d\n",omp_get_num_threads()); #pragma omp parallel { int id = omp_get_thread_num(); printf("Hello World from %d\n",id); if (id == 0) printf("Number of threads = %d\n",omp_get_num_threads()); } exit(0); } Parallel Processing

  9. Compiling & Running Hello_openmp % gcc –fopenmp hello_openmp.c –o hello % ./hello 4 Number of threads = 1 Hello World from 1 Hello World from 0 Hello World from 3 Number of threads = 4 Hello World from 2 The order of the print statements is nondeterministic Parallel Processing

  10. Execution Model Master thread Implicit thread creation (fork) Parallel Region Master and slave threads Implicit barrier synchronization (join) Master thread Parallel Processing

  11. Explicit Barrier #include <stdio.h> #include <stdlib.h> int main(int argc, char **argv) { int n; if (argc > 1) { n = atoi(argv[1]); omp_set_num_threads(n); } printf("Number of threads = %d\n",omp_get_num_threads()); #pragma omp parallel { int id = omp_get_thread_num(); printf("Hello World from %d\n",id); #pragma omp barrier if (id == 0) printf("Number of threads = %d\n",omp_get_num_threads()); } exit(0); } Parallel Processing

  12. Output with Barrier %./hellob 4 Number of threads = 1 Hello World from 1 Hello World from 0 Hello World from 2 Hello World from 3 Number of threads = 4 The order of the “Hello World” print statements are nondeterministic; however, the Number of threads print statement always comes at the end Parallel Processing

  13. Hello_pthreads.c printf("Number of threads = %d\n",pthread_getconcurrency()); for (i=0;i<n;i++) { pid[i]=i; error = pthread_create(&tid[i], NULL,(void *(*)(void *))hello, &pid[i]); } for (i=0;i<n;i++) { error = pthread_join(tid[i],NULL); } exit(0); } #include <stdio.h> #include <stdlib.h> #include <pthread.h> #include <errno.h> #define MAXTHREADS 32 int main(int argc, char **argv) { int error,i,n; void hello(int *pid); pthread_t tid[MAXTHREADS],mytid; int pid[MAXTHREADS]; if (argc > 1) { n = atoi(argv[1]); if (n > MAXTHREADS) { printf("Too many threads\n"); exit(1); } pthread_setconcurrency(n); } Parallel Processing

  14. Hello_pthreads.c % gcc -pthreadhello.c -o hello % ./hello 4 Number of threads = 4 Hello World from 0 (tid = 1832728912) Hello World from 1 (tid = 1824336208) Number of threads = 4 Hello World from 3 (tid = 1807550800) Hello World from 2 (tid = 1815943504) The order of the print statements is nondeterministic void hello(int *pid) { pthread_t tid; tid = pthread_self(); printf("Hello World from %d (tid = %u)\n",*pid,(unsigned int) tid); if (*pid == 0) printf("Number of threads = %d\n",pthread_getconcurrency()); } Parallel Processing

  15. Types of Parallelism Data Parallelism Functional Parallelism FORK JOIN FORK JOIN LOOP F1 F2 F3 F4 Threads execute same instructions Threads execute different instructions … but on different data … and can read same data but should write different data

  16. Parallel Loop int a[1000], b[1000]; int main() { int i; int N = 1000; // Serial Initialization for (i=0; i<N; i++) a[i] = i; b[i] = N-i; #pragma omp for shared(a,b), private(i), schedule(static) for (i=0;i<N;i++) { a[i] = a[i] + b[i]; } int a[1000], b[1000]; int main() { int i; int N = 1000; for (i=0; i<N; i++) a[i] = i; b[i] = N-i; for (i=0;i<N;i++) { a[i] = a[i] + b[i]; } Parallel Processing

  17. Scheduling of Parallel Loop     a +     b 0 1 2 Nthreads-1  tid Stripmining Parallel Processing

  18. Implementation of Parallel Loop void vadd(int *id) { int i; for (i=*id;i<N;i+=numthreads) { a[i] = a[i] + b[i]; } } for (i=0;i<numthreads;i++) { id[i] = i; error = pthread_create(&tid[i],NULL,(void *(*)(void *))vadd, &id[i]); } for (i=0;i<numthreads;i++) { error = pthread_join(tid[i],NULL); } Parallel Processing

  19. Scheduling Chunks of Parallel Loop     a chunk0 Chunk 1 Chunk 2     b chunk0 Chunk Nthreads-1 0 1 2  tid Parallel Processing

  20. Implementation of Chunking #pragma omp for shared(a,b), private(i), schedule(static,CHUNK) for (i=0;i<N;i++) { a[i] = a[i] + b[i]; } void vadd(int *id) { int i,j; for (i=*id*CHUNK;i<N;i+=numthreads*CHUNK) { for (j=0;j<CHUNK;j++) a[i+j] = a[i+j] + b[i+j]; } } Parallel Processing

  21. Race Condition int x[10000000]; int main(int argc, char **argv) { int sum=0; ……. omp_set_num_threads(numcounters); for (i=0;i<numcounters*limit;i++) x[i] = 1; #pragma omp parallel for schedule(static) private(i) shared(sum,x) for (i=0;i<numcounters*limit;i++) { sum = sum + x[i]; if (i==0) printf("num threads = %d\n",omp_get_num_threads()); } Parallel Processing

  22. Critical Sections int x[10000000]; int main(int argc, char **argv) { int sum=0; ……. #pragma omp parallel for schedule(static) private(i) shared(sum,x) for (i=0;i<numcounters*limit;i++) { #pragma omp critical(sum) sum = sum + x[i]; } Parallel Processing

  23. Reduction Variables int x[10000000]; int main(int argc, char **argv) { int sum=0; ……. #pragma omp parallel for schedule(static) private(i) shared(x) reduction(+:sum) for (i=0;i<numcounters*limit;i++) { sum = sum + x[i]; } Parallel Processing

  24. Reduction X[] + + + + + partial sum partial sum partial sum partial sum partial sum + total sum Parallel Processing

  25. Implementing Reduction #pragma omp parallel shared(sum,x) { int i; int localsum=0; int id; id = omp_get_thread_num(); for (i=id;i<numcounters*limit;i+=numcounters) { localsum = localsum + x[i]; } #pragma omp critical(sum) sum = sum+localsum; } Parallel Processing

  26. Functional Parallelism Example int main() { int i; double a[N], b[N], c[N], d[N]; // Parallel Function #pragma omp parallel shared(a,b,c,d) privite(i) { #pragma omp sections { #pragma omp section for (i=0; i<N; i++) c[i] = a[i] + b[i]; #pragma omp section for (i=0; i<N; i++) d[i] = a[i] * b[i]; } }

More Related