1 / 18

Five Common Defect Types in Parallel Computing

Five Common Defect Types in Parallel Computing. Prepared for Applied Parallel Computing Prof. Alan Edelman. Taiga Nakamura University of Maryland. Introduction. Debugging and testing parallel programs is hard What kinds of mistakes do programmers make?

trina
Download Presentation

Five Common Defect Types in Parallel Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Five Common Defect Types in Parallel Computing Prepared for Applied Parallel Computing Prof. Alan Edelman Taiga NakamuraUniversity of Maryland

  2. Introduction • Debugging and testing parallel programs is hard • What kinds of mistakes do programmers make? • How to prevent or effectively find and fix defects? • Hypothesis: Knowing about common defects will reduce time spent debugging • Here: Five common defect types in parallel programming (from last year’s classes) • These examples are in C/MPI • Suspect similar defect types in UPC, CAF, F/MPI Your feedback is solicited (by both us and UMD)!

  3. Defect 1: Language Usage (Valid in MPI 2.0 only. In 1.1, it had to be MPI_Init(&argc, &argv);) MPI_Finalize must be called by all processorsin every execution path • Example #include <mpi.h> #include <stdio.h> int main(int argc, char**argv) { FILE *fp; MPI_Status status; status = MPI_Init(NULL, NULL); if (status != MPI_SUCCESS) { return -1; } fp = fopen(...); if (fp == NULL) { return -1; } ... fclose(fp); MPI_Finalize(); return 0; }

  4. Use of Language Features MPI keywords in Conjugate Gradient in C/C++ (15 students) • Advanced language features are not necessarily used • Try to understand a few, basic language features thoroughly 24 functions, 8 constants

  5. Defect 1: Language Usage • Erroneous use of parallel language features • E.g. inconsistent data types between send and recv, usage of memory copy functions in UPC • Simple mistakes in understanding • Very common in novices • Compile-time defects can be found easily • Wrong number or type of parameters, etc. • Some defects surface only under specific conditions • e.g., number of processors, value of input, hardware/software environment • Advice: • Check unfamiliar language features carefully

  6. Defect 2: Space Decomposition ysize may not be divisible by np 1 ysize+1 ysize+2 • Example: Game of Life • Loop boundaries must be changed(there are other approaches too) send/recv send/recv ysize send/recv xsize MPI_Comm_size(MPI_COMM_WORLD, &np); ysize /= np; /* Main loop */ /* ... */ for (y = 0; y < ysize; y++) { for (x = 0; x < xsize; x++) { c = count(buffer, xsize, ysize, x, y); /* update buffer ... */ } } /* MPI_Send, MPI_Recv */ /* Main loop */ /* … */ for (y = 0; y < ysize; y++) { for (x = 0; x < xsize; x++) { c = count(buffer, xsize, ysize, x, y); /* update buffer ... */ } }

  7. Defect 2: Space Decomposition • Incorrect mapping between the problem space and the program memory space • Mapping in parallel version can be different from that in serial version • Array origin is different in every processor • Additional memory space for communication can complicate the mapping logic • Symptoms: • Segmentation fault (if array index is out of range) • Incorrect or slightly incorrect output

  8. Defect 3: Side-Effects of Parallelization 1: All procs might use the same pseudo-random sequence, spoiling independence 2: Hidden serialization in rand() causes performance bottleneck • Example: Approximation of pi int np; status = MPI_Comm_size(MPI_COMM_WORLD &np); ... srand(time(NULL)); for (i=0; i<n; i+=np) { double x = rand() / (double)RAND_MAX; double y = rand() / (double)RAND_MAX; if (x*x + y*y < 1) ++k; } status = MPI_Reduce( ... MPI_SUM... ); ... return k/(double)n; srand(time(NULL)); for (i=0; i<n; i++) { double x = rand() / (double)RAND_MAX; double y = rand() / (double)RAND_MAX; if (x*x + y*y < 1) ++k; } return k/(double)n;

  9. Defect 3: Side-Effect of Parallelization Filesystem may cause performance bottleneck if all processorsaccess the same file simultaneously (Schedule I/O carefully, or let “master” processor do all I/O) • Example: File I/O FILE *fp = fopen (….); If (fp != NULL) { while (…) { fscanf(…); } fclose(fp); }

  10. Defect 3: Side-Effect of Parallelization • Typical parallel programs contain only a few parallel primitives • The rest of the code is made of a sequential program running in parallel • Ordinary serial constructs can cause correctness/performance defects when they are accessed in parallel contexts • Advice: • Don’t just focus on the parallel code • Check that the serial code is working on one processor, but remember that the defect may surface only in a parallel context

  11. Defect 4: Performance • Example: load balancing myN = N / ( numProc - 1 ); If ( myRank != 0) { for (i=0; i<myN; i++) { if (…) { ++myHits; } } } MPI_Reduce(&myHits, &totalHits, 1, MPI_INT, 0, MPI_COMM_WORLD); Rank 0 “master” processor is just waiting while other“worker” processors are executing the loop

  12. Defect 4: Performance Communication requires O(size) time (a “correct” solution takes O(1)) #1 Send → #0 Recv → #0 Send → #1 Recv#2 Send → #1 Recv → #1 Send → #2 Recv #3 Send → #2 Recv → #2 Send → #3 Recv • Example: scheduling y2 Rank #(size-1) y1 send/recv if (rank != 0) { MPI_Send ((board[y1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD); MPI_Recv ((board[y1-1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD, &status); } if (rank != (size-1)) { MPI_Recv ((board[y2+1], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD, &status); MPI_Send ((board[y2], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD); } send/recv y2 Rank #1 y1 send/recv y2 Rank #0 y1 xsize

  13. Defect 4: Performance • Scalability problem because processors are not working in parallel • The program output itself is correct • Perfect parallelization is often difficult: need to evaluate if the execution speed is unacceptable • Symptom: sub-linear scalability, performance much less than expected (e.g, most time spent waiting), unbalanced amount of computation • Load balancing may depend on input data • Advice: • Make sure all processors are “working” in parallel • Profiling tool might help

  14. Defect 5: Synchronization Obvious example of deadlock (can’t avoid noticing this) #0 Recv → deadlock #1 Recv → deadlock #2 Recv → deadlock • Example: deadlock y2 Rank #(size-1) y1 send/recv MPI_Recv ((board[y2+1], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD, &status); MPI_Send ((board[y1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD); MPI_Recv ((board[y1-1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD, &status); MPI_Send ((board[y2], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD); send/recv y2 Rank #1 y1 send/recv y2 Rank #0 y1 xsize

  15. Defect 5: Synchronization This may work, but it cause deadlock with some implementation and parameters #0 Send → deadlock if MPI_Send is blocking #1 Send → deadlock if MPI_Send is blocking #2 Send → deadlock if MPI_Send is blocking A “correct” solution could be (1) alternate the order of send and recv, (2) use MPI_Bsend with sufficient buffer size, (3) MPI_Sendrecv, or (4) MPI_Isend/recv(see http://www.mpi-forum.org/docs/mpi-11-html/node41.html) • Example: deadlock y2 Rank #(size-1) y1 send/recv MPI_Send ((board[y1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD); MPI_Recv ((board[y2+1], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD, &status); MPI_Send ((board[y2], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD); MPI_Recv ((board[y1-1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD, &status); send/recv y2 Rank #1 y1 send/recv y2 Rank #0 y1 xsize

  16. Defect 5: Synchronization Synchronization (e.g. MPI_Barrier) is needed at each iteration (But too many barriers can cause a performance problem) • Example: barriers y2 Rank #(size-1) y1 for (…) { MPI_Isend ((board[y1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD, &request); MPI_Recv ((board[y2+1], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD, &status); MPI_Isend ((board[y2], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD, &request); MPI_Recv ((board[y1-1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD, &status); } send/recv send/recv y2 Rank #1 y1 send/recv y2 Rank #0 y1 xsize

  17. Defect 5: Synchronization • Well-known defect type in parallel programming: races, deadlocks • Some defects can be very subtle • Use of asynchronous (non-blocking) communication can lead to more synchronization defects • Symptom: program hangs, incorrect/non-deterministic output • This particular example derives from insufficient understanding of the language specification • Advice: • Make sure that all communications are correctly coordinated

  18. Summary • This is a first cut at understanding common defects in parallel programming

More Related