240 likes | 256 Views
MPI Program Performance. Self Test with solution. Amdahl's Law Profiles Relative efficiency Load imbalances Timers Asymptotic analysis Execution time Cache effects Event traces Absolute speedup.
E N D
MPI Program Performance Self Test with solution
Amdahl's Law Profiles Relative efficiency Load imbalances Timers Asymptotic analysis Execution time Cache effects Event traces Absolute speedup The time elapsed from when the first processor starts executing a problem to when the last processor completes execution. T1/(P*Tp), where T1 is the execution time on one processor and Tp is the execution time on P processors. The execution time on one processor of the fastest sequential program divided by the execution time on P processors. When the sequential component of an algorithm accounts for 1/s of the program's execution time, then the maximum possible speedup that can be achieved on a parallel computer is s. Characterizing performance in a large limit. When an algorithm suffers from computation or communication imbalances among processors. When the fast memory on a processor gets used more often in a parallel implementation, causing an unexpected decrease in the computation time. A performance tool that shows the amount of time a program spends on different program components. A performance tool that determines the length of time spent executing particular piece of code. The most detailed performance tool that generates a file which records the significant events in the running of a program. Matching
Answer • D • H • B • F • I • E • A • G • J • C
Self Test • The following is not a performance metric: • speedup • efficiency • problem size
Answer • IncorrectNo, speedup is a performance metric. Relative speedup is defined as T1/Tp where T1 is the execution time on one processor and Tp is the execution time on P processors. Absolute speedup is obtained by replacing the execution time on one processor with the execution time of the fastest sequential algorithm. • IncorrectNo, efficiency is a performance metric. Relative efficiency is defined as T1/(P* Tp), where T1 is the execution time on one processor and Tp is the execution time on P processors. Absolute efficiency is obtained by replacing the execution time on one processor with the execution time of the fastest sequential algorithm. • CorrectThat's correct! Problem size is a factor that affects a program's execution time but it is not a metric for analyzing performance.
Self Test • A good question to ask in scalability analysis is: • How can one overlap computation and communications tasks in an efficient manner? • How can a single performance measure give an accurate picture of an algorithm's overall performance? • How does efficiency vary with increasing problem size? • In what parameter regime can I apply Amdahl's law?
Answer • IncorrectSorry, that's not correct. • IncorrectSorry, that's not correct. • CorrectThat's correct! • IncorrectSorry, that's not correct.
Self Test • If an implementation has unaccounted-for overhead, a possible reason is: • an algorithm may suffer from computation or communication imbalances among processors. • the cache, or fast memory, on a processor may get used more often in a parallel implementation causing an unexpected decrease in the computation time. • you failed to employ a domain decomposition. • there is not enough communication between processors.
Answer • CorrectThat's correct! • IncorrectSorry, that's not correct. • IncorrectSorry, that's not correct. • IncorrectSorry, that's not correct.
Self Test • Which one of the following is not a data collection technique used to gather performance data: • counters • profiles • abstraction • event traces
Answer • IncorrectNo, counters are data collection subroutines which increment whenever a specified event occurs. • IncorrectNo, profiles show the amount of time a program spends on different program components. • CorrectThat's correct. Abstraction is not a data collection technique. A good performance tool allows data to be examined at a level of abstraction appropriate for the programming model of the parallel program. • IncorrectNo, event traces contain the most detailed program performance information. A trace based system generates a file that records the significant events in the running of a program.
Course Problem • In this chapter, the broad subject of parallel code performance is discussed both in terms of theoretical concepts and some specific tools for measuring performance metrics that work on certain parallel machines. Put in its simplest terms, improving code performance boils down to speeding up your parallel code and/or improving how your code uses memory.
Course Problem • As you have learned new features of MPI in this course, you have also improved the performance of the code. Here is a list of performance improvements so far: • Using Derived Datatypes instead of sending and receiving the separate pieces of data • Using Collective Communication routines instead of repeating/looping individual sends and receives • Using a Virtual Topology and its utility routines to avoid extraneous calculations • Changing the original master-slave algorithm so that the master also searches part of the global array (The slave rebellion: Spartacus!) • Using "true" parallel I/O so that all processors write to the output file simultaneously instead of just one (the master)
Course Problem • But more remains to be done - especially in terms of how the program affects memory. And that is the last exercise for this course. The problem description is the same as the one given in Chapter 9 but you will modify the code you wrote using what you learned in this chapter.
Course Problem • Description • The initial problem implements a parallel search of an extremely large (several thousand elements) integer array. The program finds all occurrences of a certain integer, called the target, and writes all the array indices where the target was found to an output file. In addition, the program reads both the target value and all the array elements from an input file. • Exercise • Modify your code from Chapter 9 so that it uses dynamic memory allocation to use only the amount of memory it needs and only for as long as it needs it. Make both the arrays a and b ALLOCATED DYNAMICALLY and connect them to memory properly. You may also assume that the input data file "b.data" now has on its first line the number of elements in the global array b. The second line now has the target value. The remaining lines are the contents of the global array b.
Solution • Note: The sections of code shown in red are new code in which the arrays a and b are declared dynamically, actually allocated, and deallocated when they are no longer needed. Further, note that only processor 0 is concerned with allocating/deallocating the global array b while all processors dynamically create and destroy their individual subarrays a.
Solution #include <stdio.h> #include <mpi.h> int main(int argc, char **argv) { int i, target; /*local variables*/ /* Arrays a and b have no memory assigned to them when they are declared */ int *b, *a; /*a is name of the array each slave searches*/ int length=0; int rank, size, err; MPI_Status status; int end_cnt, gi; FILE *sourceFile; /* Variables needed to prepare the file */ int amode, etype, filetype, intsize; MPI_Info info; MPI_File fh; MPI_Offset disp;
Solution err = MPI_Init(&argc, &argv); err = MPI_Comm_rank(MPI_COMM_WORLD, &rank); err = MPI_Comm_size(MPI_COMM_WORLD, &size); if(size != 4) { printf("Error: You must use 4 processes to run this program.\n"); return 1; } /* Create the file and make it write only */ amode=(MPI_MODE_CREATE|MPI_MODE_WRONLY); info=0; /* Name the file and open it to all processors */ err = MPI_File_open(MPI_COMM_WORLD,"found.dat",amode,info,&fh); intsize = sizeof(MPI_INT); disp=rank*intsize; etype=MPI_INT; filetype=MPI_INT; err = MPI_File_set_view(fh,disp,etype,filetype,"native",info); /*This and the preceeding four ! lines prepare the "view" each processor has of the output file. This view tells where in ! the file each processor should put the target locations its finds. In our case, P0 will ! start putting data at the beginning of the file, P1 will start putting data one integer's ! length from the beginning of the file, and so on. */
Solution if (rank == 0) { /* File b1.data has the length value on the first line, then target */ /* The remaining 300 lines of b.data have the values for the b array */ sourceFile = fopen("b1.data", "r"); if(sourceFile==NULL) { printf("Error: can't access file.c.\n"); return 1; } else { /* Read in the target */ fscanf(sourceFile, "%d", &length); fscanf(sourceFile, "%d", &target); } } /*Notice the broadcast is outside of the if, all processors must call it*/ err = MPI_Bcast(&target, 1, MPI_INT, 0, MPI_COMM_WORLD); err = MPI_Bcast(&length, 1, MPI_INT, 0, MPI_COMM_WORLD);
Solution if (rank == 0) { /* Only at this point is b connected to exactly the correct amount of memory */ b = (int *)malloc(length*sizeof(int)); /* Read in b array */ for (i=0; i<length; i++) { fscanf(sourceFile,"%d", &b[i]); } fclose(sourceFile); } /* Only at this point is each processor's array a connected to a smaller amount of memory */ a = (int *)malloc(length/size*sizeof(int)); /* Again, the scatter is after the if, all processors must call it */ err = MPI_Scatter(b, (length/size), MPI_INT, a, (length/size), MPI_INT, 0, MPI_COMM_WORLD); /* Processor 0 no longer needs b */ if (rank == 0) free(b);
Solution for (i=1; i<length/size; i++) { if (a[i] == target) { gi=(rank)*75+i+1; /* Each processor writes to the file */ err = MPI_File_write(fh,&gi,1,MPI_INT,&status); } } free(a); /* All the processors are through with a */ err = MPI_File_close(&fh); err = MPI_Finalize(); return 0; }
Solution • The results obtained from running this code are in the file "afound.dat". As before, it must be viewed with the special octal dump (od) command. If you type od -d found.dat • you get 0000000 62 0 0 0 183 0 271 0 0000020 291 0 296 0 0000030 • which as you can see is exactly what you obtained in Chapter 9 but with statically allocated arrays a and b.
Solution • Many improvements in the performance of any code are possible. This is left as your final exercise: come up with some other ideas and try them! Good luck and happy MPI programming!