1 / 38

Performance Oriented MPI

Performance Oriented MPI. Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame. Overview. Overview and History of MPI Performance Oriented Point to Point Collectives, Data Types Diagnostics and Tuning Rules of Thumb and Gotchas. Scope of This Talk.

josef
Download Presentation

Performance Oriented MPI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

  2. Overview • Overview and History of MPI • Performance Oriented Point to Point • Collectives, Data Types • Diagnostics and Tuning • Rules of Thumb and Gotchas

  3. Scope of This Talk • Beginning to intermediate user • General principles and rules of thumb • When and where performance might be available • Omit (advanced) low-level issues

  4. Overview and History of MPI • Library (not language) specification • Goals • Portability • Efficiency • Functionality (small and large) • Safety (communicators) • Conservative (current best practices)

  5. Performance in MPI • MPI includes many performance-oriented features • These features are only potentially high-performance • The standard seeks not to preclude performance, it does not mandate it • Progress might only be made during MPI function calls

  6. (Potential) Performance Features • Non-blocking operations • Persistent operations • Collective operations • MPI Datatypes

  7. Basic Point to Point • “Six function MPI” includes • MPI_Send() • MPI_Recv() • These are useful, but there is more

  8. Basic Point to Point MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { MPI_Send(&work, 1, MPI_INT, dest, TAG, MPI_COMM_WORLD); } else { MPI_Recv(&result, 1, MPI_INT, src, TAG, MPI_COMM_WORLD, &status); }

  9. Non-Blocking Operations • MPI_Isend() • MPI_Irecv() • “I” is for immediate • Paired with MPI_Test()/MPI_Wait()

  10. Non-Blocking Operations MPI_Comm_rank(comm,&rank); if (rank == 0) { MPI_Isend(sendbuf,count,MPI_REAL,1,tag,comm,&request); /* Do some computation */ MPI_Wait(&request,&status); } else { MPI_Irecv(recvbuf,count,MPI_REAL,0,tag,comm,&request); /* Do some computation */ MPI_Wait(&request,&status); }

  11. Persistent Operations • MPI_Send_Init() • MPI_Recv_init() • Creates a request but does not start it • MPI_Start() begins the communication • A single request can be re-used with multiple calls to MPI_Start()

  12. Persistent Operations MPI_Comm_rank(comm, &rank); if (rank == 0) MPI_Send_init(sndbuf, count, MPI_REAL, 1, tag, comm, &request); else MPI_Recv_init(rcvbuf, count, MPI_REAL, 0, tag, comm, &request); /* … */ for (i = 0; i < n; i++) { MPI_Start(&request); /* Do some work */ MPI_Wait(&request, &status); }

  13. Collective Operations • May be layered on point to point • May use tree communication patterns for efficiency • Synchronization! (No non-blocking collectives)

  14. Collective Operations MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, comm); O(P) O(log P)

  15. MPI Datatypes • May allow MPI to send a message directly from memory • May avoid copying/packing • (General) high performance implementations not widely available network copy

  16. Quiz: MPI_Send() • After I call MPI_Send() • The recipient has received the message • I have sent the message • I can write to the message buffer without corrupting the message • I can write to the message buffer

  17. Sidenote: MPI_Ssend() • MPI_Ssend() has the (perhaps) expected semantics • When MPI_Ssend() returns, the recipient has received the message • Useful for debugging (replace MPI_Send() with MPI_Ssend())

  18. Quiz: MPI_Isend() • After I call MPI_Isend() • The recipient has started to receive the message • I have started to send the message • I can write to the message buffer without corrupting the message • None of the above (I must call MPI_Test() or MPI_Wait())

  19. Quiz: MPI_Isend() • True or False • I can overlap communication and computation by putting some computation between MPI_Isend() and MPI_Test()/MPI_Wait() • False (in many/most cases)

  20. Communication is Still Computation • A CPU, usually the main one, must do the communication work • Part of your process (inside MPI calls) • Another process on main CPU • Another thread on main CPU • Another processor

  21. No Free Lunch • Part of your process (most common) • Fast but no overlap • Another process (daemons) • Overlap, but slow (extra copies) • Another thread (rare) • Overlap and fast, but difficult • Another processor (emerging) • Overlap and fast, but more hardware • E.g., Myri/gm, VIA

  22. How Do I Get Performance? • Minimize time spent communicating • Minimize data copies • Minimize synchronization • I.e., time waiting for communication

  23. Minimizing Communication Time • Bandwidth • Latency

  24. Minimizing Latency • Collect small messages together (if you can) • One 1024-byte message instead of 1024 one-byte messages • Minimize other overhead (e.g., copying) • Overlap with computation (if you can)

  25. Example: Domain Decomposition

  26. Naïve Approach while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) MPI_send(…); for (i = 0; i < 4; i++) MPI_recv(…); }

  27. Naïve Approach • Deadlock! (Maybe) • Can fix with careful coordination of receiving versus sending on alternate processes • But this can still serialize

  28. MPI_Sendrecv() while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) { MPI_Sendrecv(…); } }

  29. Immediate Operations while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) { MPI_Isend(…); MPI_Irecv(…); } MPI_Waitall(…); }

  30. Receive Before Sending while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) MPI_Irecv(…); for (i = 0; i < 4; i++) MPI_Isend(…); MPI_Waitall(…); }

  31. Persistent Operations for (i = 0; i < 4; i++) { MPI_Recv_init(…); MPI_Send_init(…); } while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { MPI_Startall(…) MPI_Waitall(…); }

  32. Overlapping while (!done) { MPI_Startall(…); /* Start exchanges */ do_inner_red(D); /* Internal computation */ for (i = 0; i < 4; i++) { MPI_Wait_any(…); /* As information arrives */ do_received_red(D); /* Process */ } MPI_Startall(…); do_inner_black(D); for (i = 0; i < 4; i++) { MPI_Wait_any(…); do_received_black(D); } }

  33. Advanced Overlap MPI_Startall(…); /* Start all receives */ /* … */ while (!done) { MPI_Startall(…); /* Start sends */ do_inner_red(D); /* Internal computation */ for (i = 0; i < 4; i++) { MPI_Wait_any(…); /* Wait on receives */ if (received) { do_received_red(D); /* Process */ MPI_Start(…); /* Restart receive */ } } /* Repeat for black */ }

  34. MPI Data Types • MPI_Type_vector • MPI_Type_struct • Etc. • MPI_Pack might be better network copy

  35. Minimizing Synchronization • At synchronization point (e.g., with collective communication) all processes must arrive at collective call • Can spend lots of time waiting • This is often an algorithmic issue • E.g., check for convergence every 5 iterations instead of every iteration

  36. Gotchas • MPI_Probe • Guarantees extra memory copy • MPI_Any_source • Can cause additional (internal) looping • MPI_All_to_all • All pairs must communicate • Synchronization (avoid in general)

  37. Diagnostic Tools • Totalview • Prism • Upshot • XMPI

  38. Summary • Receive before sending • Collect small messages together • Overlap (if possible) • Use immediate operations • Use persistent operations • Use diagnostic tools

More Related