1 / 26

PAPI 3.0.8.1 on Blue Gene L

PAPI 3.0.8.1 on Blue Gene L. Using network performance counters to layout tasks for improved performance. Presentation overview. Project objectives PAPI explanation Blue Gene L explanation Current state of research. Project objectives. Upgrade PAPI on BG/L

Download Presentation

PAPI 3.0.8.1 on Blue Gene L

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

  2. Presentation overview • Project objectives • PAPI explanation • Blue Gene L explanation • Current state of research

  3. Project objectives • Upgrade PAPI on BG/L • Provide interface for network counters • Allow Lawrence Livermore National Lab users to also have access to PAPI • Using network counters to place tasks optimally on BG/L

  4. PAPI – Intro Courtesy of http://icl.cs.utk.edu/papi/

  5. PAPI – Intro • PAPI useful to profile your own programs. • Many tools based on PAPI • PapiEx – Command line measurement tool • PerfSuite – Aggregate measurement and statistical profiling package and API • HPCToolkit – Statistical profiling package • Many more!

  6. PAPI – Supported platforms • IBM – POWER3, 604, 604e, POWER4 • Cray T3E, Cray X1 • AMD – Athlon, Opteron • Intel – P1 to P4, Itanium I and II • UltraSparc I, II & III • MIPS R10K, R12K, R14K • Alpha

  7. PAPI – Generic Interface • Call sequence for generic interface • PAPI_library_init – Initialize memory for PAPI’s data structures • PAPI_create_eventset – Create an empty list of events • PAPI_add_event – Add events to be counted • PAPI_start – Begin counting all events within the specified eventset • PAPI_stop – Stop all counters and read their current values

  8. PAPI – Events: Presets • Presets – list of predefined events implemented on all systems where they can be supported • Not all presets available on every architecture (e.g. BG/L has no cache lower than L3 – thus L1 cache hit preset not applicable) • Native events form the basic building blocks for PAPI presets

  9. PAPI – Events: Presets Courtesy of http://icl.cs.utk.edu/papi/

  10. PAPI – Events: Native • In addition to the predefined PAPI preset events, the PAPI library also exposes a majority of the events native to each platform • Can be added to eventsets in the same manner as presets

  11. PAPI – Events: Native

  12. PAPI – Internals • Array of eventsets is the main portion

  13. PAPI – Other features • Multiplexing – If there are not enough hardware counters • Thread safe – Profiling is thread safe • Overflow detection – Hardware counters have limited space

  14. PAPI – PAPI2 vs PAPI3 • PAPI 3 significantly reduced overheads for starting, stopping and reading the counters Courtesy of http://icl.cs.utk.edu/papi/

  15. PAPI – PAPI2 vs PAPI3 • Better native event support in PAPI3 • Better thread support in PAPI3 • Overflow and Profiling enhancements in PAPI3 • Myriad bug fixes and code cleanup in PAPI3

  16. PAPI – PAPI2 vs PAPI3 • Overlapping eventsets supported in PAPI2 • Minor changes in the API – mostly dereferencing variables

  17. Blue Gene L – Intro • 65,536 nodes connected in 64 x 32 x 32 3D torus • Nodes made up of PowerPC 440 embedded processors • Smaller than most super computers • Consumes less power

  18. Blue Gene L

  19. Blue Gene L - Networks • 3D torus network (node to node) • Tree network (broadcasts)

  20. Blue Gene L – HW counters • 48 universal performance counters • 4 floating point unit counters • Counters 32 bit – must use virtual counters to prevent overflow

  21. Blue Gene L – HW counters

  22. Research – Overall goals • Network hardware counters new • Use network counters to determine traffic between tasks • Try to optimize placement of tasks to minimize communication latency • Given counts and distances: cost = counts * distance. Minimize over all nodes

  23. Research – Counting • First goal to determine what is being counted

  24. Research – Networks • For each MPI call – determine which network counters are being used • Tree is supposed to be for broadcasts • Torus is supposed to be for point to point communication • Ambiguities in the specification

  25. Research – Future decisions • How to profile a target application • Manually insert PAPI instrumentation: a lot of work • Instrument binaries with counting code • What information to store • All counts on each node: a lot of data • Sample of all nodes: not as accurate (what if the tasks behave / communicate differently?

  26. Research – Future decisions • How to use collected information • Profile an application to obtain counter feedback to determine optimized static task layout • Dynamically migrate tasks in response to counters

More Related