1 / 31

PAPI Directions

PAPI Directions. Dan Terpstra Innovative Computing Lab University of Tennessee. PAPI Directions. Overview What’s PAPI? What’s New? Features Platforms What’s Next? Network PAPI Thermal PAPI When? PAPI release roadmap What’s ICL? (a word from our sponsor). What’s PAPI?.

terrel
Download Presentation

PAPI Directions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PAPI Directions Dan Terpstra Innovative Computing Lab University of Tennessee

  2. PAPI Directions • Overview • What’s PAPI? • What’s New? • Features • Platforms • What’s Next? • Network PAPI • Thermal PAPI • When? • PAPI release roadmap • What’s ICL? • (a word from our sponsor) IBM Petascale Workshop 2006

  3. What’s PAPI? • A software layer (library) designed to provide the tool developer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major micro-processors. • Countable events are defined in two ways: • platform-neutral Preset Events • Platform-dependent Native Events • Preset Events can be derived from multiple Native Events • All events referenced by name and collected in EventSets for sampling • Events can be multiplexed if counters are limited • Statistical sampling is implemented by: • Software overflow with timer driven sampling • Hardware overflow if supported by the platform IBM Petascale Workshop 2006

  4. Where’s PAPI • PAPI runs on most modern processors and Operating Systems of interest to HPC: • IBM POWER3,4,5 / AIX • POWER4,5 / Linux • PowerPC-32 and -64 / Linux • Blue Gene • Intel Pentium II, III, 4, M, EM64T, etc. / Linux • Intel Itanium • AMD Athlon, Opteron / Linux • Cray T3E, X1, XD3, XT3 Catamount • Altix, Sparc, … • NOTE: All Linux implementations require the perfctr kernel patch. • Except Itanium which uses the built-in perfmon interface • Perfmon2 development is underway to replace perfctr and be pre-installed in the kernel – NO PATCHES NEEDED! IBM Petascale Workshop 2006

  5. Extending PAPI beyond the CPU • PAPI has historically targeted on on-processor performance counters • Several categories of off-processor counters exist • network interfaces: Myrinet, Infiniband, GigE • memory interfaces: Cray X1 • thermal and power interfaces: ACPI • CHALLENGE: • Extend the PAPI interface to address multiple counter domains • Preserve the PAPI calling semantics, ease of use, and platform independence for existing applications IBM Petascale Workshop 2006

  6. Multi-Substrate PAPI • Goals: • Isolate hardware dependent code in a separable ‘substrate’ module • Extend platform independent code to support multiple simultaneous substrates • Add or modify API calls to support access to any of several substrates • Modify build environment for easy selection and configuration of multiple available substrates IBM Petascale Workshop 2006

  7. PAPI 3.0 Design PAPI High Level PAPI Low Level Portable Layer • Hardware Independent Layer PAPI Machine DependentSubstrate Machine Specific Layer KernelExtension Operating System Hardware Performance Counters IBM Petascale Workshop 2006

  8. PAPI 4.0 Multiple Substrate Design PAPI High Level PAPI High Level PAPI Low Level PAPI Low Level Portable Layer Portable Layer • Hardware Independent Layer • Hardware Independent Layer PAPI Machine DependentSubstrate PAPI Machine DependentSubstrate PAPI Machine DependentSubstrate Machine Specific Layer Machine Specific Layer KernelExtension KernelExtension KernelExtension Operating System Operating System Operating System Hardware Performance Counters Hardware Performance Counters Off-Processor Hardware Counters IBM Petascale Workshop 2006

  9. API Changes • 3 calls augmented with a substrate index • Old syntax preserved in wrapper functions for backward compatibility • Modified entry points: • PAPI_create_eventset  PAPI_create_sbstr_eventset • PAPI_get_opt  PAPI_get_sbstr_opt • PAPI_num_hwctrs  PAPI_num_sbstr_hwctrs • New entry points for new functionality: • PAPI_num_substrates • PAPI_get_sbstr_info • Old code can run with no source modifications IBM Petascale Workshop 2006

  10. PAPI 4.0 Status • Multi-substrate development complete • Some CPU platforms not yet ported • Substrates available for • ACPI (Advanced Configuration and Power Interface ) • Myrinet MX • Substrates under development for • Infiniband • GigE • Friendly User release available now for CVS checkout IBM Petascale Workshop 2006

  11. Myrinet MX Counters IBM Petascale Workshop 2006

  12. Myrinet MX Counters IBM Petascale Workshop 2006

  13. Multiple Measurements • The HPCC HPL benchmark with 3 performance metrics: • FLOPS; Temperature; Network Sends/Receives • Node 7: IBM Petascale Workshop 2006

  14. Multiple Measurements • The HPCC HPL benchmark with 3 performance metrics: • FLOPS; Temperature; Network Sends/Receives • Node 3: IBM Petascale Workshop 2006

  15. IBM Petascale Workshop 2006

  16. IBM Petascale Workshop 2006

  17. Data Structure Addressing • Goal: • Measure events related to specific data addresses (structures). • Availability: • Itanium: 160 / 475 native events • rumored on POWER4; POWER5? • PAPI example: • ...opt.addr.eventset = EventSet; opt.addr.start = (caddr_t)array; opt.addr.end = (caddr_t)(array + size_array); retval = PAPI_set_opt(PAPI_DATA_ADDRESS, &opt);actual.start = (caddr_t)array - opt.addr.start_off; actual.end = (caddr_t)(array + size_array) + opt.addr.end_off; ... IBM Petascale Workshop 2006

  18. Rensselaer to Build and House$100 Million Supercomputer NY Times, May 11, 2006 Rensselaer Polytechnic Institute announced yesterday that it was combining forces with New York State and I.B.M. to build a $100 million supercomputer that will be among the 10 most powerful in the world. The computer, a type of I.B.M. system known as Blue Gene, will be on Rensselaer's campus in Troy, N.Y., and will have the power to perform more than 70 trillion calculations per second. It will mainly be used to help researchers make smaller, faster semiconductor devices and for nanotechnology research. IBM Petascale Workshop 2006

  19. PAPI and BG/L 2 FPU PMCs 2 FPU PMCs UPC Module 48 Shared Counters • Performance Counters: • 48 UPC Counters • shared by both CPUs • External to CPU cores • 32 bits :( • 2 Counters on each FPU • 1 counts load/stores • 1 counts arithmetic operations • Accessed via blg_perfctr IBM Petascale Workshop 2006

  20. PAPI and BG/L (2): Versions • PAPI 2.3.4 • Original release • Poor native event support • PAPI 3.2.2 beta • Currently being beta tested • Full access to native events by name • Limitations • Only events exposed by bgl_perfctr • No control over native event edges • Still no overflow/profile support • Is there a timer available? • No configure script (cross-compilation) • No scripted acceptance test suite(multiple queuing systems) IBM Petascale Workshop 2006

  21. PAPI and BG/L (3): Presets Test case avail.c: Available events and hardware information. ------------------------------------------------------------------------- Vendor string and code : (1312) Model string and code : PVR=0x5202:0x1891 Serial=R00-M0-N1-C:J16-U01 (1375869073) CPU Revision : 20994.062500 CPU Megahertz : 700.000000 CPU's in this Node : 1 Nodes in this System : 32 Total CPU's : 32 Number Hardware Counters : 52 Max Multiplex Counters : 32 ------------------------------------------------------------------------- Name Derived Description (Mgr. Note) PAPI_L3_TCM No Level 3 cache misses () PAPI_L3_LDM Yes Level 3 load misses () PAPI_L3_STM No Level 3 store misses () PAPI_FMA_INS No FMA instructions completed () PAPI_TOT_CYC No Total cycles () PAPI_L2_DCH Yes Level 2 data cache hits () PAPI_L2_DCA Yes Level 2 data cache accesses () PAPI_L3_TCH No Level 3 total cache hits () PAPI_FML_INS No Floating point multiply instructions () PAPI_FAD_INS No Floating point add instructions () PAPI_BGL_OED No BGL special event: Oedipus operations () PAPI_BGL_TS_32B Yes BGL special event: Torus 32B chunks sent () PAPI_BGL_TS_FULL Yes BGL special event: Torus no token UPC cycles () PAPI_BGL_TR_DPKT Yes BGL special event: Tree 256 byte packets () PAPI_BGL_TR_FULL Yes BGL special event: UPC cycles (CLOCKx2) tree rcv is full () ------------------------------------------------------------------------- avail.c PASSED IBM Petascale Workshop 2006

  22. PAPI and BG/L (4): Native Events • 328 native events available • Only events exposed by bgl_perfctr • 4 arithmetic events per FPU • 4 Load/Store events per FPU • 312 UPC events BGL_FPU_ARITH_ADD_SUBTRACT 0x40000000 |Add and subtract, fadd, fadds, fsub, fsubs (Book E add, substract)| BGL_FPU_ARITH_MULT_DIV 0x40000001 |Multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div)| BGL_FPU_ARITH_OEDIPUS_OP 0x40000002 |Oedipus operations, All symmetric, asymmetric, and complex Oedipus multiply-add instructions| ... BGL_UPC_TS_ZP_VCD0_CHUNKS 0x40000145 |ZP vcd0 chunks| BGL_UPC_TS_ZP_VCD1_CHUNKS 0x40000146 |ZP vcd1 chunks| BGL_PAPI_TIMEBASE 0x40000148 |special event for getting the timebase reg| ------------------------------------------------------------------------- Total events reported: 328 native_avail.c PASSED IBM Petascale Workshop 2006

  23. XT3 and Catamount The Oak Ridger February 21, 2006 “The Cray XT3 Jaguar, the flagship computing system in ORNL's Leadership Computing Facility, was ranked tenth in the world in a November 2005 survey of supercomputers, delivering 20.5 trillion operations per second (teraflops).” IBM Petascale Workshop 2006

  24. PAPI and Catamount • Opteron-based • Catamount OS similar to CNK • Driven by Sandia-Cray version of perfctr • No overflow / profiling • Configure works because compile node == compute node • Test Suite script works because there’s only one queuing system IBM Petascale Workshop 2006

  25. …and of course, Cell • The PlayStation 3's CPU is based on a chip codenamed "Cell" • Each Cell contains 8 APUs. • An APU is a self contained vector processor acting independently from the others. • 4 floating point units capable of a total of 32 Gflop/s (8 Gflop/s each) • 256 Gflop/speak! 32 bit floating point; 64 bit floating point at 25 Gflop/s. • But what about the performance counters! IBM Petascale Workshop 2006

  26. When? PAPI Release Schedule • PAPI 3.3.0: RealSoonNow™ • BG/L in beta testing • Merging and deprecating PAPI 3.0.8.1 • Regression testing on other platforms • PAPI 4.0: Q2, 2006 • Porting some substrates to Multi-substrate model • Developing additional non-cpu substrates • Wanna Help? Distributed Testing… IBM Petascale Workshop 2006

  27. Distributed Testing • Dart / CTest • Mozilla Tinderbox • DejaGnu • Homegrown • Others? • Problem: • How do you develop/test/verify on multiple systems with multiple OS’s at multiple sites? • Automatically; Transparently; Repetitively IBM Petascale Workshop 2006

  28. A Word from our Sponsor…Innovative Computing Laboratory Jack’s Research Group in the CS Department Size- About 45-50 people 16 students; 19 scientific staff; 10 support staff; 1 visitors Funding NSF Supercomputer Centers (UCSD & NCSA) Next Generation Software (NGS) Info Tech Res. (ITR) Middleware Init. (NMI) DOE Scientific Discovery through Advanced Computing (SciDAC) Math in Comp Sci (MICS) DARPA High Productivity Computing Systems DOD Modernization Work with companies AMD, Cray, Dolphin, Microsoft, MathWorks, Intel, Sun, Myricom, SGI, HP, IBM, Northrop-Grumman PhD Dissertation, MS Project Equipment A number of clusters Desktop machines Office setup Summer internships Industry, ORNL, … Travel to meetings Participate in publications IBM Petascale Workshop 2006

  29. ICL Class of 2005 IBM Petascale Workshop 2006

  30. Speculative Performance Positions • PostDoc Positions Probably Available • PAPI • New Platforms (Cell?) • New Substrates (Infiniband?) • KOJAK • Automated Performance Analysis • ECLIPSE PTP & TAU Integration • See me for brochures or more info IBM Petascale Workshop 2006

  31. PAPI Directions Dan Terpstra Innovative Computing Lab University of Tennessee

More Related