400 likes | 696 Views
ChaMPIon/Pro TM : A High Performance Multithreaded Portable MPI-2 Implementation for ASCI Terascale Platforms and Linux Clusters. Rossen Dimitrov, Anthony Skjellum, Kumaran Rajaram, Weiyi Chen, Dave Leimbach, Srigurunath Chakravarthi, and Jothi P Neelamegam MPI Software Technology, Inc.
E N D
ChaMPIon/ProTM: A High Performance Multithreaded Portable MPI-2 Implementation for ASCI Terascale Platforms and Linux Clusters Rossen Dimitrov, Anthony Skjellum, Kumaran Rajaram, Weiyi Chen, Dave Leimbach, Srigurunath Chakravarthi, and Jothi P Neelamegam MPI Software Technology, Inc. Ronald Brightwell – Sandia National Laboratory Bronis de Supinski and Terry Jones Lawrence Livermore National Laboratory Gary Grider and Marydell Nochumson Los Alamos National Laboratory
Outline • Overview • Objectives • MPI-2 Features • ChaMPIon/Pro • Performance Results • Summary
Overview of ChaMPIon/Pro • ChaMPIon/Pro™ is the first commercial MPI-2.1 version available for Linux. • ChaMPIon/Pro is a robust, scalable, high-performance, commercial MPI-2.1 Implementation from MPI Software Technology, Inc. • MercutIO™, a high performance portable MPI-IO implementation, which currently supports NFS, GPFS, and PVFS is included with ChaMPIon/Pro. • ChaMPIon/Pro works to retain system scalability for applications, while balancing performance criteria (such as latency vs. overhead) and resource utilization.
DOE Collaborator’s Key Contributions • ASCI relevant input/requirements • Review of Designs specialized for ASCI systems • Validating performance on systems; advice/feedback • Attracting Production users • Co-design of PERUSE; Leading PERUSE forum • Test Suite Requirements/Ideas/Advice
ChaMPIon/Pro’s Performance and Scalability Objectives • Scaling to thousands and tens of thousands of processors and beyond • Multi-device support • Topology awareness • Thread safety • Optimized collective operations • Optimized derived datatypes • Efficient memory (and NIC resource) usage
ChaMPIon/Pro’s Functionality and Usability Objectives • Integration with schedulers and resource managers • Integration with debuggers and profilers • Functionality controlled by tunable parameters • Documentation • Reflect user feedback
Major New Functionality in MPI-2 • Parallel I/O • One-sided communication • Dynamic process management • Extended collective operations • Improved error handling • Info object • External interfaces
DOE Tri - lab Ultrascale Requirements ChaMPIon/Pro Commercial Baseline Commercial technology Product MPI/Pro CMPI ASCI Solutions New Ideas, know - how, & software MSTI Target Platforms I/O Devices Tools Support Communication Devices On-going R&D ChaMPIon/Pro Technology Evolution
Point - to Matching and Scheduling, Progress Collectives • Datatypes Point Ordering Multi - Device, • Groups and • Virtual I/O, etc. Communicators Topologies • Tool Support • Error handling • Cached attributes Common Low - level Messaging & I/O Domain: Portals, LAPI, TCP/IP, VIA, GM, RACE, SMP, BAFS etc. Architecture (Baseline)
Point - to Matching and Progress Scheduling, Collectives Point Ordering Multi - Device, I/O, etc. Common low - level messaging & I/O domain Architecture (Morphable) • Datatypes • Groups and • Virtual Topologies Communicators • Error handling • Tool Support Hardware/Firmware Middleware pushdown • Cached attributes Exploitable Semantics
Collective Operations Multi-hierarchy Operations Class 1 Class 2 Class 3 Bcast Bcast Bcast Level 2 Reduce Reduce Reduce Level 1 Gather Gather Gather Level 0 Scatter Scatter Scatter Bcast
Main Characteristics • Independent Message Progress • Multi-Device support • Fully multithreaded MPI-1, MPI I/O (MercutIO) • One-sided communication • Low CPU Overhead • Overlap of communication, computation, I/O • Thread Safe and Thread Aware [MPI_THREAD_MULTIPLE]. Works fully with OpenMP
Platform Support • LLNL ASCI Blue (PPC 603e; IBM AIX; SP Switch/LAPI) • LLNL ASCI White (IBM Power 3; IBM AIX; SP Switch/LAPI) • Sandia Cplant (HP/Compaq Alpha; Linux; Myrinet/Portals) • COTS Clusters (Intel IA-32; Linux; TCP/IP; Myrinet/GM,InfiniBand/VAPI,Quadrics/ELAN)
Communication Support • Portals • SP Switch/ LAPI • InfiniBand/VAPI • Quadrics/ELAN • Myrinet/GM1 and GM2 • TCP/IP • SMP
MercutIO • The MPI-IO Component of ChaMPIon/Pro • Distributed File Systems: NFS, ENFS • Parallel File Systems: PVFS, GPFS • Cluster File Systems: Lustre, Panasas • Design and Optimizations (SCICOMP6)
Integration with Tools and Resource Managers • Schedulers/Resource Manager • LLNL: GangLL (LoadLeveler) • LANL: • LSF • BPROC • Sandia: Cplant’s yod, yod2 • Etnus TotalView parallel debugger • Pallas Vampir performance profiler
Miscellaneous Support • C and C++ Language Bindings; ISO FORTRAN 90 upcoming. • PERUSE Support (SCICOMP6) • Improved error handling • Extensive performance and correctness test suites • Customizable
Performance Numbers GM version 2.0.2
HPL Performance Numbers 64 Node cluster, 3GHz Xeon, 1 process per node
ChaMPIon/Pro Differentiation • ChaMPIon/Pro is the only MPI-2 implementation on Linux to offer all of the functionality of the MPI-2 standard. Also efficient. (http://www.lam-mpi.org/mpi/implementations/display.php?id=32) • MercutIO is more efficient than other MPI I/O systems in key performance benchmarks • (http://www.spscicomp.org/ScicomP6/Presentations/Rajaram/MercutIO.ppt) • ChaMPIon/Pro enables “shortest time-to-solution” for the real world application.
Summary • ChaMPIon/Pro offers all of the robustness, scalability, and performance of MPI/Pro plus all MPI-2 features. • Support for number of target platforms, communication devices, file systems, and performance monitoring and debugging tools.
Questions? This work was supported in part by a Small Business Innovation Phase I, II, and IIb Awards from the National Science Foundation, under Contracts DMI-9860997, DMI-9983413, amd DMP-0222804, respectively. Further work was performed under Contract W-7405-Eng-48, with the University of California as a subcontract B510240 to the Department of Energy, of the ASCI Pathforward Ultrascale Tools Initiative.
Selected References, I • Rossen Dimitrov and Anthony Skjellum. A Theoretical Framework for Overlapping of Communication and Computation and Early Binding, part I: BOUM Model and Overlapping Metrics. Submitted to Parallel Computing. February 2003. • Rossen Dimitrov and Anthony Skjellum. A Theoretical Framework for Overlapping of Communication and Computation and Early Binding, part II: Early Binding. Submitted to Parallel Computing. June 2003. • Kumaran Rajaram, Anthony Skjellum, Rossen P. Dimitrov, Purushotham V. Bangalore, Vijay Velusamy, and David Leimbach. Design, Implementation, and Evaluation of a High Performance Portable Implementation of the MPI-2 I/O Standard API. Submitted to Parallel Computing. November 2002. • Gropp W., E. Lusk, N. Doss, and A. Skjellum. 1996. A High-performance, Portable Implementation of The MPI Message Passing Interface Standard, Parallel Computing, 22(6):789--828, September 1996.
Selected References, II • Dimitrov, Rossen. 2001. Overlapping of communication and computation and early binding: Fundamental mechanisms for improving parallel performance on clusters of workstations. Ph.D. dissertation, Mississippi State University. http://library.msstate.edu/etd/show.asp?etd=etd-04092001-231941. • Dimitrov R. and A. Skjellum. 1999. An Efficient MPI Implementation for Virtual Interface (VI) Architecture-enabled Cluster Computing. In Proc. MPIDC'99, Message Passing Interface Developer's and User's Conference, pages 15--24, Atlanta, GA, March 1999. • Kumaran Rajaram. Principal design criteria influencing the performance of a portable high performance parallel I/O implementation. M.S. Thesis, Dept of Computer Science, Mississippi State University, May 2002. http://library.msstate.edu/etd/show.asp?etd=etd-04052002-105711 • William Gropp, Ewing Lusk, and Rajeev Thakur. 1999. Using MPI-2: Advanced features of the message-passing interface. Cambridge, MA: The MIT Press.
Dr. Anthony Skjellum CTO 662-320-4300 x15 tony@mpi-softtech.com Kumaran Rajaram Senior Software Engineer 662-320-4300 x18 kums@mpi-softtech.com Dr. Rossen Dimitrov Principal Software Engineer 603-891-4766 rossen@mpi-softtech.com Contacts
Main Characteristics • Highly optimized datatype management • Software engineering processes: SRSs, HLDs, DD’s before implementation (feedback from ultimate users) • Collectives with topology awareness • Optimized persistent mode of communication
One-sided Communication • Complete implementation. Including Passive Synchronization, Accumulate operations, non-contiguous PUT and GET. • Independent progress thread. True one-sided effect for all operations. • TCP/GM/InfiniBand.
Dynamic Process Creation • Spawn by mpirun. Does not require additional resource manager or daemon process running on each node. • Dynamic device initialization. Multi-device architecture. • Dynamic connection establishment. Compatible with MPI-1 static model.
MercutIO vs. ROMIO • Hardware Configuration • Linux Cluster • 500 MHz Pentium II Processor, 512 MB RAM • 8 Nodes interconnected by 100 Mbps Fast Ethernet • Software Configuration • PVFS 1.5.4 • MPICH 1.2.4 • Access Pattern: Contiguous
MercutIO vs IBM MPI-IO Implementation • Hardware Configuration • IBM SP Cluster • 280 Nodes • 4 Processors per node (PowerPC 604e processors) • Total Memory: 512 GB • Total Disk Space: 16 TB GPFS, 3 TB local space. • Software Configuration • GPFS • Access Pattern: Strided and Segmented
MercutIO vs. IBM MPI-IO Implementation: Strided Access Performance Platform = blue Geometry = 4 Nodes, 2 Tasks-per-node Iterations = 3 Transfer Size = 4MB Block Size = 4MB Stride Count = 100 Access pattern = Strided File Size = 12.5GB Collective = false
APIs Write Bandwidth (MB/sec) Read Bandwidth (MB/sec) POSIX 181 243 IBM MPI (wo large block) 127 202 IBM MPI (w large block) 159 247 MercutIO 224 397 MercutIO vs. IBM MPI-IO Implementation: Strided Access Performance(contd.)
MercutIO vs. IBM MPI-IO Implementation: Segmented Access Performance Platform = blue Geometry = 4 Nodes, 2 Tasks-per-node Iterations = 3 Transfer Size = 256KB Block Size = 128MB Stride Count = 1 Access pattern = Segmented File Size = 12.5GB Collective = false
APIs Write Bandwidth (MB/sec) Read Bandwidth (MB/sec) POSIX 269 245 IBM MPI (wo large block) 122 156 IBM MPI (w large block) 335 240 MercutIO 221 449 MercutIO vs. IBM MPI-IO Implementation: Segmented Access Performance (contd.)
PERUSE • PERUSE provides level of detail and accuracy of MPI performance data that is not possible through PMPI. • PERUSE helps in investigating hard performance and scalability issues. • PERUSE can be used to study the behavior of the MPI middleware as well as the behavior of the hardware in greater detail. • PERUSE can be used to complement the performance data accessible through PMPI. • MPI profiling tools can utilize PERUSE to provide additional services for performance analysis to MPI developers.