370 likes | 564 Views
Designing Parallel Operating Systems via Parallel Programming. Eitan Frachtenberg 1 , Kei Davis 1 , Fabrizio Petrini 1 , Juan Fernández 1,2 and José Carlos Sancho 1 1 Performance and Architecture Lab (PAL) 2 Grupo de Arquitectura y Computación Paralelas (GACOP)
E N D
Designing Parallel Operating Systemsvia Parallel Programming Eitan Frachtenberg1, Kei Davis1, Fabrizio Petrini1, Juan Fernández1,2 and José Carlos Sancho1 1Performance and Architecture Lab (PAL) 2Grupo de Arquitectura y Computación Paralelas (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es email:juanf@um.es
Motivation • Clusters have been the most successful player in high-performance computing in the last decade OS OS OS OS OS OS OS OS HARDWARE = Independent Nodes + High-speed Network SOFTWARE = Commodity OS + Parallel Apps + System Software
Motivation • Ever-increasing demand for computing capability is driving the construction of ever-larger clusters 2 3 1 Earth Simulator 5120 Processors Thunder (LLNL) 4096 Processors ASCI Q (LANL) 8192 Processors Systems are becoming more complex, less efficient and less reliable
Motivation • Clusters are loosely-coupled systems used for solving inherently tightly-coupled problems • Parallel software keeps all the pieces together • Development of parallel software is a time- and resource- consuming task due to its complexity PROBLEM: parallel software has neither evolved nor scaled accordingly to cluster sizes SOLUTION: new approach to the design of parallel software for large-scale clusters
Goals • Target • New methodology for the design of parallel software • Simplicity, performance, scalability, reliability • Backbone to integrate all nodes into a parallel OS • Vision • BSP-like system running MIMD applications (variable granularity in the order of hundreds of s) • Approach • BSP-like global control and coordination of all system activities • Small set of collective communication primitives for global coordination
Outline • Motivation and Goals • Toward a Parallel Operating System • Core Primitives • Parallel Software Design • Case Studies • Concluding remarks
Toward a Parallel OS • Designing a Parallel OS: • Lack of global coordination (loose coupling) • Redundant/missing functionality (complexity) Resource Management Parallel Application . . . Parallel File System Comm Protocol 1 Comm Protocol 2 . . . Comm Protocol N Hardware
Toward a Parallel OS • Scientific applications are tightly coupled … • Data dependencies between nodes • They exchange messages very often • … but the processing nodes are “bolted together” in a loosely coupled fashion Need for global control and coordination of all the system activities, enforced by global collective communication primitives
Resource Management Parallel Application . . . Parallel File System Global control and coordination Comm Protocol 1 Comm Protocol 2 . . . Comm Protocol N Hardware Toward a Parallel OS • Designing a Parallel OS: • System-level, global control and coordination of all application and system software activities
Toward a Parallel OS • Parallel applications use point-to-point and collective communication • System software tasks are either collective operations or can be cast in terms of them Parallel applications and system software can be built atop the same communication primitives
Toward a Parallel OS • Designing a Parallel OS: • Least common denominator of system and application software Core Primitives Resource Management Parallel Application . . . Parallel File System Global control and coordination Comm Protocol 1 Comm Protocol 2 . . . Comm Protocol N Core Primitives Hardware
Outline • Motivation and Goals • Toward a Parallel Operating System • Core Primitives • Parallel Software Design • Case Studies • Concluding remarks
Core Primitives • Parallel software built atop three primitives • Xfer-And-Signal • Transfer block of data to a set of nodes • Optionally signal local/remote event upon completion • Test-Event • Poll local event • Compare-And-Write • Compare global variable on a set of nodes • Optionally write global variable on the same set of nodes
D1 D3 D4 D2 Core Primitives • Parallel software built atop three primitives • Xfer-And-Signal (QsNet): • Node S transfers block of data to nodes D1, D2, D3 and D4 S
D1 D3 D4 D2 Source Event Destination Events Core Primitives • Parallel software built atop three primitives • Xfer-And-Signal (QsNet): • Node S transfers block of data to nodes D1, D2, D3 and D4 • Events triggered at source and destinations S
D1 D3 D4 D2 Core Primitives • Parallel software built atop three primitives • Compare-And-Write (QsNet): • Node S compares variable V on nodes D1, D2, D3 and D4 S • Is V {, , >} to Value?
D1 D3 D4 D2 Core Primitives • Parallel software built atop three primitives • Compare-And-Write (QsNet): • Node S compares variable V on nodes D1, D2, D3 and D4 • Partial results are combined in the switches S
Outline • Motivation and Goals • Toward a Parallel Operating System • Core Primitives • Parallel Software Design • Case Studies • Concluding remarks
Toward a Parallel OS • Global control/coordination of all system activities • Global Strobe • (time slice starts) Task 1 • Global • Synchronization Task 2 Time Slice (hundreds of s) • Global • Synchronization Task 3 • Global Strobe • (time slice ends)
Parallel Software Design • Using the core primitives… • Global control and coordination • Strobe sent at regular intervals (time slices) • Compare-And-Write + Xfer-And-Signal (Master) • Test-Event (Slaves) • All system activities are tightly coupled • Global information is required to schedule resources, global synchronization facilitates the task but it is not enough • Global resource scheduling • Exchange of requirements/restrictions • Xfer-And-Signal + Test-Event • Resource scheduling
Parallel Software Design SYSTEM SOFTWARE
Parallel Software Design • Using the core primitives…
Parallel Software Design Can we really build system software using this new approach?
Outline • Motivation and Goals • Introduction • Core Primitives • Parallel Software Design • Case Studies • Concluding remarks
Case Studies • Experimental Setup
Case Studies • STORM (Scalable TOol for Resource Management) • Architecture: • Set of dæmons running on the management/compute nodes • Built atop the three core primitives • BSP-like behavior: management activities are synchronized and scheduled every few hundreds of microseconds • Functionality: • Job Launching • Job Scheduling (FCFS, gang scheduling and others) • New scheduling algorithms can be “plugged in” • Resource Accounting
Case Studies • Job Launching: send/execute/check for completion 40 times faster than the best reported result!!!
Case Studies • BCS-MPI (Buffered CoScheduled MPI) • Architecture • Set of cooperative threads running in the NIC • Built atop the three core primitives • BSP-like behavior: communications are synchronized and scheduled every few hundreds of microseconds • Functionality: • Subset of the MPI standard • Paves the way to provide: • Traffic segregation • Deterministic replay of user applications • System-level fault tolerance
Case Studies • SWEEP3D and SAGE Performance (IA32) • Production-level MPI versus BCS-MPI 2% SPEEDUP 0.5% SPEEDUP
Outline • Motivation and Goals • Introduction • Core Primitives • Parallel Software Design • Case Studies • Concluding remarks
Concluding Remarks • Methodology for designing parallel software • Coordination of all system and application software activities in a BSP-like fashion • Parallel applications and system software built atop a basic set of collective primitives for global coordination • Backbone to integrate all nodes into a parallel OS • Promising preliminary results demonstrate that this approach is indeed feasible
Future Work • Kernel-level implementation • User-level solution is already working • Deterministic replay of MPI programs • Ordered resource scheduling may enforce reproducibility • Transparent fault tolerance • Global coordination simplifies the state of the machine
Designing Parallel Operating Systemsvia Parallel Programming Eitan Frachtenberg1, Kei Davis1, Fabrizio Petrini1, Juan Fernández1,2 and José Carlos Sancho1 1Performance and Architecture Lab (PAL) 2Grupo de Arquitectura y Computación Paralelas (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es email:juanf@um.es
Parallel Software Design • Using the core primitives…
Case Studies • Job Scheduling: gang scheduling Very small time slices: RESPONSIVENESS !!!
Toward a Parallel OS • BCS-MPI: real-time communication scheduling • Global Strobe • (time slice starts) Exchange of comm requirements • Global • Synchronization Communication scheduling Time Slice (hundreds of s) • Global • Synchronization Real transmission • Global Strobe • (time slice ends)
Toward a Parallel OS • BCS-MPI: real-time communication scheduling