Architectural Support for System Software on Large-Scale Clusters

Architectural Support for System Softwareon Large-Scale Clusters Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1, Kei Davis1 and José Carlos Sancho1 1Performance and Architecture Lab (PAL) 2Grupo de Arquitectura y Computación Paralelas (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es email:juanf@um.es

Motivation OS OS OS Independent Nodes / OSs glued together by System Software: Parallel Development and Debugging Tools Resource Management Communications Parallel File System Fault Tolerance OS OS OS OS OS System software is a key factor to maximize usability, performance and scalability on large-scale clusters!!!

Motivation • Why is System Software so important? • Parallel applications rely on services provided by system software • System software quality impacts performance, scalability, usability and responsiveness of parallel applications • Development of system software is a very time- and resource- consuming task • System software affects the cost of ownership of clusters

Motivation Why is System Software so complex? • Commodity hardware/OSs not designed for clusters • Hardware conceived for loosely-coupled environments • Local OSs lack global awareness of parallel applications • Complex global state • Non-deterministic behavior inherent to clusters (local OS scheduling) and applications (MPI_ANY_SOURCE) • Independent design of different system software components leads to redundant/missing functionality

Motivation How is System Software built? • System Software relies on an abstract network interface to move information between nodes • Performance expressed by latency and bandwidth • Interface can evolve to exploit new hardware capabilities What hardware features should the interconnection network provide to simplify system software?

Goals • Target • Simplifying design and implementation of the system software for large-scale clusters • Simplicity, performance, scalability, responsiveness • Backbone to integrate all nodes into a single, global OS • Approach • Least common denominator of all system software components as a basic set of three network primitives • Global synchronization/scheduling of all system activities • Vision • SIMD system running MIMD applications (variable granularity in the order of hundreds of s)

Outline • Motivation and Goals • Introduction • Core Primitives • System Software Design • Case Studies • Concluding remarks

Introduction • Challenges in the Design of System Software SYSTEM SOFTWARE

Introduction • Designing a Parallel OS: What we have… Parallel Applications . . . Fault Tolerance MPI Resource Management Independent System Software Components Network Protocol Network Protocol Network Protocol Network Protocol Abstract Network Interface Hardware

Introduction • Designing a Parallel OS: What we want… Parallel Applications Single,Global OS MPI Resource Management Fault Tolerance . . . Core Primitives Hardware

Core Primitives • System software built atop three primitives • Xfer-And-Signal • Transfer block of data to a set of nodes • Optionally signal local/remote event upon completion • Compare-And-Write • Compare global variable on a set of nodes • Optionally write global variable on the same set of nodes • Test-Event • Poll local event

D1 D3 D4 D2 Source Event Destination Events Core Primitives • System software built atop three primitives • Xfer-And-Signal (QsNet): • Node S transfers block of data to nodes D1, D2, D3 and D4 • Events triggered at source and destinations S

D1 D3 D4 D2 Core Primitives • System software built atop three primitives • Compare-And-Write (QsNet): • Node S compares variable V on nodes D1, D2, D3 and D4 S • Is V {, , >} to Value?

D1 D3 D4 D2 Core Primitives • System software built atop three primitives • Compare-And-Write (QsNet): • Node S compares variable V on nodes D1, D2, D3 and D4 • Partial results are combined in the switches S

System Software Design • … atop the three core primitives

System Software Design Can System Software really be built atop the Core Primitives?

Case Studies • Experimental Setup

Case Studies: Job Launching • Job Launching: send/execute/check completion 40 times faster than the best reported result!!!

Case Studies: Job Scheduling • Job Scheduling: Gang Scheduling Very small time slices: RESPONSIVENESS !!!

Case Studies: BCS-MPI • Communication Library: BCS-MPI • Global Strobe • (time slice starts) Exchange of comm requirements • Global • Synchronization Communication scheduling Time Slice (hundreds of s) • Global • Synchronization Real transmission • Global Strobe • (time slice ends)

Case Studies: BCS-MPI • Global synchronization • Strobe sent at regular intervals (time slices) • Compare-And-Write + Xfer-And-Signal (Master) • Test-Event (Slaves) • All system activities are tightly coupled • Global information is required to schedule resources, global synchronization facilitates the task but it is not enough • Global Scheduling • Exchange of communication requirements • Xfer-And-Signal + Test-Event • Communication scheduling • Real transmission • Xfer-And-Signal + Test-Event • Implementation in Network Interface Card

Case Studies: BCS-MPI • SWEEP3D and SAGE Performance (IA32) • Production-level MPI versus BCS-MPI 2% SPEEDUP 0.5% SPEEDUP

Case Studies System Software atop three Core Primitives: Performance/Scalability with a simple design

Outline • Motivation and Goals • Introduction • Core Primitives • System Software Requirements Design • Case Studies • Concluding remarks

Concluding Remarks • New abstract network interface for system software • Three communication primitives in the network hardware • Implement the basics of most system software components • Simplicity / Performance / Scalability / Responsiveness • Promising preliminary results demonstrate that scalable resource management and parallel application communication are indeed feasible

Future Work • Kernel-level implementation of the core primitives • User-level solution is already working • Deterministic replay of MPI programs • Ordered resource scheduling may enforce reproducibility • Transparent fault tolerance • Global coordination simplifies the state of the machine

Architectural Support for System Softwareon Large-Scale Clusters Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1, Kei Davis1 and José Carlos Sancho1 1Performance and Architecture Lab (PAL) 2Grupo de Arquitectura y Computación Paralelas (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es email:juanf@um.es

Introduction • Designing a Parallel Operating System • Most system software components need to move information between nodes frequently • Fast and scalable unicast mechanism • All system activities must be tightly coupled by means of … • … global synchronization • Fast and scalable global synchronization mechanisms • … and global resource scheduling • Fast and scalable global information exchange Hardware support is key to perform these tasks scalably and at a sub-millisecond granularity

Case Studies • Experimental Setup • STORM (Scalable TOol for Resource Management) • Architecture: • Set of dæmons running on the management/compute nodes • Network abstract layer based on the three core primitives • Functionality: • Job Launching • Job Scheduling (FCFS, gang scheduling and others) • New scheduling algorithms can be plugged in • Resource Accounting

System Software Design • … atop the three core primitives

Case Studies: BCS-MPI • Non-blocking primitives: MPI_Isend/Irecv

Case Studies: BCS-MPI • Blocking primitives: MPI_Send/Recv

Architectural Support for System Software on Large-Scale Clusters