1 / 35

Architectural Support for System Software on Large-Scale Clusters

Architectural Support for System Software on Large-Scale Clusters. Juan Fernández 1,2 , Eitan Frachtenberg 1 , Fabrizio Petrini 1 , Kei Davis 1 and José Carlos Sancho 1 1 Performance and Architecture Lab (PAL) 2 Grupo de Arquitectura y Computación Paralelas (GACOP)

betsy
Download Presentation

Architectural Support for System Software on Large-Scale Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architectural Support for System Softwareon Large-Scale Clusters Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1, Kei Davis1 and José Carlos Sancho1 1Performance and Architecture Lab (PAL) 2Grupo de Arquitectura y Computación Paralelas (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es email:juanf@um.es

  2. Motivation OS OS OS Independent Nodes / OSs glued together by System Software: Parallel Development and Debugging Tools Resource Management Communications Parallel File System Fault Tolerance OS OS OS OS OS System software is a key factor to maximize usability, performance and scalability on large-scale clusters!!!

  3. Motivation • Why is System Software so important? • Parallel applications rely on services provided by system software • System software quality impacts performance, scalability, usability and responsiveness of parallel applications • Development of system software is a very time- and resource- consuming task • System software affects the cost of ownership of clusters

  4. Motivation Why is System Software so complex? • Commodity hardware/OSs not designed for clusters • Hardware conceived for loosely-coupled environments • Local OSs lack global awareness of parallel applications • Complex global state • Non-deterministic behavior inherent to clusters (local OS scheduling) and applications (MPI_ANY_SOURCE) • Independent design of different system software components leads to redundant/missing functionality

  5. Motivation How is System Software built? • System Software relies on an abstract network interface to move information between nodes • Performance expressed by latency and bandwidth • Interface can evolve to exploit new hardware capabilities What hardware features should the interconnection network provide to simplify system software?

  6. Goals • Target • Simplifying design and implementation of the system software for large-scale clusters • Simplicity, performance, scalability, responsiveness • Backbone to integrate all nodes into a single, global OS • Approach • Least common denominator of all system software components as a basic set of three network primitives • Global synchronization/scheduling of all system activities • Vision • SIMD system running MIMD applications (variable granularity in the order of hundreds of s)

  7. Outline • Motivation and Goals • Introduction • Core Primitives • System Software Design • Case Studies • Concluding remarks

  8. Introduction • Challenges in the Design of System Software SYSTEM SOFTWARE

  9. Introduction • Designing a Parallel OS: What we have… Parallel Applications . . . Fault Tolerance MPI Resource Management Independent System Software Components Network Protocol Network Protocol Network Protocol Network Protocol Abstract Network Interface Hardware

  10. Introduction • Designing a Parallel OS: What we want… Parallel Applications Single,Global OS MPI Resource Management Fault Tolerance . . . Core Primitives Hardware

  11. Outline • Motivation and Goals • Introduction • Core Primitives • System Software Design • Case Studies • Concluding remarks

  12. Core Primitives • System software built atop three primitives • Xfer-And-Signal • Transfer block of data to a set of nodes • Optionally signal local/remote event upon completion • Compare-And-Write • Compare global variable on a set of nodes • Optionally write global variable on the same set of nodes • Test-Event • Poll local event

  13. D1 D3 D4 D2 Source Event Destination Events Core Primitives • System software built atop three primitives • Xfer-And-Signal (QsNet): • Node S transfers block of data to nodes D1, D2, D3 and D4 • Events triggered at source and destinations S

  14. D1 D3 D4 D2 Core Primitives • System software built atop three primitives • Compare-And-Write (QsNet): • Node S compares variable V on nodes D1, D2, D3 and D4 S • Is V {, , >} to Value?

  15. D1 D3 D4 D2 Core Primitives • System software built atop three primitives • Compare-And-Write (QsNet): • Node S compares variable V on nodes D1, D2, D3 and D4 • Partial results are combined in the switches S

  16. Outline • Motivation and Goals • Introduction • Core Primitives • System Software Design • Case Studies • Concluding remarks

  17. System Software Design • … atop the three core primitives

  18. System Software Design Can System Software really be built atop the Core Primitives?

  19. Outline • Motivation and Goals • Introduction • Core Primitives • System Software Design • Case Studies • Concluding remarks

  20. Case Studies • Experimental Setup

  21. Case Studies: Job Launching • Job Launching: send/execute/check completion 40 times faster than the best reported result!!!

  22. Case Studies: Job Scheduling • Job Scheduling: Gang Scheduling Very small time slices: RESPONSIVENESS !!!

  23. Case Studies: BCS-MPI • Communication Library: BCS-MPI • Global Strobe • (time slice starts) Exchange of comm requirements • Global • Synchronization Communication scheduling Time Slice (hundreds of s) • Global • Synchronization Real transmission • Global Strobe • (time slice ends)

  24. Case Studies: BCS-MPI • Global synchronization • Strobe sent at regular intervals (time slices) • Compare-And-Write + Xfer-And-Signal (Master) • Test-Event (Slaves) • All system activities are tightly coupled • Global information is required to schedule resources, global synchronization facilitates the task but it is not enough • Global Scheduling • Exchange of communication requirements • Xfer-And-Signal + Test-Event • Communication scheduling • Real transmission • Xfer-And-Signal + Test-Event • Implementation in Network Interface Card

  25. Case Studies: BCS-MPI • SWEEP3D and SAGE Performance (IA32) • Production-level MPI versus BCS-MPI 2% SPEEDUP 0.5% SPEEDUP

  26. Case Studies System Software atop three Core Primitives: Performance/Scalability with a simple design

  27. Outline • Motivation and Goals • Introduction • Core Primitives • System Software Requirements Design • Case Studies • Concluding remarks

  28. Concluding Remarks • New abstract network interface for system software • Three communication primitives in the network hardware • Implement the basics of most system software components • Simplicity / Performance / Scalability / Responsiveness • Promising preliminary results demonstrate that scalable resource management and parallel application communication are indeed feasible

  29. Future Work • Kernel-level implementation of the core primitives • User-level solution is already working • Deterministic replay of MPI programs • Ordered resource scheduling may enforce reproducibility • Transparent fault tolerance • Global coordination simplifies the state of the machine

  30. Architectural Support for System Softwareon Large-Scale Clusters Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1, Kei Davis1 and José Carlos Sancho1 1Performance and Architecture Lab (PAL) 2Grupo de Arquitectura y Computación Paralelas (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es email:juanf@um.es

  31. Introduction • Designing a Parallel Operating System • Most system software components need to move information between nodes frequently • Fast and scalable unicast mechanism • All system activities must be tightly coupled by means of … • … global synchronization • Fast and scalable global synchronization mechanisms • … and global resource scheduling • Fast and scalable global information exchange Hardware support is key to perform these tasks scalably and at a sub-millisecond granularity

  32. Case Studies • Experimental Setup • STORM (Scalable TOol for Resource Management) • Architecture: • Set of dæmons running on the management/compute nodes • Network abstract layer based on the three core primitives • Functionality: • Job Launching • Job Scheduling (FCFS, gang scheduling and others) • New scheduling algorithms can be plugged in • Resource Accounting

  33. System Software Design • … atop the three core primitives

  34. Case Studies: BCS-MPI • Non-blocking primitives: MPI_Isend/Irecv

  35. Case Studies: BCS-MPI • Blocking primitives: MPI_Send/Recv

More Related