220 likes | 431 Views
System Software: It Slices, Dices, and makes Julienne Fries!. You have exascale problems? Load Balancing? Failure? Power Management? My s ystem s oftware will solve these problems. Example: Fault Tolerance. Coordinated checkpointing to the traditional parallel file system won’t scale
E N D
System Software: It Slices, Dices, and makes Julienne Fries! • You have exascale problems? • Load Balancing? • Failure? • Power Management? • My system software will solve these problems
Example: Fault Tolerance • Coordinated checkpointing to the traditional parallel file system won’t scale • Checkpoint commit approaches node MTBF=> Application efficiency drops quickly
rMPI: Replicated (not Ronco™) MPI • Each MPI process runs twice, only fail if both processes in a rank fail • Handle full MPI semantics at scale Ferreira, et al. SC 2011.
Only two low, low payments of… Your machine power budget andhardware acquisitionbudget (*) Act now, and you’ll get twice the capacity computing functionality for FREE! (*) plus contracting and granting
What are you trying to sell me? • Costs and benefits are really easy to understand • Large and node-scalable reduction in system mean time to interrupt (MTTI) • Using it as the primary fault tolerance technique means twice the power consumption on capability problems • Buying twice the number of nodes is also quite painful • SC13 Panel: “Replication is too expensive…We [as a community] will have failed if we can't do better than that. ” – Marc Snir
Everything’s A NailWhy you don’t want system software to solve your problems (if you can help it) Patrick G. Bridges bridges@cs.unm.edu April 22, 2014
I have a hammer! • TheoremEvery individual complete system-level solution to an application exascale problem is “too expensive” for some real workload • Rationale • OS doesn’t know your application • General solutions are expensive • Specialized solutions have limited power or applicability
“Simple” Resilience Solutions • Save us, vendors! • Adding reliability on the compute and control path is potentially hardware-intensive • How much to pay in transistors, power, and $$? • While stepping off the commodity price/performance curve… • Burst Buffers • How much budget to spend on the I/O system? • Memory is a scarse resource at exascale • NVRAM and network bandwidth aren’t free in power • Some nice recent work in this area
Asynchronous Checkpointing • Idea: Each node checkpoints when most convenient and out of sync with other nodes • Benefit: get checkpointing off the peak B/W curve onto the sustained B/W curve • Has some (low) obvious costs, some less obvious costs
Async. Checkpointing approaches highly application-dependent Ferreira, et al. In submission. • Note how bimodal these performance curves are! • Clustered asynchronous checkpointing may hold promise here Apps and Benchmarks Proxy Applications
Checkpoint-avoidance Systems Levy, et al. In submission. Cheap and powerful is here
Still have to solve the problems • No one inexpensive technique enough, but each solves part of the problem • System software must stop trying to “rescue” the application and work with the application • Application/runtime can cover part of the space • System software can provide “last resort” solutions when the application cannot easily recover • Right solution application and hardware dependent • Like it is for linear solvers and load balancing • Not just a resilience issue
What do we need to enable this? • Characterization of techniques at scale • Continued development of new techniques • Good decision support • Yet more knobs someone needs to turn • Many of the tradeoffs are non-linear, stochastic, etc • Different problem areas interact “interestingly” • Complex influence on acquisition decisions, too • Clean interfaces to runtime and application • “From a runtime developer’s perspective, the way that current operating systems manage resources is fundamentally broken” – Mike Bauer, Legion project
Current Osesare Helicopter Parents Linux(like OSF/1) will solve all your problems for you • Whether you like it or not • While making sure you can’t do the things you (think you) should do • Which is fine, as long as you don’t need to do anything interesting
Exascale OS must be your partner • Runtimes: “…it is the OS's job to provide mechanism and stay out of the way…” • Sandia lightweight kernels: “The QK provides mechanism, PCT encapsulates policy” • Go ahead and try – if you fall, I’ll catch you
Lightweight OSes for Next-generation Systems • Applications more complex than when the LWK was originally designed • Users want more complex interfaces and services • Runtimes still want low-level hardware access • But we still have to provide some level of isolation • As well as backstop mechanisms in cooperation with hardware • Two predominant approaches: • Composite OS (Fused OS, MAHOS, Argo OS/R, etc.) • Virtualization (Kitten+Palacios VMM, Hobbes OS/R)
Why Virtualization? • Safe low-level hardware access for runtime systems • Supports bringing your own OS with you • Don’t have to muck with the insides of Linux • Can be very fast CTH on Palacios/Kitten on Red Storm HPCC FFT over virtualized 10GbE
Virtualization in Hobbes OS/R • Multiple virtualization architectures, not just one • Pick the point on the spectrum that provides the mechanisms your application/runtime needs • Interesting research challenges on the right mechanisms and interfaces to provide at and between each point Full HW VMRuns UnmodifiedGuest OSes, Passthru(Palacios, KVM, …) LWKVirtual LinuxEvironment(Kitten, CNK) Fused OSMultiple-nativeOSes(Pisces, Argo) LightestWeight HeaviestWeight LWK Custom(Catamount,HybridVM) Para-virtualImplicit,VMM ChangesGuest OS(Gears,GuardedModules) Para-virtualExplicit,Guest OS Modifiedor Augmented(Orig. Xen,Device Drivers) Software VirtEmulate HW, BinaryTranslation, …(Qemu, Vmware,Emulate HW TransMemory pre-product)
Now that I’ve punted on policy… • Assumption is that the runtime (and/or virtualized OS) will do this for the LWK • Is a semi-static policy + local (HW or runtime) adaptation sufficient? • Or global dynamic adaptive runtime systemthat sets policy and resource allocation for millions of cores? • With low overhead and application interference? • “Burning a core” probably not viable at this problem size? • Heuristics vs. more disciplined methods? • I want to believe but I have yet to see it • Distributed, Decentralized • Must be robust and efficient • Can we tolerate imperfectand unfair?
Conclusions • No, the application and runtime really shouldn’t expect the OS to rescue it • System software can and shouuld provide a range of modest, inexpensive mechanisms • Which can backstop app when it can’t rescue itself • Need well-quantified performance for techniques • On real legacy and next-generation workloads • Virtualization can give the runtime the low-level mechanisms it wants inexpensively
Acknowledgements • Colleagues, collaborators and students on this work • UNM: Dorian Arnold, Scott Levy, Cui Zheng • Sandia: Ron Brightwell, Kurt Ferreira, Kevin Pedretti, Patrick Widener • Northwestern: Peter Dinda, Lei Xia • Oak Ridge: Barney Maccabe • Pittsburgh: Jack Lange
Acknowledgements (cont’d) This work was supported in part by: • DOE Office of Science, Advanced Scientific Computing Research, under award number DE-SC0005050, program manager Sonia Sachs • Sandia National Labs including funding from the Hobbes project, which is funded by the 2013 Exascale Operating and Runtime Systems Program from the DOE Office of Science, Advanced Scientific Computing Research • Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract DE-AC04-94AL85000 • U.S. National Science Foundation Awards CNS-0709168 and CNS-0707365