250 likes | 370 Views
Programming Models and FastOS. Bill Gropp and Rusty Lusk. Application makes use of the programming model (may be calls, may be compiler-generated). Considers all calls the same, or at most, libc + programming model Deciding what goes in the runtime and what goes in the OS is very important
E N D
Programming Models and FastOS Bill Gropp and Rusty Lusk
Application makes use of the programming model (may be calls, may be compiler-generated). Considers all calls the same, or at most, libc + programming model Deciding what goes in the runtime and what goes in the OS is very important But not to the application or (for the most part) the programming model Just make it fast and correct Application Progming Model Node Runtime Operating System Application View of the OS Bill Gropp <www.mcs.anl.gov/~gropp>
It's about US Bill Gropp <www.mcs.anl.gov/~gropp>
Parallel Programming Models • Shared-nothing • Typically communicating OS processes • Need not be OS processes • Need the appearance of a separate address space (as visible to the programmer) and program counter • Need not have separate OS entries for each “application process” • Shared-all • Typically either OS processes with shared address spaces or one process + threads • Need not involve OS in each thread of control • Processes and Processors are different • An single “application process” could use many processors • (and what is a “processor”?) Bill Gropp <www.mcs.anl.gov/~gropp>
Some Needs of Programming Models • Job startup • Acquire and setup resources • Job rundown • Release resources, even on abnormal exit • Schedule • As a job to match collective operations • Support communication • Allocate and manage resources • Control • Signals, interaction with other jobs, external services Bill Gropp <www.mcs.anl.gov/~gropp>
Locality of Calls from the Application Viewpoint • Local Only • Affects resources on the processing element • Collective • All (or any subset) perform a coordinated operation, such as file access or “symmetric malloc” • Independent Non-Local • Uncoordinated access to an externally managed resource, such as a file system or network • Potential scalability hazard • Two important subsets: • Cacheable • Noncacheable Bill Gropp <www.mcs.anl.gov/~gropp>
Local Calls App Node Resources NodeOS Bill Gropp <www.mcs.anl.gov/~gropp>
Node Node Node App App App Independent Non-Local Calls Argh!!! Remote Service Bill Gropp <www.mcs.anl.gov/~gropp>
Node Node Node App App App Collective Management Remote Service Collective Calls • Note that collective calls can be implemented (but not efficiently) with the non-local independent • Metrics are needed to identify and measure scalability goals Bill Gropp <www.mcs.anl.gov/~gropp>
Job Startup • Independent read of the executable and shared libraries • Hack (useful but still a hack) • Capture the file accesses on startup and provide a scalable distribution of the needed data • Better solution • Define operations as collective • “collective exec” • In between solution • Define as non-local, independent operation on cacheable (read-only) data Bill Gropp <www.mcs.anl.gov/~gropp>
Many Other Examples • Collective scheduling, signaling (avoid batch) • Gettimeofday • It is not practical to implement a special-case solution for each system call • Must identify a few strategies and apply them Bill Gropp <www.mcs.anl.gov/~gropp>
Implementing Independent Non-Local Calls • For each routine, implement special caching code • Example: DNS cache • More interesting approach • Exploit techniques used for coherent shared memory caches • Virtual “system pages” (a kind of distributed, coherent /proc) • Read-only references can be cached • Write references must invalidate • Special case – allow no consistency if desired (e.g., NFS semantics) • Syscalls could choose, but must be consistent by default. Correctness uber alles! Bill Gropp <www.mcs.anl.gov/~gropp>
Exploiting a Cached-Data View • Shared OS space approach provides a common way to support scalable implementation of independent non-local calls • Caching algorithms provide guidance for the implementation and the definition: routines should provide useful operations that can be implemented efficiently • For operations without a useful caching strategy, use the caching model to implement flow control • Provides a naturally-distributed approach • Must be careful of faults! • Is this the answer? • Research will tell us. It does point out one possible strategy Bill Gropp <www.mcs.anl.gov/~gropp>
Case Study of Collective Operations • MPI I/O provides an example of the benefit of collective semantics • MPI I/O is not POSIX; however, it is well defined and provides precise semantics that match applications’ needs (unlike NFS) • Benefit is large (100x in some cases) • More than just collective Bill Gropp <www.mcs.anl.gov/~gropp>
MPI Code to Write a Distributed Mesh to a Single File • MPI Datatypes define memory layout and placement in file • Collective write provides scalable, correct output of data from multiple processes to a single file MPI_File_Open( comm, … &file ); MPI_Type_create_subarray(..., &subarray, ...); MPI_Type_commit(&subarray); MPI_Type_vector( ..., &memtype ); MPI_Type_commit(&memtype); MPI_File_set_view(fh, ..., subarray, ...); MPI_File_write_all(fh, A, 1, memtype, ...);MPI_File_close( &file ); Bill Gropp <www.mcs.anl.gov/~gropp>
File Space Level 1 Collectivealong one axis Level 2 Level 3 Processes 0 1 2 3 The Four Levels of Access Independent Level 0 Collective along both Bill Gropp <www.mcs.anl.gov/~gropp>
Distributed Array Access: Write Bandwidth 64 procs 64 procs 256 procs 8 procs 32 procs Array size: 512 x 512 x 512 Bill Gropp <www.mcs.anl.gov/~gropp>
Unstructured Code:Read Bandwidth 64 procs 64 procs 256 procs 8 procs 32 procs Bill Gropp <www.mcs.anl.gov/~gropp>
MPI’s Collective I/O Calls • Includes the usual • Open, Close, Seek, Get_position • Includes collective versions • Read_all, Write_all • Includes thread-safe versions • Read_at_all, Write_at_all • Includes nonblocking versions • Read_all_begin/end, Write_all_begin/end, Read_at_all_begin/end, Write_at_all_begin/end • Includes general data patterns • Application can make a single system call instead of many • Only a four types cover very general patterns • Includes explicit coherency control • MPI_File_sync, MPI_File_set_atomicity Bill Gropp <www.mcs.anl.gov/~gropp>
MPI as Init • MPI provides a rich set of collective operations • Includes collective process creation — MPI_Comm_Spawn • Parallel I/O • Sample MPI Init process: • While { recv syscall switch syscall_id { case pexec: … use MPI_Comm_split to create a communicator this process creation … use MPI_File I/O to move executable to nodes MPI_Comm_spawn( … ) to create processes … remember new intercommunicator as handle for processes break; … Bill Gropp <www.mcs.anl.gov/~gropp>
What’s Missing • Process control (e.g., signals, ptrace) • Some of this was considered, see the MPI “Journal of Development” • Wait for any (no probe on any intercommunicator) (like wait or poll) • A “system” like operation, not normally appropriate for a well-designed MPI program • Precisely defined error actions (not inconsistent with MPI spec but because not defined, would need to be added) Bill Gropp <www.mcs.anl.gov/~gropp>
What’s Not Missing • Most of I/O • But directory operations are missing • Process creation • Fault tolerance • MPI spec is relatively friendly to fault tolerance, more so than current implementations • Scalability • Most (all common) routines are scalable • Thread Safety • E.g., MPI_File_read_at; no global state or (non-constant) global objects • Most communication • Could implement much of an OS Bill Gropp <www.mcs.anl.gov/~gropp>
Linux System Calls Bill Gropp <www.mcs.anl.gov/~gropp>
Linux System Calls Bill Gropp <www.mcs.anl.gov/~gropp>
Should An OS Be Implemented with MPI? • Probably not (but it would be interesting to see how close you could come and what else you’d need) • But many of the concepts used in MPI are applicable • Do not reinvent • Exploit existing technologies • Use an open process to ensure a solid design • Encourage simultaneous experimentation and development • Build code that can be run • Have a testbed to run it on Bill Gropp <www.mcs.anl.gov/~gropp>