810 likes | 1.05k Views
Parallel I/O. Common Ways of Doing I/O in Parallel Programs. Sequential I/O: All processes send data to rank 0, and 0 writes it to the file. Pros and Cons of Sequential I/O. Pros: parallel machine may support I/O from only one process (e.g., no common file system)
E N D
Common Ways of Doing I/O in Parallel Programs • Sequential I/O: • All processes send data to rank 0, and 0 writes it to the file
Pros and Cons of Sequential I/O • Pros: • parallel machine may support I/O from only one process (e.g., no common file system) • Some I/O libraries (e.g. HDF-4, NetCDF) not parallel • resulting single file is handy for ftp, mv • big blocks improve performance • short distance from original, serial code • Cons: • lack of parallelism limits scalability, performance (single node bottleneck)
Another Way • Each process writes to a separate file • Pros: • parallelism, high performance • Cons: • lots of small files to manage • difficult to read back data from different number of processes
What is Parallel I/O? • Multiple processes of a parallel program accessing data (reading or writing) from a common file FILE P(n-1) P0 P1 P2
Why Parallel I/O? • Non-parallel I/O is simple but • Poor performance (single process writes to one file) or • Awkward and not interoperable with other tools (each process writes a separate file) • Parallel I/O • Provides high performance • Can provide a single file that can be used with other tools (such as visualization programs)
Why is MPI a Good Setting for Parallel I/O? • Writing is like sending a message and reading is like receiving. • Any parallel I/O system will need a mechanism to • define collective operations (MPI communicators) • define noncontiguous data layout in memory and file (MPI datatypes) • Test completion of nonblocking operations (MPI request objects) • i.e., lots of MPI-like machinery
MPI-IO Background • Marc Snir et al (IBM Watson) paper exploring MPI as context for parallel I/O (1994) • MPI-IO email discussion group led by J.-P. Prost (IBM) and Bill Nitzberg (NASA), 1994 • MPI-IO group joins MPI Forum in June 1996 • MPI-2 standard released in July 1997 • MPI-IO is Chapter 9 of MPI-2
FILE P(n-1) P0 P1 P2 Using MPI for Simple I/O Each process needs to read a chunk of data from a common file
Using Individual File Pointers #include<stdio.h> #include<stdlib.h> #include "mpi.h" #define FILESIZE 1000 int main(int argc, char **argv){ int rank, nprocs; MPI_File fh; MPI_Status status; int bufsize, nints; int buf[FILESIZE]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); bufsize = FILESIZE/nprocs; nints = bufsize/sizeof(int); MPI_File_open(MPI_COMM_WORLD, "datafile", MPI_MODE_RDONLY, MPI_INFO_NULL, &fh); MPI_File_seek(fh, rank * bufsize, MPI_SEEK_SET); MPI_File_read(fh, buf, nints, MPI_INT, &status); MPI_File_close(&fh); }
Using Explicit Offsets #include<stdio.h> #include<stdlib.h> #include "mpi.h" #define FILESIZE 1000 int main(int argc, char **argv){ int rank, nprocs; MPI_File fh; MPI_Status status; int bufsize, nints; int buf[FILESIZE]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); bufsize = FILESIZE/nprocs; nints = bufsize/sizeof(int); MPI_File_open(MPI_COMM_WORLD, "datafile", MPI_MODE_RDONLY, MPI_INFO_NULL, &fh); MPI_File_read_at(fh, rank*bufsize, buf, nints, MPI_INT, &status); MPI_File_close(&fh); }
Function Details MPI_File_open(MPI_Comm comm, char *file, int mode, MPI_Info info, MPI_File *fh) (note: mode = MPI_MODE_RDONLY, MPI_MODE_RDWR, MPI_MODE_WRONLY, MPI_MODE_CREATE, MPI_MODE_EXCL, MPI_MODE_DELETE_ON_CLOSE, MPI_MODE_UNIQUE_OPEN, MPI_MODE_SEQUENTIAL, MPI_MODE_APPEND) MPI_File_close(MPI_File *fh) MPI_File_read(MPI_File fh, void *buf, int count, MPI_Datatype type, MPI_Status *status) MPI_File_read_at(MPI_File fh, int offset, void *buf, int count, MPI_Datatype type, MPI_Status *status) MPI_File_seek(MPI_File fh, MPI_Offset offset, in whence); (note: whence = MPI_SEEK_SET, MPI_SEEK_CUR, or MPI_SEEK_END) MPI_File_write(MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status) Note: Many other functions to get/set properties (see Gropp et al)
Writing to a File • Use MPI_File_write or MPI_File_write_at • Use MPI_MODE_WRONLY or MPI_MODE_RDWR as the flags to MPI_File_open • If the file doesn’t exist previously, the flag MPI_MODE_CREATE must also be passed to MPI_File_open • We can pass multiple flags by using bitwise-or ‘|’ in C, or addition ‘+” in Fortran
MPI Datatype Interlude • Datatypes in MPI • Elementary: MPI_INT, MPI_DOUBLE, etc • everything we’ve used to this point • Contiguous • Next easiest: sequences of elementary types • Vector • Sequences separated by a constant “stride”
MPI Datatypes, cont • Indexed: more general • does not assume a constant stride • Struct • General mixed types (like C structs)
Creating simple datatypes • Let’s just look at the simplest types: contiguous and vector datatypes. • Contiguous example • Let’s create a new datatype which is two ints side by side. The calling sequence is MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype); MPI_Datatype newtype; MPI_Type_contiguous(2, MPI_INT, &newtype); MPI_Type_commit(newtype); /* required */
Creating vector datatype • MPI_Type_vector( int count, int blocklen, int stride, MPI_Datattype oldtype, MPI_Datatype &newtype);
Using File Views • Processes write to shared file • MPI_File_set_view assigns regions of the file to separate processes
File Views • Specified by a triplet (displacement, etype, and filetype) passed to MPI_File_set_view • displacement = number of bytes to be skipped from the start of the file • etype = basic unit of data access (can be any basic or derived datatype) • filetype = specifies which portion of the file is visible to the process
File View Example MPI_File thefile; for (i=0; i<BUFSIZE; i++) buf[i] = myrank * BUFSIZE + i; MPI_File_open(MPI_COMM_WORLD, "testfile", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &thefile); MPI_File_set_view(thefile, myrank * BUFSIZE, MPI_INT, MPI_INT, "native", MPI_INFO_NULL); MPI_File_write(thefile, buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE); MPI_File_close(&thefile);
MPI_Set_view • MPI_Set_view(MPI_File fh, MPI_Offset disp, MPI_Datatype etype, • MPI_Datatype filetype, char *datarep, MPI_Info info) • MPI_Set_view defines which portion of a file is accessible to a given process • Extremely important fucntion -- make sure that you understand arguments • When using previous functions (MPI_Write, etc.) we did not set a file view but rather used the default one (explained below) disp: a byte offset to define the starting point etype: this is the base (think native) type of data in the file (ints, doubles, etc. are common). It is better not to simply use bytes as an etype for heterogeneous environments, e.g. when mapping ints between big and little endian archs filetype: like etype this is any MPI_type. It is used to define regions of non-contiguous access in a file.
MPI_Set_view, cont. • For individual file pointers each process can specify its own view • View can be changed any number of times during a program • All file access done in unites of etype! • filetype must be equal to or be derived from etype
filetype = two MPI_INTs followed by a gap of four MPI_INTs A Simple File View Example etype = MPI_INT head of file FILE displacement filetype filetype and so on...
Example non-contiguous access MPI_Aint lb, extent; MPI_Datatype etype, filetype, contig; MPI_Offset disp; MPI_File fh; int buf[1000]; MPI_File_open(MPI_COMM_WORLD, “filename”, MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh); MPI_Type_contiguous(2, MPI_INT, &contig); lb = 0; MPI_Type_create_resized(contig, lb, extent, &filetype); MPI_Type_commit(&filetype); disp = 5*sizeof(int); etype = MPI_INT; MPI_File_set_view(fh, disp, etype, filetype, “native”, MPI_INFO_NULL); MPI_File_write(fh, buf, 1000, MPI_INT, MPI_STATUS_IGNORE);
Ways to Write to a Shared File • MPI_File_seek • MPI_File_read_at • MPI_File_write_at • MPI_File_read_shared • MPI_File_write_shared • Collective operations like Unix seek combine seek and I/O for thread safety use shared file pointer good when order doesn’t matter
Collective I/O in MPI • A critical optimization in parallel I/O • Allows communication of “big picture” to file system • Framework for 2-phase I/O, in which communication precedes I/O (can use MPI machinery) • Basic idea: build large blocks, so that reads/writes in I/O system will be large Small individual requests Large collective access
Noncontiguous Accesses • Common in parallel applications • Example: distributed arrays stored in files • A big advantage of MPI I/O over Unix I/O is the ability to specify noncontiguous accesses in memory and file within a single function call by using derived datatypes • Allows implementation to optimize the access • Collective IO combined with noncontiguous accesses yields the highest performance.
Collective I/O • MPI_File_read_all, MPI_File_read_at_all, etc • _all indicates that all processes in the group specified by the communicator passed to MPI_File_open will call this function • Each process specifies only its own access information -- the argument list is the same as for the non-collective functions
Collective I/O • By calling the collective I/O functions, the user allows an implementation to optimize the request based on the combined request of all processes • The implementation can merge the requests of different processes and service the merged request efficiently • Particularly effective when the accesses of different processes are noncontiguous and interleaved
Collective non-contiguousMPI-IO examples #define “mpi.h” #define FILESIZE 1048576 #define INTS_PER_BLK 16 int main(int argc, char **argv){ int *buf, rank, nprocs, nints, bufsize; MPI_File fh; MPI_Datatype filetype; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); bufsize = FILESIZE/nprocs; buf = (int *) malloc(bufsize); nints = bufsize/sizeof(int); MPI_File_open(MPI_COMM_WORLD, “filename”, MPI_MODE_RD_ONLY, MPI_INFO_NULL, &fh); MPI_Type_vector(nints/INTS_PER_BLK, INTS_PER_BLK, INTS_PER_BLK*nprocs, MPI_INT, &filetype); MPI_Type_commit(&filetype); MPI_File_set_view(fh, INTS_PER_BLK*sizeof(int)*rank, MPI_INT, filetype, “native”, MPI_INFO_NULL); MPI_File_read_all(fh, buf, nints, MPI_INT, MPI_STATUS_IGNORE); MPI_Type_free(&filetype); free(buf) MPI_Finalize(); return(0); }
More on MPI_Read_all • Note that the _all version has the same argument list • Difference is that all processes involved in MPI_Open must call this the read • Contrast with the non-all version where any subset may or may not call it • Allows for many optimizations
Array-specific datatypes • We can take this a level higher for the (common) case of distributed arrays • There are two very convenient datatype constructors -- darray and subarray -- that can be used to read/write distributed arrays with less work. • These are just to create MPI_Datatype. All other machinery just described is identical.
n columns P1 P2 P0 coords = (0,1) coords = (0,0) coords = (0,2) m rows P5 P3 P4 coords = (1,0) coords = (1,1) coords = (1,2) nproc(1) = 2, nproc(2) = 3 Accessing Arrays Stored in Files
Using the “Distributed Array” (Darray) Datatype int gsizes[2], distribs[2], dargs[2], psizes[2]; gsizes[0] = m; /* no. of rows in global array */ gsizes[1] = n; /* no. of columns in global array*/ distribs[0] = MPI_DISTRIBUTE_BLOCK; distribs[1] = MPI_DISTRIBUTE_BLOCK; dargs[0] = MPI_DISTRIBUTE_DFLT_DARG; dargs[1] = MPI_DISTRIBUTE_DFLT_DARG; psizes[0] = 2; /* no. of processes in vertical dimension of process grid */ psizes[1] = 3; /* no. of processes in horizontal dimension of process grid */
MPI_Type_create_darray MPI_Type_create_darray(int size, int rank, int ndims, int array_of_gsizes[], int array_of_distribs[], int array_of_dargs[], int array_of_psizes[], int order, MPI_Datatype old, MPI_Datatype &new); size: number of procs over which array is distributed rank: Rank of the process whose array is being described ndims: number of dimensions in global (and local) array array_of_gsizes: size of global array in each dimension array_of_distribs: way in which array is distributed in each dimension -- can be block, cyclic, or block-cyclic. Here it is block so we use MPI_DISTRIBUTE-BLOCK array_of_dargs: k in cyclic(k). Not necessary of block or cyclic distributions, so we use MPI_DISTRIBUTE-DFLT-DARG array_of_psizes: number of procs in each dimension Order: row-major (MPI_ORDER_C) or column-major (MPI_ORDER_FORTRAN)
Darray Continued MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Type_create_darray(6, rank, 2, gsizes, distribs, dargs, psizes, MPI_ORDER_C, MPI_FLOAT, &filetype); MPI_Type_commit(&filetype); MPI_File_open(MPI_COMM_WORLD, "/pfs/datafile", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &fh); MPI_File_set_view(fh, 0, MPI_FLOAT, filetype, "native", MPI_INFO_NULL); local_array_size = num_local_rows * num_local_cols; MPI_File_write_all(fh, local_array, local_array_size, MPI_FLOAT, &status); MPI_File_close(&fh);
A Word of Warning about Darray • The darray datatype assumes a very specific definition of data distribution -- the exact definition as in HPF • For example, if the array size is not divisible by the number of processes, darray calculates the block size using a ceiling division (20 / 6 = 4 ) • darray assumes a row-major ordering of processes in the logical grid, as assumed by cartesian process topologies in MPI-1 • If your application uses a different definition for data distribution or logical grid ordering, you cannot use darray. Use subarray instead.
MPI_Type_create_subarray MPI_Type_create_subarray(int ndims, int array_of_sizes[], int array_of_subsizes[], int array_of_starts[], int order, MPI_Datatype old, MPI_Datatype *new); ndims: number of dimensions in array array_of_sizes: size of global array in each dimension array_of_subsizes: size of subarray in each dimension int array_of_starts: starting coordintes in each dimension (zero-based) order: same as for darray (C or Fortran ordering)
Using the Subarray Datatype gsizes[0] = m; /* no. of rows in global array */ gsizes[1] = n; /* no. of columns in global array*/ psizes[0] = 2; /* no. of procs. in vertical dimension */ psizes[1] = 3; /* no. of procs. in horizontal dimension */ lsizes[0] = m/psizes[0]; /* no. of rows in local array */ lsizes[1] = n/psizes[1]; /* no. of columns in local array */ dims[0] = 2; dims[1] = 3; periods[0] = periods[1] = 1; MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &comm); MPI_Comm_rank(comm, &rank); MPI_Cart_coords(comm, rank, 2, coords);
Subarray Datatype contd. /* global indices of first element of local array */ start_indices[0] = coords[0] * lsizes[0]; start_indices[1] = coords[1] * lsizes[1]; MPI_Type_create_subarray(2, gsizes, lsizes, start_indices, MPI_ORDER_C, MPI_FLOAT, &filetype); MPI_Type_commit(&filetype); MPI_File_open(MPI_COMM_WORLD, "/pfs/datafile", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &fh); MPI_File_set_view(fh, 0, MPI_FLOAT, filetype, "native", MPI_INFO_NULL); local_array_size = lsizes[0] * lsizes[1]; MPI_File_write_all(fh, local_array, local_array_size, MPI_FLOAT, &status);
(0,0) (0,107) (4,4) (4,103) local data ghost area for storing off-process elements (103,4) (103,103) (107,0) (107,107) Local Array with Ghost Areain Memory • Use a subarray datatype to describe the noncontiguous layout in memory • Pass this datatype as argument to MPI_File_write_all
Local Array with Ghost Area memsizes[0] = lsizes[0] + 8; /* no. of rows in allocated array */ memsizes[1] = lsizes[1] + 8; /* no. of columns in allocated array */ start_indices[0] = start_indices[1] = 4; /* indices of the first element of the local array in the allocated array */ MPI_Type_create_subarray(2, memsizes, lsizes, start_indices, MPI_ORDER_C, MPI_FLOAT, &memtype); MPI_Type_commit(&memtype); /* create filetype and set file view exactly as in the subarray example */ MPI_File_write_all(fh, local_array, 1, memtype, &status);
Process 0’s data array Process 1’s data array Process 2’s data array Accessing Irregularly Distributed Arrays Process 0’s map array Process 1’s map array Process 2’s map array 0 3 8 11 2 4 7 13 1 5 10 14 The map array describes the location of each element of the data array in the common file
Accessing Irregularly Distributed Arrays integer (kind=MPI_OFFSET_KIND) disp call MPI_FILE_OPEN(MPI_COMM_WORLD, '/pfs/datafile', & MPI_MODE_CREATE + MPI_MODE_RDWR, & MPI_INFO_NULL, fh, ierr) call MPI_TYPE_CREATE_INDEXED_BLOCK(bufsize, 1, map, & MPI_DOUBLE_PRECISION, filetype, ierr) call MPI_TYPE_COMMIT(filetype, ierr) disp = 0 call MPI_FILE_SET_VIEW(fh, disp, MPI_DOUBLE_PRECISION, & filetype, 'native', MPI_INFO_NULL, ierr) call MPI_FILE_WRITE_ALL(fh, buf, bufsize, & MPI_DOUBLE_PRECISION, status, ierr) call MPI_FILE_CLOSE(fh, ierr)
Nonblocking I/O MPI_Request request; MPI_Status status; MPI_File_iwrite_at(fh, offset, buf, count, datatype, &request); for (i=0; i<1000; i++) { /* perform computation */ } MPI_Wait(&request, &status);
Split Collective I/O • A restricted form of nonblocking collective I/O • Only one active nonblocking collective operation allowed at a time on a file handle • Therefore, no request object necessary MPI_File_write_all_begin(fh, buf, count, datatype); for (i=0; i<1000; i++) { /* perform computation */ } MPI_File_write_all_end(fh, buf, &status);
Shared File Pointers #include "mpi.h" // C++ example int main(int argc, char *argv[]) { int buf[1000]; MPI::File fh; MPI::Init(); MPI::File fh = MPI::File::Open(MPI::COMM_WORLD, "/pfs/datafile", MPI::MODE_RDONLY, MPI::INFO_NULL); fh.Write_shared(buf, 1000, MPI_INT); fh.Close(); MPI::Finalize(); return 0; }
Passing Hints to the Implementation MPI_Info info; MPI_Info_create(&info); /* no. of I/O devices to be used for file striping */ MPI_Info_set(info, "striping_factor", "4"); /* the striping unit in bytes */ MPI_Info_set(info, "striping_unit", "65536"); MPI_File_open(MPI_COMM_WORLD, "/pfs/datafile", MPI_MODE_CREATE | MPI_MODE_RDWR, info, &fh); MPI_Info_free(&info);
Examples of Hints (used in ROMIO) • striping_unit • striping_factor • cb_buffer_size • cb_nodes • ind_rd_buffer_size • ind_wr_buffer_size • start_iodevice • pfs_svr_buf • direct_read • direct_write MPI-2 predefined hints New Algorithm Parameters Platform-specific hints
I/O Consistency Semantics • The consistency semantics specify the results when multiple processes access a common file and one or more processes write to the file • MPI guarantees stronger consistency semantics if the communicator used to open the file accurately specifies all the processes that are accessing the file, and weaker semantics if not • The user can take steps to ensure consistency when MPI does not automatically do so