HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PARALLEL FILE I/O 2

Prof. Thomas Sterling Prof. Hartmut Kaiser Department of Computer Science Louisiana State University March 31st, 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANSPARALLEL FILE I/O 2

IO Problem of the day #include <stdio.h> int main() { int a = 0, b = 0; char buf[10]; scanf ("%d%d", a, b); sprintf (buf, "%d %d"); puts ("you entered: "); puts (buf); } If the user entered 3 and 17, what‘s the generated output?

IO Problem of the day #include <stdio.h> int main() { int a = 0, b = 0; char buf[42]; // max. 20 digits in 64bit int scanf ("%d%d", &a, &b); snprintf (buf, 42, "%d %d", a, b); puts ("you entered: "); puts (buf); }

Topics Introduction POSIX I/O API Parallel I/O Libraries (MPI-IO) Scientific I/O Interface: netCDF Scientific Data Package: HDF5 Summary – Materials for Test

Parallel I/O: Library Layers (Review) High-Level I/O Library Parallel I/O (MPI I/O) Parallel File System Storage Hardware Lower level interfaces may be provided by the file system for higher-performance access Above the parallel file systems are the parallel I/O layers provided in the form of libraries such as MPI-IO The parallel I/O layer provides a low level interface and operations such as collective I/O Scientific applications work with structured data for which a higher level API written on top of MPI-IO such as HDF5 or parallel netCDF are used HDF5 and parallel netCDF allow the scientists to represent the data sets in terms closer to those used in their applications, and in a portable manner

POSIX File Access API • Widespread standard • Available on any UNIX-compliant platform • IBM AIX, HP HP-UX, SGI Irix, Sun Solaris, BSDi BSD/OS, Mac OS X, Linux, FreeBSD, OpenBSD, NetBSD, BeOS, and many others • Also: Windows NT, XP, Server 2003, Vista, Windows 7 (through C runtime libraries) • Simple interface: six functions from POSIX.1 (core services) provide practically all necessary I/O functionality • File open • File close • File data read • File data write • Flush buffer to disk • Adjust file pointer (seek) • Two interface variants, provide roughly equivalent functionality • Low-level file interface (file handles are integer descriptors) • C stream interface (streams are represented by FILE structure; function names prefixed with “f”) • But: no parallel I/O support

File Open #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> ... /* create empty writable file with default access permissions, storing its descriptor in fd */ int fd = open(“test”, O_WRONLY|O_CREAT|O_TRUNC); if (fd < 0) {/* handle error here */} #include <stdio.h> ... /* replicate open() example on the left, storing file handle in f */ FILE *f = fopen(“test”, “w”); if (f == NULL) {/* handle error here */} ...

File Close #include <unistd.h> ... /* open a file */ intrc; intfd = open(...); ... /* file is accessed here */ ... rc = close(fd); if (rc != 0) {/* handle error here */} #include <stdio.h> ... /* open a file */ intrc; FILE *f = fopen(...); ... /* file is accessed here */ ... rc = fclose(f); if (rc != 0) {/* handle error here */}

File Read #include <unistd.h> ... int bytes; char buf[100]; /* open an existing file for reading */ intfd = open(...); ... bytes = read(fd, buf, 100); if (bytes < 100) {/* handle EOF or error here */} ... #include <stdio.h> ... size_t items; char buf[100]; /* open an existing file for reading */ FILE *f = fopen(...); ... items = fread(buf, 1, 100, f); if (items < 100) {/* handle EOF or error here */} ...

File Write #include <unistd.h> ... int bytes; char buf[100]; /* open a file for writing or appending */ intfd = open(...); ... /* initialize buffer data */ ... bytes = write(fd, buf, 100); if (bytes < 100) {/* handle short write */} ... #include <stdio.h> ... size_t items; char buf[100]; /* open a file for writing or appending */ FILE *f = fopen(...); ... /* initialize buffer data */ ... items = fwrite(buf, 1, 100, f); if (items < 100) {/* handle short write */} ...

File Seek #include <sys/types.h> #include <unistd.h> ... /* open file for read/write access */ int fd = open(“/tmp/myfile”, O_RDWR); ... /* write some file data */ ... /* “rewind” to the beginning of file to check the written data */ lseek(fd, 0, SEEK_SET); /* start reading... */ #include <stdio.h> ... /* open file for reading and writing */ FILE *f = fopen(“/tmp/myfile”, “r+”); ... /* to start appending data at the end of file: */ fseek(f, 0, SEEK_END); fwrite(...); ...

File Data Flushing #include <unistd.h> ... /* open file for writing */ int fd = open(“checkpt.dat”, O_WRONLY|O_CREAT); ... /* write checkpoint data */ ... /* make sure data are flushed to disk before starting the next iteration */ fsync(fd); ... #include <stdio.h> ... /* open file for appending */ FILE *f = fopen(“/var/log/app.log”, “a”); ... /* special event happened: output a message */ fprintf(f, “driver initialization failed”); /* make sure message reaches at least kernel buffers before application crashes */ fflush(f); ...

Problems with POSIX File I/O • Too simplistic interface • Operates on anonymous sequences of bytes • No preservation of type or information structure • Cumbersome access to optimized/additional features (fcntl, ioctl) • Designed for sequential I/O (even regularly strided accesses require multiple calls and may suffer from poor performance) • Portability issues • Must use specialized reader/writer created for a particular application • Compatibility checks dependent on application developers (possibility of undetected failures) • No generic utilities to parse and interpret the contents of saved files • Cross platform endianness and type representation problem if saving in binary mode • Significant waste of storage space if text mode is used (for portability or readability of transferred data) • Permit access only to locally mounted storage, or remote storage via NFS (which has its share of problems) • Parallel and concurrent access issues • Lack of synchronization when accessing shared files from multiple nodes • Atomic access to shared files may not be enforceable, has unclear semantics, or has to rely on the programmer for synchronization • Uncoordinated access of I/O devices shared by multiple nodes may result in poor performance (bottlenecks) • Additional performance loss due to suboptimal bulk data movement (e.g., no collective I/O) • On the other hand, without sharing, the management of individual files (i.e. with at least one data file per I/O node) is complicated and tedious

MPI-IO Overview Initially developed as a research project at the IBM T. J. Watson Research Center in 1994 Voted by the MPI Forum to be included in MPI-2 standard (Chapter 9) Most widespread open-source implementation is ANL’s ROMIO, written by Rajeev Thakur (http://www-unix.mcs.anl.gov/romio/ ) Integrates file access with the message passing infrastructure, using similarities between send/receive and file write/read operations Allows MPI datatypes to describe meaningfully data layouts in files instead of dealing with unorganized streams of bytes Provides potential for performance optimizations through the mechanism of “hints”, collective operations on file data, or relaxation of data access atomicity Enables better file portability by offering alternative data representations

MPI-IO Features (I) • Basic file manipulation (open/close, delete, space preallocation, resize, storage synchronization, etc.) • File views (define what part of a file each process can see and how it is interpreted) • Processes can view file data independently, with possible overlaps • The users may define patterns to describe data distributions both in file and in memory, including non-contiguous layouts • Permit skipping over fixed header blocks (“displacements”) • Views can be changed by tasks at any time • Data access positioning • Explicitly specified offsets (suffix “_at”) • Independent data access by each task via individual file pointers (no suffix) • Coordinated access through shared file pointer (suffix “_shared”) • Access synchronism • Blocking • Non-blocking (include split-collective operations)

MPI-IO Features (II) • Access coordination • Non-collective (no additional suffix) • Collective (suffix: “_all” for most blocking calls, “_begin” and “_end” for split-collective, or “_ordered” for equivalent of shared pointer access) • File interoperability (ensures portability of data representation) • Native: for purely homogeneous environments • Internal: heterogeneous environments with implementation-defined data representation (subset of “external32”) • External32: heterogeneous environments using data representation defined by the MPI-IO standard • Optimization hints (the “_info” interface) • Access style (e.g. read_once, write_once, sequential, random, etc.) • Collective buffering components (buffer and block sizes, number of target nodes) • Striping unit and factor • Chunked I/O specification • Preferred I/O devices • C, C++ and Fortran bindings

MPI-IO Types Source:http://www.mhpcc.edu/training/workshop2/mpi_io/MAIN.html Etype (elementary datatype): the unit of data access and positioning; all data accesses are performed in etype units and offsets are measured in etypes Filetype: basis for partitioning the file among processes: a template for accessing the file; may be identical to or derived from the etype

MPI-IO File Views A view defines the current set of data visible and accessible from an open file as an ordered set of etypes Each process has its own view of the file, defined by: a displacement, an etype, and a filetype Displacement: an absolute byte position relative to the beginning of file; defines where a view begins

MPI-IO: File Open #include <mpi.h> ... MPI_File fh; int err; ... /* create a writable file with default parameters */ err = MPI_File_open(MPI_COMM_WORLD, “/mnt/piofs/testfile”, MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh); if (err != MPI_SUCCESS) {/* handle error here */} ...

MPI-IO: File Close #include <mpi.h> ... MPI_Filefh; int err; ... /* open a file storing the handle in fh */ /* perform file access */ ... err = MPI_File_close(&fh); if (err != MPI_SUCCESS) {/* handle error here */} ...

MPI-IO: Set File View #include <mpi.h> ... MPI_File fh; int err; ... /* open file storing the handle in fh */ ... /* view the file as stream of integers with no header, using native data representation */ err = MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL); if (err != MPI_SUCCESS) {/* handle error */} ...

MPI-IO: Read File with Explicit Offset #include <mpi.h> ... MPI_File fh; MPI_Status stat; int buf[3], err; ... /* open file storing the handle in fh */ ... MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL); /* read the third triad of integers from file */ err = MPI_File_read_at(fh, 6, buf, 3, MPI_INT, &stat); ...

MPI-IO: Write to File with Explicit Offset #include <mpi.h> ... MPI_File fh; MPI_Status stat; int err; double dt = 0.0005; ... /* open file storing the handle in fh */ ... MPI_File_set_view(fh, 0, MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL); /* store timestep as the first item in file */ err = MPI_File_write_at(fh, 0, &dt, 1, MPI_DOUBLE, &stat); ...

MPI-IO: Read File Collectively with Individual File Pointers #include <mpi.h> ... MPI_File fh; MPI_Status stat; int buf[20], err; ... /* open file storing the handle in fh */ ... MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL); /* read 20 integers at current file offset in every process */ err = MPI_File_read_all(fh, buf, 20, MPI_INT, &stat); ...

MPI-IO: Write to File Collectively with Individual File Pointers #include <mpi.h> ... MPI_File fh; MPI_Status stat; double t; int err, rank; ... /* open file storing the handle in fh; compute t */ ... MPI_Comm_rank(MPI_COMM_WORLD, &rank); /* interleave time values t from each process at the beginning of file */ MPI_File_set_view(fh, rank*sizeof(t), MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL); err = MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat); ...

MPI-IO: File Seek #include <mpi.h> ... MPI_File fh; MPI_Status stat; double t; int rank; ... /* open file storing the handle in fh; compute t */ ... MPI_Comm_rank(MPI_COMM_WORLD, &rank); /* interleave time values t from each process at the beginning of file */ MPI_File_set_view(fh, 0, MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL); MPI_File_seek(fh, MPI_SEEK_SET, rank); MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat); ...

MPI-IO Data Access Classification :Source:http://www.mpi-forum.org/docs/mpi2-report.pdf

Example: Scatter to File Example created by Jean-Pierre Prost from IBM Corp.

Scatter Example Source #include "mpi.h" static int buf_size = 1024; static int blocklen = 256; static char filename[] = "scatter.out"; main(int argc, char **argv) { char *buf, *p; int myrank, commsize; MPI_Datatype filetype, buftype; int length[3]; MPI_Aint disp[3]; MPI_Datatype type[3]; MPI_File fh; int mode, nbytes; MPI_Offset offset; MPI_Status status; /* initialize MPI */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &commsize); /* initialize buffer */ buf = (char *) malloc(buf_size); memset(( void *)buf, '0' + myrank, buf_size); /* create and commit buftype */ MPI_Type_contiguous(buf_size, MPI_CHAR, &buftype); MPI_Type_commit(&buftype); /* create and commit filetype */ length[0] = 1; length[1] = blocklen; length[2] = 1; disp[0] = 0; disp[1] = blocklen * myrank; disp[2] = blocklen * commsize; type[0] = MPI_LB; type[1] = MPI_CHAR; type[2] = MPI_UB; MPI_Type_struct(3, length, disp, type, &filetype); MPI_Type_commit(&filetype); /* open file */ mode = MPI_MODE_CREATE | MPI_MODE_WRONLY;

Scatter Example Source (cont.) MPI_File_open(MPI_COMM_WORLD, filename, mode, MPI_INFO_NULL, &fh); /* set file view */ offset = 0; MPI_File_set_view(fh, offset, MPI_CHAR, filetype, "native", MPI_INFO_NULL); /* write buffer to file */ MPI_File_write_at_all(fh, offset, (void *)buf, 1, buftype, &status); /* print out number of bytes written */ MPI_Get_elements(&status, MPI_CHAR, &nbytes); printf( "TASK %d ====== number of bytes written = %d ======\n", myrank, nbytes); /* close file */ MPI_File_close(&fh); /* free datatypes */ MPI_Type_free(&buftype); MPI_Type_free(&filetype); /* free buffer */ free (buf); /* finalize MPI */ MPI_Finalize(); }

Data Access Optimizations Data Sieving 2-phase I/O Collective Read Implementation in ROMIO Source: http://www-unix.mcs.anl.gov/~thakur/papers/romio-coll.pdf

ROMIO Scaling Examples Write Operations Read Operations Source: http://www-unix.mcs.anl.gov/~thakur/sio-demo/astro.html Bandwidths obtained for 5123 arrays (astrophysics benchmark) on Argonne IBM SP

Independent vs. Collective Access Individual I/O on IBM SP Collective I/O on IBM SP Source: http://www-unix.mcs.anl.gov/~thakur/sio-demo/upshot.html

Demos MPI-IO Demo POSIX I/O API Demo

NetCDF: Introduction Stands for Network Common Data Form Portable format to represent scientific data Developed at the Unidata Program Center in Boulder, Colorado, with many contributions from user community Project page hosted by the Unidata program at University Corporation for Atmospheric Research (UCAR): http://www.unidata.ucar.edu/software/netcdf/ Provides a set of interfaces for array-oriented data access and a collection of data access libraries for C, Fortan (77 and 90), C++, Java, Perl, Python, and other languages Available on UNIX and Windows platforms Features simple programming interface Supports large data files (and 64-bit offsets) Open source, freely available Commonly used file extension is “.nc” (changed from “.cdf” to avoid confusion with other formats) Current stable release is version 4.0 (released on June 12, 2008) Used extensively by a number of climate modeling, land and atmosphere, marine, naval data storage, satellite data processing, theoretical physics centers, geological institutes, commercial analysis, universities, as well as other research institutions in over 30 countries

NetCDF Rationale To facilitate the use of common datasets by distinct applications Permit datasets to be transported between or shared by dissimilar computers transparently, i.e., without translation (automatic handling of different data types, endian-ness, etc.) Reduce the programming effort usually spent interpreting formats Reduce errors arising from misinterpreting data and ancillary data Facilitate using output from one application as input to another Establish an interface standard which simplifies the inclusion of new software into already existing application set (originally: Unidata system) However: not another DBMS!

Key Properties of NetCDF Format • Self-describing • A netCDF file includes information about the data it contains • Portable • Files are accessible by computers that use different ways of representing and storing of integers, floating-point numbers and characters • Direct-access • Enabling an efficient access to small subsets of a large dataset without the need to read through all preceding data • Appendable • Additional data may be appended to a properly structured netCDF file without copying the dataset or redefining its structure • Sharable • One writer and multiple readers may simultaneously access the same netCDF file • Archivable • Access to all earlier forms of netCDF data will be supported by current and future versions of the software

NetCDF Dataset Building Blocks • Data in netCDF are represented as n-dimensional arrays, with n being 0, 1, 2, … (scalars are 0-dimensional arrays) • Array elements are of the same data type • Three basic entities: • Dimension: has name and length; one dimension per array may be UNLIMITED for unbounded arrays • Variable: identifies array of values of the same type (byte, character, short, int, float, or double) • In addition, coordinate variables may be named identically to dimensions, and by convention define physical coordinate set corresponding to that dimension • Attribute: provides additional information about a variable, or global properties of a dataset • There are established conventions for attribute names, e.g., unit, long_name, valid_range, etc. • Multiple attributes per dataset are allowed • The only kind of data structures supported by netCDF classic are collections of named arrays with attached vector attributes

Common Data Form Language (CDL) netcdf example_1 { // example of CDL notation for a netCDF dataset dimensions: // dimension names and lengths are declared first lat = 5, lon = 10, level = 4, time = unlimited; variables: // variable types, names, shapes, attributes float temp(time,level,lat,lon); temp:long_name = "temperature"; temp:units = "celsius"; int lat(lat), lon(lon), level(level); lat:units = "degrees_north"; lon:units = "degrees_east"; level:units = "millibars"; short time(time); time:units = "hours since 1996-1-1"; // global attributes :source = "Fictional Model Output"; data: // optional data assignments level = 1000, 850, 700, 500; lat = 20, 30, 40, 50, 60; lon = -160,-140,-118,-96,-84,-52,-45,-35,-25,-15; time = 12; } NetCDF uses CDL to provide a way to describe data model CDL represents the information stored in binary netCDF files in a human-readable form, e.g.:

NetCDF Utilities • ncgen • takes input in CDL format and creates a netCDF file, or a C or Fortran program that creates a netCDF dataset ncgen [-b] [-o netcdf-file] [-c] [-f] [-k kind] [-x] [input-file] • ncdump • generates the CDL text representation of a netCDF dataset on standard output, optionally including some or all variable data • Output from ncdump is an acceptable input to ncgen ncdump [-c|-h] [-v var1,…] [-b lang] [-f lang] [-l len] [-p fdig[,ddig]] [-n name] [-k] [input-file]

NetCDF API: Create a Dataset #include <netcdf.h> ... int status; intncid; ... status = nc_create("foo.nc", NC_NOCLOBBER, &ncid); if (status != NC_NOERR) handle_error(status);

NetCDF API: Open a Dataset #include <netcdf.h> ... int status; intncid; ... status = nc_open("foo.nc", 0, &ncid); if (status != NC_NOERR) handle_error(status);

NetCDF API: Create a Dimension #include <netcdf.h> ... int status, ncid, latid, recid; ... status = nc_create("foo.nc", NC_NOCLOBBER, &ncid); if (status != NC_NOERR) handle_error(status); ... status = nc_def_dim(ncid, "lat", 18L, &latid); if (status != NC_NOERR) handle_error(status); status = nc_def_dim(ncid, "rec", NC_UNLIMITED, &recid); if (status != NC_NOERR) handle_error(status);

NetCDF API: Create a Variable #include <netcdf.h> int status, ncid; /* error status and dataset ID */ intlat_dim, lon_dim, time_dim; /* dimension IDs */ intrh_id, rh_dimids[3]; /* variable ID and shape */ ... status = nc_create("foo.nc", NC_NOCLOBBER, &ncid); if (status != NC_NOERR) handle_error(status); /* define dimensions */ status = nc_def_dim(ncid, "lat", 5L, &lat_dim); if (status != NC_NOERR) handle_error(status); status = nc_def_dim(ncid, "lon", 10L, &lon_dim); if (status != NC_NOERR) handle_error(status); status = nc_def_dim(ncid, "time", NC_UNLIMITED, &time_dim); if (status != NC_NOERR) handle_error(status); /* define variable */ rh_dimids[0] = time_dim; rh_dimids[1] = lat_dim; rh_dimids[2] = lon_dim; status = nc_def_var(ncid, "rh", NC_DOUBLE, 3, rh_dimids, &rh_id); if (status != NC_NOERR) handle_error(status);

NetCDF API: Leave Define Mode #include <netcdf.h> ... int status; intncid; ... status = nc_create("foo.nc", NC_NOCLOBBER, &ncid); if (status != NC_NOERR) handle_error(status); ... /* create dimensions, variables, attributes */ ... status = nc_enddef(ncid); /*leave define mode*/ if (status != NC_NOERR) handle_error(status);

NetCDF API: Quering Variable Information

HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PARALLEL FILE I/O 2