1 / 72

Update on HDF5 1.8

HDF. Update on HDF5 1.8. The HDF Group HDF and HDF-EOS Workshop X November 28, 2006. Why HDF5 1.8?. … as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know.

derora
Download Presentation

Update on HDF5 1.8

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HDF Update on HDF5 1.8 The HDF Group HDF and HDF-EOS Workshop X November 28, 2006

  2. Why HDF5 1.8?

  3. … as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know. Donald Rumsfeld HDF and HDF-EOS Workshop X, Landover MD

  4. Some things we knew we knew • Need high level APIs – image, etc. • Need more datatypes - packed n-bit, etc. • Need external and other links • Tools needed – h5pack, etc. • Caching embellishments • Eventually, multithreading HDF and HDF-EOS Workshop X, Landover MD

  5. Things we knew we did not know • New requirements from EOS and ASCI • New applications that would use HDF5 • How HDF5 would really perform in parallel • What new tools, features and options needed • New APIs, API features HDF and HDF-EOS Workshop X, Landover MD

  6. Things we didn’t know we didn’t know • Completely unanticipated applications • New data types and structures • E.g. DNA sequences • New operations • E.g. write many real-time streams simultaneously HDF and HDF-EOS Workshop X, Landover MD

  7. HDF5 1.8 topics • Dataset and datatype improvements • Group improvements • Link Revisions • Shared object header nessages • Metadata cache improvements • Other improvements • Platform-specific changes • High level APIs • Parallel HDF5 • Tool improvements HDF and HDF-EOS Workshop X, Landover MD

  8. Dataset and Datatype Improvements

  9. Text-based data type descriptions • Why: • Simplify datatype creation • Make datatype creation code more readable • Facilitate debugging by printing the text description of a data type • What: • New routine to create a data type through the text description of the data type:H5LTdtype_to_text HDF and HDF-EOS Workshop X, Landover MD

  10. Text data type description – Example • Create a datatype of compound type. /* Create the data type with text description */ dtype = H5Ttext_to_type( “typedef struct foo {int a; float b;} foo_t;”) /* Convert the data type back to text */ H5Ttype_to_text(dtype, NULL, H5T_C, &tsize) HDF and HDF-EOS Workshop X, Landover MD

  11. Serialized datatypes and dataspaces • Why: • Allow datatype and dataspace info to be transmitted between processes • Allow datatype/dataspace to be stored in non-HDF5 files • What: • A new set of routines to serialize/deserialize HDF5 datatypes and dataspaces. HDF and HDF-EOS Workshop X, Landover MD

  12. Int to float convert during I/O • Why: Convert ints to floats during I/O • What: Int to float conversion supported during I/O HDF and HDF-EOS Workshop X, Landover MD

  13. Revised conversion exception handling • Why: Give apps greater control over exceptions (range errors, etc.) during datatype conversion. • What: Revised conversion exception handling HDF and HDF-EOS Workshop X, Landover MD

  14. Revised conversion exception handling • To handle exceptions during conversions, register handling function through H5Pset_type_conv_cb(). • Cases of exception: • H5T_CONV_EXCEPT_RANGE_HI • H5T_CONV_EXCEPT_RANGE_LOW • H5T_CONV_EXCEPT_TRUNCATE • H5T_CONV_EXCEPT_PRECISION • H5T_CONV_EXCEPT_PINF • H5T_CONV_EXCEPT_NINF • H5T_CONV_EXCEPT_NAN • Return values: H5T_CONV_ABORT, H5T_CONV_UNHANDLED, H5T_CONV_HANDLED HDF and HDF-EOS Workshop X, Landover MD

  15. Compression filter for n-bit data • Why: Compact storage for user-defined datatypes • What: • When data stored on disk, padding bits chopped off and only significant bits stored • Supports most datatypes • Works with compound datatypes HDF and HDF-EOS Workshop X, Landover MD

  16. N-bit compression example • In memory, one value of N-Bit datatype is stored like this: | byte 3 | byte 2 | byte 1 | byte 0 | |????????|????SPPP|PPPPPPPP|PPPP????| S-sign bit P-significant bit ?-padding bit • After passing through the N-Bit filter, all padding bits are chopped off, and the bits are stored on disk like this: | 1st value | 2nd value | |SPPPPPPP PPPPPPPP|SPPPPPPP PPPPPPPP|... • Opposite (decompress) when going from disk to memory HDF and HDF-EOS Workshop X, Landover MD

  17. Offset+size storage filter • Why:Use less storage when less precision needed • What: • Performs scale/offset operation on each value • Truncates result to fewer bits before storing • Currently supports integers and floats • Example H5Pset_scaleoffset (dcr,H5Z_SO_INT,H5Z_SO_INT_MINBITS_DEFAULT); H5Dcreate(……, dcr) H5Dwrite (…); HDF and HDF-EOS Workshop X, Landover MD

  18. Example with floating-point type • Data: {104.561, 99.459, 100.545, 105.644} • Choose scaling factor: decimal precision to keepE.g. scale factor D = 2 1. Find minimum value (offset): 99.459 2. Subtract minimum value from each element Result: {5.102, 0, 1.086, 6.185} 3. Scale data by multiplying 10D = 100 Result: {510.2, 0, 108.6, 618.5} 4. Round the data to integer Result: {510 , 0, 109, 619} 5. Pack and store using min number of bits HDF and HDF-EOS Workshop X, Landover MD

  19. “NULL” Dataspace • Why: • Allow datasets with no elements to be described • NetCDF 4 needed a “place holder” for attributes • What: • A dataset with no dimensions, no data HDF and HDF-EOS Workshop X, Landover MD

  20. Group improvements

  21. Access links by creation-time order • Why: • Allow iteration & lookup of group’s links (children) by creation order as well as by name order • Support netCDF access model for netCDF 4 • What: Option to access objects in group according to relative creation time HDF and HDF-EOS Workshop X, Landover MD

  22. “Compact groups” • Why: • Save space and access time for small groups • If groups small, don’t need B-tree overhead • What: • Alternate storage for groups with few links • Example • File with 11,600 groups • With original group structure, file size ~ 20 MB • With compact groups, file size ~ 12 MB • Total savings: 8 MB (40%) • Average savings/group: ~700 bytes HDF and HDF-EOS Workshop X, Landover MD

  23. Better large group storage • Why: Faster, more scalable storage and access for large groups • What: New format and method for storing groups with many links HDF and HDF-EOS Workshop X, Landover MD

  24. Intermediate group creation • Why: • Simplify creation of a series of connected groups • Avoid having to create each intermediate group separately, one by one • What: • Intermediate groups can be created when creating an object in a file, with one function call HDF and HDF-EOS Workshop X, Landover MD

  25. / / A A B C dset1 Example: add intermediate groups • Want to create “/A/B/C/dset1” • “A” exists, but “B/C/dset1” do not H5Dcreate(file_id, “/A/B/C/dset1”,..) One call creates groups “B” & “C”, then creates “dset1” HDF and HDF-EOS Workshop X, Landover MD

  26. Link Revisions

  27. <address> “/target dataset” What are links? Links connect groups to their members “Hard” links point to a target by address “Soft” links store the path to a target root group Hard link Soft link dataset HDF and HDF-EOS Workshop X, Landover MD

  28. “target dataset” <address> “dataset EL” “file2.h5” “target dataset” New: external Links • Why: Access objects by file & path within file • What: • Store location of file and path within that file • Can link across files file2.h5 root group file1.h5 root group dataset HDF and HDF-EOS Workshop X, Landover MD

  29. New: User-defined Links • Why: • Allow applications to create their own kinds of links and link operations, such as • Create “hard” external link that finds an object by address • Create link that accesses a URL • Keep track of how often a link accessed, or other behavior • What: • App can create new kinds of links by supplying custom callback functions • Can do anything HDF5 hard, soft, or external links do HDF and HDF-EOS Workshop X, Landover MD

  30. Shared Object Header Messages

  31. Dataset 1 Dataset 2 Dataset 3 datatype datatype datatype dataspace dataspace dataspace data 1 data 2 data 3 Shared object header messages • Why: metadata duplicated many times, wasting space • Example: • You create a file with 10,000 datasets • All use the same datatype and dataspace • HDF5 needs to write this information 10,000 times! HDF and HDF-EOS Workshop X, Landover MD

  32. Shared object header messages What: • Enable messages to be shared automatically • HDF5 shares duplicated messages on its own! Dataset 1 Dataset 2 datatype dataspace data 1 data 2 HDF and HDF-EOS Workshop X, Landover MD

  33. Shared Messages • Happens automatically • Works with datatypes, dataspaces, attributes, fill values, and filter pipelines • Saves space if these objects are relatively large • May be faster if HDF5 can cache shared messages • Drawbacks • Usually slower than non-shared messages • Adds overhead to the file • Index for storing shared datatypes • 25 bytes per instance • Older library versions can’t read files with shared messages HDF and HDF-EOS Workshop X, Landover MD

  34. Two informal tests • File with 24 datasets, all with same big datatype • 26,000 bytes normally • 17,000 bytes with shared messages enabled • Saves 375 bytes per dataset • But, make a bad decision: invoke shared messages but only create one dataset… • 9,000 bytes normally • 12,000 bytes with shared messages enabled • Probably slower when reading and writing, too. • Moral: shared messages can be a big help, but only in the right situation! HDF and HDF-EOS Workshop X, Landover MD

  35. Metadata cache improvements

  36. Metadata Cache improvements • Why: • Improve I/O performance and memory usage when accessing many objects • What: • New metadata cache APIs • control cache size • monitor actual cache size and current hit rate • Under the hood: adaptive cache resizing • Automatically detects the current working size • Sets max cache size to the working set size HDF and HDF-EOS Workshop X, Landover MD

  37. Metadata cache improvements • Note: most applications do not need to worry about the cache • See “Advanced topics” for details • And if you do see unusual memory growth or poor performance, please contact us. We want to help you. HDF and HDF-EOS Workshop X, Landover MD

  38. Other improvements

  39. New extendible error-handling API • Why: Enable app to integrate error reporting with HDF5 library error stack • What: New error handling API • H5Epush - push major and minor error ID on specified error stack • H5Eprint – print specified stack • H5Ewalk – walk through specified stack • H5Eclear – clear specified stack • H5Eset_auto – turn error printing on/off for specified stack • H5Eget_auto – return settings for specified stack traversal HDF and HDF-EOS Workshop X, Landover MD

  40. Attribute improvements • Why: • Use less storage when large numbers of attributes attached to a single object • Iterate over or look up attributes by creation order • What: • Property to create index on the order in which the attributes are created • Improved attribute storage HDF and HDF-EOS Workshop X, Landover MD

  41. Support for Unicode Character Set • Why: • So apps can create names using Unicode • netCDF 4 needed this • What • UTF-8 Unicode encoding now supported • For string datatypes, names of links and attributes • Example: H5Pset_char_encoding(lcpl_id, H5T_CSET_UTF8) H5Llink(file_id, "UTF-8 name", …, lcpl_id, …); HDF and HDF-EOS Workshop X, Landover MD

  42. Efficient copying of HDF5 objects • Why: • Enable apps to copy objects efficiently • What • New routines to copy an object in an HDF5 file within the current file or to another file • Done at a low-level in the HDF5 file, allowing • Entire group hierarchies to be copied quickly • Compressed datasets to be copied without going through a decompression/compression cycle HDF and HDF-EOS Workshop X, Landover MD

  43. Performance of object copy routines HDF and HDF-EOS Workshop X, Landover MD

  44. Data transformation filter • Why: • Apply arithmetic operations to data during I/O • What: • Data transformation filter • Transform expressed by algebraic formula • Only +, -, *, and /supported • Example: • Expression parameter set, such as x*(x-5) • When dataset read/written, x*(x-5) applied per element • When reading, values in file are unchanged • When writing, transformed data written to file HDF and HDF-EOS Workshop X, Landover MD

  45. Stackable Virtual File Drivers • What is Virtual File Driver (VFD)? HDF and HDF-EOS Workshop X, Landover MD

  46. Structure of HDF5 Library • Object API (C, Fortran 90, Java, C++) • Specify objects and transformation properties • Invoke data movement operations and data transformations • Library internals • Performs data transformations and other prep for I/O • Configurable transformations (compression, etc.) • Virtual file I/O (C only) • Perform byte-stream I/O operations (open/close, read/write, seek) • User-implementable I/O (stdio, network, memory, etc.) HDF and HDF-EOS Workshop X, Landover MD

  47. Stackable VFD • HDF5 VFD allows • Storing data using different physical file layout. E.g., Family VFD (writes file as “family of files”) • Doing different types of I/O. E.g., stdio (standard I/O); MPI-I/O (for parallel I/O) HDF and HDF-EOS Workshop X, Landover MD

  48. Stackable VFD • Why “stackable:” • Before now, only one VFD could be used at a time • VFDs could not inter-operative • What is “stackable:” • A Non-terminal VFD may stack on top of compatible non-terminal and eventually Terminal VFD’s • Two kinds of VFD • Non-terminal (e.g. Family) • Terminal (e.g. stdio; MPI-I/O) HDF and HDF-EOS Workshop X, Landover MD

  49. Stackable VFD Application HDF5 API Non-terminal VFD split Family File Default I/O path metadata rawdata Terminal VFD Sec2 stdio mpiio HDF5 Files HDF and HDF-EOS Workshop X, Landover MD

  50. Platform-specific changes

More Related