1 / 114

HDF5 Speedup Workshop: Understanding New Features and Enhancements

This workshop explores the latest features in HDF5 1.8.0, addressing deficiencies in the initial design and meeting new requirements. Learn about dataset and datatype improvements, error handling, compression filters, and more. Discover how to simplify data type creation and handle conversion exceptions for better control over your data. Get an overview of serialized datatypes and dataspace and how to convert between integer and float types seamlessly during I/O operations. For detailed information, visit the official HDF5 documentation.

ldouthit
Download Presentation

HDF5 Speedup Workshop: Understanding New Features and Enhancements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. New Features in HDF5 SPEEDUP Workshop - HDF5 Tutorial

  2. Why new features? SPEEDUP Workshop - HDF5 Tutorial

  3. Why new features? • HDF5 1.8.0 was released in February 2008 • Major update of HDF5 1.6.* series (stable set of features and APIs since 1998) • New features • 200 new APIs • Changes to file format • Changes to APIs • Backward compatible • New releases in November 2008 • HDF5 1.6.8 and 1.8.2 • Minor bug fixes • Support for new platforms and compilers SPEEDUP Workshop - HDF5 Tutorial

  4. Information about the release http://www.hdfgroup.org/HDF5/doc/ Follow “New Features and Compatibility Issues” links SPEEDUP Workshop - HDF5 Tutorial

  5. Why new features? • Need to address some deficiencies in initial design • Examples: • Big overhead in file sizes • Non-tunable metadata cache implementation • Handling of free-space in a file SPEEDUP Workshop - HDF5 Tutorial

  6. Why new features? • Need to address new requirements • Add support for • New types of indexing (object creation order) • Big volumes of variable-length data (DNA sequences) • Simultaneous real-time streams (fast append to one -dimensional datasets) • UTF-8 encoding for objects’ path names • Accessing objects stored in another HDF5 files (external or user-defined links) SPEEDUP Workshop - HDF5 Tutorial

  7. Outline • Dataset and datatype improvements • Group improvements • Link revisions • Shared object header messages • Metadata cache improvements • Error handling • Backward/forward compatibility • HDF5 and NetCDF-4 SPEEDUP Workshop - HDF5 Tutorial

  8. Dataset and Datatype Improvements SPEEDUP Workshop - HDF5 Tutorial

  9. Text-based data type descriptions • Why: • Simplify data type creation • Make data type creation code more readable • Facilitate debugging by printing the text description of a data type • What: • New routines to create an HDF5 data type through the text description of the data type and get a text description from the HDF5 data type SPEEDUP Workshop - HDF5 Tutorial

  10. Text data type description Example /* Create the data type from DDL text description */ dtype = H5LTtext_to_dtype( "H5T_IEEE_F32BE\n”,H5LT_DDL); /* Convert the data type back to text */ H5LTtype_to_text(dtype, NULL, H5LT_DLL, str_len); dt_str = (char*)calloc(str_len, sizeof(char)); H5LTdtype_to_text(dtype, dt_str, H5LT_DDL, &str_len); SPEEDUP Workshop - HDF5 Tutorial

  11. Serialized datatypes and dataspaces • Why: • Allow datatype and dataspace info to be transmitted between processes • Allow datatype/dataspace to be stored in non-HDF5 files • What: • A new set of routines to serialize/deserialize HDF5 datatypes and dataspaces. SPEEDUP Workshop - HDF5 Tutorial

  12. Serialized datatypes and dataspaces Example /* Find the buffer length and encode a datatype into buffer */ status = H5Tencode(t_id, NULL, &cmpd_buf_size); cmpd_buf = (unsigned char*)calloc(1, cmpd_buf_size); H5Tencode(t_id, cmpd_buf, &cmpd_buf_size) /* Decode a binary description of a datatype and retune a datatype handle */ t_id = H5Tdecode(cmpd_buf); SPEEDUP Workshop - HDF5 Tutorial

  13. Integer to float convert during I/O • Why: • HDF5 1.6 and earlier supported conversion within the same class (16-bit integer 32-bit integer, 64-bit float  32-bit float) • Conversion needed to support NetCDF 4 programming model • What: • Integer to float conversion supported during I/O SPEEDUP Workshop - HDF5 Tutorial

  14. Integer to float convert during I/O Example: conversion is transparent to application /* Create a dataset of 64-bit little-endian type */ dset_id = H5Dcreate(loc_id,“Mydata”, H5T_IEEE_F64LE,space_id, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT); /* Write integer data to “Mydata” */ status = H5Dwrite(dset_id, H5T_NATIVE_INT, …); SPEEDUP Workshop - HDF5 Tutorial

  15. Revised conversion exception handling • Why: • Give apps greater control over exceptions (range errors, etc.) during datatype conversion • Needed to support NetCDF 4 programming model • What: • Revised conversion exception handling SPEEDUP Workshop - HDF5 Tutorial

  16. Revised conversion exception handling • To handle exceptions during conversions, register handling function through H5Pset_type_conv_cb(). • Cases of exception: • H5T_CONV_EXCEPT_RANGE_HI • H5T_CONV_EXCEPT_RANGE_LOW • H5T_CONV_EXCEPT_TRUNCATE • H5T_CONV_EXCEPT_PRECISION • H5T_CONV_EXCEPT_PINF • H5T_CONV_EXCEPT_NINF • H5T_CONV_EXCEPT_NAN • Return values: H5T_CONV_ABORT, H5T_CONV_UNHANDLED, H5T_CONV_HANDLED SPEEDUP Workshop - HDF5 Tutorial

  17. Compression filter for n-bit data • Why: • Compact storage for user-defined datatypes • What: • When data stored on disk, padding bits chopped off and only significant bits stored • Supports most datatypes • Works with compound datatypes SPEEDUP Workshop - HDF5 Tutorial

  18. N-bit compression example • In memory, one value of N-Bit datatype is stored like this: | byte 3 | byte 2 | byte 1 | byte 0 | |????????|????SPPP|PPPPPPPP|PPPP????| S-sign bit P-significant bit ?-padding bit • After passing through the N-Bit filter, all padding bits are chopped off, and the bits are stored on disk like this: | 1st value | 2nd value | |SPPPPPPP PPPPPPPP|SPPPPPPP PPPPPPPP|... • Opposite (decompress) when going from disk to memory • Limited to integer and floating-point data SPEEDUP Workshop - HDF5 Tutorial

  19. N-bit compression example Example /* Create a N-bit datatype */ dt_id = H5Tcopy(H5T_STD_I32LE); H5Tset_precision(dt_id, 16); H5Tset_offset(dt_id, 4); /* Create and write a dataset */ dcpl_id = H5Pcreate(H5P_DATASET_CREATE); H5Pset_chunk(dcpl_id, …); H5Pset_nbit(dcpl_id); dset_id = H5Dcreate(…,…,…,…,…,dcpl_id,…); H5Dwrite(dset_id,…,…,…,…,buf); SPEEDUP Workshop - HDF5 Tutorial

  20. Offset+size storage filter • Why: • Use less storage when less precision needed • What: • Performs scale/offset operation on each value • Truncates result to fewer bits before storing • Currently supports integers and floats • Precision may be lost SPEEDUP Workshop - HDF5 Tutorial

  21. Example with floating-point type • Data: {104.561, 99.459, 100.545, 105.644} • Choose scaling factor: decimal precision to keepE.g. scale factor D = 2 1. Find minimum value (offset): 99.459 2. Subtract minimum value from each element Result: {5.102, 0, 1.086, 6.185} 3. Scale data by multiplying 10D = 100 Result: {510.2, 0, 108.6, 618.5} 4. Round the data to integer Result: {510 , 0, 109, 619} 5. Pack and store using min number of bits SPEEDUP Workshop - HDF5 Tutorial

  22. Offset+size storage filter Example /* Use scale+offset filter on integer data; let library figure out the number of minimum bits necessary to story the data without loss of precision */ H5Pset_scaleoffset (dcrp_id,H5Z_SO_INT,H5Z_SO_INT_MINBITS_DEFAULT); H5Pset_chunk(dcrp_id,…,…); dset_id = H5Dcreate(…,…,…,…,…,dcpl_id, …); /* Use sclae+offset filter on floating-point data; compression may be lossy */ H5Pset_scaleoffset(dcrp_id,H5Z_SO_FLOAT_DSCALE,2 ); SPEEDUP Workshop - HDF5 Tutorial

  23. “NULL” Dataspace • Why: • Allow datasets with no elements to be described • NetCDF 4 needed a “place holder” for attributes • What: • A dataset with no dimensions, no data SPEEDUP Workshop - HDF5 Tutorial

  24. NULL dataspace Example /* Create a dataset with “NULL” dataspace*/ sp_id = H5Screate(H5S_NULL); dset_id = H5Dcreate(…,"SDS.h5”,…,sp_id,…,…,…); HDF5 "SDS.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32LE DATASPACE NULL DATA { } } } } SPEEDUP Workshop - HDF5 Tutorial

  25. HDF5 file format revision SPEEDUP Workshop - HDF5 Tutorial

  26. HDF5 file format revision • Why: • Address deficiencies of the original file format • Address space overhead in an HDF5 file • Enable new features • What: • New routine that instructs the HDF5 library to create all objects using the latest version of the HDF5 file format (cmp. with the earliest version when object became available, e.g. array datatype) • Will talk about the versioning later SPEEDUP Workshop - HDF5 Tutorial

  27. HDF5 file format revision Example /* Use the latest version of a file format for each object created in a file */ fapl_id = H5Pcreate(H5P_FILE_ACCESS); H5Pset_latest_format(fapl_id, 1); fid = H5Fcreate(…,…,…,fapl_id); or fid = H5Fopen(…,…,fapl_id); SPEEDUP Workshop - HDF5 Tutorial

  28. Group Revisions SPEEDUP Workshop - HDF5 Tutorial

  29. Better large group storage • Why: • Faster, more scalable storage and access for large groups • What: • New format and method for storing groups with many links SPEEDUP Workshop - HDF5 Tutorial

  30. Informal benchmark • Create a file and a group in a file • Create up to 10^6 groups with one dataset in each group • Compare files sizes and performance of HDF5 1.8.1 using the latest group format with the performance of HDF5 1.8.1 (default, old format) and 1.6.7 • Note: Default 1.8.1 and 1.6.7 became very slow after 700000 groups SPEEDUP Workshop - HDF5 Tutorial

  31. Time to open and read a dataset SPEEDUP Workshop - HDF5 Tutorial

  32. Time to close the file SPEEDUP Workshop - HDF5 Tutorial

  33. File size SPEEDUP Workshop - HDF5 Tutorial

  34. Access links by creation-time order • Why: • Allow iteration & lookup of group’s links (children) by creation order as well as by name order • Support netCDF access model for netCDF 4 • What: • Option to access objects in group according to relative creation time SPEEDUP Workshop - HDF5 Tutorial

  35. Access links by creation-time order Example /* Track and index creation order of the links */ H5Pset_link_creation_order(gcpl_id, (H5P_CRT_ORDER_TRACKED | H5P_CRT_ORDER_INDEXED)); /* Create a group */ gid = H5Gcreate(fid, GNAME, H5P_DEFAULT, gcpl_id, H5P_DEFAULT); SPEEDUP Workshop - HDF5 Tutorial

  36. Example: h5dump --group=1 tordergr.h5 HDF5 "tordergr.h5" { GROUP "1" { GROUP "a" { GROUP "a1" { } GROUP "a2" { GROUP "a21" { } GROUP "a22" { } } } GROUP "b" { } GROUP "c" { } } } SPEEDUP Workshop - HDF5 Tutorial

  37. Example: h5dump --sort_by=creation_order HDF5 "tordergr.h5" { GROUP "1" { GROUP "c" { } GROUP "b" { } GROUP "a" { GROUP "a1" { } GROUP "a2" { GROUP "a22" { } GROUP "a21" { } } } } } SPEEDUP Workshop - HDF5 Tutorial

  38. “Compact groups” • Why: • Save space and access time for small groups • If groups small, don’t need B-tree overhead • What: • Alternate storage for groups with few links • Default storage when “latest format” is specified • Library converts to “original” storage (B-tree based) using default or user-specified threshold SPEEDUP Workshop - HDF5 Tutorial

  39. “Compact groups” • Example • File with 11,600 groups • With original group structure, file size ~ 20 MB • With compact groups, file size ~ 12 MB • Total savings: 8 MB (40%) • Average savings/group: ~700 bytes SPEEDUP Workshop - HDF5 Tutorial

  40. Compact groups Example /* Change storage to “dense” if number of group members is bigger than 16 and go back to compact storage if number of group members is smaller than 12 */ H5Pset_link_phase_change(gcpl_id, 16, 12) /* Create a group */ g_id = H5Gcreate(…,…,…,gcpl_id,…); SPEEDUP Workshop - HDF5 Tutorial

  41. Intermediate group creation • Why: • Simplify creation of a series of connected groups • Avoid having to create each intermediate group separately, one by one • What: • Intermediate groups can be created when creating an object in a file, with one function call SPEEDUP Workshop - HDF5 Tutorial

  42. / / A A B C dset1 Intermediate group creation • Want to create “/A/B/C/dset1” • “A” exists, but “B/C/dset1” do not One call creates groups “B” & “C”, then creates “dset1” SPEEDUP Workshop - HDF5 Tutorial

  43. Intermediate group creation Example /* Create link creation property list */ lcrp_id = H5Pcreate(H5P_LINK_CREATE); /* Set flag for intermediate group creation Groups B and C will be created automatically */ H5Pset_create_intermediate_group(lcrp_id, TRUE); ds_id = H5Dcreate (file_id, "/A/B/C/dset1",…,…, lcrp_id,…,…,); SPEEDUP Workshop - HDF5 Tutorial

  44. Link Revisions SPEEDUP Workshop - HDF5 Tutorial

  45. <address> “/target dataset” What are links? • Links connect groups to their members • “Hard” links point to a target by address • “Soft” links store the path to a target root group Hard link Soft link dataset SPEEDUP Workshop - HDF5 Tutorial

  46. New: External Links • Why: • Access objects stored in other HDF5 files in a transparent way • What: • Store location of file and path within that file • Can link across files SPEEDUP Workshop - HDF5 Tutorial

  47. “target object” <address> “External_link” “file2.h5” “/A/B/C/D/E” New: External Links file2.h5 root group file1.h5 root group group External link object “External_link” in file1.h5 points to the group /A/B/C/D/E in file2.h5 SPEEDUP Workshop - HDF5 Tutorial

  48. External links Example /* Create an external link */ H5Lcreate_external(TARGET_FILE, ”/A/B/C/D/E", source_file_id, ”External_link”, …,…); /* We will use external link to create a group in a target file */ gr_id = H5Gcreate(source_file_id,”External_link/F”,…,…,…,…); /* We can access group “External_link/F” in the source file and group “/A/B/C/D/E/F” in the target file */ SPEEDUP Workshop - HDF5 Tutorial

  49. New: User-defined Links • Why: • Allow applications to create their own kinds of links and link operations, such as • Create “hard” external link that finds an object by address • Create link that accesses a URL • Keep track of how often a link accessed, or other behavior • What: • Applications can create new kinds of links by supplying custom callback functions • Can do anything HDF5 hard, soft, or external links do SPEEDUP Workshop - HDF5 Tutorial

  50. Traversing an HDF5 file SPEEDUP Workshop - HDF5 Tutorial

More Related