HDF5 and H5py
HDF5 and H5py
● In general, binary files are more memory efficient and they are open to
optimizations
○ Shuffling and chunking in netCDF
○ Under the hood, netCDF uses B-Trees and other data structures (same in SQL)
○ Parallel file systems (Lustre, GPFS) in HPC can read/write in parallel
● What is a ‘string’ anyway?
○ See this and this
● Scientific data is about floating point numbers
From previous class
Text files vs binary files
h5py
HDF5
HDF5
Hierarchical Data Format version 5
● Abstract data model, abstract storage model, open source library and file format
for storing and managing data
● Efficient I/O and for high volume and complex data
○ Chunked multidimensional arrays
○ netCDF4 relies on HDF5 for chunked storage
● Self-Describing data format (portability)
● Low level C and Fortran APIs
● High level wrappers
○ h5py, PyTables
HDF5
Abstract Data Model (ADM)
● Predefined
● Derived - String and compound
In the dataset depicted above each element of the dataset is a 32-bit
integer.
HDF5 - Datatypes
● Pre-Defined Datatypes
○ Created by HDF5
○ They are actually opened and
closed by HDF5 and can have
different values from one HDF5
session to the next
■ Standard - H5T_IEEE_F32BE
■ Native - H5T_NATIVE_INT
This is an example of a dataset with a compound datatype. Each element in
● Derived Datatypes the dataset consists of a 16-bit integer, a character, a 32-bit integer, and a
2x3x2 array of 32-bit floats (the datatype). It is a 2-dimensional 5 x 3 array
○ Derived from the pre-defined (the dataspace). The datatype should not be confused with the dataspace.
datatypes
■ String, Compound
HDF5 - Dataspaces
A dataspace describes the layout of a
dataset’s data elements
Attributes look similar to HDF5 datasets in that they have a datatype and dataspace.
However, they do not support partial I/O operations, and they cannot be compressed
or extended.
HDF5
Since netCDF version 4,
netCDF files are HDF5 files
Originally in netCDF
● No chunks
● No groups
Originally in netCDF
● No chunks
● No groups
● Defines how HDF5 objects and data are mapped to a linear address space
● The address space is assumed to be a contiguous array of bytes stored on some
random access medium
HDF5
Abstract Storage Model
● Level 0: File signature and super block - Information about the file
● Level 1: File infrastructure - Information about B-trees and heaps
● Level 2: Data object - Data objects (data + metadata)
HDF5
Abstract Storage Model
It uses straightforward NumPy and Python metaphors, like dictionary and NumPy
array syntax