0% found this document useful (0 votes)
43 views

HDF5 and H5py

HDF5 is a binary file format for storing large scientific datasets that provides an efficient way to organize and access data. It defines an abstract data model of files, groups, datasets and attributes. HDF5 files can be accessed using the h5py Python library, which mimics NumPy arrays to allow easy manipulation of HDF5 datasets from Python.

Uploaded by

Jaime Bala Norma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

HDF5 and H5py

HDF5 is a binary file format for storing large scientific datasets that provides an efficient way to organize and access data. It defines an abstract data model of files, groups, datasets and attributes. HDF5 files can be accessed using the h5py Python library, which mimics NumPy arrays to allow easy manipulation of HDF5 datasets from Python.

Uploaded by

Jaime Bala Norma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

HDF5 and h5py

Ezequiel Cimadevilla Álvarez


ezequiel.cimadevilla@unican.es

Santander Meteorology Group


ETSI Caminos, Department of Applied Mathematics and Computer Sciences
University of Cantabria
Avenida de los Castros s/n
39005 Santander, Spain
http://www.meteo.unican.es

Máster Data Science/Ciencia de Datos - 2020/2021


From previous class
Text files vs binary files

● In general, binary files are more memory efficient and they are open to
optimizations
○ Shuffling and chunking in netCDF
○ Under the hood, netCDF uses B-Trees and other data structures (same in SQL)
○ Parallel file systems (Lustre, GPFS) in HPC can read/write in parallel
● What is a ‘string’ anyway?
○ See this and this
● Scientific data is about floating point numbers
From previous class
Text files vs binary files

● CSV - Limited to structured data, what about metadata?


○ JSON could solve this but still not memory efficient (see this)
● Text data favors readability
● Binary data favors efficiency
● Text data can be opened anywhere
● Binary data requires a software library to visualize and deal with it
● Text data is portable while binary data requires additional work to make it
portable (self-describing data formats)
Outline
HDF5

● Abstract Data Model


● Abstract Storage Model
● HDF5 as netCDF backend

h5py
HDF5
HDF5
Hierarchical Data Format version 5

● Abstract data model, abstract storage model, open source library and file format
for storing and managing data
● Efficient I/O and for high volume and complex data
○ Chunked multidimensional arrays
○ netCDF4 relies on HDF5 for chunked storage
● Self-Describing data format (portability)
● Low level C and Fortran APIs
● High level wrappers
○ h5py, PyTables
HDF5
Abstract Data Model (ADM)

● Defines concepts for defining


complex data stored in files
○ File, Group, Dataset, Dataspace,
Data type, Attribute, Property
List, Link
● Map application domain entities into
the HDF5 ADM
● The HDF5 library implements the
Abstract Data Model
● netCDF application data structures?
HDF5 - File
An HDF5 file (an object in itself) can be
thought of as a container (or group) that
holds a variety of heterogeneous data
objects (or datasets). The datasets can be
images, tables, graphs, and even
documents, such as PDF or Excel

The two primary objects in the HDF5 Data


Model are groups and datasets
HDF5 - Groups
● HDF5 groups (and links) organize
data objects
● Every HDF5 file contains a root group
that can contain other groups or be
linked to objects in other files
● Similar to directories and files in
UNIX
○ / - root group There are two groups in the HDF5 file depicted above: Viz and SimOut. Under
the Viz group are a variety of images and a table that is shared with the
○ /foo - member of the root group SimOut group. The SimOut group contains a 3-dimensional array, a
2-dimensional array and a link to a 2-dimensional array in another HDF5 file.
called foo
○ /foo/zoo - member of the group
foo, member of the root group
HDF5 - Datasets
● HDF5 datasets organize and contain
the “raw” data values
● A dataset consists of metadata that
describes the data, in addition to the
data itself
● Datatypes, dataspaces, properties
and (optional) attributes are HDF5
In the picture above, the data is stored as a three dimensional dataset of
objects that describe a dataset size 4 x 5 x 6 with an integer datatype. It contains attributes, Time and
Pressure, and the dataset is chunked and compressed.
● The datatype describes the individual
data elements
HDF5 - Datatypes
The datatype describes the individual data
elements in a dataset

It provides complete information for data


conversion to or from that datatype

Datatypes in HDF5 can be grouped into:

● Predefined
● Derived - String and compound
In the dataset depicted above each element of the dataset is a 32-bit
integer.
HDF5 - Datatypes
● Pre-Defined Datatypes
○ Created by HDF5
○ They are actually opened and
closed by HDF5 and can have
different values from one HDF5
session to the next
■ Standard - H5T_IEEE_F32BE
■ Native - H5T_NATIVE_INT
This is an example of a dataset with a compound datatype. Each element in
● Derived Datatypes the dataset consists of a 16-bit integer, a character, a 32-bit integer, and a
2x3x2 array of 32-bit floats (the datatype). It is a 2-dimensional 5 x 3 array
○ Derived from the pre-defined (the dataspace). The datatype should not be confused with the dataspace.
datatypes
■ String, Compound
HDF5 - Dataspaces
A dataspace describes the layout of a
dataset’s data elements

It can consist of no elements (NULL), a


single element (scalar), or a simple array

A dataspace can have dimensions that are


fixed (unchanging) or unlimited, which
means they can grow in size (i.e. they are
extendible)
This image illustrates a dataspace that is an array with dimensions of
5 x 3 and a rank (number of dimensions) of 2.
HDF5 - Dataspaces
● Contain the spatial information
(logical layout) of a dataset stored in a
file

● Rank and dimensions of a dataset are


a permanent part of the dataset
definition

● Describe an application’s data buffers


and data elements participating in
I/O, it can be used to select a subset The dataspace is used to describe both the logical layout of a dataset and a
of a dataset subset of a dataset.
HDF5 - Property lists
A property is a characteristic or feature of an
HDF5 object. There are default properties
which handle the most common needs. These
default properties can be modified using the
HDF5 Property List API to take advantage of
more powerful or unusual features of HDF5
objects.

For example, the data storage layout property


of a dataset is contiguous by default. For better
performance, the layout can be modified to be
chunked or chunked and compressed.
HDF5 - Attributes
Attributes can optionally be associated with HDF5 objects. They have two parts: a
name and a value. Attributes are accessed by opening the object that they are
attached to so are not independent objects. Typically an attribute is small in size and
contains user metadata about the object that it is attached to.

Attributes look similar to HDF5 datasets in that they have a datatype and dataspace.
However, they do not support partial I/O operations, and they cannot be compressed
or extended.
HDF5
Since netCDF version 4,
netCDF files are HDF5 files

Originally in netCDF

● No chunks
● No groups

HDF5 backend introduced


chunks and groups
HDF5
Since netCDF version 4,
netCDF files are HDF5 files

Originally in netCDF

● No chunks
● No groups

HDF5 backend introduced


chunks and groups
HDF5
Abstract Storage Model

● Defines how HDF5 objects and data are mapped to a linear address space
● The address space is assumed to be a contiguous array of bytes stored on some
random access medium
HDF5
Abstract Storage Model

● Level 0: File signature and super block - Information about the file
● Level 1: File infrastructure - Information about B-trees and heaps
● Level 2: Data object - Data objects (data + metadata)
HDF5
Abstract Storage Model

● The concept of an HDF5 file is actually rather abstract


● The address space for what is normally thought of as an HDF5 file might
correspond to any of the following
○ Single file on standard file system
○ Multiple files on standard file system
○ Multiple files on parallel file system
○ Block of memory within application’s memory space
○ More abstract situations such as virtual files
HDF5
Abstract Storage Model

● Virtual File Drivers deal


with low level storage in
different systems
● Available drivers:
○ H5FD_SEC2 (POSIX)
○ H5FD_DIRECT
○ H5FD_LOG
○ H5FD_MULTI
○ H5FD_FAMILY
○ H5FD_CORE (RAM)
h5py
h5py
Pythonic interface to the HDF5 binary data format

It tries to mimic NumPy arrays, in order to allow storage of huge amounts of


numerical data and easy manipulation from Python

It uses straightforward NumPy and Python metaphors, like dictionary and NumPy
array syntax

● Iterate over datasets in a file


● Check out the .shape or .dtype attributes of datasets

Open the notebook and follow the examples


References
● HDF5 User Guide
● HDF5 File Format Specification
● https://simpsonlab.github.io/2015/05/19/io-performance/
HDF5 and h5py
Ezequiel Cimadevilla Álvarez
ezequiel.cimadevilla@unican.es

Santander Meteorology Group


ETSI Caminos, Department of Applied Mathematics and Computer Sciences
University of Cantabria
Avenida de los Castros s/n
39005 Santander, Spain
http://www.meteo.unican.es

Máster Data Science/Ciencia de Datos - 2020/2021

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy