0% found this document useful (0 votes)

43 views

HDF5 and H5py

HDF5 is a binary file format for storing large scientific datasets that provides an efficient way to organize and access data. It defines an abstract data model of files, groups, datasets and attributes. HDF5 files can be accessed using the h5py Python library, which mimics NumPy arrays to allow easy manipulation of HDF5 datasets from Python.

Uploaded by

Jaime Bala Norma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views

HDF5 and H5py

Uploaded by

Jaime Bala Norma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

HDF5 and h5py

Ezequiel Cimadevilla Álvarez

ezequiel.cimadevilla@unican.es

Santander Meteorology Group

ETSI Caminos, Department of Applied Mathematics and Computer Sciences
University of Cantabria
Avenida de los Castros s/n
39005 Santander, Spain
http://www.meteo.unican.es

Máster Data Science/Ciencia de Datos - 2020/2021

From previous class
Text ﬁles vs binary ﬁles

● In general, binary files are more memory efficient and they are open to
optimizations
○ Shuffling and chunking in netCDF
○ Under the hood, netCDF uses B-Trees and other data structures (same in SQL)
○ Parallel file systems (Lustre, GPFS) in HPC can read/write in parallel
● What is a ‘string’ anyway?
○ See this and this
● Scientific data is about floating point numbers
From previous class
Text files vs binary files

● CSV - Limited to structured data, what about metadata?

○ JSON could solve this but still not memory efﬁcient (see this)
● Text data favors readability
● Binary data favors efﬁciency
● Text data can be opened anywhere
● Binary data requires a software library to visualize and deal with it
● Text data is portable while binary data requires additional work to make it
portable (self-describing data formats)
Outline
HDF5

● Abstract Data Model

● Abstract Storage Model
● HDF5 as netCDF backend

h5py
HDF5
HDF5
Hierarchical Data Format version 5

● Abstract data model, abstract storage model, open source library and ﬁle format
for storing and managing data
● Efﬁcient I/O and for high volume and complex data
○ Chunked multidimensional arrays
○ netCDF4 relies on HDF5 for chunked storage
● Self-Describing data format (portability)
● Low level C and Fortran APIs
● High level wrappers
○ h5py, PyTables
HDF5
Abstract Data Model (ADM)

● Deﬁnes concepts for deﬁning

complex data stored in ﬁles
○ File, Group, Dataset, Dataspace,
Data type, Attribute, Property
List, Link
● Map application domain entities into
the HDF5 ADM
● The HDF5 library implements the
Abstract Data Model
● netCDF application data structures?
HDF5 - File
An HDF5 ﬁle (an object in itself) can be
thought of as a container (or group) that
holds a variety of heterogeneous data
objects (or datasets). The datasets can be
images, tables, graphs, and even
documents, such as PDF or Excel

The two primary objects in the HDF5 Data

Model are groups and datasets
HDF5 - Groups
● HDF5 groups (and links) organize
data objects
● Every HDF5 file contains a root group
that can contain other groups or be
linked to objects in other files
● Similar to directories and files in
UNIX
○ / - root group There are two groups in the HDF5 file depicted above: Viz and SimOut. Under
the Viz group are a variety of images and a table that is shared with the
○ /foo - member of the root group SimOut group. The SimOut group contains a 3-dimensional array, a
2-dimensional array and a link to a 2-dimensional array in another HDF5 file.
called foo
○ /foo/zoo - member of the group
foo, member of the root group
HDF5 - Datasets
● HDF5 datasets organize and contain
the “raw” data values
● A dataset consists of metadata that
describes the data, in addition to the
data itself
● Datatypes, dataspaces, properties
and (optional) attributes are HDF5
In the picture above, the data is stored as a three dimensional dataset of
objects that describe a dataset size 4 x 5 x 6 with an integer datatype. It contains attributes, Time and
Pressure, and the dataset is chunked and compressed.
● The datatype describes the individual
data elements
HDF5 - Datatypes
The datatype describes the individual data
elements in a dataset

It provides complete information for data

conversion to or from that datatype

Datatypes in HDF5 can be grouped into:

● Predefined
● Derived - String and compound
In the dataset depicted above each element of the dataset is a 32-bit
integer.
HDF5 - Datatypes
● Pre-Defined Datatypes
○ Created by HDF5
○ They are actually opened and
closed by HDF5 and can have
different values from one HDF5
session to the next
■ Standard - H5T_IEEE_F32BE
■ Native - H5T_NATIVE_INT
This is an example of a dataset with a compound datatype. Each element in
● Derived Datatypes the dataset consists of a 16-bit integer, a character, a 32-bit integer, and a
2x3x2 array of 32-bit floats (the datatype). It is a 2-dimensional 5 x 3 array
○ Derived from the pre-defined (the dataspace). The datatype should not be confused with the dataspace.
datatypes
■ String, Compound
HDF5 - Dataspaces
A dataspace describes the layout of a
dataset’s data elements

It can consist of no elements (NULL), a

single element (scalar), or a simple array

A dataspace can have dimensions that are

ﬁxed (unchanging) or unlimited, which
means they can grow in size (i.e. they are
extendible)
This image illustrates a dataspace that is an array with dimensions of
5 x 3 and a rank (number of dimensions) of 2.
HDF5 - Dataspaces
● Contain the spatial information
(logical layout) of a dataset stored in a
ﬁle

● Rank and dimensions of a dataset are

a permanent part of the dataset
deﬁnition

● Describe an application’s data buffers

and data elements participating in
I/O, it can be used to select a subset The dataspace is used to describe both the logical layout of a dataset and a
of a dataset subset of a dataset.
HDF5 - Property lists
A property is a characteristic or feature of an
HDF5 object. There are default properties
which handle the most common needs. These
default properties can be modiﬁed using the
HDF5 Property List API to take advantage of
more powerful or unusual features of HDF5
objects.

For example, the data storage layout property

of a dataset is contiguous by default. For better
performance, the layout can be modiﬁed to be
chunked or chunked and compressed.
HDF5 - Attributes
Attributes can optionally be associated with HDF5 objects. They have two parts: a
name and a value. Attributes are accessed by opening the object that they are
attached to so are not independent objects. Typically an attribute is small in size and
contains user metadata about the object that it is attached to.

Attributes look similar to HDF5 datasets in that they have a datatype and dataspace.
However, they do not support partial I/O operations, and they cannot be compressed
or extended.
HDF5
Since netCDF version 4,
netCDF ﬁles are HDF5 ﬁles

Originally in netCDF

● No chunks
● No groups

HDF5 backend introduced

chunks and groups
HDF5
Since netCDF version 4,
netCDF ﬁles are HDF5 ﬁles

Originally in netCDF

● No chunks
● No groups

HDF5 backend introduced

chunks and groups
HDF5
Abstract Storage Model

● Deﬁnes how HDF5 objects and data are mapped to a linear address space
● The address space is assumed to be a contiguous array of bytes stored on some
random access medium
HDF5
Abstract Storage Model

● Level 0: File signature and super block - Information about the ﬁle
● Level 1: File infrastructure - Information about B-trees and heaps
● Level 2: Data object - Data objects (data + metadata)
HDF5
Abstract Storage Model

● The concept of an HDF5 ﬁle is actually rather abstract

● The address space for what is normally thought of as an HDF5 file might
correspond to any of the following
○ Single file on standard file system
○ Multiple files on standard file system
○ Multiple files on parallel file system
○ Block of memory within application’s memory space
○ More abstract situations such as virtual files
HDF5
Abstract Storage Model

● Virtual File Drivers deal

with low level storage in
different systems
● Available drivers:
○ H5FD_SEC2 (POSIX)
○ H5FD_DIRECT
○ H5FD_LOG
○ H5FD_MULTI
○ H5FD_FAMILY
○ H5FD_CORE (RAM)
h5py
h5py
Pythonic interface to the HDF5 binary data format

It tries to mimic NumPy arrays, in order to allow storage of huge amounts of

numerical data and easy manipulation from Python

It uses straightforward NumPy and Python metaphors, like dictionary and NumPy
array syntax

● Iterate over datasets in a ﬁle

● Check out the .shape or .dtype attributes of datasets

Open the notebook and follow the examples

References
● HDF5 User Guide
● HDF5 File Format Speciﬁcation
● https://simpsonlab.github.io/2015/05/19/io-performance/
HDF5 and h5py
Ezequiel Cimadevilla Álvarez
ezequiel.cimadevilla@unican.es

Santander Meteorology Group

ETSI Caminos, Department of Applied Mathematics and Computer Sciences
University of Cantabria
Avenida de los Castros s/n
39005 Santander, Spain
http://www.meteo.unican.es

Máster Data Science/Ciencia de Datos - 2020/2021

SailPoint IdentityIQ Training (1)
No ratings yet
SailPoint IdentityIQ Training (1)
7 pages
Event Driven Programming Tutorial PDF
No ratings yet
Event Driven Programming Tutorial PDF
2 pages
Introduction To HDF5: HDF & HDF-EOS Workshop XII October 15, 2008
No ratings yet
Introduction To HDF5: HDF & HDF-EOS Workshop XII October 15, 2008
80 pages
HDF5 tutorialNUG2010
No ratings yet
HDF5 tutorialNUG2010
112 pages
parallel-io-hdf5
No ratings yet
parallel-io-hdf5
53 pages
HDF5 Users Guide
No ratings yet
HDF5 Users Guide
342 pages
HDF5 in Python - The Future of Large Dataset Storage
No ratings yet
HDF5 in Python - The Future of Large Dataset Storage
11 pages
HDF5 RM r187
No ratings yet
HDF5 RM r187
778 pages
Netcdf (Network Common Data Format)
No ratings yet
Netcdf (Network Common Data Format)
2 pages
Data Science Formats Beyond CSV and Hdfs
No ratings yet
Data Science Formats Beyond CSV and Hdfs
54 pages
Pytables: An On - Disk Binary Data Container, Query Engine and Computa:Onal Kernel
No ratings yet
Pytables: An On - Disk Binary Data Container, Query Engine and Computa:Onal Kernel
35 pages
Python For Netcdf
No ratings yet
Python For Netcdf
17 pages
2207.09503
No ratings yet
2207.09503
6 pages
HDF5 Data Architecture and Programming Guide: Definitive Reference for Developers and Engineers
From Everand
HDF5 Data Architecture and Programming Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Introduction To Netcdf4 Binary File With Python, C++ and R: Bertrand Brelier
No ratings yet
Introduction To Netcdf4 Binary File With Python, C++ and R: Bertrand Brelier
27 pages
Grads and Hdf5
No ratings yet
Grads and Hdf5
20 pages
Structure Versioning For PyTables
100% (2)
Structure Versioning For PyTables
17 pages
Data File Structure
No ratings yet
Data File Structure
2 pages
IDL Programming & Data Visualization: Shou-Lien Chen Department of Physics, NCUE
No ratings yet
IDL Programming & Data Visualization: Shou-Lien Chen Department of Physics, NCUE
67 pages
Computer Science
No ratings yet
Computer Science
78 pages
Netcdf and Self-Describing Data: Kate Hedstrom January 2010
No ratings yet
Netcdf and Self-Describing Data: Kate Hedstrom January 2010
44 pages
Readey B6P1 ESTF2016
No ratings yet
Readey B6P1 ESTF2016
33 pages
8 Type
No ratings yet
8 Type
81 pages
BDA UT2 QB ANS-1
No ratings yet
BDA UT2 QB ANS-1
17 pages
Unit 5 Lecture Notes 5
No ratings yet
Unit 5 Lecture Notes 5
20 pages
Data Wrangling & Visualization - II
No ratings yet
Data Wrangling & Visualization - II
41 pages
Big-Data Final
No ratings yet
Big-Data Final
7 pages
DP_900_Data_Fundamentals_1710103456
No ratings yet
DP_900_Data_Fundamentals_1710103456
35 pages
Alternatives To HIVE SQL in Hadoop File Structure
No ratings yet
Alternatives To HIVE SQL in Hadoop File Structure
5 pages
Chap4_BigDataStorageAndManagement
No ratings yet
Chap4_BigDataStorageAndManagement
46 pages
Ch2 PDF Slides
No ratings yet
Ch2 PDF Slides
26 pages
The Ultrasound File Format (UFF) - First Draft: Institut National Des Sciences Appliquées de Lyon Duke University
No ratings yet
The Ultrasound File Format (UFF) - First Draft: Institut National Des Sciences Appliquées de Lyon Duke University
2 pages
Screenshot 2023-05-16 at 5.54.20 PM
No ratings yet
Screenshot 2023-05-16 at 5.54.20 PM
158 pages
soars_f90
No ratings yet
soars_f90
23 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Recording Performances of Some File Types For Pandas Data. DOI-10.31590-Ejosat.1103499-2374400
No ratings yet
Recording Performances of Some File Types For Pandas Data. DOI-10.31590-Ejosat.1103499-2374400
6 pages
DSA - Unit 1
No ratings yet
DSA - Unit 1
158 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
mtg2 PDF
No ratings yet
mtg2 PDF
24 pages
DSA - Unit 1
No ratings yet
DSA - Unit 1
158 pages
Data Types Handout1
No ratings yet
Data Types Handout1
75 pages
CS8391 Data Structures Lecture Notes PDF
No ratings yet
CS8391 Data Structures Lecture Notes PDF
95 pages
Module-4
No ratings yet
Module-4
51 pages
05b-Hive
No ratings yet
05b-Hive
37 pages
Data Science - Sec5
No ratings yet
Data Science - Sec5
16 pages
(Ebook) Python and HDF5 by Collette, Andrew ISBN 9781449367831, 1449367836pdf download
100% (4)
(Ebook) Python and HDF5 by Collette, Andrew ISBN 9781449367831, 1449367836pdf download
48 pages
10 Read Netcdf Python
No ratings yet
10 Read Netcdf Python
9 pages
Chapter2 2
No ratings yet
Chapter2 2
27 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Revision Python For Computer Vision
No ratings yet
Revision Python For Computer Vision
50 pages
Input/Output
No ratings yet
Input/Output
19 pages
DSA - Unit 1 06.09.22
No ratings yet
DSA - Unit 1 06.09.22
171 pages
Database Programming
No ratings yet
Database Programming
17 pages
Data Analysis With Python
100% (3)
Data Analysis With Python
49 pages
Wa0001.
No ratings yet
Wa0001.
56 pages
Lesson One-Data Structures
No ratings yet
Lesson One-Data Structures
6 pages
HDFS Commands Updated
No ratings yet
HDFS Commands Updated
87 pages
Data Type and Data Structure
No ratings yet
Data Type and Data Structure
16 pages
CENG 351 Introduction To Data Management and File Structures
No ratings yet
CENG 351 Introduction To Data Management and File Structures
21 pages
NumPy_and_NumPy_Arrays
No ratings yet
NumPy_and_NumPy_Arrays
5 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
11 pages
Data Structures and Algorithms
No ratings yet
Data Structures and Algorithms
61 pages
Uji Kecepatan Perkalian Matriks Dengan Menggunakan Algoritma Standar Dan Algoritma Strassen
No ratings yet
Uji Kecepatan Perkalian Matriks Dengan Menggunakan Algoritma Standar Dan Algoritma Strassen
9 pages
2021 APGS Computer Science Program Sheet
No ratings yet
2021 APGS Computer Science Program Sheet
2 pages
[Ebooks PDF] download Computer Programming E. Balagurusamy full chapters
100% (2)
[Ebooks PDF] download Computer Programming E. Balagurusamy full chapters
41 pages
The Basics of Oracle Architecture
No ratings yet
The Basics of Oracle Architecture
5 pages
DBMS Assignment
No ratings yet
DBMS Assignment
1 page
Mon God Band Mongoose
No ratings yet
Mon God Band Mongoose
27 pages
An Algorithm - Characteristics and Types - Lecture-1
No ratings yet
An Algorithm - Characteristics and Types - Lecture-1
8 pages
Getting Started With Python Programming
100% (10)
Getting Started With Python Programming
1,484 pages
What Is Software Engineering?
No ratings yet
What Is Software Engineering?
2 pages
01 CS107 Course Information
No ratings yet
01 CS107 Course Information
7 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
DBMS Practical File
No ratings yet
DBMS Practical File
94 pages
Currency Conversion With SAP HANA SQLScript in Transformation Routines
No ratings yet
Currency Conversion With SAP HANA SQLScript in Transformation Routines
5 pages
Assignment 1 Arn Il
No ratings yet
Assignment 1 Arn Il
7 pages
Software Used in Pharmacoeconomics
100% (3)
Software Used in Pharmacoeconomics
29 pages
WIP MASS LOAD EXPLOSION
No ratings yet
WIP MASS LOAD EXPLOSION
71 pages
(Ebook) Introduction to Object-Oriented Programming, An by Timothy Budd ISBN 9780201760316, 0201760312 - Download the ebook today and experience the full content
100% (2)
(Ebook) Introduction to Object-Oriented Programming, An by Timothy Budd ISBN 9780201760316, 0201760312 - Download the ebook today and experience the full content
57 pages
Ptu Bca Exam Questions Paper
No ratings yet
Ptu Bca Exam Questions Paper
2 pages
SPPU DBMS UT1 Sem 5
No ratings yet
SPPU DBMS UT1 Sem 5
3 pages
Curriculum Vitae: Ananta PANDEY
No ratings yet
Curriculum Vitae: Ananta PANDEY
2 pages
Variable Scope Crystal Reports
0% (1)
Variable Scope Crystal Reports
3 pages
Chapter 1 - Computer Language
No ratings yet
Chapter 1 - Computer Language
8 pages
Day 15 - Collections in Apex - Map
No ratings yet
Day 15 - Collections in Apex - Map
2 pages
Encapsulation: Jin L.C. Guo
No ratings yet
Encapsulation: Jin L.C. Guo
45 pages
Java Full QB Solution
No ratings yet
Java Full QB Solution
41 pages
Log
No ratings yet
Log
5 pages
لیست موضوعات سمینار
No ratings yet
لیست موضوعات سمینار
3 pages
Selenium Interview Q & A - Part 2
No ratings yet
Selenium Interview Q & A - Part 2
28 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

HDF5 and H5py

Uploaded by

HDF5 and H5py

Uploaded by

HDF5 and h5py

Ezequiel Cimadevilla Álvarez

Santander Meteorology Group

Máster Data Science/Ciencia de Datos - 2020/2021

● CSV - Limited to structured data, what about metadata?

● Abstract Data Model

● Deﬁnes concepts for deﬁning

The two primary objects in the HDF5 Data

It provides complete information for data

Datatypes in HDF5 can be grouped into:

It can consist of no elements (NULL), a

A dataspace can have dimensions that are

● Rank and dimensions of a dataset are

● Describe an application’s data buffers

For example, the data storage layout property

HDF5 backend introduced

HDF5 backend introduced

● The concept of an HDF5 ﬁle is actually rather abstract

● Virtual File Drivers deal

It tries to mimic NumPy arrays, in order to allow storage of huge amounts of

● Iterate over datasets in a ﬁle

Open the notebook and follow the examples

Santander Meteorology Group

Máster Data Science/Ciencia de Datos - 2020/2021

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.