0% found this document useful (0 votes)

29 views13 pages

p1895 Song

Uploaded by

xzhaothss

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views13 pages

p1895 Song

Uploaded by

xzhaothss

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Apache TsFile: An IoT-native Time Series File Format

Xin Zhao Jialin Qiao∗ Xiangdong Huang

Tsinghua University Timecho Ltd Tsinghua University
zhao-x19@mails.tsinghua.edu.cn jialin.qiao@timecho.com huangxdong@tsinghua.edu.cn

Chen Wang Shaoxu Song† Jianmin Wang

Timecho Ltd Tsinghua University Tsinghua University
wangchen@timecho.com sxsong@tsinghua.edu.cn jimwang@tsinghua.edu.cn

ABSTRACT for long-term maintenance and analysis. While another of our in-
The proliferation of the Internet of Things (IoT) has led to an ex- dustrial partners, ZY, has sensors installed on their rock drilling
ponential increase in time series data, distributed and applied in machines, caching posture and position information on device con-
various contexts, demanding a dedicated storage solution. Based troller to enable real-time control. Unless otherwise specified, time
on our observations and analysis of IoT production systems, we series and series will be used interchangeably in this paper.
have characterized 3 requirements for time series data: (1) a close
association with devices and sensors, (2) continually synchronizing 1.1 Motivation
between cloud-edge, and (3) requiring the ability for high ingestion In the aforesaid IoT scenarios, rather than directly storing the time
and low latency access on big volume data. Despite the growing series data in databases such as InfluxDB [18], it is highly desired to
trend, current time series database systems lack a standardized file first store the time series as files in end devices, and then sync them
format, and existing open file formats do not adequately leverage to edge and cloud servers. The reason is that time series database
the unique characteristics of IoT time series data. In this paper, we management systems are often too heavy to be installed in end
introduce Apache TsFile, a specialized file format tailored for IoT devices. While SQLite [31] is light enough for end devices, it incurs
time series data. TsFile organizes data by devices, creating indexes huge ETL costs to transfer the data from end devices to the cloud,
based on device-related information. Our experiments demonstrate e.g., hosted by InfluxDB.
the efficiency of TsFile in achieving high data ingestion rates, mini- Some open file formats, such as Apache Parquet [19, 20, 28],
mizing latency, and optimizing data compactness. have been applied to store time series data. However, they do not
recognize and leverage features of time series data in IoT, resulting
PVLDB Reference Format: in performance fallback to some extent. To be more specific, these
Xin Zhao, Jialin Qiao, Xiangdong Huang, Chen Wang, Shaoxu Song, features include 3 aspects as follows.
and Jianmin Wang. Apache TsFile: An IoT-native Time Series File Format.
1.1.1 Series Specific Compression. As sensors detects physical sta-
PVLDB, 17(12): 4064 - 4076, 2024.
tus like temperature, speed, pressure, or displacement and convert
doi:10.14778/3685800.3685827
these into digital signals all the time, voluminous time series data
PVLDB Artifact Availability: has been produced and requires efficient storage. Sensors produce
The source code, data, and/or other artifacts have been made available at distinct series even when measuring the same type of physical
https://github.com/apache/tsfile/. quantity, reflecting variations in the objects being measured. Each
time series fluctuates with inherent patterns, adhering to the phys-
1 INTRODUCTION ical laws underlying its sensor. Selecting a suitable encoding and
compression scheme for each series is vital for optimal compact-
Time series data are prevalent in the Internet of Things (IoT) sce-
ness [39], with each stored separately and contiguously.
narios. With the widespread deployment of sensor-equipped de-
However, common file formats, such as Apache Parquet, typically
vices, a vast amount of time series data are generated to reflect
place multiple series of the same physical quantity type into a
the operational states of these devices. These series serves diverse
single column, applying a uniform compression scheme across the
purposes including simulation design, production manufacturing,
entire column. Time series that measure physical quantities of the
and equipment maintenance. For instance, CCS, one of our partners,
same type can differ vastly in patterns, leading to additional space
tracks the time series data throughout whole lifecycle of 30 million
overhead due to the uniform compression method. This situation
of shipbuilding components, storing these data in cluster servers
motivates the design in Sect 3.2, which enables each series employ
individual encoding and compression scheme.
∗ JialinQiao is the PMC Chair of Apache TsFile Committee (https://tsfile.apache.org/).
† Shaoxu Song (https://sxsong.github.io/) is the corresponding author. 1.1.2 Hierarchical Device Identification. Once transmitted from
This work is licensed under the Creative Commons BY-NC-ND 4.0 International sensors to an Industrial PC (IPC) or PLC, time series data are
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
matched with specifications from the point table using a communi-
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights cation address assigned by field engineers during installation, as
licensed to the VLDB Endowment. shown in Figure 1 (a). The identifier of the device, an essential part
Proceedings of the VLDB Endowment, Vol. 17, No. 12 ISSN 2150-8097.
doi:10.14778/3685800.3685827 of the specification, typically possesses a hierarchical structure. For
instance, energy and power enterprises employ KKS coding [38] to
Figure 1: Hierarchy Across Endpoint, Edge and Cloud

categorize and identify devices within a power plant, while the Do- compact them into consolidated files for efficiency. Ultimately, cloud
main Model in IoT-A [7] presents a self-association of device entity, servers preserve gross time series data for long-term application,
both exemplifying a hierarchical structure. Figure 1 (b) depicts a conducting compaction for higher performance.
company with numerous wind farms across different regions. On As Parquet and similar file formats rely on ordering rows by
this hierarchy, each leaf represents a sensor collecting time series, device ID to ensure efficient access, preserving the order throughout
while the path from the root to the parent of the leaf denotes the compaction is essential but costly. Compacting multiple pages, each
identifier of the device, i.e., device ID. The hierarchical structure belonging to different files and containing interleaved device IDs,
reveals the relationship between the identification of related time into a single consolidated page requires decoding and rewriting,
series, and thus leads to the design in Sect 4.2. As the hierarchy making the compaction rather expensive. This situation motivates
naturally represents the entities and relationships in the scenarios, a layout where data points from the same time series are stored
it is also referred as the data model in the following sections. contiguously, as elaborated in Sections 3.2 and 3.3.
Device ID remains static throughout the lifecycle of time series
while serving as a part of the index for access, thereby ought to 1.2 Contribution
be handled distinctively from ordinary time series data. In Parquet
In this paper, we introduce a novel open file format dedicated to time
and similar open file formats, both the device ID and time series
series in IoT scenarios, referred to as TsFile (Time Series File). TsFile
data are stored as ordinary columns without any dedicated indexes.
enhances the entire lifecycle of IoT time series data. On resource-
To achieve reasonable latency for series access, these formats re-
limited endpoint devices, an open file format allows for direct data
sort to sorting rows by device IDs, facilitating binary search upon
manipulation, eliminating dependency on additional processes. At
related columns. However, the repetition of device specifications
the edge level, it reduces the overhead of ETL tasks during data
across numerous rows introduces storage redundancy even with
compaction. On cluster servers, directly analyzing extensive time
dictionary encoding employed. Moreover, utilizing nested datatype
series data from files proves more efficient than executing database
to describe the hierarchy of device IDs increases complexity due to
system operations [9, 26].
the column-striping and record-assembly algorithms [26].
Specifically, the unique IoT features stated in Section 1.1 have
1.1.3 ETL-free File Compaction. Time series data is typically com- shaped the design choices and novelty as below.
pacted several times during the synchronization, as shown in Fig- (1) TsFile organizes data by series, enabling distinct encoding and
ure 1 (c) and (d). End devices, such as IPCs or PLCs, are commonly compression schemes for each series. This strategy effectively min-
resource-constrained and thus store only the latest time series data imizes the space cost for series exhibiting various patterns. Data
for real-time control while continuously transmitting this data. points within one series are store contiguously, leveraging inherent
Edge computers gather time series from multiple endpoints and patterns for enhanced compression. Series originating from the
same device are stored in locality, since they are more likely to
be accessed together for joint analysis. As some sensors generate
multiple readings at once, a common timestamp sequence is utilized
to reduce storage footprint;
(2) TsFile constructs indexes based on device identifiers and sensor
names, thoroughly eliminating storage redundancy of identifiers.
The index adopts two implementations, based on B-Tree and Trie
respectively, leveraging shared prefix among identifiers originated
from the hierarchical structure;
(3) As TsFile organizes data by series, and distinct files being com-
pacted, whether at the edges or in the cloud, are disjoint in terms
of time range, compaction is simplified to the concatenation of se-
ries data and adjustment of index offsets. This approach minimizes
deserialization and decoding, which constitute the most expensive
part of ETL.
This paper is organized as follows: Section 2 gives a overall per-
spective of TsFile structure, Section 3 and Section 4 delve into the
design principles behind Apache TsFile. Section 5 provides straight-
forward examples of usage for further comprehension. Section 6 Figure 2: Data Area and Index Area in TsFile
compares TsFile against prevalent open file format and evaluates
its design choices. Section 7 explores related research on IoT time
series data models and other open file formats. Finally, Section 8
concludes the paper.
aligned and non-aligned types, and accordingly, chunk groups also
2 TSFILE FORMAT OVERVIEW fall into these two categories. A chunk group consists of a header
and one or more chunks, each chunk stores data from a specific
The overall structure of TsFile is divided into 2 parts: the Data Area
time series. The header of the chunk group stores the identifier of
and the Index Area, as shown in Figure 2. The Data Area is self-
the device, which is the path from the root to the device node in
documenting and thus can be independent of the Index Area, in
the data model. The concept of chunk groups achieves device-level
spite of the low efficiency. The Index Area can be implemented in
locality, as different time series from the same device are often
alternative structures to satisfy specific application requirements.
queried simultaneously.
This paper only outlines a B-Tree-based implementation.
Chunk groups are the basic units for flushing TsFile on secondary
The Data Area comprises various Chunk Groups, each holding
storage. When data is written to TsFile, it is first buffered in memory.
time series data for a device over a specific period. A device may be
Once the memory usage reaches a threshold, the buffer, which
associated with multiple Chunk Groups, depending on the work-
may contain multiple chunk groups, will be flushed to secondary
load. Within a Chunk Group, each Chunk contains data for a single
storage. This threshold can be adjusted in line with the file system
series. Other than TsFiles resulting from compaction, each Chunk
configuration to deliver block-level locality. For example, adjusting
within one Chunk Group is associated with a distinct series.
the buffering threshold based on the block size in HDFS can prevent
The Index Area links query conditions, such as identifiers, time,
a single chunk group from being stored separately across different
or value ranges, to data offsets in the Data Area. It includes a Bloom
blocks.
Filter to quickly determine the presence of a specific series, thus
Common file formats use a tabular structure as the data model,
speeding up searches across multiple TsFiles. The Chunk and Series
organizing tuples with their ingesting order into row groups as
Indexes are crucial for fast access and will be explored further in
the unit for writing to secondary storage [15, 16, 26]. In contrast,
subsequent sections.
TsFile flushes multiple independent chunk groups once it reaches
the memory threshold, with each chunk group corresponding to
3 TSFILE DATA AREA
a distinct device, thereby offering improved locality. Furthermore,
The principle of the data area is to store the data points of each time different chunks may consist of varying chunks, while different
series in a columnar way to enhance compression efficiency and to row groups always contain the same set of columns. This feather
provide locality at both the device level and file system block level. is beneficial for typical industrial scenarios, as datasets from our
This principle differs TsFile from other common open file formats partners illustrate in Section 6.1.2. In these scenarios, one file may
with higher compression ratio and throughput for time series in contain data points from up to thousands of sensors with different
IoT scenarios. names; these sensors are distributed across various devices, with
most devices having only a tiny subset of all the sensors. Figure 4
3.1 Chunk Group showcases a common scenario where, despite tuples being sorted
The data area is organized into one or more contiguous chunk by device IDs, values from distinct series end up grouped on the
groups, with each chunk group corresponding to all time series data same page due to the row-wise grouping strategy, thereby reducing
from a single device over a period. Devices can be categorized into compression efficiency.
depending on the type of chunk it belongs to, a page stores only
one sequence, either of timestamps or data values.
The ingesting data is first placed in the buffer of the current
page. Once the buffer reaches a threshold, the data is encoded,
compressed, and written to the buffer of the corresponding chunk.
The threshold of a buffer in page is configurable; a higher threshold
imposes a higher cost to deserialize a single page even if only a
few points are expected, while a lower threshold introduces more
fragmented pages thus affecting both the efficiency of locating the
target page and data compression efficiency. A reasonable threshold
needs to strike a balance between the two.

3.4 Time Series Encoding

Since each data series employs individual encoding and compres-
sion schemes, specialized algorithms can more efficiently leverage
intervals in timestamps [11], variations in values [12], and patterns
in frequency domain [37], compared to ordinary counterparts.
While there are many encoding algorithms available, they haven’t
fully exploited the characteristics of time series in IoT scenarios [39].
Figure 3: Detail of Data Area Timestamps from sensor data are typically at fixed intervals, al-
though network delays may introduce variation or loss. To address
this, we propose a timestamp encoding method [11] that focuses
3.2 Chunk on these regular intervals, accommodating potential disturbance.
Moreover, an effective encoding scheme should exploit not only
A Chunk consists of a header and one or more pages, storing data of
the patterns within each series but also the intricate relationships
a single time series over a period. The header contains information
that exist among values across various series. Therefore, we have
such as the name of the time series, the data type, the encoding and
introduced an encoding method based on feature models, which
compression scheme, and the number of pages within the chunk.
captures similarity and regression relationships among series [12].
The time series data in a chunk are sequentially and disjointedly
When analyzing time series data, frequency domain information
distributed across the pages, ordered by timestamp.
plays a crucial role. To avoid performing complex frequency domain
In non-aligned chunk groups, each chunk serializes both times-
transformations, such as Fast Fourier Transform (FFT), with every
tamps and values into every page. In aligned chunk groups, where
analysis process, we propose an efficient encoding method that
all time series of the device share the same sequence of timestamps,
stores the frequency domain of the series directly [37].
there is a specialized chunk stores the sequence of timestamps ex-
clusively, and other chunks store the sequence of value along with
a bitmaps indicating null values, as shown in Figure 3. 4 TSFILE INDEX AREA
Each chunk has a corresponding entry in the Chunk Index. When The retrieval of time series data typically specify series identifier
accessing a time series, the system first locates the specific chunk and target time or value range. Thereby the Index Area mainly
and then evaluates whether to deserialize the pages within the includes 2 parts: the Series Index is tasked to locate the entrance
chunk by checking the header of each page. The deserialization, of a single series through its identifier, and the Chunk Index is
involving decoding and decompression, can introduce significant la- designed for locating the exact chunks containing requested data.
tency. Deserializing by page rather than by chunk enables a balance
between index size and access efficiency. 4.1 Chunk Index
Each time series has an entry in the chunk index, which comprises
3.3 Page 2 parts as Figure 6 demonstrated. The first part, referred to as the
A page represents the smallest unit for the serialization or deserial- time series metadata (TSM), holds the name and datatype of the
ization of time series data, consecutively storing data from a specific corresponding series, along with comprehensive statistical informa-
series over a given period. IoT time series data, originating from tion about the series across the entire file. This information includes
physical states on devices, often exhibit stable values or periodic maximum and minimum values, as well as the earliest and latest
patterns, thus holding the potential for efficient compression. In timestamps, among other data. The second part consists of one or
contrast, storing time series in the tabular data model of common more chunk metadata, each stores the statistical information and
file formats involves interleaving data values from multiple series offset for each chunk. The fields of statistics in both the TSM and
in one column, which degrades compression efficiency. chunk metadata are identical.
Each page stores its statistical information in the header, enabling When flushing chunk groups to a TsFile, the chunk index is
fast filtering of irrelevant pages during data access. For non-aligned maintained in memory. Once all data in the data area has been
time series, each page sequentially stores 2 segments: a sequence of flushed, the chunk index is then written to the file. When querying
timestamps and a sequence of data values. For aligned time series, data from a TsFile, as Figure 5 shows, the Series Index locates the
Figure 4: Comparative Example for Data Area

Figure 5: Access from Index Area to Data Area

TSM of the requested series through its identifier. The query process indexes only the data within requested series. The count of chunk
determines the chunks to be accessed by sequentially inspecting metadata for a specific series depends solely on its volume in the
the chunk metadata. file, reserving stable access efficiency irrespective of the presence
Compared to index structures in prevalent open file formats, of other series, as Section 6 demonstrate. This approach leverages
such as Page Index in Parquet [28], the Chunk Index distinctively
Figure 6: Detail of Chunk Index

the structure of Data Area, where time series data are grouped by
devices and sensors.

4.2 Series Index

The Series Index is a composite index structure composed of 2
layers, each being a 256-ary search tree by default. The first layer Figure 7: B-Tree Based Series Index
is a dense index that uses the device ID, which is the path from the
root to the device node in the data model, as the index key. The
index value is the offset of the root node of a tree in the second layer.
The second layer consists of multiple search trees, each indexing
the name of a series to the offset of its TSM. Every tree in the second
leading to mismatches between time series data and their identifiers,
layer indexes only a portion (by default, 1/256) of the series within
necessitating identification and correction based on data pattern.
its corresponding device.
Previous research has found that such schema issues occur with
All trees in the Series Index are constructed bottom-up. During
a probability of about 4% in production environments [32]. Con-
the process of writing the chunk index to a file, the offsets and names
sequently, we propose an automatic method [32] for identifying
of the series from the TSMs to be indexed are temporarily buffered in
schema of series to reduce data loss caused by schema issues.
memory. These TSMs serves as the entry of tree nodes, constructing
As Section 3 stated, the alignment of time series significantly
second layer trees from bottom up. Similarly, the offsets of these
affects their storage methods. However, due to the aforementioned
trees serve as the entries for constructing the tree in the first layer.
lack of authority, some series that are actually aligned might be
Ultimately, the offset of the first layer tree is stored in the tail of
mistakenly stored as non-aligned. Furthermore, given that sensors
TsFile, serving as the entry for accessing time series.
typically operate at regular sampling rates, certain sensors could
The two-layered structure reduces the duplication of device IDs,
have similar sampling timestamps even without being aligned at
typically represented by long strings, across its nodes, effectively
hardware level, making it efficient to store them as aligned series.
reducing the footprint of the index. In the second layer of the tree
Our proposed method [10] evaluates the similarity of series times-
structure, only a subset of the TSMs for each device is indexed. The
tamps and automatically groups aligned time series, considering
TSMs that are not indexed are always stored contiguously after
the trade-off with spatial overhead.
the indexed TSMs, and searching for these series requires a linear
search starting from the indexed TSMs. In the IoT context, where
devices are often equipped with numerous sensors, this sparse 5 THE API OF TSFILE
indexing approach strikes a balance between space and time. Given its high throughput and efficient storage from columnar
organization, combined with the grouping strategy that eases com-
4.3 Automatic Schema Identification paction, TsFile is well-suited as SSTable in LSM architecture [27]
Consistent patterns exhibited within individual sequence, along and thus serves well as storage format in time series database
with indexes that delineate the relationships between series, facil- management (TSDBMS), like Apache IoTDB. Moreover, TsFile also
itate automatic correction of the association between each series provides direct data access through integrated interfaces as follows.
and its specification [32], which may be mismatched due to sensor In TsFile, the essential unit of time series data is a quadruple,
cable misplacement. consisted of a timestamp, a value, and identifications of device and
In IoT scenarios, field engineers may install new sensors to col- sensor generating the value. Ingest data into TsFile means storing
lect new metrics, introducing additional time series. During device a sequence of these quadruples, while read from TsFile is accessing
maintenance, engineers might mistakenly connect sensor cables, data through specifying part of them.
5.1 TsFile Writer Neither of Tablet nor TSRecord determines alignment of the
TsFile can be created and manipulated on both local and distributed device, which is a description on physical level. Aligned and non-
file systems, such as HDFS, integrating seamlessly with the big data aligned devices are logically equivalent, yet they exhibit significant
ecosystem. The following code example outlines 3 constructors performance differences under specific workloads.
along with its parameter types. Line 1 instantiate a writer with a file TsFile can not modify or delete data in place, since data from
descriptor, while line 2 create the writer with a schema describing sensor reading is rarely required to update. For those under the
devices and sensors. TsFileOutput in line 3 can be an instance of TSDBMS, modifications are delivered by tombstones.
HDFSOutput, which enables TsFile to be store in HDFS.
5.2 TsFile Reader
1 public TsFileWriter ( File ) ;
2 public TsFileWriter ( File , Schema ) ; There are two ways reading data from TsFile, their constructors are
3 public TsFileWriter ( TsFileOutput , Schema ) ; stated below. TsFileSequenceReader describes metadata of the
Constructors below demonstrate details in Schema. Among these, TsFile and provides a low level access where data can be located
MeasurementSchema represents the name and physical schema of by position and is deserialized from page to page. TsFileReader
one series, including datatype, encoding and compression scheme. retrieves data with specifications consisted of series identifiers
MeasurementGroup describe multiple series within one device and and optional filters. Read with TsFileReader requires specifying
the alignment of the device, while Schema holds mappings from one or more series. If the process knows nothing about the series
device IDs to descriptions of its sensors. inside a TsFile, it is better to access data by TsFileSequenceReader.
As TsFileReader provides more general usage, this paper mainly
1 public Schema ( Map < String , MeasurementGroup >) ; focus on it and take it for evaluation in Section 6.
2 public MeasurementGroup ( boolean , Map < String ,
MeasurementSchema >) ;
1 public TsFileSequenceReader ( File ) ;
3 public MeasurementSchema (
2 public TsFileReader ( TsFileSequenceReader ) ;
4 String , TSDataType , TSEncoding , CompressionType ) ;

Each series in TsFile can be assigned with distinct schema, with TsFileReader accepts expressions consisted of specific series
details are store in the header of each chunks as shown in Figure paths and filters. Filters can be applied to timestamps or values, and
3, provided the datatype is compatible with the encoding scheme. can be composed via logical operators like and and or.
This approach offers greater schema flexibility compared to com- 1 TsFile reader = new TsFileReader ( file ) ;
mon file formats, which typically stores series with the same name 2 Path series = new Path ( device , sensor ) ;
in a single column, applying uniform encoding and compression 3 Filter valueFilter = ValueFilterApi . gt (1.1) ;
4 Filter timeFilter = TimeFilterApi . gt ( now () - 3 * hour );
schemes irrespective of their distinct characteristics. 5 IExpression filterExpression =
Before ingesting data from new series, they must be registered in 6 BinaryExpression . and (
7 new SingleSeriesExpression ( path , valueFilter ) ,
TsFile as follows, which evolvs the schema within the TsFile. Line 8 new GlobalTimeExpression ( timeFilter ) ) ;
7-8 register a time series through specifying its device and schema, 9 QueryExpression expression =
after then the series is ready to ingest data. 10 QueryExpression
11 . create ()
1 TsFileWriter writer = new TsFileWriter ( file ) ; 12 . addSelectedPath ( path )
2 13 . setExpression ( expression ) ;
3 String device = " Turbine . Beijing . FU01 . AZQ01 " ; 14 QueryDataSet res = reader . query ( expression ) ;
4 MeasurementSchema sensor = new MeasurementSchema ( 15 RowRecord row = res . next () ;
5 " Speed " , TSDataType . FLOAT , TSEncoding . RLE ) ;
6 writer . registerTimeseries ( device , sensor ) ; Code example above demonstrate a naive usage to access data
points of a certain series with time and value filters. Line 3-5 exem-
TsFile accepts time series data by TSRecords or Tablets. The
plify that value filter is applied to a specific series while time filter
former holds only 1 timestamps, containing multiple values mea-
works on all series specified in line 9-13. Line 15 places the initial
sured at that time from distinct sensors within 1 device. The later
data points that meet the filter criteria, each from a selected series,
submits data points from 1 device in batch, requiring a schema
into a RowRecord, thereby forming a tabular structure, smoothly
for initiation while providing higher throughput. Line 5-11 shows
integrating with various applications that utilize table formats.
that a tablet is created with given device and schema list, and the
Since TsFile can be stored in distributed file systems like HDFS, it
tablet collects data points via arrays of timestamps and values. The
can be split into fixed-size blocks distributed across cluster servers.
evaluations in Section 6 is based on Tablets since it represents the
TsFileReader provides an interface for querying data at specific
ingestion capability of TsFile.
offset range, facilitating data retrieval only on the local server to
1 TSRecord record = new TSRecord ( now () , device ) ; minimize network overhead in big data analysis.
2 record . addTuple ( new FloatDataPoint ( " Speed " , 1.2) ;
3 writer . write ( record ) ;
4 5.3 TsFile Compaction
5 List < MeasurementSchema > schemas = new ArrayList < >() ;
6 schemas . add ( sensor ) ; TsFileResource and ICompactionPerformer are the two key com-
7 Tablet = new Tablet ( device , schemas ) ; ponents for compaction. The fundamental usage of each is outlined
8 tablet . timestamps [ tablet . rowSize ] = now () ;
9 float [] values = ( float []) tablet . values [0];
below. TsFileResource acts as a summary of TsFile, offering sta-
10 values [ tablet . rowSize ++] = 1.13; tistics on the devices contained within the file, such as timestamps
11 writer . write ( tablet ) ; of both the first and the last data point. ICompactionPerformer
can be implemented in various approaches but consistently requires Table 1: Dataset Profile
both source and target files.
1 TsFileResource rsc1 = new TsFileResource ( tsFile1 ) ; DataSet Points Series Devices
2 TsFileResource rsc2 = new TsFileResource ( tsFile2 ) ;
3 TsFileResource rsc3 = new TsFileResource ( newFile ) ; REDD 56M 115 115
4 GeoLife 72M 543 181
5 ICompactionPerformer performer = TDrive 18M 17778 8889
6 new FastCompactionPerformer () ;
7 performer . setSourcesFiles ( rsc1 , rsc2 ) ; TSBS 496M 16000 4000
8 performer . setTargetFiles ( rsc3 ) ; ZY 376M 17154 186
9 performer . perform () ;
CCS 161M 2750 1108

6 PERFORMANCE EVALUATION
i.e., data points from the same device are stored contiguously and
We compare TsFile with other widely-used open file formats, namely
ordered by timestamp, before writing to the file.
Parquet and Arrow. Furthermore, we also compare Apache IoTDB
The ZY dataset, provided by our industrial partner, consists of
[36], which employs TsFile as its underlying storage format, with
data points collected by sensors on rock drilling machines. This
InfluxDB [18] and other top performers in time series database
dataset is more sparse as these data are only available if related
track. Among these systems, InfluxDB-IOx [19, 20] utilizes Parquet
machines are working. Furthermore, the quantity of sensors linked
as its underlying storage. When storing time series data in Parquet,
to different devices differs significantly. Some devices have fewer
we will discuss the alternatives of schema for fairer comparison.
than three sensors, while others have over a hundred due to the
varying complexity of their tasks.
6.1 Experimental Setup The CCS dataset is provided by our industrial partner as well.
6.1.1 Hardware. For evaluation in Section 6.2 , we perform the ex- The data are collected from shipbuilding components, as mentioned
periments on an 8-core Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz earlier. In comparison to other datasets, some time series in this
machine with 32GB memory, 1T SSD, and 64-bit Windows 10. dataset are collected at high frequency, such as data points from
For systematic evaluation in Section 6.3.1, we conduct the exper- vibration measurements.
iments on a machine with 20-core Intel(R) Core(TM) i7-12700 CPU,
16 GB memory and 512GB SSD, running 64-bit Ubuntu 22.04.1 SMP. 6.2 File Evaluation
For Section 6.3.2, we conduct the evaluations on a Raspberry Pi 4 We evaluate TsFile with Parquet and Arrow, which are represen-
Model B with 8GB RAM, which is approximate to industrial end tative open file formats in these days, regarding space cost, write
devices in typical IoT scenarios. speed, and query latency across various datasets. While Arrow
6.1.2 DataSets. We employ three public real-life datasets, one time was initially designed for in-memory usage, it indeed has an inter-
series benchmark, and two datasets from our industrial partners as process communication format, which is also known as Feather
listed in Table 1. [25]. When we write data into disks, we actually write Feather files;
The Reference Energy Disaggregation Data Set (REDD) [23] con- while we read data from Feather, we actually read Arrow data in
tains detailed electricity usage data collected from various house- memory. In the following experiments, for simplicity, we will refer
holds, including both high-frequency appliance-level power usage to both Arrow and Feather collectively as Arrow. Although there
and low-frequency whole-house power consumption. The dataset are other open file formats such as ORC [16] or RCFile [15], their
used in this paper contains data from 6 buildings, each with approx- architecture is similar to that of Parquet and has been thoroughly
imately 20 meters. Every meter is considered as a device generating analyzed in previous research [24, 36, 42].
only 1 time series and is identified by the combination of building In contrast to the flexible and IoT-native data model in TsFile,
and meter number. Parquet and Arrow require the data schema to be defined based on
GeoLife [43] and TDrive [40, 41] are GPS trajectory datasets data characteristics before writing data to the file. As they employ
consisted of coordinates recorded during a wide array of activities a tabular schema, if the device ID has multiple fields, there are
like walking, running, cycling and driving. Every object tracked primarily two alternatives for schema definition. The first approach
in these datasets are deemed to be a device equipped with sensors stores each field from the device ID in an individual field. InfluxDB-
measuring its coordinate, which constitutes time series data. IOx, which utilizes Parquet as its underlying storage, adopts this
Time Series Benchmark Suite(TSBS) [34] is a collection of pro- approach [19, 20]. The second approach stores the entire device ID
grams widely used to generate tailored dataset for benchmark. This in a single column, resulting in a simpler layout but compromis-
paper employ the IoT case in the suite, where data are pertained ing the atomicity of these fields. For instance, Device ID in TSBS
to a set of trucks, including their coordinates, velocity and other includes three parts: name, fleet, driver. The first definition stores
status. TSBS interleaves data points from different devices, but the them in different fields while the second stores them in a single
data points for each individual device are sequential in terms of the one, as the following snippet illustrated. On the other hand, the
timestamp. As Section 3 illustrates, performance in common file device ID in TsFile is represented as segmented string such similar
formats like Parquet declines when data points are not sorted by to “<name>.<fleet>.<driver>”.
device ID, whereas TsFile maintains unaffected performance. For // schema of Parquet
the sake of fairness, we reorganize all data points by its device ID, message TSBS{
required binary name; TsFile Parquet Parquet-AS Arrow
required binary fleet; ×109
required binary driver;

space cost (Bytes)

required int64 timestamp;
optional double lat;

1.00
optional double lon;
optional double ele;
optional double vel;
}

0.10
// schema of Parquet-AS TDrive REDD GeoLife TSBS CCS ZY
message TSBS{ (a) Data Area
required binary deviceID; ×106
required int64 timestamp;

space cost (Bytes)

optional double lat;

0.01 0.10 1.00

optional double lon;
optional double ele;
optional double vel;
}
We implement both strategies on Parquet, referred to as Parquet and TDrive REDD GeoLife TSBS CCS ZY
Parquet-AS (for Alternated-Schema) respectively in the following (b) Index Area
sections, while Arrow is only implemented with the second one for
simplicity. Figure 8: Space Cost
TsFile achieves optimal performance by writing Tablet in batch.
However, as Parquet lacks such batch interface [1] currently, we
compare only the write times of internal processes during data thereby its effect on the overall size remains minimal. In Parquet, the
ingestion, mitigating effect of interface differences. Furthermore, we size of the index area depends on the number of pages, correlating
have added the construction time of Tablet in to the comparison of directly with the overall number of data points. On the other hand,
write time. In terms of space cost, Parquet automatically select the the index area size in TsFile is primarily proportional to the number
most suitable encoding scheme with its auto-encoding feature, and of devices, as there is no index entry for pages.
TsFile consistently uses GORILLA [29] encoding for comparability. 6.2.2 Write Latency. Figure 9 presents the write latency across
To mitigate discrepancies in the implementation of compression various datasets for the data area and index area, indicating the rate
algorithms, we employ no compression for all file formats. of data points written per second. TsFile outperforms Parquet in the
Through all following evaluations, each datasets are written into data area by eliminating the redundant storage of descriptions like
one file per format. Parquet takes the default blocks size threshold device IDs. Parquet introduces extra overhead for its auto-encoding
128 MB, which equals the flush threshold within TsFile. Arrow feature and column-striping algorithms, adding complexity even if
takes a default batch size 64K rows. Both Parquet and Arrow en- all datasets employed involve no nested types. Arrow is significantly
able dictionary encoding for text fields, i.e., component fields of faster than its competitors, as its in-memory and serialized layouts
deviceID. unless specifically mentioned, all other configurations are similar.
are consistent with the default setting. Given that Parquet provides no batch writing interface, whereas
TsFile employs a batch style ingestion, we have also included the
6.2.1 Space Cost. Space cost is crucial because time series data is
construction time of Tablets in TsFile in the figure. Even with this
either stored on resource-constrained endpoints or edge devices,
additional time overhead, TsFile still performs out Parquet. It is
or is stored with a high volume in cluster servers. Figure 8 (a)
worth noting that to imitate the writing method of Parquet, the
reports that with various datasets, TsFile consistently occupies less
tablet construction approach in the evaluation is far from the op-
or approximate space compared to Parquet, and both significantly
timal, indicating that TsFile could exhibit even better performance
outperform serialized Arrow format [24] in term of space cost. The
in practice.
figure reveals that the space cost is little affected by the alternative
In the index area, Parquet employs Apache Thrift [4] for seri-
schema definition due to the dictionary encoding. While Parquet
alization, trading off higher latency for reduced space occupation.
utilizes dictionary encoding to mitigate the redundancy of device
Like with space cost, latency in the index area is significantly lower
IDs, this ultimately increases space cost in data area. It should be
than in the data area, minimally affecting the overall performance.
noted that TsFile consumes slightly more space than Parquet with
the TDrive dataset, attributed to the numerous devices each hosting 6.2.3 Raw Query. Raw query denotes accessing specific series or
brief series data, leading to an increased number of Chunk Groups multiple series within a device without any filters. To ensure equiv-
in TsFile. alence of queries across different formats for each dataset, we per-
Figure 8 (b) shows TsFile can have larger Index Area than Parquet formed uniform random sampling from the dataset prior to query-
under specific dataset conditions. However, in these datasets, index ing, using these samples as query conditions. The query commands
area is several orders of magnitude smaller than the data area, for each file were defined based on these pre-generated conditions.
TsFile Parquet-AS TsFile Tablet TsFile Parquet Parquet-AS ARROW
Parquet Arrow
×102
0.00 0.01 0.10 1.00

5
×10

query latency (ms)

write latency(ms)

0.10 1.00
TDrive REDD GeoLife TSBS CCS ZY
TDrive REDD GeoLife TSBS CCS ZY (a) Access Single Series
(a) Data Write Latency ×103
2
×10

query latency (ms)

0.01 0.10 1.00
write latency(ms)
0.10 1.00

TDrive REDD GeoLife TSBS CCS ZY

TDrive REDD GeoLife TSBS CCS ZY (b) Access Aligned Series
(b) Index Write Latency
Figure 10: Raw Series Access
Figure 9: Write Latency
TsFile Parquet Parquet-AS ARROW

We employed a cold query methodology. Only one type of query ×102

will be performed once a file is opened. After the file is closed and
query latency (ms)

reopened, queries of other types will then be conducted.

The query latency is calculated between the time when the query
1.00

is issued and when all target data is received. This process includes
the time taken to read the relevant data blocks from disk into
memory. All data is sorted first by timestamp and then by device
0.10

ID, ensuring that data from the same device is stored contiguously
TDrive REDD GeoLife TSBS CCS ZY
and ordered by timestamp.
(a) Filter on Time
As shown in Figure 10, TsFile maintains consistently low la-
×102
tency in pinpointing series, whereas Parquet, using page indexes
query latency (ms)

for locating specific series, shows increased latency as Data Area

extends, despite data being sorted by device ID. Even under the
1.00

sorting aforementioned, access latency in Arrow is significantly

higher than other formats due to two factors. Firstly, Arrow must
deserialize the entire block [24, 42] for each read operation before
0.10

it can filter the data, thus decoding an excessive amount of unnec-

essary data; secondly, Arrow does not establish an index at the file TDrive REDD GeoLife TSBS CCS ZY
level, necessitating sequential scanning of blocks for access. While (b) Filter on Value
other modules have been developed for facilitating access to Arrow
data structures, they primarily aim to improve expressiveness [2] Figure 11: Filtered Series Access
or universality [3] without modifying the file layout, and therefore,
are beyond the scope of this paper.
optimize querying process. In contrast, Parquet and Arrow cannot
6.2.4 Filtered query. Filtering time series data based on times- ascertain timestamps are monotonically incrementing, resulting in
tamps or values reflects the comprehensive performance of the higher latency.
index mechanism. For either type of filtering, only the filtered se- The effectiveness demonstrated in overall performance compari-
ries will be retrieved, regardless of other series under the same son derives from a more detailed design. This section evaluates the
device. As shown in Figure 11, TsFile consistently outperforms aforementioned designs using more specific metrics.
across all datasets and filters, with timestamp filtering always ex-
hibiting lower latency. This is because timestamps in each series are 6.2.5 Bulk Compaction. Figure 12 illustrates that the compaction
monotonically increasing, allowing TsFile to leverage this feature to method employed for TsFile, termed Fast-Compaction, significantly
Fast-Compaction Naive-Compaction IoTDB VictoriaMetrics TimescaleDB
InfluxDB IOx QuestDB
×105 ×106 ×105

1.00
2.0
1.00

latency (ms)
latency (ms)

1.5

size (kb)

0.10
1.0

0.10
0.5

TDrive REDD GeoLife TSBS CSS ZY

0.0
(a) Data Ingestion
e

D
BS

S
ZY

D
BS

S
ZY
riv

riv
CC

CC
D

D
Li

Li
TS

TS
RE

RE
×102
eo

eo
TD

TD
G

(a) Compaction Time (b) Compaction Size

latency (ms)
Figure 12: Compact Effect

1.00
TDrive REDD GeoLife TSBS CSS ZY

outperforms Naive-Compaction, which requires reading all chunks (b) Data Query

to verify the interleaving of devices to maintain their order.

Fast-Compaction sketches all files specified for compaction at Figure 13: System Comparison
begin, executing a multi-way merge for each device appearing in
more than one file. When addressing devices from multiple files, TsFile Parquet Parquet-AS ARROW
it scrutinizes all chunks within the chunk group for any overlap
in time ranges. Chunks without time range overlap are directly ×105
0.00 0.01 0.10 1.00

transferred to the target file by the order of time, without dese-

query latency (ms)

rialization and decoding. This approach may also benefits from

zero-copy technology. In contrast, Naive-Compaction, a method
requisite for common open file formats like Parquet that necessitate
maintaining device order for efficient access, mandates decoding
for each chunk, thus incurring additional latency.
Files resulting from Fast-Compaction are shown to be equally
TDrive GeoLife REDD TSBS CCS ZY
efficient or nearly so, regarding file size and performance across
(a) Overall Write Latency
various queries. Details on query latency and other metrics are
×103
1.00

omitted due to spatial constraints.

query latency (ms)
0.10

6.3 System Evaluation

6.3.1 Overall Performance. Another important role of TsFile is
serving as the storage format for TSDBMS. We compared the write
0.01

throughput and access latency of Apache IoTDB and top performers

in the ranking of TSDBMS according to the benchANT [8]. TDrive GeoLife REDD TSBS CCS ZY
It is worth noting that among these systems, only Apache IoTDB (b) Filter on Time
and InfluxDB employs open file format as underlying storage. Vic-
toriaMetrics [35] writes data into ‘part’ directories on disk, each Figure 14: Industrial End Device Setup
containing separate time and value files, along with several other
metadata and index files. QuestDB [30] stores data of each table into
several appending files, each relating to a column. Every column file Figure 13 reports that Apache IoTDB consistently outperforms
is accompanied by an extra index file. TimescaleDB [33] employs other systems across various datasets. Given the earlier compar-
PostgreSQL for underlying storage so time series data are stored isons between TsFile and Parquet under various loads and function-
in related files [14]. These file formats do not provide integrated alities and the report from benchANT [8], result of this comparison
interfaces for direct usage and, thus, typically cannot be utilized is hardly surprising.
independently of their associated database systems. These file for-
mats are challenging to employ for IoT devices with insufficient 6.3.2 End Device. We conduct the performance on typical indus-
resources to run a complete database system. trial devices. Since TsFile provided platform-independent interfaces,
the architectural or system differences between industrial comput- IOx [18, 20] stores time series data into Parquet, it only provides
ers and ordinary personal computers have little impact on per- primitive data types to this day. Storing time series data into com-
formance. Figure 14 (a) reports overall write latency. Figure 14 (b) mon file format, like Parquet, ORC or RCFile, incurs another serious
illustrates query latency with filters on time, a typical access pattern drawback, even though descriptive information about time series,
in IoT scenario. To cope with the limited memory resource on the like device ID, are static through the life cycle, they are handled
end device, we reduced the block size in Parquet and flush threshold as ordinary columns thus introducing unnecessary storage con-
in TsFile to 64M. For Arrow, we reduce the batch size to 32K rows. sumption even with dictionary encoding. Let alone that Parquet
However, these discrepancies barely impact the performance. would withdraw its dictionary encoding once a dictionary grows
too large, storing identical time series description repeatedly. In
7 RELATED WORK contrast, TsFile stores them in tree-structured index area, reducing
space cost while proving fast access.
7.1 Data Model for IoT Time Series
Apache ORC [16] is a successor of RCFile [15], employing a
The data model of time series represents the associations between tabular data model and supporting nested data types as Parquet.
series and the semantics of time series data. As the data model Instead of using a specific threshold to determine encoding strategy
determines the physical layout in some extent, therefore designing as Parquet, ORC uses the ratio of distinct values to predetermine
an efficient file format for IoT time series requires an in-depth study whether to apply dictionary encoding [24]. When the number of
of its data model. time series in one file increases, ORC would finally withdraw dictio-
The hierarchical structure is widely applied among conventional nary encoding the drawback occurs. Contrast to Parquet construct
time series management. Since the 1980s, the industrial sector has its page-level index after all row groups, ORC maintains index data
relied on data historians [6] to manage time series data from sensors in the front of each stripes, leading to a higher price to filter with
within industrial systems, making time series modeling a pivotal specific condition. ORC investigates the affect of row reordering, ta-
concern. The International Society of Automation introduced stan- ble partitioning and data packing on performance in deep dive [17],
dards like Batch Control ISA-88 [21] and Enterprise-Control System however these scenario-independent designs are not taking ad-
Integration ISA-95 [22], featuring a hierarchical Physical Model to vantage of the features in IoT scenarios. Since the data placement
delineate the structure among sensors and devices. In 2009, major method in ORC is quite similar to that in Parquet, thereby this
industry players like IBM and Siemens, alongside other enterprises paper do not take it in evaluations as above.
and institutions, participated in the European Union’s Seventh
Framework Internet of Things Architecture Project (IoT-A) [7]. This
initiative aimed to standardize fundamental IoT concepts, offering a
domain model that sketches out entities such as devices and sensors.
8 CONCLUSION
OSIsoft’s renowned PI Asset Framework [5] and General Electric’s This paper presents Apache TsFile, which, to our knowledge, stands
Predix Asset Model [13] both employ hierarchical structures to or- as the first open file format specifically designed for time series data
ganize industrial system entities. The PI Asset Framework utilizes in IoT applications. TsFile leverages characteristics inherent to each
interconnected Elements to represent entities, with each element series, by employing distinct encoding and compression scheme
having multiple children but only one parent. Similarly, the Predix upon individual series. As IoT time series is invariably associated
Asset Model categorizes entities into five hierarchical levels, struc- with devices, TsFile groups series from the same device and stores
turing the complex relationships within industrial systems. For points from same series contiguously, optimizing compactness.
TSDBMS today, a unified data model has yet to be established, the Specification of series is crucial for accessing and invariant of time,
concept of device is implemented in diverse approaches. Such as these elements are utilized for indexes in TsFile. This approach
InfluxDB and Prometheus employ labels for series specification, minimizes storage redundancy and ensures rapid data retrieval.
QuestDB and QuasarDB leverage symbol-type columns in tables for Having been recognized as a top-level Apache project, TsFile has
data invariant of time. TsFile, as the underlying storage of IoTDB, or- seen extensively applied in IoT contexts and demonstrates superior
ganizes device IDs in a hierarchy, thereby adhering to IoT standards performance as demonstrated by our evaluation. With the rapid
and capturing device specifications as proposed by TSDBMS. emergence of time series data and intelligent devices, we believe
that TsFile holds significant promise for enhancing future time
7.2 File Format for Big Data series data applications and beyond.
In big data applications, various file formats have been proposed
and applied widely for their high performance and concise format
design, such as Parquet [28], ORC [16] and RCFile [15]. However, ACKNOWLEDGMENTS
these formats are not designed for time series data in IoT scenario, This work is supported in part by the National Natural Science
resulting in performance fall back as show in Section 6. Foundation of China (62021002, 62072265, 62232005, 92267203), the
Specifically, Apache Parquet [28] is featured with column-striping National Key Research and Development Plan (2021YFB3300500),
and record assembly algorithms [26], which is designed for the the State Grid Science and Technology Project (5700-202435261A-1-
nested data such as XML or JSON documents. However these com- 1-ZN), the Chongqing Technical Innovation and Application Devel-
plex types rarely occurs in time series data as sensor readings are opment Key Project (CSTB2023TIAD-STX0034), Beijing Key Labo-
mostly numerical, resulting in unnecessary overhead to store and ratory of Industrial Big Data System and Application. Shaoxu Song
decode corresponding structures. For instance, although InfluxDB (https://sxsong.github.io/) is the corresponding author.
REFERENCES [24] Chunwei Liu, Anna Pavlenko, Matteo Interlandi, and Brandon Haynes. 2023. A
[1] Apache. 2024. https://github.com/apache/parquet-java/. deep dive into common open formats for analytical dbmss. Proceedings of the
[2] Apache. 2024. https://arrow.apache.org/docs/cpp/gandiva.html. VLDB Endowment 16, 11 (2023), 3044–3056.
[3] Apache. 2024. https://arrow.apache.org/docs/java/dataset.html. [25] Wes McKinney. 2024. https://github.com/wesm/feather/.
[4] Apache Thrift. 2024. https://thrift.apache.org/. [26] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shiv-
[5] Aveva. 2024. Asset Framework and PI System Explorer. https://docs.aveva.com/ akumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: interactive analysis of
bundle/pi-server-af-pse-f/page/1031642.html. Accessed: 2024-02-16. web-scale datasets. Proceedings of the VLDB Endowment 3, 1-2 (2010), 330–339.
[6] DC Barr. 1994. The use of a data historian to extend plant life. (1994). [27] Patrick E. O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O’Neil. 1996.
[7] Martin Bauer, Nicola Bui, Jourik De Loof, Carsten Magerkurth, Andreas The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica 33, 4 (1996), 351–
Nettsträter, Julinda Stefa, and Joachim W Walewski. 2013. IoT reference model. 385. https://doi.org/10.1007/S002360050048
Enabling Things to Talk: Designing IoT solutions with the IoT architectural reference [28] Apache Parquet. 2024. https://parquet.apache.org/.
model (2013), 113–162. [29] Tuomas Pelkonen, Scott Franklin, Paul Cavallaro, Qi Huang, Justin Meza, Justin
[8] benchANT. 2024. https://benchant.com/ranking/database-ranking. Teller, and Kaushik Veeraraghavan. 2015. Gorilla: A Fast, Scalable, In-Memory
[9] Jeffrey Dean and Sanjay Ghemawat. 2010. MapReduce: a flexible data processing Time Series Database. Proc. VLDB Endow. 8, 12 (2015), 1816–1827. https://doi.
tool. Commun. ACM 53, 1 (2010), 72–77. org/10.14778/2824032.2824078
[10] Chenguang Fang, Shaoxu Song, Haoquan Guan, Xiangdong Huang, Chen Wang, [30] QuestDB. 2024. https://questdb.io/docs/concept/storage-model/.
and Jianmin Wang. 2023. Grouping time series for efficient columnar storage. [31] SQLite. 2024. https://www.sqlite.org/.
Proceedings of the ACM on Management of Data 1, 1 (2023), 1–26. [32] Yu Sun, Shaoxu Song, Chen Wang, and Jianmin Wang. 2020. Swapping repair
[11] Chenguang Fang, Shaoxu Song, and Yinan Mei. 2022. On repairing timestamps for misplaced attribute values. In 2020 IEEE 36th International Conference on Data
for regular interval time series. Proceedings of the VLDB Endowment 15, 9 (2022), Engineering (ICDE). IEEE, 721–732.
[33] TimescaleDB. 2023. https://www.timescale.com/.
1848–1860.
[34] TimescaleDB. 2023. Time Series Benchmark Suite (TSBS). https://github.com/
[12] Chenguang Fang, Shaoxu Song, Yinan Mei, Ye Yuan, and Jianmin Wang. 2022. On
timescale/tsbs.
aligning tuples for regression. In Proceedings of the 28th ACM SIGKDD Conference
[35] VictoriaMetrics. 2024. https://docs.victoriametrics.com/#storage/.
on Knowledge Discovery and Data Mining. 336–346.
[36] Chen Wang, Jialin Qiao, Xiangdong Huang, Shaoxu Song, Haonan Hou, Tian
[13] GE Digital. 2022. About Asset Model. https://www.ge.com/digital
Jiang, Lei Rui, Jianmin Wang, and Jiaguang Sun. 2023. Apache IoTDB: A Time
/documentation/predix-essentials/latest/c_apm_asset_about_asset_model_1.
Series Database for IoT Applications. Proceedings of the ACM on Management of
html. Accessed: 2024-02-16.
Data 1, 2 (2023), 1–27.
[14] The PostgreSQL Global Development Group. 2024. https://www.postgresql.org/
[37] Haoyu Wang and Shaoxu Song. 2022. Frequency domain data encoding in apache
docs/16/storage-file-layout.html/.
IoTDB. Proceedings of the VLDB Endowment 16, 2 (2022), 282–290.
[15] Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang,
[38] Zhenhua Wang, Huikun Pei, Xiaomeng Zhang, Chenghao Wang, Xi Chen, and Te
and Zhiwei Xu. 2011. RCFile: A fast and space-efficient data placement struc-
Zhou. 2021. Application of KKS Coding and QR Code Technology in Transmission
ture in MapReduce-based warehouse systems. In 2011 IEEE 27th International
Asset Management. In 2021 IEEE 2nd China International Youth Conference on
Conference on Data Engineering. IEEE, 1199–1208.
Electrical Engineering (CIYCEE). IEEE, 1–6.
[16] Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N Hanson,
[39] Jinzhao Xiao, Yuxiang Huang, Changyu Hu, Shaoxu Song, Xiangdong Huang,
Owen O’Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, and Xiaodong Zhang.
and Jianmin Wang. 2022. Time Series Data Encoding for Efficient Storage: A
2014. Major technical advancements in apache hive. In Proceedings of the 2014
Comparative Analysis in Apache IoTDB. Proc. VLDB Endow. 15, 10 (2022), 2148–
ACM SIGMOD international conference on Management of data. 1235–1246.
2160. https://doi.org/10.14778/3547305.3547319
[17] Yin Huai, Siyuan Ma, Rubao Lee, Owen O’Malley, and Xiaodong Zhang. 2013.
[40] Jing Yuan, Yu Zheng, Xing Xie, and Guangzhong Sun. 2011. Driving with
Understanding insights into the basic structure and essential issues of table
knowledge from the physical world. In Proceedings of the 17th ACM SIGKDD
placement methods in clusters. Proceedings of the VLDB Endowment 6, 14 (2013),
International Conference on Knowledge Discovery and Data Mining, San Diego,
1750–1761.
CA, USA, August 21-24, 2011, Chid Apté, Joydeep Ghosh, and Padhraic Smyth
[18] InfluxData. 2024. https://www.influxdata.com/time-series-platform/influxdb/.
(Eds.). ACM, 316–324. https://doi.org/10.1145/2020408.2020462
[19] InfluxData. 2024. https://www.influxdata.com/blog/understanding-influxdb-iox
[41] Jing Yuan, Yu Zheng, Chengyang Zhang, Wenlei Xie, Xing Xie, Guangzhong
-commitment-open-source/.
Sun, and Yan Huang. 2010. T-drive: driving directions based on taxi trajectories.
[20] InfluxData. 2024. https://github.com/influxdata/influxdb/tree/3c5e
In 18th ACM SIGSPATIAL International Symposium on Advances in Geographic
5bf241dcc2c0e13554c5286577ad6066bfec/parquet_file/.
Information Systems, ACM-GIS 2010, November 3-5, 2010, San Jose, CA, USA, Pro-
[21] International Society of Automation (ISA) 2010. Batch Control Part 1: Mod-
ceedings, Divyakant Agrawal, Pusheng Zhang, Amr El Abbadi, and Mohamed F.
els and Terminology. International Society of Automation (ISA). Accessed
Mokbel (Eds.). ACM, 99–108. https://doi.org/10.1145/1869790.1869807
online at https://www.isa.org/products/ansi-isa-88-00-01-2010-batch-control-
[42] Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and
part-1-models.
Huanchen Zhang. 2023. An Empirical Evaluation of Columnar Storage Formats.
[22] International Society of Automation (ISA) 2010. Enterprise-Control System
Proc. VLDB Endow. 17, 2 (2023), 148–161. https://www.vldb.org/pvldb/vol17/p148-
Integration. International Society of Automation (ISA). Accessed on-
zeng.pdf
line at https://www.isa.org/products/ansi-isa-95-00-01-2010-iec-62264-1-mod-
[43] Yu Zheng, Xing Xie, and Wei-Ying Ma. 2010. GeoLife: A Collaborative Social
enterprise.
Networking Service among User, Location and Trajectory. IEEE Data Eng. Bull.
[23] J Zico Kolter and Matthew J Johnson. 2011. REDD: A public data set for energy
33, 2 (2010), 32–39. http://sites.computer.org/debull/A10june/geolife.pdf
disaggregation research. In Workshop on Data Mining Applications in Sustain-
ability (SIGKDD), San Diego, CA, Vol. 25. 59–62.

Big Data Demystified - How To Use Big Data, Data Science and AI To Make Better Business Decisions and Gain Competitive Advantage
100% (4)
Big Data Demystified - How To Use Big Data, Data Science and AI To Make Better Business Decisions and Gain Competitive Advantage
196 pages
Timeseriesdatabases Newwaystostoreandaccessdata
No ratings yet
Timeseriesdatabases Newwaystostoreandaccessdata
81 pages
Py Spark
No ratings yet
Py Spark
427 pages
Additional Book Chapters - Service Fabric
No ratings yet
Additional Book Chapters - Service Fabric
313 pages
Influxdb 2017
No ratings yet
Influxdb 2017
45 pages
Time Series Databases
100% (2)
Time Series Databases
81 pages
Data Management For The Internet of Thin PDF
No ratings yet
Data Management For The Internet of Thin PDF
31 pages
Data Mining in IoT
100% (1)
Data Mining in IoT
29 pages
Modeling and Building IoT Data Platforms With Actor-Oriented Databases
No ratings yet
Modeling and Building IoT Data Platforms With Actor-Oriented Databases
12 pages
Interoperable & Efficient: Linked Data For The Internet of Things
No ratings yet
Interoperable & Efficient: Linked Data For The Internet of Things
15 pages
Autonomous Device Identification Architecture For Internet of Things
No ratings yet
Autonomous Device Identification Architecture For Internet of Things
5 pages
Clustering of Time-Series Data
No ratings yet
Clustering of Time-Series Data
20 pages
Atc20 Visheratin
No ratings yet
Atc20 Visheratin
14 pages
HDFS Commands
No ratings yet
HDFS Commands
2 pages
Big Data Engineer Ibm Exploree Cartes - Quizlet
No ratings yet
Big Data Engineer Ibm Exploree Cartes - Quizlet
30 pages
BDA - Chapter-1-Components of Hadoop Ecosystem - Lecture 3
0% (1)
BDA - Chapter-1-Components of Hadoop Ecosystem - Lecture 3
38 pages
Automated IoT Device Identification Based On Full Packet
No ratings yet
Automated IoT Device Identification Based On Full Packet
17 pages
Apache Iotdb: Time-Series Database For Internet of Things
No ratings yet
Apache Iotdb: Time-Series Database For Internet of Things
4 pages
Ketulkumar Polara: Data Scientist Email: Phone
No ratings yet
Ketulkumar Polara: Data Scientist Email: Phone
6 pages
Adaptive Clustering For Dynamic IoT Data Streams
No ratings yet
Adaptive Clustering For Dynamic IoT Data Streams
11 pages
A Systematic Survey DM and BD in IoT
No ratings yet
A Systematic Survey DM and BD in IoT
49 pages
IoT Module-3 Notes
No ratings yet
IoT Module-3 Notes
22 pages
Deep Learning Based Forecasting of Critical Infrastructure Data
No ratings yet
Deep Learning Based Forecasting of Critical Infrastructure Data
10 pages
2016 3 1 2 Haenisch
No ratings yet
2016 3 1 2 Haenisch
10 pages
Enhancing Industrial IoT Security: A Comprehensive Approach To Intrusion Detection Using PCA-Driven Decision Trees
No ratings yet
Enhancing Industrial IoT Security: A Comprehensive Approach To Intrusion Detection Using PCA-Driven Decision Trees
11 pages
Kshape
No ratings yet
Kshape
49 pages
A Data Aggregation Approach Exploiting Spatial and Temporal Correlation Among Sensor Data in Wireless Sensor Networks
No ratings yet
A Data Aggregation Approach Exploiting Spatial and Temporal Correlation Among Sensor Data in Wireless Sensor Networks
5 pages
Sensors 22 08337 v2
No ratings yet
Sensors 22 08337 v2
20 pages
Database For Internet of Things
No ratings yet
Database For Internet of Things
13 pages
02 Hadoop Architecture and HDFS
100% (1)
02 Hadoop Architecture and HDFS
74 pages
Externally Validating The Iotdevid Device Identification Methodology Using The Cic Iot 2022 Dataset
No ratings yet
Externally Validating The Iotdevid Device Identification Methodology Using The Cic Iot 2022 Dataset
20 pages
Analyzing An OT Network, What To Expect
No ratings yet
Analyzing An OT Network, What To Expect
6 pages
Unit - 4 Iot (Q&a)
No ratings yet
Unit - 4 Iot (Q&a)
12 pages
HTI Exam Questions
100% (2)
HTI Exam Questions
7 pages
Apache IoTDB A Time Series Database For IoT Applications
No ratings yet
Apache IoTDB A Time Series Database For IoT Applications
27 pages
A Comparative Analysis in Apache IoTDB
No ratings yet
A Comparative Analysis in Apache IoTDB
13 pages
Data Analytics For IoT Solutions (Module VI)
No ratings yet
Data Analytics For IoT Solutions (Module VI)
81 pages
SNO e Book 20004 - SNO Ebook 7 Reference Architectures For Application Builders IoT
No ratings yet
SNO e Book 20004 - SNO Ebook 7 Reference Architectures For Application Builders IoT
1 page
Module 5
No ratings yet
Module 5
30 pages
MOD3-1 Iot
No ratings yet
MOD3-1 Iot
50 pages
Question Bank-2
No ratings yet
Question Bank-2
7 pages
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
No ratings yet
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
74 pages
Chapter 3
No ratings yet
Chapter 3
30 pages
ARenault IJCNN23 CR
No ratings yet
ARenault IJCNN23 CR
11 pages
7th Sem Syllebux PDF
No ratings yet
7th Sem Syllebux PDF
22 pages
IoT 5.3
No ratings yet
IoT 5.3
13 pages
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
No ratings yet
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
11 pages
INFORMATICA
No ratings yet
INFORMATICA
388 pages
Internet of Things 18Cs81: Module - 4 Data and Analytics For Iot
No ratings yet
Internet of Things 18Cs81: Module - 4 Data and Analytics For Iot
32 pages
Apache Hive Guide
No ratings yet
Apache Hive Guide
99 pages
A Novel Anomaly Detection Approach For Internet of Things Time Series Data
No ratings yet
A Novel Anomaly Detection Approach For Internet of Things Time Series Data
13 pages
Module 3 (Iot)
No ratings yet
Module 3 (Iot)
39 pages
f02 IoTTFID - An - Incremental - IoT - Device - Identification - Model - Based - On - Traffic - Fingerprint
No ratings yet
f02 IoTTFID - An - Incremental - IoT - Device - Identification - Model - Based - On - Traffic - Fingerprint
13 pages
A Machine Learning Based Framework For IoT Device Identification Salman2019
No ratings yet
A Machine Learning Based Framework For IoT Device Identification Salman2019
15 pages
Hadoop 2.6 Installing On Ubuntu 14.04 (Single-Node Cluster) STEP:1
No ratings yet
Hadoop 2.6 Installing On Ubuntu 14.04 (Single-Node Cluster) STEP:1
13 pages
2024 - Elsevier - Enabling Energy-Efficient and Lossy-Aware Data Compression in
No ratings yet
2024 - Elsevier - Enabling Energy-Efficient and Lossy-Aware Data Compression in
18 pages
Resume
No ratings yet
Resume
7 pages
B-Anomaly Detection Based On Multidimensional Data Processing For Protecting Vital Devices in 6G-Enabled Massive IIoT
No ratings yet
B-Anomaly Detection Based On Multidimensional Data Processing For Protecting Vital Devices in 6G-Enabled Massive IIoT
12 pages
Codetru - Big Data
No ratings yet
Codetru - Big Data
17 pages
WASE 2021 Cloud Computing Handout - S2 - 23
No ratings yet
WASE 2021 Cloud Computing Handout - S2 - 23
25 pages
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
No ratings yet
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
102 pages
Ailibaba-Time-Series DB
No ratings yet
Ailibaba-Time-Series DB
13 pages
Guru Data Resume
No ratings yet
Guru Data Resume
6 pages
Big-Data-Analytics-18CS72 ... : R Lil
No ratings yet
Big-Data-Analytics-18CS72 ... : R Lil
7 pages
Feature-Based Time-Series Analysis
No ratings yet
Feature-Based Time-Series Analysis
28 pages
Unit - III
No ratings yet
Unit - III
37 pages
4.2. Spark Applications
No ratings yet
4.2. Spark Applications
19 pages
CC 1
No ratings yet
CC 1
39 pages
It-3006 (Da) - CS End April 2024
No ratings yet
It-3006 (Da) - CS End April 2024
23 pages
Unit II Notes
No ratings yet
Unit II Notes
53 pages
Deepa Mam Cloud Analytics
No ratings yet
Deepa Mam Cloud Analytics
40 pages
10 1109@glocomw 2018 8644438
No ratings yet
10 1109@glocomw 2018 8644438
6 pages
DASC PiCom DataCom CyberSciTec.2018.00133
No ratings yet
DASC PiCom DataCom CyberSciTec.2018.00133
6 pages
Markov Enhanced I LSTM Approach For Effective A - 2024 - International Journal o
No ratings yet
Markov Enhanced I LSTM Approach For Effective A - 2024 - International Journal o
7 pages
Paper 4 - An Efficient NoSQL-Based Storage Schema For Large-Scale Time Series Data
No ratings yet
Paper 4 - An Efficient NoSQL-Based Storage Schema For Large-Scale Time Series Data
22 pages
Cidr2021 Paper11
No ratings yet
Cidr2021 Paper11
8 pages
Arista Storage Networking Whitepaper 10GBPS Networks
No ratings yet
Arista Storage Networking Whitepaper 10GBPS Networks
7 pages
Unit 4 Session 2
No ratings yet
Unit 4 Session 2
15 pages
Sample MCQ Iot
No ratings yet
Sample MCQ Iot
7 pages
Bake Off Redux: A Review and Experimental Evaluation of Recent Time Series Classification Algorithms
No ratings yet
Bake Off Redux: A Review and Experimental Evaluation of Recent Time Series Classification Algorithms
61 pages
Big Data Analytics 2023 Solution
No ratings yet
Big Data Analytics 2023 Solution
17 pages
Evaluating Partitioning and Bucketing Strategies For Hive-Based Big Data Warehouse Systems
No ratings yet
Evaluating Partitioning and Bucketing Strategies For Hive-Based Big Data Warehouse Systems
38 pages
ReductStore - White Paper - Review
No ratings yet
ReductStore - White Paper - Review
7 pages
Time Series Databases Explained.
No ratings yet
Time Series Databases Explained.
32 pages
Time-Series, Graph Database Deep Dive
No ratings yet
Time-Series, Graph Database Deep Dive
20 pages
Grzesik 2020 Comparative Analysis of Time Series Databases in The Context of Edge Computing For Low Power Sensor Networks
No ratings yet
Grzesik 2020 Comparative Analysis of Time Series Databases in The Context of Edge Computing For Low Power Sensor Networks
13 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

p1895 Song

Uploaded by

p1895 Song

Uploaded by

Apache TsFile: An IoT-native Time Series File Format

Xin Zhao Jialin Qiao∗ Xiangdong Huang

Chen Wang Shaoxu Song† Jianmin Wang

3.4 Time Series Encoding

Figure 5: Access from Index Area to Data Area

4.2 Series Index

space cost (Bytes)

space cost (Bytes)

0.01 0.10 1.00

query latency (ms)

query latency (ms)

TDrive REDD GeoLife TSBS CCS ZY

We employed a cold query methodology. Only one type of query ×102

reopened, queries of other types will then be conducted.

for locating specific series, shows increased latency as Data Area

sorting aforementioned, access latency in Arrow is significantly

it can filter the data, thus decoding an excessive amount of unnec-

TDrive REDD GeoLife TSBS CSS ZY

(a) Compaction Time (b) Compaction Size

to verify the interleaving of devices to maintain their order.

transferred to the target file by the order of time, without dese-

rialization and decoding. This approach may also benefits from

omitted due to spatial constraints.

6.3 System Evaluation

throughput and access latency of Apache IoTDB and top performers

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.