0% found this document useful (0 votes)
8 views14 pages

p148 Zeng

This paper evaluates the performance and design of the widely used columnar storage formats Parquet and ORC, highlighting their advantages and inefficiencies in modern data analytics contexts. The authors conducted a comprehensive benchmark to analyze these formats under various workloads, revealing that while Parquet offers better file size and decoding speed, ORC excels in selection pruning. The study identifies key design considerations for future columnar formats, particularly in relation to machine learning workloads and GPU utilization.

Uploaded by

Abdullah Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views14 pages

p148 Zeng

This paper evaluates the performance and design of the widely used columnar storage formats Parquet and ORC, highlighting their advantages and inefficiencies in modern data analytics contexts. The authors conducted a comprehensive benchmark to analyze these formats under various workloads, revealing that while Parquet offers better file size and decoding speed, ORC excels in selection pruning. The study identifies key design considerations for future columnar formats, particularly in relation to machine learning workloads and GPU utilization.

Uploaded by

Abdullah Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

An Empirical Evaluation of Columnar Storage Formats

Xinyu Zeng Yulong Hui Jiahong Shen


Tsinghua University Tsinghua University Tsinghua University
zeng-xy21@mails.tsinghua.edu.cn huiyl22@mails.tsinghua.edu.cn shen-jh20@mails.tsinghua.edu.cn

Andrew Pavlo Wes McKinney Huanchen Zhang∗


Carnegie Mellon University Voltron Data Tsinghua University
pavlo@cs.cmu.edu wes@voltrondata.com huanchen@tsinghua.edu.cn

ABSTRACT 105], Impala [16], Spark [20, 113], and Presto [19, 98], to respond to
Columnar storage is a core component of a modern data analytics the petabytes of data generated per day and the growing demand for
system. Although many database management systems (DBMSs) large-scale data analytics. To facilitate data sharing across the vari-
have proprietary storage formats, most provide extensive support to ous Hadoop-based query engines, vendors proposed open-source
open-source storage formats such as Parquet and ORC to facilitate columnar storage formats [11, 17, 18, 76], represented by Parquet
cross-platform data sharing. But these formats were developed over and ORC, that have become the de facto standard for data storage in
a decade ago, in the early 2010s, for the Hadoop ecosystem. Since today’s data warehouses and data lakes [14, 15, 19, 20, 29, 38, 61].
then, both the hardware and workload landscapes have changed. These formats, however, were developed more than a decade ago.
In this paper, we revisit the most widely adopted open-source The hardware landscape has changed since then: persistent stor-
columnar storage formats (Parquet and ORC) with a deep dive into age performance has improved by orders of magnitude, achieving
their internals. We designed a benchmark to stress-test the formats’ gigabytes per second [48]. Meanwhile, the rise of data lakes means
performance and space efficiency under different workload config- more column-oriented files reside in cheap cloud storage (e.g., AWS
urations. From our comprehensive evaluation of Parquet and ORC, S3 [7], Azure Blob Storage [24], Google Cloud Storage [33]), which
we identify design decisions advantageous with modern hardware exhibits both high bandwidth and high latency. On the software side,
and real-world data distributions. These include using dictionary a number of new lightweight compression schemes [57, 65, 87, 116],
encoding by default, favoring decoding speed over compression as well as indexing and filtering techniques [77, 86, 101, 115], have
ratio for integer encoding algorithms, making block compression been proposed in academia, while existing open columnar formats
optional, and embedding finer-grained auxiliary data structures. are based on DBMS methods from the 2000s [56].
We also point out the inefficiencies in the format designs when Prior studies on storage formats focus on measuring the end-
handling common machine learning workloads and using GPUs to-end performance of Hadoop-based query engines [72, 80]. They
for decoding. Our analysis identified important considerations that fail to analyze the design decisions and their trade-offs. Moreover,
may guide future formats to better fit modern technology trends. they use synthetic workloads that do not consider skewed data
distributions observed in the real world [109]. Such data sets are
PVLDB Reference Format: less suitable for storage format benchmarking.
Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, The goal of this paper is to analyze common columnar file for-
Huanchen Zhang. An Empirical Evaluation of Columnar Storage Formats. mats and to identify design considerations to provide insights for
PVLDB, 17(2): 148 - 161, 2023. developing next-generation column-oriented storage formats. We
doi:10.14778/3626292.3626298 created a benchmark with predefined workloads whose configura-
tions were extracted from a collection of real-world data sets. We
PVLDB Artifact Availability:
then performed a comprehensive analysis for the major compo-
The source code, data, and/or other artifacts have been made available at
nents in Parquet and ORC, including encodings, block compression,
https://github.com/XinyuZeng/EvaluationOfColumnarFormats.
metadata organization, indexing and filtering, and nested data mod-
eling. In particular, we investigated how efficiently the columnar
1 INTRODUCTION formats support common machine learning workloads and whether
Columnar storage has been widely adopted for data analytics be- their designs are friendly to GPUs. We detail the lessons learned in
cause of its advantages, such as irrelevant attribute skipping, effi- Section 6 and summarize our main findings below.
cient data compression, and vectorized query processing [55, 59, 68]. First, there is no clear winner between Parquet and ORC in
In the early 2010s, organizations developed data processing engines format efficiency. Parquet has a slight file size advantage because of
for the open-source big data ecosystem [12], including Hive [13, its aggressive dictionary encoding. Parquet also has faster column
decoding due to its simpler integer encoding algorithms, while ORC
This work is licensed under the Creative Commons BY-NC-ND 4.0 International is more effective in selection pruning due to the finer granularity
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by of its zone maps (a type of sparse index).
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights Second, most columns in real-world data sets have a small num-
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 17, No. 2 ISSN 2150-8097.
ber of distinct values (or low “NDV ratios” defined in Section 4.1),
doi:10.14778/3626292.3626298
∗ Huanchen Zhang is also affiliated with Shanghai Qi Zhi Institute.

148
which is ideal for dictionary encoding. As a result, the efficiency Because of its closer relationship with Spark, previous work failed
of integer-encoding algorithms (i.e., to compress dictionary codes) to evaluate the format in isolation [106]. Recent work concludes
is critical to the format’s size and decoding speed. Third, faster that CarbonData has a worse performance compared with Parquet
and cheaper storage devices mean that it is better to use faster de- and ORC and has a less active community [69].
coding schemes to reduce computation costs than to pursue more A number of large companies have developed their own pro-
aggressive compression to save I/O bandwidth. Formats should not prietary columnar formats in the last decade. Google’s Capacitor
apply general-purpose block compression by default because the format is used by many of their systems [3], including BigQuery [92]
bandwidth savings do not justify the decompression overhead. and Napa [58]. It is based on the techniques from Dremel [91] and
Fourth, Parquet and ORC provide simplistic support for auxiliary Abadi et al. [56] that optimize layout based on workload behavior.
data structures (e.g., zone maps, Bloom Filters). As bottlenecks shift YouTube developed the Artus format in 2019 for the Procella DBMS
from storage to computation, there are opportunities to embed that supports adaptive encoding without block compression and
more sophisticated structures and precomputed results into the 𝑂 (1) seek time for nested schemas [66]. Meta’s DWRF is a vari-
format to trade inexpensive space for less computation. ant of ORC with better support for reading and encrypting nested
Fifth, existing columnar formats are inefficient in serving com- data [50]. Meta recently developed Alpha to improve the training
mon machine learning (ML) workloads. Current designs are sub- workloads of machine learning (ML) applications [108].
optimal in handling projections of thousands of features during Arrow is an in-memory columnar format designed for efficient
ML training and low-selectivity selection during top-k similarity exchange of data with limited or no serialization between differ-
search in the vector embeddings. Finally, the current formats do ent application processes or at library API boundaries [8]. Unlike
not provide enough parallel units to fully utilize the computing Parquet or ORC, Arrow supports random access and thus does not
power of GPUs. Also, unlike the CPUs, more aggressive compres- require block-based decoding on reads. Because Arrow is not meant
sion is preferred in the formats with GPU processing because the for long-term disk storage [5], we do not evaluate it in this paper.
I/O overhead (including PCIe transfer) dominates the file scan time. The recent lakehouse [62] trend led to an expansion of formats to
We make the following contributions in this paper. First, we support better metadata management (e.g., ACID transactions). Rep-
created a feature taxonomy for columnar storage formats like Par- resentative projects include Delta Lake [61], Apache Iceberg [15],
quet and ORC. Second, we designed a benchmark to stress-test and Apache Hudi [14]. They add an auxiliary metadata layer and
the formats and identify their performance vs. space trade-offs do not directly modify the underlying columnar file formats.
under different workloads. Lastly, we conducted a comprehensive There are also scientific data storage formats for HPC workloads,
set of experiments on Parquet and ORC using our benchmark and including HDF5, BP5, NetCDF, and Zarr [25, 39, 53, 73]. They target
summarized the lessons learned for the future format design. heterogeneous data that has complex file structures, types, and
organizations. Their data is typically multi-dimensional arrays and
does not support column-wise encoding. Although they expose
2 BACKGROUND AND RELATED WORK several language APIs, few DBMSs support these formats because
The Big Data ecosystem in the early 2010s gave rise to open-source of their lack of columnar storage features.
file formats. Apache Hadoop first introduced two row-oriented Most of the previous investigations on columnar formats target
formats, SequenceFile [49] organized as key-value pairs, and entire query processing systems without analyzing the format in-
Avro [10] based on JSON. At the same time, column-oriented ternals in isolation [72, 80, 95]. Trivedi et al. compared the read
DBMSs, such as C-Store [102], MonetDB [79], and VectorWise [118], performance of Parquet, ORC, Arrow, and JSON on the NVMe
developed the fundamental methods for efficient analytical query SSDs [106], but they only measured sequential scans with synthetic
processing [55]: columnar compression, vectorized processing, and data sets (i.e., TPC-DS [103]). There are also older industry articles
late materialization. The Hadoop community then adopted these that compare popular columnar formats, but they do not provide
ideas from columnar systems and developed more efficient formats. an in-depth analysis of the internal design details [1, 2, 4].
In 2011, Facebook/Meta released a column-oriented format for Other research proposes ways to optimize these existing colum-
Hadoop called RCFile [76]. Two years later, Meta refined RCFile nar formats under specific workloads or hardware configurations [63,
and announced the PAX (Partition Attribute Across)-based [59] 64, 89]. For example, Jiang et al. use ML to select the best encoding
ORC (Optimized Record Columnar File) format [17, 78]. A month algorithms for Parquet according to the query history [81]. Btr-
after ORC’s release, Twitter and Cloudera released the first ver- Blocks integrates a sampling-based encoding selection algorithm to
sion of Parquet [18]. Their format borrowed insights from earlier achieve the optimal decompression speed with network-optimized
columnar storage research, such as the PAX model and the record- instances [83]. Li et al. proposed using BMI instructions to improve
shredding and assembly algorithm from Google’s Dremel [91]. selection performance on Parquet [85]. None of these techniques,
Since then, both Parquet and ORC have become top-level Apache however, have been incorporated in the most popular formats.
Foundation projects. They are also supported by most data pro-
cessing platforms, including Hive [13], Presto/Trino [19, 98], and
Spark [20, 113]. Even database products with proprietary storage 3 FEATURE TAXONOMY
formats (e.g., Redshift [75], Snowflake [70], ClickHouse [27], and In this section, we present a taxonomy of columnar formats fea-
BigQuery [32]) support Parquet and ORC through external tables. tures (see Table 1). For each feature category, we first describe the
Huawei’s CarbonData [11] is another open-source columnar common designs between Parquet and ORC and then highlight
format that provides built-in inverted indexing and column groups. their differences as well as the rationale behind the divergence.

149
Table 1: Feature Taxonomy – An overview of the features of columnar storage formats.
Parquet ORC
Internal Layout (§3.1) PAX PAX
Encoding Variants (§3.2) plain, RLE_DICT, RLE, Delta, Bitpacking plain, RLE_DICT, RLE, Delta, Bitpacking, FOR
Features

Compression (§3.3) Snappy, gzip, LZO, zstd, LZ4, Brotli Snappy, zlib, LZO, zstd, LZ4
Type System (§3.4) Separate logical and physical type system One unified type system
Zone Map / Index (§3.5) Min-max per smallest zone map/row group/file Min-max per smallest zone map/row group/file
Bloom Filter (§3.5) Supported per column chunk Supported per smallest zone map
Nested Data Encoding (§3.6) Dremel Model Length and presence
Table 2: Concepts Mapping – Terms used in this paper and the corre- Parquet applies Dictionary Encoding aggressively to every col-
sponding ones in the formats. umn regardless of the data type by default, while ORC only uses
This Paper Parquet ORC it for strings. They both apply another layer of integer encoding
Row Group Row Group Stripe on the dictionary codes. The advantage of applying Dictionary En-
Smallest Zone Map Page Index (a Page) Row Index (10k rows) coding to an integer column, as in Parquet, is that it might achieve
Compression Unit Page Compression Chunk additional compression for large-value integers. However, the dic-
3.1 Format Layout tionary codes are assigned based on the values’ first appearances in
the column chunk and thus might destroy local serial patterns that
As shown in Figure 1, both Parquet and ORC employ the PAX format.
could be compressed well by Delta Encoding or Frame-of-Reference
The DBMS first partitions a table horizontally into row groups. It
(FOR) [74, 84, 117]. Therefore, Parquet only uses Bitpacking and
then stores tuples column-by-column within each row group, with
RLE to further compress the dictionary codes.
each attribute forming a column chunk. The hybrid columnar layout
Parquet imposes a limit (1 MB by default) to the dictionary size
enables the DBMS to use vectorized query processing and mitigates
for each column chunk. When the dictionary is full, later values fall
the tuple reconstruction overhead in a row group. Many systems
back to “plain” (i.e., no encoding) because a full dictionary indicates
and libraries, such as DuckDB and Arrow, leverage the PAX layout
that the number of distinct values (NDVs) is too large On the other
to perform parallel reads over column chunks.
hand, ORC computes the NDV ratio (i.e., NDV / row count) of the
Both formats first apply lightweight encoding schemes to the val-
column to determine whether to apply Dictionary Encoding to it.
ues for each column chunk. The formats then use general-purpose
If a column’s NDV ratio is greater than a predefined threshold (e.g.,
block compression algorithms to reduce the column chunk’s size.
0.8), then ORC disables encoding. Compared to Parquet’s dictionary
The entry point of a Parquet/ORC file is called a footer. Besides
size physical limit, ORC’s approach is more intuitive, and the tuning
file-level metadata such as table schema and tuple count, the footer
of the NDV ratio threshold is independent of the row group size.
keeps the metadata for each row group, including its offset in the
For integer columns, Parquet first dictionary encodes and then
file and zone maps for each column chunk. For clarity in our ex-
applies a hybrid of RLE and Bitpacking to the dictionary codes.
position, in Table 2 we also summarize the mapping between the
If the same value repeats ≥ 8 times consecutively, it uses RLE;
terminologies used in this paper and those used in Parquet/ORC.
otherwise, it uses bitpacking. Interestingly, we found that the RLE-
Although the layouts of Parquet and ORC are similar, they differ
threshold 8 is a non-configurable parameter hard-coded in every
in how they map logical blocks to physical storage. For example,
implementation of Parquet. Although it saves Parquet a tuning
(non-Java) Parquet uses a row-group size based on the number of
knob, such inflexibility could lead to suboptimal compression ratios
rows (e.g., 1M rows), whereas ORC uses fixed physical storage size
for specific data sets (e.g., when the common repetition length is 7).
(e.g., 64 MB). Parquet seeks to guarantee that there are enough
Unlike Parquet’s RLE + Bitpacking scheme, ORC includes four
entries within a row group to leverage vectorized query processing,
schemes to encode both dictionary codes (for string columns) and
but it may suffer from large memory footprints, especially with
integer columns. ORC’s integer encoder uses a rule-based greedy
wide tables. On the other hand, ORC limits the physical size of
algorithm to select the best scheme for each subsequence of values.
a row group to better control memory usage, but it may lead to
Starting from the beginning of the sequence, the algorithm keeps a
insufficient entries with large attributes.
look-ahead buffer (with a maximum size of 512 values) and tries
Another difference is that Parquet maps its compression unit
to detect particular patterns. First, if there are subsequences of
to the smallest zone map. ORC provides flexibility in tuning the
identical values with lengths between 3 and 10, ORC uses RLE to
performance-space trade-off of a block compression algorithm.
encode them. If the length of the identical values is greater than
However, misalignment between the smallest zone map and com-
10, or the values of a subsequence are monotonically increasing
pression units imposes extra complexity during query processing
or decreasing, ORC applies Delta Encoding to the values. Lastly,
(e.g., a value may be split across unit boundaries).
for the remaining subsequences, the algorithm encodes them using
either Bitpacking or a variant of PFOR [117], depending on whether
there exist “outliers” in a subsequence. Figure 2 is an example of
3.2 Encoding ORC’s integer encoding schemes.
Applying lightweight compression schemes to the columns can The sophistication (compared to Parquet) of ORC’s integer encod-
reduce both storage and network costs [56]. Parquet and ORC ing algorithm allows ORC to seize more opportunities for compres-
support standard OLAP compression techniques, such as Dictionary sion. However, switching between four encoding schemes slows
Encoding, Run-Length Encoding (RLE), and Bitpacking.

150
Column Row Group 1 Column
Row Group 1 Page 1 Page Header Col 1 Zone Map Present Stream
Chunk 1 Chunk 1
Index (logical)
Row Group 2 Column Col 1 Bloom Filter Column
Page 2 Definition Levels Length Stream
Chunk 2 Data Chunk 2
. . .
. Footer .
. . .
. Repetition Levels . Data Stream
. . .
Row Group 2
. Column Col c Zone Map Column
Page p Values .
Chunk c Chunk c
Col c Bloom Filter .
Row Group r
Metadata: version, schema, Column 1 Metadata: offset, . Metadata: version,
.... type, encoding, compression, number of rows, ... offset,
Bloom Filter Row Group r
Row Group 1 Metadata zone maps... Row Group 1 Metadata index length,
Page Index . . ColChunkStats . data length,
. . . footer length
Footer Footer
Footer Length Row Group r Metadata Column c Metadata Footer Length Row Group r Metadata

(a) Parquet layout. (b) ORC layout.

Figure 1: Format Layout – Blocks in gray are optional depending on configurations/data.

913 | 222 | 123 | 222 | 9 1 3 5 4 8 7 99 6 Union allows data values to have different types for the same col-
Original Integer Sequence umn name. Recent work shows that a Union type can help optimize
outlier
Parquet’s Dremel model with schema changes [60].
Bitpack RLE Delta RLE PFOR
Corresponding Encoding Method

Figure 2: ORC’s Hybrid Integer Encoding – Each encoding subsequence


3.5 Index and Filter
has a header for the decoder to decide which algorithm to use at run time.
Parquet and ORC include zone maps and optional Bloom Filters to
down the decoding process and creates more fragmented subse- enable selection pruning. A zone map contains the min value, the
quences that require more metadata to keep track. All the open- max value, and the row count within a predefined range in the file.
source DBMSs and libraries that we surveyed follow Parquet and If the range of the values of the zone does not satisfy a predicate,
ORC’s default encoding schemes without implementing their own the entire zone can be skipped during the table scan. Both Parquet
tools for selecting encoding algorithms in the files. and ORC contain zone maps at the file level and the row group level.
The smallest zone map granularity in Parquet is a physical page (i.e.,
3.3 Compression the compression unit), while that in ORC is a configurable value
representing the number of rows (10000 rows by default). Whether
Both Parquet and ORC enable block compression by default. The
to build the smallest zone maps is optional in Parquet.
algorithms supported by each format are summarized in Table 1. Be-
In earlier versions of Parquet, the smallest zone maps are stored
cause a block compression algorithm is type-agnostic (i.e., it treats
in the page headers. Because the page headers are co-located with
any data as a byte stream), it is mostly orthogonal to the underlying
each page and are thus discontinuous in storage, (only) checking
format layout. Most block compression algorithms contain parame-
the zone maps requires a number of expensive random I/Os. In
ters to configure the “compression level” to make trade-offs between
Parquet’s newest version (2.9.0), this is fixed by having an optional
the compression ratio and the compression/decompression speed.
component called the PageIndex, stored before the file footer to
Parquet exposes these tuning knobs directly to the users, while ORC
keep all the smallest zone maps. Similarly, ORC stores its smallest
provides a wrapper with two pre-configured options, “optimize for
zone maps at the beginning of each row group, as shown in Figure 1.
speed” and “optimize for compression”, for each algorithm.
Bloom Filters are optional in Parquet and ORC. The Bloom Fil-
One of our key observations is that applying block compression
ters in ORC have the same granularity as the smallest zone maps,
to columnar storage formats is unhelpful (or even detrimental) to
and they are co-located with each other. Bloom Filters in Parquet,
the end-to-end query speed on modern hardware. Section 5 further
however, are created only at the column chunk level partly because
discusses this issue with experimental evidence.
the PageIndex (i.e., the smallest zone maps) in Parquet is optional.
In terms of the Bloom Filter implementation, Parquet adopts the
3.4 Type System Split Block Bloom Filter (SBBF) [96], which is designed to have
Parquet provides a minimal set of primitive types (e.g., INT32, better cache performance and SIMD support [42].
FLOAT, BYTE_ARRAY). All the other supported types (e.g., INT8, According to our survey, Arrow and DuckDB only adopt zone
date, timestamp) in Parquet are implemented using those primi- maps at the row group level for Parquet, while InfluxDB and Spark
tives. For example, INT8 in Parquet is encoded as INT32 internally. enable PageIndex and Bloom Filters to trade space for better selec-
Because small integers may be dictionary compressed well, such tion performance [46]. When writing ORC files, Arrow, Spark, and
a “type expansion” has minimal impact on storage efficiency. On Presto enable row indexes but disable Bloom Filters by default.
the other hand, every type in ORC has a separate implementation Zone maps are only effective when the values are clustered (e.g.,
with a dedicated reader and writer. Although this could bring more mostly sorted). As data processing bottlenecks shift from storage
type-specific optimizations, it makes the implementation bloated. to computation, whether adding more types of auxiliary data struc-
As for complex types, Parquet and ORC both support Struct, List, tures [77, 86, 101, 115] to the format will be beneficial to the overall
and Map, but Parquet does not provide the Union type like ORC. query performance remains an interesting open question.

151
first last tag name first last tags tag
Val R D Val R D Val R D P Val P Val P Len P Val P
root Mike 0 2 Lee 0 2 a 0 2 1 Mike 1 Lee 1 2 a 1
1
{name: {first: Mike, last: Lee}, tags: [a, b]} 0 1 Hill 0 2 b 1 2 1 0 Hill 1 0 1 b 1
name tags {name: {last: Hill}, tags: []} Joe 0 2 0 1 0 1 1 Joe 1 1 c 1
{name: {first: Joe}, tags: [c]} 0 1
c 0 2
first last tag

(a) Example schema and three sample records. (b) Parquet’s. R/D=Repetition/Definition Level. (c) ORC’s. Len=length, P=presence.

Figure 3: Nested Data Example – Assume all nodes except the root can be null.
Config File Metadata Table first define several salient properties of the value distribution of
Workload a column (e.g., sortedness, skew pattern). We then extract these
Generator properties from real-world data sets to form predefined workloads
Transform
Workload representing applications ranging from BI to ML. To use our bench-
Templates Predicates Scan mark, as shown in Figure 4, a user first provides a configuration
Target Format Final results
file (or an existing workload template) that specifies the parameter
Select
values of the properties. The workload generator then produces the
Figure 4: Benchmark Procedure Overview data using this configuration and then generates point and range
predicates to evaluate the format’s (filtered) scan performance.
3.6 Nested Data Model
As semi-structured data sets such as those in JSON and Protocol 4.1 Column Properties
Buffers [44] have become prevalent, an open format must sup-
We first introduce the core properties that define the value distri-
port nested data. The nested data model in Parquet is based on
bution of a column. We use [𝑎 1, 𝑎 2, ..., 𝑎 𝑁 ] to represent the values
Dremel [91]. As shown in Figure 3b, Parquet stores the values of
in a particular column, where 𝑁 denotes the number of records.
each atomic field (the leaf nodes in the hierarchical schema in Fig-
ure 3a) as a separate column. Each column is associated with two NDV Ratio: Defined as the number of distinct values (NDV) di-
integer sequences of the same length, called the repetition level (R) vided by the total number of records in a column: 𝑓𝑐𝑟 = 𝑁 𝑁
𝐷𝑉 . A
and the definition level (D), to encode the structure. R links the numeric column typically has a higher NDV ratio than a categor-
values to their corresponding “repeated fields”, while D keeps track ical column. A column with a lower NDV ratio is usually more
of the NULLs in the “non-required fields”. compressible via Dictionary Encoding and RLE, for example.
On the other hand, ORC adopts a more intuitive model based
on length and presence to encode nested data [92]. As shown in Null Ratio: Defined as the number of NULLs divided by the total
| {𝑖 |𝑎𝑖 is null} |
Figure 3c, ORC associates a boolean column to each optional field number of records in a column: 𝑓𝑛𝑟 = 𝑁 . It is important
to indicate value presence. For each repeated field, ORC includes for a columnar storage format to handle NULL values efficiently
an additional integer column to record the repeated lengths. both in terms of space and query processing.
For comparison, ORC creates separate columns (presence and
length) for non-atomic fields (e.g., “name” and “tags” in Figure 3c), Value Range: This property defines the range of the absolute
while Parquet embeds this structural information in the atomic values in a column. Users pass two parameters: the average value
fields via R and D. The advantage of Parquet’s approach is that (e.g., 1000 for an integer column) and the variance of the value
it reads fewer columns (i.e., atomic fields only) during query pro- distribution. The value range directly impacts the compressed file
cessing. However, Parquet often produces a larger file size because size because most columnar formats apply Bitpacking to the values.
the information about the non-atomic fields could be duplicated For string, this is defined as byte length.
in multiple atomic fields (e.g., “first” and “last” both contains the Sortedness: The degree of sortedness of a column affects not
information about the presence of “name” in Figure 3b). only the efficiency of encoding algorithms such as RLE and Delta
Encoding, but also the effectiveness of zone maps. Prior work has
proposed ways to measure the sortedness of a sequence [90], but
4 COLUMNAR STORAGE BENCHMARK
these metrics do not correlate strongly with encoding efficiency,
The next step is to stress-test the performance and space efficiency so we developed a simple metric that puts more emphasis on local
of the storage formats using data sets using varying value distribu- sortedness. We divide the column into fixed-sized blocks (512 entries
tions. Standard OLAP benchmarks such as SSB [93], TPC-H [104] by default). Within each block, we compute a sortedness score to
and TPC-DS [103] generate data sets with uniform distributions. reflect its ascending or descending tendency:
Second, although some benchmarks, such as YCSB [67], DSB [71], 𝑎𝑠𝑐 = |{𝑖 |1 ≤ 𝑖 <𝑛 and 𝑎𝑖 < 𝑎𝑖+1 }|; 𝑑𝑒𝑠𝑐 = |{𝑖 |1 ≤ 𝑖 <𝑛 and 𝑎𝑖 > 𝑎𝑖+1 }|
and BigDataBench [111] allow users to set data skewness, the con- max(𝑎𝑠𝑐,𝑑𝑒𝑠𝑐 )+𝑒𝑞−⌊ 𝑁2 ⌋
figuration space is often too small to generate distributions that are 𝑒𝑞 = |{𝑖 |1 ≤ 𝑖 < 𝑛 and 𝑎𝑖 = 𝑎𝑖+1 }|; 𝑓𝑠𝑜𝑟𝑡𝑛𝑒𝑠𝑠 =
⌈ 𝑁2 ⌉−1
close to real-world data sets. Lastly, using real-world data is ideal, We then take the average of the per-block scores to represent the
but the number of high-quality resources available is insufficient column’s overall sortedness. A score of 1 means that the column
to cover a comprehensive analysis. is fully sorted, while a score close to 0 indicates a high probability
Given this, we designed a benchmark framework based on real- that the column’s values are randomly distributed. Although this
world data to evaluate multiple aspects of columnar formats. We metric is susceptible to adversarial patterns (e.g., 1, 2, 3, 4, 3, 2, 1), it

152
(0.0, 1e-05] (0.001, 0.01] uniform binary (0.0, 0.2] (0.6, 0.8] (0,1] (1e3,1e4] (0, 5] (25, 50]
0 (0.001, 0.1]
(1e-05, 0.0001] (0.01, 0.1] gentle_zipf single (0.2, 0.4] (0.8, 1.0] (1e1,1e2] (1e4,1e5] (5, 10] (50, 100]
(0.0, 1e-05] (0.1, 0.5]
(0.0001, 0.001] (0.1, 1.0] hotspot (0.4, 0.6] (1e2,1e3] (1e5,1e6] (10, 25] (100, 1000]
(1e-05, 0.001] (0.5, 1.0]
1.0 1.0 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8 0.8 0.8


Percentage

0.6 0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0 0.0 0.0


Integer Float Point String Integer Float Point String Integer Float Point String Integer Float Point String Integer String

(a) NDV Ratio (b) Null Ratio (c) Skew Pattern (d) Sortedness (e) Int Value Range (f) String Length

Figure 5: Parameter Distribution – Percentage of total columns from diverse data sets of different parameter values.

is sufficient for our generator to produce columns with different repetitions with an NDV ratio smaller than 0.1. This implies that
sortedness levels. Given a score (e.g., 0.8), we first sort the values Dictionary Encoding would be beneficial to most of the real-world
in a block in ascending or descending order and then swap value columns. Figure 5b shows that the NULL ratio is low, and string
pairs randomly until the sortedness degrades to the target score. columns tend to have more NULLs than the other data types.
Most columns in the real world exhibit a skewed value distribu-
Skew Pattern: We use the following pseudo-zipfian distribution
tion, as shown in Figure 5c. Less than 5% of the columns can be
to model the value skewness in a column: 𝑝 (𝑘) = 𝑘1𝑠 /( 𝐶 1
∑︁
𝑛=1 𝑛𝑠 ). classified as Uniform. Regardless of the data type, a majority of the
𝐶 denotes the total number of distinct values, and 𝑘 refers to the columns fall into the category of Gentle Zipf. The remaining ≈ 30%
frequency rank (e.g., 𝑝 (1) represents the portion occupied by the of the columns contain “heavy hitters”. This distribution indicates
most frequent value). The Zipf-parameter 𝑠 determines the column that an open columnar format must handle both the “heavy hitters”
skewness: a larger 𝑠 leads to a more skewed distribution. Based on and the “long tails” (from Gentle Zipf ) efficiently at the same time.
the range of 𝑠, we classified the skew patterns into four categories: Figure 5d shows that the distribution of the sortedness scores
• Uniform: When 𝑠 ≤ 0.01. Each value appears in the column is polarized: most columns are either well-sorted or unsorted at
with a similar probability. all. This implies that encoding algorithms that excel only at sorted
• Gentle Zipf: When 0.01 < 𝑠 ≤ 2. The data is skewed to some columns (e.g., Delta Encoding and FOR) could still play an important
extent. The long tail still dominates the values of a column. role. Lastly, as shown in Figure 5e, most integer columns have small
• Hotspot: When 𝑠 > 2. The data is highly skewed. A few hot values that are ideal for Bitpacking compression. Long string values
values cover almost the entire column. are also rare in our data set collection (see Figure 5f). We also
• Single/Binary: This represents extreme cases in real-world data analyzed real-world Parquet files sampled from publicly available
where a column contains one/two distinct values. object store buckets and found that they mostly corroborate Figure 5.
The skew pattern is a key factor that determines the performance
of both lightweight encodings and block compression algorithms.
4.3 Predefined Workloads
4.2 Parameter Distribution in Real-World Data We extracted the column properties from the real-world data sets
introduced in Section 4.1 and categorized them into five predefined
We study the following real-world data sets to depict a parameter
workloads: bi (based on the Public BI Benchmark), classic (based
distribution of each of the core properties introduced in Section 4.1.
on IMDb, Yelp, and a subset of the Clickhouse sample data sets), geo
– Public BI Benchmark [45, 109]: real-world data and queries from (based on Geonames and the Cell Towers and Air Traffic data sets
Tableau with 206 tables (uncompressed 386GB). from Clickhouse), log (based on LOG and the machine-generated
– ClickHouse [28]: sample data sets from the ClickHouse tutorials, log data sets from Clickhouse), and ml (based on UCL-ML). Table 3
which represent typical OLAP workloads. presents the proportion of each data type for each workload, while
– UCI-ML [6]: a collection of 622 data sets for ML training. We select Table 4 summarizes the parameter settings of the column properties.
nine data sets that are larger than 100 MB. All are numerical data Each value in Table 4 represents a weighted average across the data
excluding unstructured images and embeddings. types (e.g., if there are 6 integer columns with an NDV ratio of 0.1,
– Yelp [52]: Yelp’s businesses, reviews, and user information. 3 string columns with an NDV ratio of 0.2, and 1 float columns with
– LOG [30]: log information on internet search traffic for EDGAR an NDV ratio of 0.4, the value reported in Table 4 would be 0.16).
filings through SEC.gov. The classic workload has a higher Zipf parameter and a higher
– Geonames [31]: geographical information covering all countries. NDV ratio at the same time, indicating a long-tail distribution. On
– IMDb [37]: data sets that describe the basic information, ratings, the other hand, the NDV ratio in log is relatively low, but the
and reviews of a collection of movies. columns are better sorted. In terms of data types, classic and geo
We extracted the core properties from each of the above data are string-heavy, while log and ml are float-heavy.
sets and plotted their parameter distributions in Figure 5. As shown We then created the core workload which is a mix of the five pre-
in Figure 5a, over 80% of the integer columns and 60% of the string defined workloads. It contains 50% of bi columns, 21% of classic,
columns have an NDV ratio smaller than 0.01. Surprisingly, even 7% of geo, 7% of log, and 15% of ml. We will use core as the de-
for floating-point columns, 60% of them have significant value fault workload in Section 5. For each workload, we also specify a

153
File Size (MB)
Table 3: Data type distribution of different workloads – Number in 150 Parquet ORC

the table indicating the proportion of columns. 100 0.6

File Size (MB)

Time (ms)
150 50 100

Time (s)
Type core bi classic geo log ml 0.4
0
100 core bi classicgeo log ml

Integer 0.37 0.46 0.33 0.31 0.22 0.24 0.2 50


50
Float 0.21 0.20 0.06 0.08 0.46 0.39 0 0.0 0
String 0.41 0.34 0.61 0.61 0.32 0.37 re bi ssicgeo log ml re bi ssicgeo log ml re bi ssicgeo log ml
co co co
cla cla cla
Bool 0.003 0.002 0.00 0.00 0.00 0.01
Table 4: Summarized Workload Properties – We categorize each prop- (a) File Size (b) Scan Time (c) Select Time
erty into three levels. The darker the color the higher the number.
Figure 6: Benchmark results with predefined workloads
Properties core bi classic geo log ml
NDV Ratio 0.12 0.08 0.25 0.18 0.08 0.12 can decode Parquet’s dictionary page directly into its dictionary
Null Ratio 0.09 0.11 0.09 0.13 0.02 0.00 array), while we must convert ORC into an intermediate in-memory
Value Range medium small large small small large representation (ColumnVectorBatch) before transforming it into
Sortedness 0.54 0.57 0.49 0.45 0.75 0.30 Arrow tables. Given this, we focus on the raw scan performance of
Zipf 𝑠 1.12 1.10 1.42 0.89 1.26 1.00 each storage format. We preallocate a fixed-sized memory buffer.
Pred. Selectivity mid high high low low mid After decoding the fixed-size unit of data, the system writes the
selectivity for our benchmark to generate predicates to evaluate the result to the same buffer, assuming that the previous one has already
filtered scan performance of a columnar storage format. As shown been consumed by upstream operators.
in Table 4, bi and classic have high selectivities because these
scenarios typically involve large scans. On the contrary, we use a 5.2 Benchmark Result Overview
low selectivity in geo and log because their queries request data We first present the results of benchmarking Parquet and ORC
from small geographic areas or specific time windows. with their default configurations using the predefined workloads
(Section 4.3). We generate a 20-column table with 1m rows for each
workload and store the data in a single Parquet/ORC file. We then
5 EXPERIMENTAL EVALUATION
perform a sequential scan of the file and report the execution time.
In this section, we analyze Parquet and ORC’s features presented in Lastly, we clear the buffer cache and perform 30 select queries. The
Section 3. The purpose is to provide experiment-backed lessons to selectivities of the range predicates are defined in Table 4, and we
guide the design of the next-generation columnar storage formats. report the average latency of the select queries for each workload.
Section 5.1 describes the experimental setup. Section 5.2 presents As shown in Figure 6a, there is no clear winner between Parquet
the performance and space results of Parquet and ORC under default and ORC in terms of file sizes. Parquet’s file size is smaller than
configurations using the predefined workloads in our benchmark. ORC’s in log and ml because Parquet applies dictionary encoding
We then examine the formats’ key components with controlled on float columns where their NDV ratios are low in real-world data
experiments in Sections 5.3 to 5.7. Lastly, we test the formats’ ability sets (Figure 5a). However, ORC generates smaller files for classic
to support ML workloads (Section 5.8) and GPUs (Section 5.9). and geo because they mostly contain string data. We provide further
analysis of the encoding schemes in Section 5.3.
5.1 Experiment Setup The results in Figure 6b indicate that Parquet is faster than ORC
We run the experiments on an AWS i3.2xlarge instance with 8 for scans. The main reason is that Parquet’s integer/dictionary-code
vCPUs of Intel Xeon CPU E5-2686 v4, 61GB memory, and 1.7TB encoding scheme is lightweight: it mostly uses Bitpacking and only
NVMe SSD. The operating system is Ubuntu 20.04 LTS. We use applies RLE when value repetition is ≥ 8 (Section 3.2). Because
Arrow v9.0.0 to generate the Parquet and ORC files. For all experi- RLE decoding is hard to accelerate using SIMD, it has an inferior
ments, we use the following configurations of the formats (unless performance compared to Bitpacking when the repetition count
specified otherwise). Parquet has a row group size of 1m rows and is small. In contrast, ORC applies RLE more aggressively (when
sets the dictionary page size limit to 1 MB. The row group size in value repetition is ≥ 3, and its integer encoding scheme switches
ORC is 64 MB, and its NDV-ratio threshold for dictionary encoding between four algorithms, thus slowing down the decoding process.
is v0.8 (Hive’s default). Snappy compression is enabled (by default Figure 6c shows the average latencies of the select queries. The
) for both formats. We use the C++ implementation of Parquet results generally follow those in Figure 6b. The only exception is
(integrated with Arrow C++) [21] and ORC (v1.8) [41] compiled geo where ORC outperforms. The reason is that ORC’s smallest
with g++ v9.4. To evaluate page-level zone maps, we use the Rust zone map has a finer granularity than Parquet’s. Compared to other
implementation (v32) of Parquet in Section 5.6. We generate the workloads, geo has a relatively high NDV ratio but a low predicate
workloads for the experiments using the benchmark introduced selectivity, which makes fine-grained zone maps more effective.
in Section 4. We measure the file sizes and the scan performance
(with filters) in these experiments. Each reported measurement is 5.3 Encoding Analysis
the average of three runs per experiment. We next examine the performance and space efficiency of the en-
One approach to measuring the (filtered) scan performance of coding schemes in Parquet and ORC in this section.
Parquet and ORC is to decode both formats into Arrow tables. But
this approach is unfair because Parquet is tightly coupled with 5.3.1 Compression Ratio. To investigate how Parquet and ORC’s
Arrow with native support for format transformation (e.g., Arrow compression ratios change based on column properties, we generate

154
File Size (MB)
Parquet ORC
50
75 18 39 36
File Size (MB)
Integer

50 12 25 26 24

25 6 0 13 12
−4 −3 −2 −1 0
10 10 10 10 10
0 0 0 0
10 −4
10 −3
10 −2
10 −1
10 0 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 102 104 106 108

288 186 297 104


File Size (MB)
String

192 124 198 103


96 62 99
102
0 0 0
10−4 10−3 10−2 10−1 100 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 101 102

168 168 171 168


File Size (MB)
Float

112 112 114 112

56 56 57 56

0 0 0 0
10−4 10−3 10−2 10−1 100 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 103 105 107

NDV Ratio Zipf 𝑠 Sortedness Value Range


Figure 7: Encoding size differences – Varying parameters on core workload w/o block compression.
Parquet_Decode Parquet_IO ORC_Decode ORC_IO
a series of tables, each having 1m rows with 20 columns of the same 0.76 (IO:0.05)

Time (sec)
0.5
data type. For each table, we use the core workload’s parameter 0.4
0.3
settings but modify one of the four column properties: NDV ratio, 0.2
0.1
Value Range, Sortedness, and the Zipf parameter. Figure 7 shows 0.0 int string float
int string float int string float int string float
how the file size changes when we sweep the parameter of different
column properties. We disabled block compression in both Parquet (a) Time [Uncompressed] (b) Time [Snappy] (c) Time [zstd]
and ORC temporarily in these experiments. Parquet ORC

As shown in the first row of Figure 7, Parquet achieves a better Size (MB) 150

compression ratio than ORC for integer columns with a low to 100

50
medium NDV ratio (which is common in real-world data sets) be-
0 int string float
cause Parquet applies Dictionary Encoding on integers before using int string float int string float int string float
Bitpacking + RLE. When the NDV ratio grows larger (e.g., > 0.1),
this additional layer of Dictionary Encoding becomes less effective (d) Size [Uncompressed] (e) Size [Snappy] (f) Size [zstd]

than ORC’s more sophisticated integer encoding algorithms. Figure 8: Varying compression on core workload.
As the Zipf parameter 𝑠 becomes larger, the compression ratios 20
0.20
on integer columns improve for both Parquet and ORC (row 1, Parquet ORC
Time (sec)

Size (MB)
0.15 15
column 2 in Figure 7). The file size reduction happens earlier for
0.10 10
ORC (𝑠 = 1) than Parquet (𝑠 = 1.4). This is because RLE kicks in to 0.05
replace Bitpacking earlier in ORC (with the run length ≥ 3) than 0.00
5
2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10
Parquet (with the run length ≥ 8). We also observe that when the Data Run Length Data Run Length
integer column is highly sorted, ORC compresses those integers
(a) Scan Time (b) File Size
better than Parquet (row 1, column 3 in Figure 7) because of the
adoption of Delta Encoding and FOR in its integer encoding. Figure 9: Varying run length on string, w/o compression.
Parquet’s file size is stable as the value range of the integers varies
(row 1, column 4 in Figure 7). Parquet applies Dictionary Encoding the one described in [81] is needed to handle the situation when
on the integers and uses Bitpacking + RLE on the dictionary codes Dictionary Encoding fails. Also, the format should expose certain
only. Because these codes do not change as we vary the value range, encoding parameters, such as the minimum run length for RLE for
the file size of Parquet stays the same in these experiments. On the tuning so that users can make the trade-off more smoothly.
other hand, the file size of ORC increases as the value range gets
larger because ORC encodes the original integers directly. 5.3.2 Decoding Speed. We next benchmark the decoding speed of
For string columns, as shown in the second row of Figure 7, Parquet and ORC. We use the data sets in Section 5.3.1 that follow
Parquet and ORC have almost identical file sizes because they both the default core workload. Block compression is still disabled in the
use Dictionary Encoding on strings. ORC has a slight size advantage experiments in this section. We perform a full table scan on each
over Parquet, especially when the dictionary is large because ORC file and measure the I/O time and the decoding time separately.
applies encoding on the string lengths of the dictionary entries.
The third row of Figure 7 shows the results for float columns. Table 6: Branch Mispredic- Table 7: Subsequences count and data
Parquet dominates ORC in file sizes because Dictionary Encoding tions of Figure 8a. percentage for integer in Table 6.
is surprisingly effective on float-point numbers in the real world.
Discussion: Because of the low NDV ratio of real-world columns Workload Encoding
(as shown in Figure 5), Parquet’s strategy of applying Dictionary Format int string float RLE Bitpack Delta PFOR
Encoding on every column seems to be a reasonable default for ORC 2.9M 3.1M 0.3M .7M(16%) .7M(32%) .2M(49%) .01M(3%)
future formats. However, an encoding selection algorithm such as Parquet 0.9M 1.9M 0.6M .2M(46%) .2M(54%) 0 0

155
Figure 8a shows that Parquet has faster decoding than ORC zstd_Decode
100 Parquet Metadata Parsing
Parquet Data Decode
for integer and string columns. As explained in Section 5.2, there 0.6 zstd_IO 80 ORC Metadata Parsing

Time (sec)
NoCompression_Decode

Time (ms)
ORC Data Decode
are two main reasons behind this: (1) Parquet relies more on the 0.4 NoCompression_IO 60

fast Bitpacking and applies RLE less aggressively than ORC, and 40
0.2
(2) Parquet has a simpler integer encoding scheme that involves 20
fewer algorithm options. As shown in Table 6, switching between 0.0
st1 gp3 gp2 io1 nvme s3 0
200 2000 4000 8000 10000
the four integer encoding algorithms in ORC generates 3× more Storage Type
Number of Features
branch mispredictions than Parquet during the decoding process Figure 10: Block Compression Figure 11: Wide-Table Projection
(done on a similar physical machine to collect the performance
counters). According to the breakdown in Table 7, ORC has 4× 5.4 Block Compression
more subsequences to decode than Parquet, and the encoding algo- We study the performance-space trade-offs of block compression on
rithm distribution among the subsequences is unfriendly to branch the formats in this section. We first repeat the decoding-speed exper-
prediction. Parquet’s decoding-speed advantage over ORC shrinks iments in Section 5.3.2 with different algorithms (i.e., Snappy [34],
for integers compared to strings, indicating a (slight) decoding Zstd [54]). As shown in Figures 8d to 8f, Zstd achieves a better com-
overhead due to its additional dictionary layer for integer columns. pression ratio than Snappy for all data types. The results also show
Parquet also optimizes the bit-unpacking procedure using SIMD that block compression is effective on float columns in ORC because
instructions and code generation to avoid unnecessary branches. they contain raw values. For the rest of the columns in both Parquet
To further illustrate the performance and space trade-off be- and ORC, however, the space savings of such compression is lim-
tween Bitpacking and RLE, we construct a string column with a ited because they are already compressed via lightweight encoding
pre-configured parameter 𝑟 where each string value repeats 𝑟 times algorithms. Figures 8a to 8c also shows that block compression
consecutively in the column. Recall that ORC applies RLE when imposes up 4.2× performance overhead to scanning.
𝑟 ≥ 3, while the RLE threshold for Parquet is 𝑟 ≥ 8. Figure 9 shows We further investigate the I/O benefit and the computational
the decoding speed and file sizes of Parquet and ORC with different overhead of block compression on Parquet across different storage-
𝑟 ’s. We observe that RLE takes longer to decode compared to Bit- device tiers available in AWS. The x-axis labels in Figure 10 show
packing for short repetitions. As 𝑟 increases, this performance gap the storage tiers, where st1, gp3, gp2, and io1 are from Amazon
shrinks quickly. The file sizes show the opposite trend (Figure 9b) EBS, while nvme is from an AWS i3 instance. These storage tiers
as RLE achieves a much better compression ratio than Bitpacking. are ordered by an increasing read bandwidth. We generate a table
For float columns, ORC achieves a better decoding performance with 1m rows and 20 columns according to the core workload and
than Parquet because ORC does not apply any encoding algorithms store the data in Parquet. The file sizes are 34 MB and 25 MB with
on floating-point values. Although the float columns in ORC occupy Zstd disabled and enabled, respectively. We then perform scans on
much larger space than the dictionary-encoded ones in Parquet (as the Parquet files stored in each storage tier using a single thread.
shown in Figure 7), the saving in computation outweighs the I/O As shown in Figure 10, applying Zstd to Parquet only speeds
overhead with modern NVMe SSDs. up scans on slow storage tiers (e.g., st1) where I/O dominates
Discussion: Although more advanced encoding algorithms, the execution time. For faster storage devices, especially NVMe
such as FSST [65], HOPE [116], Chimp [87] and LeCo [88], have SSDs, the I/O time is negligible compared to the computation time.
been proposed recently, it is important to keep the encoding scheme In this case, the decompression overhead of Zstd hinders scan
in an open format simple to guarantee a fast decoding speed. Se- performance. The situation is different with S3 because of its high
lecting from multiple encoding algorithms at run time imposes access latency [61]. Reading a Parquet file requires several round
noticeable performance overhead on decoding. Future format de- trips, including fetching the footer length, the footer, and lastly the
signs should be cautious about including encoding algorithms that column chunks. Therefore, even with multi-threaded optimization
only excel at specific situations in the decoding critical path. to fully utilize S3’s bandwidth, the I/O cost of reading a medium-
In addition, as the storage device gets faster, the local I/O time sized (e.g., 10s-100s MB) Parquet file is still noticeable.
could be negligible during query processing. According to the float Discussion: As storage gets faster and cheaper, the computa-
results in Figure 8a, even a scheme as lightweight as Dictionary tional overhead of block compression dominates the I/O and storage
Encoding adds significant computational overhead for a sequential savings for a storage format. Unless the application is constrained
scan, and this overhead cannot get covered by the I/O time savings. by storage space, such compression should not be used in future
This indicates that most encoding algorithms still make trade-offs formats. Moreover, as more data is located on cloud-resident ob-
between storage efficiency and decoding speed with modern hard- ject stores (e.g., S3), it is necessary to design a columnar format
ware (instead of a Pareto improvement as in the past). Future for- specifically for this operating environment (e.g., high bandwidth
mats may not want to make any lightweight encoding algorithms and high latency). Potential optimizations include storing all the
“mandatory” (e.g., leave raw data as an option). Also, the ability to metadata continuously in the format to avoid multiple round trips,
operate on compressed data is important with today’s hardware. appropriately sizing the row groups (or files) to hide the access
latency, and coalescing small-range requests to better utilize the
cloud storage bandwidth [9, 63].

156
struct 0 and isolate other noise, we use float data so we can disable encoding
200

File Size (MB)


data 0 list 0 Parquet and compression on both formats. We test against a synthetic nested
ORC schema tree which we design as follows (as shown in Figure 12a):
struct 1 struct 1 100
The root node is a struct containing a float field and a list field. The
data 1 list 1 data 1 list 1 list recursively contains 0-2 structs with the same schema as the
... ...
0
2 4 8 17 32 4762
Max Depth
root. 97% of the lists contain one struct, and 1% contains no struct.
We generate a series of Arrow tables with 256k rows on different
(a) File Schema. Number=nested depth (b) File Size max depths of the schema tree and write them into Parquet and
4 ORC. During table generation, the tree of a record stops growing
when the depth of the tree reaches the desired max depth. Then
Time (s)

0.2
2 we record the file size, the time to read the file into an Arrow table,
and the time to decode the nested structure during the table scan.
0 0.0 As shown in Figure 12b, as the depth of the schema tree gets
2 4 8 17 32 4762 2 4 8 17 32 4762
Max Depth Max Depth
larger, the Parquet file size grows faster than ORC. On the other
hand, ORC spends much more time transforming to Arrow (Fig-
(c) Time of Scanning to Arrow (d) Nested Info Decode Overhead ure 12c). The reason is that ORC needs to be read into its in-memory
Figure 12: Nested Data Model – Varying max depth in the data. data structure and then transformed to Arrow. And the transfor-
mation is not optimized. Therefore, we further profile the time
5.5 Wide-Table Projection decoding the nested information during the scan. The result in Fig-
According to our discussion with Meta’s Alpha [108] team, it is ure 12d shows that ORC’s overhead to decode the nested structure
common to store a large number of features (thousands of key-value information is getting larger than Parquet’s as the schema gets
pairs) for ML training in ORC files using the “flat map” data type deeper. The reason is that ORC needs to decode structure infor-
where the keys and values are stored in separate columns. Because mation of struct and list while Parquet only needs to decode leaf
each ML training process often fetches a subset of the features, the fields along with their levels. This result is consistent with Dremel’s
columnar format must support wide-table projection efficiently. In retrospective work [92].
this experiment, we generate a table of 10K rows with a varying Discussion: The trade-offs between the two nested data models
number of float attributes. We store the table in Parquet and ORC only manifest when the depth is large. Future formats should pay
and randomly select 10 attributes to project. Figure 11 shows the more attention to avoiding extra overhead during the translation
breakdown of the average latency of the projection queries. between the on-disk and in-memory nested models.
As the number of attributes (i.e., features) in the table grows, the
metadata parsing overhead increases almost linearly even though
5.8 Machine Learning Workloads
the number of projection columns stays fixed. This is because the
footer structures in Parquet and ORC do not support efficient ran- We next investigate how well the columnar formats support com-
dom access. The schema information is serialized in Thrift (Parquet) mon ML workloads. Besides raw data (e.g., image URLs, text) and
or Protocol Buffer (ORC), which only supports sequential decoding. the associated metadata (e.g., image dimensions, tags), an ML data
We also notice that ORC’s performance declines as the table gets set often contains the vector embeddings of the raw data, which
wider because there are fewer entries in each row group whose size is a vector of floating-point numbers to enable similarity search in
has a physical limit (64 MB). applications such as text-image matching and ad recommendation.
Discussion: Wide tables are common, especially when storing It is common to store the entire ML data set in Parquet files [35],
features for ML training. Future formats must organize the metadata where the vector embeddings are stored as lists in Parquet’s nested
to support efficient random access to the per-column schema. model. Additionally, ML applications often build separate vector
indexes directly from Parquet to speed up similarity search [23].
5.6 Indexes and Filters
5.8.1 Compression Ratio and Deserialization Performance with Vec-
We tested the efficacy of zone maps and Bloom Filters in Parquet and
tor Embeddings. In this experiment, we collect 30 data sets with
ORC by performing scans with predicates of varying selectivities.
vector embeddings from the top downloaded and top trending
The experiment results are presented in our technical report [114].
lists on Hugging Face and store the embeddings in four differ-
Overall, zone maps and Bloom Filters can boost the performance of
ent formats: Parquet, ORC, HDF5, and Zarr. We then scan those
low-selectivity queries. However, zone maps are effective only for
files into in-memory Numpy arrays and record the scan time for
a smaller number of well-clustered columns, while Bloom Filters
each file. We report the median, 25/75%, and min/max of the com-
are useful only for point queries. Future formats should consider
pression ratio (format_size / Numpy_size) and the scan slowdown
recent research advances in indexing and filtering structures such
(format_scan_time / disk_Numpy_scan_time) in Figure 13.
as column indexes [77, 86, 101] and range filters [107, 115].
Figure 13a shows that none of the four formats achieves good
compression with vector embeddings, although Zarr is optimized
5.7 Nested Data Model for storing numerical arrays. Zarr, however, incurs a smaller scan-
In this section, we quantitatively evaluate the trade-off on the nested ning overhead compared to Parquet and ORC, as shown in Fig-
data model between Parquet and ORC. To only test the nested model ure 13b. This is because Zarr divides a list of (fixed-length) vector

157
20 Vector Index Search Parquet on SSD ORC on SSD Parquet on S3 ORC on S3
Compression Ratio

Scan Time Ratio


1.00 103

Number of S3 GETs
15
103 104
0.75

Time (s)
10
0.50 10 1
101
0.25 5
103
0.00 10−1 10−1
zip std td std zip td std std
hdf5-g parquet-z orc-zs zarr-blosc-z hdf5-g parquet-zs orc-z zarr-blosc-z
1
239 25 27 29 1
21 23 25 227 2 2 23 25 27 29
Vector Batch Size Vector Batch Size
(a) Compression Ratio (b) Scan Time w.r.t Numpy
(a) Time of vector index search vs. se- (b) S3 GET requests issued
Figure 13: Efficiency of storing and scanning embeddings
lection on files using resulting row IDs
embeddings into grid chunks to facilitate parallel scanning/decod- Figure 14: Top-k Search Workflow Breakdown (k = 10)
ing of the vectors. On the other hand, Parquet and ORC only support filter_0 filter_1 filter_2 filter_3 filter_4
sequential decoding within a row group. 20
20
Discussion: Existing columnar formats are less optimized to 3
15 15

Time (s)
store and deserialize vector embeddings, which prevail in ML data

Time (s)
2
sets. Future format designs should include specialized data type- 10 10

1
s/structures to allow better floating point compression [83, 87, 94] 5
5

and better parallelism. 26 29 0


212 215
26 29 212 215 26 29 212 215
Row Group Size (number of rows) Row Group Size (number of rows)
5.8.2 Integration with Vector Search Pipeline. Despite the emerging
vector databases [26, 43, 110], performing the vector search directly (a) With images in projection (b) Without images in projection
in the data lake is still common to avoid the expensive ETL process. Figure 15: Filterscan on Image Data in Parquet – Filters 0-4 correspond
Databricks recently announced their vision of Vector Data Lakes to low to high selectivities. Filters are applied on tabular data.
to support querying vector embeddings stored in Parquet inside
Delta Lake [51]. In this experiment, we evaluate the performance audio, and videos. One approach for storing them in the columnar
of Parquet and ORC in top-k similarity search queries. format is to use their external URLs, as done in the LAION-5B data
We use the image-text LAION-5B data set [97] with the cor- set above. This approach, however, could suffer from massive http-
responding embeddings. We store the first 100M entries in Par- get requests and invalid URLs over time. Therefore, it is beneficial
quet/ORC and then use the embeddings from the rest of the data set to store the unstructured data within the same file [36].
to perform top-k similarity search queries (k = 10). We maintain an We evaluate this on Parquet using the LAION-5B data set with
in-memory vector index auto-tuned using the FAISS library [22, 82]. the image URLs replaced by the original binaries. The result Parquet
Each query first searches the vector index to get the row IDs of the file is 13 GB with 219K rows and is stored on NVMe SSD. We
top 10 most similar entries. The query then uses those row IDs to perform scans with five different filters (filter_0 - filter_4) whose
fetch the URLs and text from the underlying columnar storage. We selectivities are 1, 0.1, 0.01, 0.001, and 0.0001, respectively. We
batch the queries to amortize the read amplification. enable parallel read and pre-buffer of column chunks. Figure 15a
Figure 14a shows the average time (over 20 trials) of performing shows the query times when the image column is projected, while
the top-k queries with a varying batch size on the x-axis. We re- Figure 15b presents the query times with only the tabular columns
peated the queries using local NVMe SSDs and AWS S3 for storage. projected. We vary the row-group size on the x-axis. A smaller row-
We observe that the selection operations in ORC are faster than group size works better when fetching the images because more
those in Parquet on local SSDs because ORC includes fine-grained row groups allow better parallel read of the large binaries with
zone maps to reduce the read amplification. As the query batch asynchronous I/Os. A smaller row group, however, compromises
size gets larger, the performance gap between ORC and Parquet the compression of the structured data, and the increased I/O time
shrinks because the query batch fetches a significant portion of the dominates the latency of queries that only project structured data.
file. The result is different when the files are stored in S3. Fetching Discussion: It is inefficient to store large binaries with struc-
records is much slower in ORC because it issues ≈ 4× S3 GET than tured data in the same PAX format with a default row-group size.
Parquet during the process, as shown in Figure 14b. The reason is Future designs should separate them in the physical layout of the
that the zone maps in ORC are scattered in the row-group footers format while providing a unified query interface logically.
while those in Parquet are centralized in the file footer.
Discussion: ML workloads often involve low-selectivity vector
5.9 GPU Decoding
search queries. Although aggressive query batching could amortize
the read amplification, fine-grained indexes (e.g., zone maps) are Besides machine learning, GPUs are used to speed up data ana-
necessary to guarantee the search latency. Also, as more and more lytics [99, 112] and decompression [100]. In this section, we in-
large-scale ML data sets reside in data lakes, it is critical for future vestigate the decoding efficiency of Parquet and ORC with GPUs.
formats to reduce the number of small reads (e.g., zone map fetches We use state-of-the-art GPU readers for Parquet and ORC in cuDF
in ORC) to the high-latency cloud object stores. 23.10a [47]. The machine is equipped with NVIDIA GeForce RTX
3090, AMD EPYC 7H12 with 128 cores, 512GB DRAM, and Intel
5.8.3 Storage of Unstructured Data. Besides tabular data, deep P5530 NVMe SSD. We generate the data set using the core workload
learning data sets often include unstructured data such as images, with a table of 32 columns and a varying number of rows.

158
Parquet-Arrow Parquet-cuDF ORC-Arrow ORC-cuDF
6 LESSONS AND FUTURE DIRECTIONS
Throughput (Mrows/s)

Throughput. Perc.
15 0.15 We summarize the lessons learned from our evaluation of Parquet
20
10 0.10 and ORC to guide future innovations in columnar storage formats.
15 Lesson 1. Dictionary Encoding is effective across data types
5 0.05
10
(even for floating-point values) because most real-world data have
0 0.00 low NDV ratios. Future formats should continue to apply the tech-
214 217 5220 223 214 217 220 223
Row Count Row Count nique aggressively, as in Parquet.
0 Lesson 2. It is important to keep the encoding scheme simple in a
(a) core workload 214 (b) Peak
217 GPU220
Throughput
223 Percentage
columnar format to guarantee a competitive decoding performance.
Throughput (Mrows/s)

Parquet-uncomp.
150 orc-zstd orc-uncomp. Future format designers should pay attention to the performance
8 Parquet-zstd
ORC-uncomp. I/O(includes PCIe transfer)
decompress
cost of selecting from many codec algorithms during decoding.
Time (ms)

6 ORC-zstd
100 decode Lesson 3. The bottleneck of query processing is shifting from
4 storage to (CPU) computation on modern hardware. Future formats
50
2 should limit the use of block compression and other heavyweight
0 0 encodings unless the benefits are justified in specific cases.
214 217 220 223 128K 256K 512K 1M
Row Count Row Count Lesson 4. The metadata layout in future formats should be
centralized and friendly to random access to better support wide
(c) cuDF varying compression (d) Time breakdown of ORC in (c)
(feature) tables common in ML training. The size of the basic I/O
Figure 16: GPU Decoding block should be optimized for high-latency cloud storage.
Lesson 5. As storage is getting cheaper, future formats could
In the first experiment, we scan and decode the files using Arrow
afford to store more sophisticated indexing and filtering structures
(with multithread and I/O prefetching enabled) and cuDF, respec-
to speed up query processing.
tively. As shown in Figure 16a, ORC-cuDF exhibits higher decoding
Lesson 6. Nested data models should be designed with an affinity
throughput than Parquet-cuDF because ORC has more independent
to modern in-memory formats to reduce the translation overhead.
blocks to better utilize the massive parallelism provided by the GPU:
Lesson 7. The characteristics of common machine learning work-
the smallest zone map in ORC maps to fewer rows than Parquet’s,
loads require future formats to support both wide-table projections
and each GPU thread block is assigned to each smallest zone map
and low-selectivity selections efficiently. This calls for better meta-
region in cuDF. As the number of rows increases in the files, the
data organization and more effective indexing. Besides, future for-
decoding throughput of Parquet-Arrow scales because there are
mats should allocate separate regions for large binary objects and
more row groups to leverage for multi-core parallel decoding with
incorporate compression techniques specifically designed for floats.
asynchronous I/O. On the contrary, the Arrow implementation for
Lesson 8. Future formats should consider the decoding efficiency
ORC does not support parallel read.
with GPUs. This requires not only sufficient parallel data blocks at
We further profile the GPU’s peak throughput in the above ex-
the file level but also encoding algorithms that are parallelizable to
periment over its theoretical maximum throughput using Nsight
fully utilize the computation within a GPU thread block.
Compute [40]. As shown in Figure 16b, the overall compute utiliza-
tion is low (although the GPU occupancy is full when row count
reaches 8M). This is because the integer encoding algorithms used
in Parquet and ORC (e.g., hybrid RLE + Bitpacking) are not designed
7 CONCLUSION
for parallel processing: all threads must wait for the first thread to In this paper, we comprehensively evaluate the common colum-
scan the entire data block to obtain their offsets in the input and nar formats, including Parquet and ORC. We build a taxonomy
output buffers. Moreover, because cuDF assigns a warp (32 threads) of the two formats to summarize the design of their format inter-
to each encoded run, a short run (e.g., a length-3 RLE run in ORC) nals. To better test the formats’ trade-offs, we analyze real-world
would cause the threads within a warp to be underutilized. data sets and design a benchmark that can sweep data distribution
We next perform a controlled experiment under the same setting to demonstrate the differences in encoding algorithms. Using the
as above to evaluate how block compression affects GPU decoding. benchmark, we conduct experiments on various metrics of the for-
Figure 16c shows that applying zstd improves the scan throughput mats. Our results highlight essential design considerations that are
for both Parquet and ORC when there are enough rows in the advantageous for modern hardware and emerging ML workloads.
files (i.e., enough data to leverage GPU parallelism). Figure 16d
shows the scan time breakdown. We observe that the I/O time
(including the PCIe transfer between GPU and CPU) dominates the ACKNOWLEDGMENTS
scan performance, making aggressive block compression pay off. The authors thank Pedro Pedreira, Yoav Helfman, Orri Erling, and
Discussion: Existing columnar formats are not designed to be Zhenyuan Zhao for discussing Meta’s ML use cases. We also thank
GPU-friendly. The integer encoding algorithms operate on variable- Gregory Kimball from NVidia for the feedback on GPU-decoding
length subsequences, making decoding hard to parallelize efficiently. experiments. This work was supported (in part) by Shanghai Qi Zhi
Future formats should favor encodings with better parallel process- Institute, National Science Foundation (IIS-1846158, SPX-1822933),
ing potentials. Besides, aggressive block compression is beneficial VMware Research Grants for Databases, Google DAPA Research
to alleviate the dominating I/O overheads (unlike with CPUs). Grants, and the Alfred P. Sloan Research Fellowship program.

159
REFERENCES [53] 2023. Zarr. https://zarr.dev/.
[1] 2016. File Format Benchmark - Avro, JSON, ORC & Parquet. [54] 2023. Zstandard. https://github.com/facebook/zstd.
https://www.slideshare.net/HadoopSummit/file-format-benchmark-avro- [55] Daniel Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, Samuel Mad-
json-orc-parquet. den, et al. 2013. The design and implementation of modern column-oriented
[2] 2016. Format Wars: From VHS and Beta to Avro and Parquet. http://www.svds. database systems. Foundations and Trends® in Databases 5, 3 (2013), 197–280.
com/dataformats/. [56] Daniel Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating compres-
[3] 2016. Inside Capacitor, BigQuery’s next-generation columnar storage sion and execution in column-oriented database systems. In Proceedings of the
format. https://cloud.google.com/blog/products/bigquery/inside-capacitor- 2006 ACM SIGMOD international conference on Management of data. 671–682.
bigquerys-next-generation-columnar-storage-format. [57] Azim Afroozeh and Peter Boncz. 2023. The FastLanes Compression Layout:
[4] 2017. Apache Arrow vs. Parquet and ORC: Do we really need a third Apache Decoding> 100 Billion Integers per Second with Scalar Code. Proceedings of the
project for columnar data representation? http://dbmsmusings.blogspot.com/ VLDB Endowment 16, 9 (2023), 2132–2144.
2017/10/apache-arrow-vs-parquet-and-orc-do-we.html. [58] Ankur Agiwal and Kevin Lai et al. 2021. Napa: Powering Scalable Data Ware-
[5] 2017. Some comments to Daniel Abadi’s blog about Apache Arrow. https: housing with Robust Query Performance at Google. Proceedings of the VLDB
//wesmckinney.com/blog/arrow-columnar-abadi/. Endowment (PVLDB) 14 (12) (2021), 2986–2998.
[6] 2022. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets. [59] Anastassia Ailamaki, David J DeWitt, Mark D Hill, and Marios Skounakis. 2001.
php. Accessed: 2022-09-22. Weaving Relations for Cache Performance.. In VLDB, Vol. 1. 169–180.
[7] 2023. Amazon S3. https://aws.amazon.com/s3/. [60] Wail Y. Alkowaileet and Michael J. Carey. 2022. Columnar Formats for Schema-
[8] 2023. Apache Arrow. https://arrow.apache.org/. less LSM-Based Document Stores. Proc. VLDB Endow. 15, 10 (sep 2022),
[9] 2023. Apache Arrow Dataset API. https://arrow.apache.org/docs/python/ 2085–2097. https://doi.org/10.14778/3547305.3547314
generated/pyarrow.parquet.ParquetDataset.html. [61] Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu,
[10] 2023. Apache Avro. https://avro.apache.org/. Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja
[11] 2023. Apache Carbondata. https://carbondata.apache.org/. Łuszczak, et al. 2020. Delta lake: high-performance ACID table storage over
[12] 2023. Apache Hadoop. https://hadoop.apache.org/. cloud object stores. Proceedings of the VLDB Endowment 13, 12 (2020), 3411–
[13] 2023. Apache Hive. https://hive.apache.org/. 3424.
[14] 2023. Apache Hudi. https://hudi.apache.org/. [62] Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. 2021. Lake-
[15] 2023. Apache Iceberg. https://iceberg.apache.org/. house: a new generation of open platforms that unify data warehousing and
[16] 2023. Apache Impala. https://impala.apache.org/. advanced analytics. In Proceedings of CIDR. 8.
[17] 2023. Apache ORC. https://orc.apache.org/. [63] Haoqiong Bian and Anastasia Ailamaki. 2022. Pixels: An Efficient Column
[18] 2023. Apache Parquet. https://parquet.apache.org/. Store for Cloud Data Lakes. In 2022 IEEE 38th International Conference on Data
[19] 2023. Apache Presto. https://prestodb.io/. Engineering (ICDE). IEEE, 3078–3090.
[20] 2023. Apache Spark. https://spark.apache.org/. [64] Haoqiong Bian, Ying Yan, Wenbo Tao, Liang Jeff Chen, Yueguo Chen, Xiaoyong
[21] 2023. Arrow C++ and Parquet C++. https://github.com/apache/arrow/tree/ Du, and Thomas Moscibroda. 2017. Wide table layout optimization based on
main/cpp. column ordering and duplication. In Proceedings of the 2017 ACM International
[22] 2023. AutoFaiss. https://github.com/criteo/autofaiss. Conference on Management of Data. 299–314.
[23] 2023. AutoFAISS build index API. https://criteo.github.io/autofaiss/API/ [65] Peter Boncz, Thomas Neumann, and Viktor Leis. 2020. FSST: fast random
_autosummary/autofaiss.external.quantize.build_index.html. Accessed: 2023- access string compression. Proceedings of the VLDB Endowment 13, 12 (2020),
07-17. 2649–2661.
[24] 2023. Azure Blob Storage. https://azure.microsoft.com/en-us/services/storage/ [66] Biswapesh Chattopadhyay, Priyam Dutta, Weiran Liu, Ott Tinn, Andrew Mc-
blobs/. cormick, Aniket Mokashi, Paul Harvey, Hector Gonzalez, David Lomax, Sagar
[25] 2023. BP5. https://adios2.readthedocs.io/en/latest/engines/engines.html#bp5. Mittal, et al. 2019. Procella: Unifying serving and analytical data at YouTube.
[26] 2023. Chroma. https://github.com/chroma-core/chroma/. (2019).
[27] 2023. ClickHouse. https://clickhouse.com/. [67] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and
[28] 2023. ClickHouse Example Datasets. https://clickhouse.com/docs/en/getting- Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In SoCC.
started/example-datasets. 143–154.
[29] 2023. Dremio. https://www.dremio.com//. [68] George P Copeland and Setrag N Khoshafian. 1985. A decomposition storage
[30] 2023. EDGAR Log File Data Sets. https://www.sec.gov/about/data/edgar-log- model. Acm Sigmod Record 14, 4 (1985), 268–279.
file-data-sets.html. [69] Dario Curreri, Olivier Curé, and Marinella Sciortino. [n.d.]. RDF DATA AND
[31] 2023. GeoNames Dataset. http://www.geonames.org/. COLUMNAR FORMATS. Master’s thesis.
[32] 2023. Google BigQuery. https://cloud.google.com/bigquery. [70] Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin
[33] 2023. Google Cloud Storage. https://cloud.google.com/storage. Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel,
[34] 2023. Google snappy. http://google.github.io/snappy/. Jiansheng Huang, et al. 2016. The Snowflake Elastic Data Warehouse. In SIG-
[35] 2023. Hugging Face Datasets Server. https://huggingface.co/docs/datasets- MOD.
server/quick_start#access-parquet-files. Accessed: 2023-07-09. [71] Bailu Ding, Surajit Chaudhuri, Johannes Gehrke, and Vivek Narasayya. 2021.
[36] 2023. image-parquet. https://discuss.huggingface.co/t/image-dataset-best- DSB: A decision support benchmark for workload-driven and traditional data-
practices/13974. base systems. Proceedings of the VLDB Endowment 14, 13 (2021), 3376–3388.
[37] 2023. IMDb Datasets. https://www.imdb.com/interfaces/. [72] Avrilia Floratou, Umar Farooq Minhas, and Fatma Özcan. 2014. Sql-on-hadoop:
[38] 2023. InfluxData. https://www.influxdata.com/. Full circle back to shared-nothing database architectures. Proceedings of the
[39] 2023. NetCDF. https://www.unidata.ucar.edu/software/netcdf/. VLDB Endowment 7, 12 (2014), 1295–1306.
[40] 2023. NVIDIA Nsight Compute. https://developer.nvidia.com/nsight-compute. [73] Mike Folk, Gerd Heber, Quincey Koziol, Elena Pourmal, and Dana Robinson.
[41] 2023. ORC C++. https://github.com/apache/orc/tree/main/c%2B%2B. 2011. An overview of the HDF5 technology suite and its applications. In
[42] 2023. Parquet Bloom Filter Jira Discussion. https://issues.apache.org/jira/ Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases. 36–47.
browse/PARQUET-41. [74] Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1998. Compressing
[43] 2023. Pinecone. https://www.pinecone.io/. relations and indexes. In Proceedings 14th International Conference on Data
[44] 2023. Protocol Buffers. https://developers.google.com/protocol-buffers/. Engineering. IEEE, 370–379.
[45] 2023. Public BI benchmark. https://github.com/cwida/public_bi_benchmark. [75] Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak,
[46] 2023. Querying Parquet with Millisecond Latency. https://www.influxdata. Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case
com/blog/querying-parquet-millisecond-latency/. for Simpler Data Warehouses. In SIGMOD.
[47] 2023. RAPIDS. https://rapids.ai/. [76] Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang,
[48] 2023. Samsung 980 PRO 4.0 NVMe SSD. https://www.samsung.com/us/ and Zhiwei Xu. 2011. RCFile: A fast and space-efficient data placement struc-
computing/memory-storage/solid-state-drives/980-pro-pcie-4-0-nvme-ssd- ture in MapReduce-based warehouse systems. In 2011 IEEE 27th International
1tb-mz-v8p1t0b-am/. Accessed: 2023-02-21. Conference on Data Engineering. IEEE, 1199–1208.
[49] 2023. SequenceFile. https://cwiki.apache.org/confluence/display/HADOOP2/ [77] Brian Hentschel, Michael S Kester, and Stratos Idreos. 2018. Column sketches:
SequenceFile. A scan accelerator for rapid and robust predicate evaluation. In Proceedings of
[50] 2023. The DWRF Format. https://github.com/facebookarchive/hive-dwrf. the 2018 International Conference on Management of Data. 857–872.
[51] 2023. Vector Data Lakes. https://www.databricks.com/dataaisummit/session/ [78] Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N Hanson,
vector-data-lakes/. Accessed: 2023-07-28. Owen O’Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, and Xiaodong Zhang.
[52] 2023. Yelp Open Dataset. https://www.yelp.com/dataset/. 2014. Major technical advancements in apache hive. In Proceedings of the 2014
ACM SIGMOD international conference on Management of data. 1235–1246.

160
[79] S Idreos, F Groffen, N Nes, S Manegold, S Mullender, and M Kersten. 2012. [100] Anil Shanbhag, Bobbi W. Yogatama, Xiangyao Yu, and Samuel Madden. 2022.
Monetdb: Two decades of research in column-oriented database. IEEE Data Tile-Based Lightweight Integer Compression in GPU. In Proceedings of the
Engineering Bulletin (2012). 2022 International Conference on Management of Data (Philadelphia, PA, USA)
[80] Todor Ivanov and Matteo Pergolesi. 2020. The impact of columnar file formats on (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA,
SQL-on-hadoop engine performance: A study on ORC and Parquet. Concurrency 1390–1403. https://doi.org/10.1145/3514221.3526132
and Computation: Practice and Experience 32, 5 (2020), e5523. [101] Lefteris Sidirourgos and Martin Kersten. 2013. Column imprints: a secondary
[81] Hao Jiang, Chunwei Liu, John Paparrizos, Andrew A Chien, Jihong Ma, and index structure. In Proceedings of the 2013 ACM SIGMOD International Conference
Aaron J Elmore. 2021. Good to the Last Bit: Data-Driven Encoding with on Management of Data. 893–904.
CodecDB. In Proceedings of the 2021 International Conference on Management of [102] Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cher-
Data. 843–856. niack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth J.
[82] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity O’Neil, Patrick E. O’Neil, Alex Rasin, Nga Tran, and Stanley B. Zdonik. 2005.
search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547. C-Store: A Column-oriented DBMS. In Proceedings of the 31st International
[83] Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis. Conference on Very Large Data Bases, Trondheim, Norway, August 30 - September
2023. BtrBlocks: Efficient Columnar Compression for Data Lakes. Proc. ACM 2, 2005. ACM, 553–564.
Manag. Data 1, 2, Article 118 (jun 2023), 26 pages. https://doi.org/10.1145/ [103] The Transaction Processing Council. 2021. TPC-DS Benchmark (Revision 3.2.0).
3589263 [104] The Transaction Processing Council. 2022. TPC-H Benchmark (Revision 3.0.1).
[84] Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per [105] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka,
second through vectorization. Software: Practice and Experience 45, 1 (2015), Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a
1–29. warehousing solution over a map-reduce framework. Proceedings of the VLDB
[85] Yinan Li, Jianan Lu, and Badrish Chandramouli. 2023. Selection Pushdown in Endowment 2, 2 (2009), 1626–1629.
Column Stores Using Bit Manipulation Instructions. Proc. ACM Manag. Data 1, [106] Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, and
2, Article 178 (jun 2023), 26 pages. https://doi.org/10.1145/3589323 Bernard Metzler. 2018. Albis: { High-Performance } File Format for Big Data Sys-
[86] Yinan Li and Jignesh M Patel. 2013. Bitweaving: Fast scans for main memory data tems. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 615–630.
processing. In Proceedings of the 2013 ACM SIGMOD International Conference [107] Kapil Vaidya, Subarna Chatterjee, Eric Knorr, Michael Mitzenmacher, Stratos
on Management of Data. 289–300. Idreos, and Tim Kraska. 2022. SNARF: a learning-enhanced range filter. Pro-
[87] Panagiotis Liakos, Katia Papakonstantinopoulou, and Yannis Kotidis. 2022. ceedings of the VLDB Endowment 15, 8 (2022), 1632–1644.
Chimp: efficient lossless floating point compression for time series databases. [108] Suketu Vakharia, Peng Li, Weiran Liu, and Sundaram Narayanan. 2023. Shared
Proceedings of the VLDB Endowment 15, 11 (2022), 3058–3070. Foundations: Modernizing Meta’s Data Lakehouse. In The Conference on Inno-
[88] Yihao Liu, Xinyu Zeng, and Huanchen Zhang. 2023. LeCo: Lightweight Com- vative Data Systems Research, CIDR.
pression via Learning Serial Correlations. arXiv preprint arXiv:2306.15374 (2023). [109] Adrian Vogelsgesang, Michael Haubenschild, Jan Finis, Alfons Kemper, Viktor
[89] Samuel Madden, Jialin Ding, Tim Kraska, Sivaprasad Sudhir, David Cohen, Leis, Tobias Muehlbauer, Thomas Neumann, and Manuel Then. 2018. Get
Timothy Mattson, and Nesime Tatbul. 2022. Self-Organizing Data Containers. Real: How Benchmarks Fail to Represent the Real World. In Proceedings of
In The Conference on Innovative Data Systems Research, CIDR. the Workshop on Testing Database Systems (Houston, TX, USA) (DBTest’18).
[90] Heikki Mannila. 1985. Measures of presortedness and optimal sorting algo- Association for Computing Machinery, New York, NY, USA, Article 1, 6 pages.
rithms. IEEE transactions on computers 100, 4 (1985), 318–325. [110] Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li,
[91] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shiv- Xiangyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, et al. 2021. Milvus:
akumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: interactive analysis of A purpose-built vector data management system. In Proceedings of the 2021
web-scale datasets. Proceedings of the VLDB Endowment 3, 1-2 (2010), 330–339. International Conference on Management of Data. 2614–2627.
[92] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shiv- [111] Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He,
akumar, Matt Tolton, Theo Vassilakis, Hossein Ahmadi, Dan Delorey, Slava Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent
Min, et al. 2020. Dremel: A decade of interactive SQL analysis at web scale. Zhan, Xiaona Li, and Bizhu Qiu. 2014. BigDataBench: A big data benchmark
Proceedings of the VLDB Endowment 13, 12 (2020), 3461–3472. suite from internet services. In 2014 IEEE 20th International Symposium on High
[93] Patrick E O’Neil, Elizabeth J O’Neil, and Xuedong Chen. 2007. The star schema Performance Computer Architecture (HPCA). 488–499. https://doi.org/10.1109/
benchmark (SSB). Pat 200, 0 (2007), 50. HPCA.2014.6835958
[94] Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin [112] Bobbi W Yogatama, Weiwei Gong, and Xiangyao Yu. 2022. Orchestrating data
Meza, and Kaushik Veeraraghavan. 2015. Gorilla: A fast, scalable, in-memory placement and query execution in heterogeneous CPU-GPU DBMS. Proceedings
time series database. Proceedings of the VLDB Endowment 8, 12 (2015), 1816– of the VLDB Endowment 15, 11 (2022), 2491–2503.
1827. [113] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
[95] Pouria Pirzadeh, Michael Carey, and Till Westmann. 2017. A performance study Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012.
of big data analytics platforms. In 2017 IEEE international conference on big data Resilient distributed datasets: A { Fault-Tolerant } abstraction for { In-Memory }
(big data). IEEE, 2911–2920. cluster computing. In 9th USENIX Symposium on Networked Systems Design and
[96] Felix Putze, Peter Sanders, and Johannes Singler. 2010. Cache-, Hash-, and Implementation (NSDI 12). 15–28.
Space-Efficient Bloom Filters. ACM J. Exp. Algorithmics 14, Article 4 (Jan 2010), [114] Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and
18 pages. Huanchen Zhang. 2023. An Empirical Evaluation of Columnar Storage Formats.
[97] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, https://arxiv.org/pdf/2304.05028.pdf/. arXiv preprint arXiv:2304.05028 (2023).
Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, [115] Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G Andersen, Michael
Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crow- Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2018. Surf: Practical range
son, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: query filtering with fast succinct tries. In Proceedings of the 2018 International
An open large-scale dataset for training next generation image-text models. In Conference on Management of Data. 323–336.
NeurIPS. [116] Huanchen Zhang, Xiaoxuan Liu, David G Andersen, Michael Kaminsky, Kim-
[98] Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, berly Keeton, and Andrew Pavlo. 2020. Order-preserving key compression for
Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, et al. in-memory search trees. In Proceedings of the 2020 ACM SIGMOD International
2019. Presto: SQL on everything. In 2019 IEEE 35th International Conference on Conference on Management of Data. 1601–1615.
Data Engineering (ICDE). IEEE, 1802–1813. [117] Marcin Zukowski, Sandor Heman, Niels Nes, and Peter Boncz. 2006. Super-
[99] Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A study of the funda- scalar RAM-CPU cache compression. In 22nd International Conference on Data
mental performance characteristics of GPUs and CPUs for database analytics. In Engineering (ICDE’06). IEEE, 59–59.
Proceedings of the 2020 ACM SIGMOD international conference on Management [118] Marcin Zukowski, Mark Van de Wiel, and Peter Boncz. 2012. Vectorwise: A
of data. 1617–1632. vectorized analytical DBMS. In 2012 IEEE 28th International Conference on Data
Engineering. IEEE, 1349–1350.

161

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy