p148 Zeng
p148 Zeng
ABSTRACT 105], Impala [16], Spark [20, 113], and Presto [19, 98], to respond to
Columnar storage is a core component of a modern data analytics the petabytes of data generated per day and the growing demand for
system. Although many database management systems (DBMSs) large-scale data analytics. To facilitate data sharing across the vari-
have proprietary storage formats, most provide extensive support to ous Hadoop-based query engines, vendors proposed open-source
open-source storage formats such as Parquet and ORC to facilitate columnar storage formats [11, 17, 18, 76], represented by Parquet
cross-platform data sharing. But these formats were developed over and ORC, that have become the de facto standard for data storage in
a decade ago, in the early 2010s, for the Hadoop ecosystem. Since today’s data warehouses and data lakes [14, 15, 19, 20, 29, 38, 61].
then, both the hardware and workload landscapes have changed. These formats, however, were developed more than a decade ago.
In this paper, we revisit the most widely adopted open-source The hardware landscape has changed since then: persistent stor-
columnar storage formats (Parquet and ORC) with a deep dive into age performance has improved by orders of magnitude, achieving
their internals. We designed a benchmark to stress-test the formats’ gigabytes per second [48]. Meanwhile, the rise of data lakes means
performance and space efficiency under different workload config- more column-oriented files reside in cheap cloud storage (e.g., AWS
urations. From our comprehensive evaluation of Parquet and ORC, S3 [7], Azure Blob Storage [24], Google Cloud Storage [33]), which
we identify design decisions advantageous with modern hardware exhibits both high bandwidth and high latency. On the software side,
and real-world data distributions. These include using dictionary a number of new lightweight compression schemes [57, 65, 87, 116],
encoding by default, favoring decoding speed over compression as well as indexing and filtering techniques [77, 86, 101, 115], have
ratio for integer encoding algorithms, making block compression been proposed in academia, while existing open columnar formats
optional, and embedding finer-grained auxiliary data structures. are based on DBMS methods from the 2000s [56].
We also point out the inefficiencies in the format designs when Prior studies on storage formats focus on measuring the end-
handling common machine learning workloads and using GPUs to-end performance of Hadoop-based query engines [72, 80]. They
for decoding. Our analysis identified important considerations that fail to analyze the design decisions and their trade-offs. Moreover,
may guide future formats to better fit modern technology trends. they use synthetic workloads that do not consider skewed data
distributions observed in the real world [109]. Such data sets are
PVLDB Reference Format: less suitable for storage format benchmarking.
Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, The goal of this paper is to analyze common columnar file for-
Huanchen Zhang. An Empirical Evaluation of Columnar Storage Formats. mats and to identify design considerations to provide insights for
PVLDB, 17(2): 148 - 161, 2023. developing next-generation column-oriented storage formats. We
doi:10.14778/3626292.3626298 created a benchmark with predefined workloads whose configura-
tions were extracted from a collection of real-world data sets. We
PVLDB Artifact Availability:
then performed a comprehensive analysis for the major compo-
The source code, data, and/or other artifacts have been made available at
nents in Parquet and ORC, including encodings, block compression,
https://github.com/XinyuZeng/EvaluationOfColumnarFormats.
metadata organization, indexing and filtering, and nested data mod-
eling. In particular, we investigated how efficiently the columnar
1 INTRODUCTION formats support common machine learning workloads and whether
Columnar storage has been widely adopted for data analytics be- their designs are friendly to GPUs. We detail the lessons learned in
cause of its advantages, such as irrelevant attribute skipping, effi- Section 6 and summarize our main findings below.
cient data compression, and vectorized query processing [55, 59, 68]. First, there is no clear winner between Parquet and ORC in
In the early 2010s, organizations developed data processing engines format efficiency. Parquet has a slight file size advantage because of
for the open-source big data ecosystem [12], including Hive [13, its aggressive dictionary encoding. Parquet also has faster column
decoding due to its simpler integer encoding algorithms, while ORC
This work is licensed under the Creative Commons BY-NC-ND 4.0 International is more effective in selection pruning due to the finer granularity
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by of its zone maps (a type of sparse index).
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights Second, most columns in real-world data sets have a small num-
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 17, No. 2 ISSN 2150-8097.
ber of distinct values (or low “NDV ratios” defined in Section 4.1),
doi:10.14778/3626292.3626298
∗ Huanchen Zhang is also affiliated with Shanghai Qi Zhi Institute.
148
which is ideal for dictionary encoding. As a result, the efficiency Because of its closer relationship with Spark, previous work failed
of integer-encoding algorithms (i.e., to compress dictionary codes) to evaluate the format in isolation [106]. Recent work concludes
is critical to the format’s size and decoding speed. Third, faster that CarbonData has a worse performance compared with Parquet
and cheaper storage devices mean that it is better to use faster de- and ORC and has a less active community [69].
coding schemes to reduce computation costs than to pursue more A number of large companies have developed their own pro-
aggressive compression to save I/O bandwidth. Formats should not prietary columnar formats in the last decade. Google’s Capacitor
apply general-purpose block compression by default because the format is used by many of their systems [3], including BigQuery [92]
bandwidth savings do not justify the decompression overhead. and Napa [58]. It is based on the techniques from Dremel [91] and
Fourth, Parquet and ORC provide simplistic support for auxiliary Abadi et al. [56] that optimize layout based on workload behavior.
data structures (e.g., zone maps, Bloom Filters). As bottlenecks shift YouTube developed the Artus format in 2019 for the Procella DBMS
from storage to computation, there are opportunities to embed that supports adaptive encoding without block compression and
more sophisticated structures and precomputed results into the 𝑂 (1) seek time for nested schemas [66]. Meta’s DWRF is a vari-
format to trade inexpensive space for less computation. ant of ORC with better support for reading and encrypting nested
Fifth, existing columnar formats are inefficient in serving com- data [50]. Meta recently developed Alpha to improve the training
mon machine learning (ML) workloads. Current designs are sub- workloads of machine learning (ML) applications [108].
optimal in handling projections of thousands of features during Arrow is an in-memory columnar format designed for efficient
ML training and low-selectivity selection during top-k similarity exchange of data with limited or no serialization between differ-
search in the vector embeddings. Finally, the current formats do ent application processes or at library API boundaries [8]. Unlike
not provide enough parallel units to fully utilize the computing Parquet or ORC, Arrow supports random access and thus does not
power of GPUs. Also, unlike the CPUs, more aggressive compres- require block-based decoding on reads. Because Arrow is not meant
sion is preferred in the formats with GPU processing because the for long-term disk storage [5], we do not evaluate it in this paper.
I/O overhead (including PCIe transfer) dominates the file scan time. The recent lakehouse [62] trend led to an expansion of formats to
We make the following contributions in this paper. First, we support better metadata management (e.g., ACID transactions). Rep-
created a feature taxonomy for columnar storage formats like Par- resentative projects include Delta Lake [61], Apache Iceberg [15],
quet and ORC. Second, we designed a benchmark to stress-test and Apache Hudi [14]. They add an auxiliary metadata layer and
the formats and identify their performance vs. space trade-offs do not directly modify the underlying columnar file formats.
under different workloads. Lastly, we conducted a comprehensive There are also scientific data storage formats for HPC workloads,
set of experiments on Parquet and ORC using our benchmark and including HDF5, BP5, NetCDF, and Zarr [25, 39, 53, 73]. They target
summarized the lessons learned for the future format design. heterogeneous data that has complex file structures, types, and
organizations. Their data is typically multi-dimensional arrays and
does not support column-wise encoding. Although they expose
2 BACKGROUND AND RELATED WORK several language APIs, few DBMSs support these formats because
The Big Data ecosystem in the early 2010s gave rise to open-source of their lack of columnar storage features.
file formats. Apache Hadoop first introduced two row-oriented Most of the previous investigations on columnar formats target
formats, SequenceFile [49] organized as key-value pairs, and entire query processing systems without analyzing the format in-
Avro [10] based on JSON. At the same time, column-oriented ternals in isolation [72, 80, 95]. Trivedi et al. compared the read
DBMSs, such as C-Store [102], MonetDB [79], and VectorWise [118], performance of Parquet, ORC, Arrow, and JSON on the NVMe
developed the fundamental methods for efficient analytical query SSDs [106], but they only measured sequential scans with synthetic
processing [55]: columnar compression, vectorized processing, and data sets (i.e., TPC-DS [103]). There are also older industry articles
late materialization. The Hadoop community then adopted these that compare popular columnar formats, but they do not provide
ideas from columnar systems and developed more efficient formats. an in-depth analysis of the internal design details [1, 2, 4].
In 2011, Facebook/Meta released a column-oriented format for Other research proposes ways to optimize these existing colum-
Hadoop called RCFile [76]. Two years later, Meta refined RCFile nar formats under specific workloads or hardware configurations [63,
and announced the PAX (Partition Attribute Across)-based [59] 64, 89]. For example, Jiang et al. use ML to select the best encoding
ORC (Optimized Record Columnar File) format [17, 78]. A month algorithms for Parquet according to the query history [81]. Btr-
after ORC’s release, Twitter and Cloudera released the first ver- Blocks integrates a sampling-based encoding selection algorithm to
sion of Parquet [18]. Their format borrowed insights from earlier achieve the optimal decompression speed with network-optimized
columnar storage research, such as the PAX model and the record- instances [83]. Li et al. proposed using BMI instructions to improve
shredding and assembly algorithm from Google’s Dremel [91]. selection performance on Parquet [85]. None of these techniques,
Since then, both Parquet and ORC have become top-level Apache however, have been incorporated in the most popular formats.
Foundation projects. They are also supported by most data pro-
cessing platforms, including Hive [13], Presto/Trino [19, 98], and
Spark [20, 113]. Even database products with proprietary storage 3 FEATURE TAXONOMY
formats (e.g., Redshift [75], Snowflake [70], ClickHouse [27], and In this section, we present a taxonomy of columnar formats fea-
BigQuery [32]) support Parquet and ORC through external tables. tures (see Table 1). For each feature category, we first describe the
Huawei’s CarbonData [11] is another open-source columnar common designs between Parquet and ORC and then highlight
format that provides built-in inverted indexing and column groups. their differences as well as the rationale behind the divergence.
149
Table 1: Feature Taxonomy – An overview of the features of columnar storage formats.
Parquet ORC
Internal Layout (§3.1) PAX PAX
Encoding Variants (§3.2) plain, RLE_DICT, RLE, Delta, Bitpacking plain, RLE_DICT, RLE, Delta, Bitpacking, FOR
Features
Compression (§3.3) Snappy, gzip, LZO, zstd, LZ4, Brotli Snappy, zlib, LZO, zstd, LZ4
Type System (§3.4) Separate logical and physical type system One unified type system
Zone Map / Index (§3.5) Min-max per smallest zone map/row group/file Min-max per smallest zone map/row group/file
Bloom Filter (§3.5) Supported per column chunk Supported per smallest zone map
Nested Data Encoding (§3.6) Dremel Model Length and presence
Table 2: Concepts Mapping – Terms used in this paper and the corre- Parquet applies Dictionary Encoding aggressively to every col-
sponding ones in the formats. umn regardless of the data type by default, while ORC only uses
This Paper Parquet ORC it for strings. They both apply another layer of integer encoding
Row Group Row Group Stripe on the dictionary codes. The advantage of applying Dictionary En-
Smallest Zone Map Page Index (a Page) Row Index (10k rows) coding to an integer column, as in Parquet, is that it might achieve
Compression Unit Page Compression Chunk additional compression for large-value integers. However, the dic-
3.1 Format Layout tionary codes are assigned based on the values’ first appearances in
the column chunk and thus might destroy local serial patterns that
As shown in Figure 1, both Parquet and ORC employ the PAX format.
could be compressed well by Delta Encoding or Frame-of-Reference
The DBMS first partitions a table horizontally into row groups. It
(FOR) [74, 84, 117]. Therefore, Parquet only uses Bitpacking and
then stores tuples column-by-column within each row group, with
RLE to further compress the dictionary codes.
each attribute forming a column chunk. The hybrid columnar layout
Parquet imposes a limit (1 MB by default) to the dictionary size
enables the DBMS to use vectorized query processing and mitigates
for each column chunk. When the dictionary is full, later values fall
the tuple reconstruction overhead in a row group. Many systems
back to “plain” (i.e., no encoding) because a full dictionary indicates
and libraries, such as DuckDB and Arrow, leverage the PAX layout
that the number of distinct values (NDVs) is too large On the other
to perform parallel reads over column chunks.
hand, ORC computes the NDV ratio (i.e., NDV / row count) of the
Both formats first apply lightweight encoding schemes to the val-
column to determine whether to apply Dictionary Encoding to it.
ues for each column chunk. The formats then use general-purpose
If a column’s NDV ratio is greater than a predefined threshold (e.g.,
block compression algorithms to reduce the column chunk’s size.
0.8), then ORC disables encoding. Compared to Parquet’s dictionary
The entry point of a Parquet/ORC file is called a footer. Besides
size physical limit, ORC’s approach is more intuitive, and the tuning
file-level metadata such as table schema and tuple count, the footer
of the NDV ratio threshold is independent of the row group size.
keeps the metadata for each row group, including its offset in the
For integer columns, Parquet first dictionary encodes and then
file and zone maps for each column chunk. For clarity in our ex-
applies a hybrid of RLE and Bitpacking to the dictionary codes.
position, in Table 2 we also summarize the mapping between the
If the same value repeats ≥ 8 times consecutively, it uses RLE;
terminologies used in this paper and those used in Parquet/ORC.
otherwise, it uses bitpacking. Interestingly, we found that the RLE-
Although the layouts of Parquet and ORC are similar, they differ
threshold 8 is a non-configurable parameter hard-coded in every
in how they map logical blocks to physical storage. For example,
implementation of Parquet. Although it saves Parquet a tuning
(non-Java) Parquet uses a row-group size based on the number of
knob, such inflexibility could lead to suboptimal compression ratios
rows (e.g., 1M rows), whereas ORC uses fixed physical storage size
for specific data sets (e.g., when the common repetition length is 7).
(e.g., 64 MB). Parquet seeks to guarantee that there are enough
Unlike Parquet’s RLE + Bitpacking scheme, ORC includes four
entries within a row group to leverage vectorized query processing,
schemes to encode both dictionary codes (for string columns) and
but it may suffer from large memory footprints, especially with
integer columns. ORC’s integer encoder uses a rule-based greedy
wide tables. On the other hand, ORC limits the physical size of
algorithm to select the best scheme for each subsequence of values.
a row group to better control memory usage, but it may lead to
Starting from the beginning of the sequence, the algorithm keeps a
insufficient entries with large attributes.
look-ahead buffer (with a maximum size of 512 values) and tries
Another difference is that Parquet maps its compression unit
to detect particular patterns. First, if there are subsequences of
to the smallest zone map. ORC provides flexibility in tuning the
identical values with lengths between 3 and 10, ORC uses RLE to
performance-space trade-off of a block compression algorithm.
encode them. If the length of the identical values is greater than
However, misalignment between the smallest zone map and com-
10, or the values of a subsequence are monotonically increasing
pression units imposes extra complexity during query processing
or decreasing, ORC applies Delta Encoding to the values. Lastly,
(e.g., a value may be split across unit boundaries).
for the remaining subsequences, the algorithm encodes them using
either Bitpacking or a variant of PFOR [117], depending on whether
there exist “outliers” in a subsequence. Figure 2 is an example of
3.2 Encoding ORC’s integer encoding schemes.
Applying lightweight compression schemes to the columns can The sophistication (compared to Parquet) of ORC’s integer encod-
reduce both storage and network costs [56]. Parquet and ORC ing algorithm allows ORC to seize more opportunities for compres-
support standard OLAP compression techniques, such as Dictionary sion. However, switching between four encoding schemes slows
Encoding, Run-Length Encoding (RLE), and Bitpacking.
150
Column Row Group 1 Column
Row Group 1 Page 1 Page Header Col 1 Zone Map Present Stream
Chunk 1 Chunk 1
Index (logical)
Row Group 2 Column Col 1 Bloom Filter Column
Page 2 Definition Levels Length Stream
Chunk 2 Data Chunk 2
. . .
. Footer .
. . .
. Repetition Levels . Data Stream
. . .
Row Group 2
. Column Col c Zone Map Column
Page p Values .
Chunk c Chunk c
Col c Bloom Filter .
Row Group r
Metadata: version, schema, Column 1 Metadata: offset, . Metadata: version,
.... type, encoding, compression, number of rows, ... offset,
Bloom Filter Row Group r
Row Group 1 Metadata zone maps... Row Group 1 Metadata index length,
Page Index . . ColChunkStats . data length,
. . . footer length
Footer Footer
Footer Length Row Group r Metadata Column c Metadata Footer Length Row Group r Metadata
913 | 222 | 123 | 222 | 9 1 3 5 4 8 7 99 6 Union allows data values to have different types for the same col-
Original Integer Sequence umn name. Recent work shows that a Union type can help optimize
outlier
Parquet’s Dremel model with schema changes [60].
Bitpack RLE Delta RLE PFOR
Corresponding Encoding Method
151
first last tag name first last tags tag
Val R D Val R D Val R D P Val P Val P Len P Val P
root Mike 0 2 Lee 0 2 a 0 2 1 Mike 1 Lee 1 2 a 1
1
{name: {first: Mike, last: Lee}, tags: [a, b]} 0 1 Hill 0 2 b 1 2 1 0 Hill 1 0 1 b 1
name tags {name: {last: Hill}, tags: []} Joe 0 2 0 1 0 1 1 Joe 1 1 c 1
{name: {first: Joe}, tags: [c]} 0 1
c 0 2
first last tag
(a) Example schema and three sample records. (b) Parquet’s. R/D=Repetition/Definition Level. (c) ORC’s. Len=length, P=presence.
Figure 3: Nested Data Example – Assume all nodes except the root can be null.
Config File Metadata Table first define several salient properties of the value distribution of
Workload a column (e.g., sortedness, skew pattern). We then extract these
Generator properties from real-world data sets to form predefined workloads
Transform
Workload representing applications ranging from BI to ML. To use our bench-
Templates Predicates Scan mark, as shown in Figure 4, a user first provides a configuration
Target Format Final results
file (or an existing workload template) that specifies the parameter
Select
values of the properties. The workload generator then produces the
Figure 4: Benchmark Procedure Overview data using this configuration and then generates point and range
predicates to evaluate the format’s (filtered) scan performance.
3.6 Nested Data Model
As semi-structured data sets such as those in JSON and Protocol 4.1 Column Properties
Buffers [44] have become prevalent, an open format must sup-
We first introduce the core properties that define the value distri-
port nested data. The nested data model in Parquet is based on
bution of a column. We use [𝑎 1, 𝑎 2, ..., 𝑎 𝑁 ] to represent the values
Dremel [91]. As shown in Figure 3b, Parquet stores the values of
in a particular column, where 𝑁 denotes the number of records.
each atomic field (the leaf nodes in the hierarchical schema in Fig-
ure 3a) as a separate column. Each column is associated with two NDV Ratio: Defined as the number of distinct values (NDV) di-
integer sequences of the same length, called the repetition level (R) vided by the total number of records in a column: 𝑓𝑐𝑟 = 𝑁 𝑁
𝐷𝑉 . A
and the definition level (D), to encode the structure. R links the numeric column typically has a higher NDV ratio than a categor-
values to their corresponding “repeated fields”, while D keeps track ical column. A column with a lower NDV ratio is usually more
of the NULLs in the “non-required fields”. compressible via Dictionary Encoding and RLE, for example.
On the other hand, ORC adopts a more intuitive model based
on length and presence to encode nested data [92]. As shown in Null Ratio: Defined as the number of NULLs divided by the total
| {𝑖 |𝑎𝑖 is null} |
Figure 3c, ORC associates a boolean column to each optional field number of records in a column: 𝑓𝑛𝑟 = 𝑁 . It is important
to indicate value presence. For each repeated field, ORC includes for a columnar storage format to handle NULL values efficiently
an additional integer column to record the repeated lengths. both in terms of space and query processing.
For comparison, ORC creates separate columns (presence and
length) for non-atomic fields (e.g., “name” and “tags” in Figure 3c), Value Range: This property defines the range of the absolute
while Parquet embeds this structural information in the atomic values in a column. Users pass two parameters: the average value
fields via R and D. The advantage of Parquet’s approach is that (e.g., 1000 for an integer column) and the variance of the value
it reads fewer columns (i.e., atomic fields only) during query pro- distribution. The value range directly impacts the compressed file
cessing. However, Parquet often produces a larger file size because size because most columnar formats apply Bitpacking to the values.
the information about the non-atomic fields could be duplicated For string, this is defined as byte length.
in multiple atomic fields (e.g., “first” and “last” both contains the Sortedness: The degree of sortedness of a column affects not
information about the presence of “name” in Figure 3b). only the efficiency of encoding algorithms such as RLE and Delta
Encoding, but also the effectiveness of zone maps. Prior work has
proposed ways to measure the sortedness of a sequence [90], but
4 COLUMNAR STORAGE BENCHMARK
these metrics do not correlate strongly with encoding efficiency,
The next step is to stress-test the performance and space efficiency so we developed a simple metric that puts more emphasis on local
of the storage formats using data sets using varying value distribu- sortedness. We divide the column into fixed-sized blocks (512 entries
tions. Standard OLAP benchmarks such as SSB [93], TPC-H [104] by default). Within each block, we compute a sortedness score to
and TPC-DS [103] generate data sets with uniform distributions. reflect its ascending or descending tendency:
Second, although some benchmarks, such as YCSB [67], DSB [71], 𝑎𝑠𝑐 = |{𝑖 |1 ≤ 𝑖 <𝑛 and 𝑎𝑖 < 𝑎𝑖+1 }|; 𝑑𝑒𝑠𝑐 = |{𝑖 |1 ≤ 𝑖 <𝑛 and 𝑎𝑖 > 𝑎𝑖+1 }|
and BigDataBench [111] allow users to set data skewness, the con- max(𝑎𝑠𝑐,𝑑𝑒𝑠𝑐 )+𝑒𝑞−⌊ 𝑁2 ⌋
figuration space is often too small to generate distributions that are 𝑒𝑞 = |{𝑖 |1 ≤ 𝑖 < 𝑛 and 𝑎𝑖 = 𝑎𝑖+1 }|; 𝑓𝑠𝑜𝑟𝑡𝑛𝑒𝑠𝑠 =
⌈ 𝑁2 ⌉−1
close to real-world data sets. Lastly, using real-world data is ideal, We then take the average of the per-block scores to represent the
but the number of high-quality resources available is insufficient column’s overall sortedness. A score of 1 means that the column
to cover a comprehensive analysis. is fully sorted, while a score close to 0 indicates a high probability
Given this, we designed a benchmark framework based on real- that the column’s values are randomly distributed. Although this
world data to evaluate multiple aspects of columnar formats. We metric is susceptible to adversarial patterns (e.g., 1, 2, 3, 4, 3, 2, 1), it
152
(0.0, 1e-05] (0.001, 0.01] uniform binary (0.0, 0.2] (0.6, 0.8] (0,1] (1e3,1e4] (0, 5] (25, 50]
0 (0.001, 0.1]
(1e-05, 0.0001] (0.01, 0.1] gentle_zipf single (0.2, 0.4] (0.8, 1.0] (1e1,1e2] (1e4,1e5] (5, 10] (50, 100]
(0.0, 1e-05] (0.1, 0.5]
(0.0001, 0.001] (0.1, 1.0] hotspot (0.4, 0.6] (1e2,1e3] (1e5,1e6] (10, 25] (100, 1000]
(1e-05, 0.001] (0.5, 1.0]
1.0 1.0 1.0 1.0 1.0 1.0
(a) NDV Ratio (b) Null Ratio (c) Skew Pattern (d) Sortedness (e) Int Value Range (f) String Length
Figure 5: Parameter Distribution – Percentage of total columns from diverse data sets of different parameter values.
is sufficient for our generator to produce columns with different repetitions with an NDV ratio smaller than 0.1. This implies that
sortedness levels. Given a score (e.g., 0.8), we first sort the values Dictionary Encoding would be beneficial to most of the real-world
in a block in ascending or descending order and then swap value columns. Figure 5b shows that the NULL ratio is low, and string
pairs randomly until the sortedness degrades to the target score. columns tend to have more NULLs than the other data types.
Most columns in the real world exhibit a skewed value distribu-
Skew Pattern: We use the following pseudo-zipfian distribution
tion, as shown in Figure 5c. Less than 5% of the columns can be
to model the value skewness in a column: 𝑝 (𝑘) = 𝑘1𝑠 /( 𝐶 1
∑︁
𝑛=1 𝑛𝑠 ). classified as Uniform. Regardless of the data type, a majority of the
𝐶 denotes the total number of distinct values, and 𝑘 refers to the columns fall into the category of Gentle Zipf. The remaining ≈ 30%
frequency rank (e.g., 𝑝 (1) represents the portion occupied by the of the columns contain “heavy hitters”. This distribution indicates
most frequent value). The Zipf-parameter 𝑠 determines the column that an open columnar format must handle both the “heavy hitters”
skewness: a larger 𝑠 leads to a more skewed distribution. Based on and the “long tails” (from Gentle Zipf ) efficiently at the same time.
the range of 𝑠, we classified the skew patterns into four categories: Figure 5d shows that the distribution of the sortedness scores
• Uniform: When 𝑠 ≤ 0.01. Each value appears in the column is polarized: most columns are either well-sorted or unsorted at
with a similar probability. all. This implies that encoding algorithms that excel only at sorted
• Gentle Zipf: When 0.01 < 𝑠 ≤ 2. The data is skewed to some columns (e.g., Delta Encoding and FOR) could still play an important
extent. The long tail still dominates the values of a column. role. Lastly, as shown in Figure 5e, most integer columns have small
• Hotspot: When 𝑠 > 2. The data is highly skewed. A few hot values that are ideal for Bitpacking compression. Long string values
values cover almost the entire column. are also rare in our data set collection (see Figure 5f). We also
• Single/Binary: This represents extreme cases in real-world data analyzed real-world Parquet files sampled from publicly available
where a column contains one/two distinct values. object store buckets and found that they mostly corroborate Figure 5.
The skew pattern is a key factor that determines the performance
of both lightweight encodings and block compression algorithms.
4.3 Predefined Workloads
4.2 Parameter Distribution in Real-World Data We extracted the column properties from the real-world data sets
introduced in Section 4.1 and categorized them into five predefined
We study the following real-world data sets to depict a parameter
workloads: bi (based on the Public BI Benchmark), classic (based
distribution of each of the core properties introduced in Section 4.1.
on IMDb, Yelp, and a subset of the Clickhouse sample data sets), geo
– Public BI Benchmark [45, 109]: real-world data and queries from (based on Geonames and the Cell Towers and Air Traffic data sets
Tableau with 206 tables (uncompressed 386GB). from Clickhouse), log (based on LOG and the machine-generated
– ClickHouse [28]: sample data sets from the ClickHouse tutorials, log data sets from Clickhouse), and ml (based on UCL-ML). Table 3
which represent typical OLAP workloads. presents the proportion of each data type for each workload, while
– UCI-ML [6]: a collection of 622 data sets for ML training. We select Table 4 summarizes the parameter settings of the column properties.
nine data sets that are larger than 100 MB. All are numerical data Each value in Table 4 represents a weighted average across the data
excluding unstructured images and embeddings. types (e.g., if there are 6 integer columns with an NDV ratio of 0.1,
– Yelp [52]: Yelp’s businesses, reviews, and user information. 3 string columns with an NDV ratio of 0.2, and 1 float columns with
– LOG [30]: log information on internet search traffic for EDGAR an NDV ratio of 0.4, the value reported in Table 4 would be 0.16).
filings through SEC.gov. The classic workload has a higher Zipf parameter and a higher
– Geonames [31]: geographical information covering all countries. NDV ratio at the same time, indicating a long-tail distribution. On
– IMDb [37]: data sets that describe the basic information, ratings, the other hand, the NDV ratio in log is relatively low, but the
and reviews of a collection of movies. columns are better sorted. In terms of data types, classic and geo
We extracted the core properties from each of the above data are string-heavy, while log and ml are float-heavy.
sets and plotted their parameter distributions in Figure 5. As shown We then created the core workload which is a mix of the five pre-
in Figure 5a, over 80% of the integer columns and 60% of the string defined workloads. It contains 50% of bi columns, 21% of classic,
columns have an NDV ratio smaller than 0.01. Surprisingly, even 7% of geo, 7% of log, and 15% of ml. We will use core as the de-
for floating-point columns, 60% of them have significant value fault workload in Section 5. For each workload, we also specify a
153
File Size (MB)
Table 3: Data type distribution of different workloads – Number in 150 Parquet ORC
Time (ms)
150 50 100
Time (s)
Type core bi classic geo log ml 0.4
0
100 core bi classicgeo log ml
154
File Size (MB)
Parquet ORC
50
75 18 39 36
File Size (MB)
Integer
50 12 25 26 24
25 6 0 13 12
−4 −3 −2 −1 0
10 10 10 10 10
0 0 0 0
10 −4
10 −3
10 −2
10 −1
10 0 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 102 104 106 108
56 56 57 56
0 0 0 0
10−4 10−3 10−2 10−1 100 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 103 105 107
Time (sec)
0.5
data type. For each table, we use the core workload’s parameter 0.4
0.3
settings but modify one of the four column properties: NDV ratio, 0.2
0.1
Value Range, Sortedness, and the Zipf parameter. Figure 7 shows 0.0 int string float
int string float int string float int string float
how the file size changes when we sweep the parameter of different
column properties. We disabled block compression in both Parquet (a) Time [Uncompressed] (b) Time [Snappy] (c) Time [zstd]
and ORC temporarily in these experiments. Parquet ORC
As shown in the first row of Figure 7, Parquet achieves a better Size (MB) 150
compression ratio than ORC for integer columns with a low to 100
50
medium NDV ratio (which is common in real-world data sets) be-
0 int string float
cause Parquet applies Dictionary Encoding on integers before using int string float int string float int string float
Bitpacking + RLE. When the NDV ratio grows larger (e.g., > 0.1),
this additional layer of Dictionary Encoding becomes less effective (d) Size [Uncompressed] (e) Size [Snappy] (f) Size [zstd]
than ORC’s more sophisticated integer encoding algorithms. Figure 8: Varying compression on core workload.
As the Zipf parameter 𝑠 becomes larger, the compression ratios 20
0.20
on integer columns improve for both Parquet and ORC (row 1, Parquet ORC
Time (sec)
Size (MB)
0.15 15
column 2 in Figure 7). The file size reduction happens earlier for
0.10 10
ORC (𝑠 = 1) than Parquet (𝑠 = 1.4). This is because RLE kicks in to 0.05
replace Bitpacking earlier in ORC (with the run length ≥ 3) than 0.00
5
2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10
Parquet (with the run length ≥ 8). We also observe that when the Data Run Length Data Run Length
integer column is highly sorted, ORC compresses those integers
(a) Scan Time (b) File Size
better than Parquet (row 1, column 3 in Figure 7) because of the
adoption of Delta Encoding and FOR in its integer encoding. Figure 9: Varying run length on string, w/o compression.
Parquet’s file size is stable as the value range of the integers varies
(row 1, column 4 in Figure 7). Parquet applies Dictionary Encoding the one described in [81] is needed to handle the situation when
on the integers and uses Bitpacking + RLE on the dictionary codes Dictionary Encoding fails. Also, the format should expose certain
only. Because these codes do not change as we vary the value range, encoding parameters, such as the minimum run length for RLE for
the file size of Parquet stays the same in these experiments. On the tuning so that users can make the trade-off more smoothly.
other hand, the file size of ORC increases as the value range gets
larger because ORC encodes the original integers directly. 5.3.2 Decoding Speed. We next benchmark the decoding speed of
For string columns, as shown in the second row of Figure 7, Parquet and ORC. We use the data sets in Section 5.3.1 that follow
Parquet and ORC have almost identical file sizes because they both the default core workload. Block compression is still disabled in the
use Dictionary Encoding on strings. ORC has a slight size advantage experiments in this section. We perform a full table scan on each
over Parquet, especially when the dictionary is large because ORC file and measure the I/O time and the decoding time separately.
applies encoding on the string lengths of the dictionary entries.
The third row of Figure 7 shows the results for float columns. Table 6: Branch Mispredic- Table 7: Subsequences count and data
Parquet dominates ORC in file sizes because Dictionary Encoding tions of Figure 8a. percentage for integer in Table 6.
is surprisingly effective on float-point numbers in the real world.
Discussion: Because of the low NDV ratio of real-world columns Workload Encoding
(as shown in Figure 5), Parquet’s strategy of applying Dictionary Format int string float RLE Bitpack Delta PFOR
Encoding on every column seems to be a reasonable default for ORC 2.9M 3.1M 0.3M .7M(16%) .7M(32%) .2M(49%) .01M(3%)
future formats. However, an encoding selection algorithm such as Parquet 0.9M 1.9M 0.6M .2M(46%) .2M(54%) 0 0
155
Figure 8a shows that Parquet has faster decoding than ORC zstd_Decode
100 Parquet Metadata Parsing
Parquet Data Decode
for integer and string columns. As explained in Section 5.2, there 0.6 zstd_IO 80 ORC Metadata Parsing
Time (sec)
NoCompression_Decode
Time (ms)
ORC Data Decode
are two main reasons behind this: (1) Parquet relies more on the 0.4 NoCompression_IO 60
fast Bitpacking and applies RLE less aggressively than ORC, and 40
0.2
(2) Parquet has a simpler integer encoding scheme that involves 20
fewer algorithm options. As shown in Table 6, switching between 0.0
st1 gp3 gp2 io1 nvme s3 0
200 2000 4000 8000 10000
the four integer encoding algorithms in ORC generates 3× more Storage Type
Number of Features
branch mispredictions than Parquet during the decoding process Figure 10: Block Compression Figure 11: Wide-Table Projection
(done on a similar physical machine to collect the performance
counters). According to the breakdown in Table 7, ORC has 4× 5.4 Block Compression
more subsequences to decode than Parquet, and the encoding algo- We study the performance-space trade-offs of block compression on
rithm distribution among the subsequences is unfriendly to branch the formats in this section. We first repeat the decoding-speed exper-
prediction. Parquet’s decoding-speed advantage over ORC shrinks iments in Section 5.3.2 with different algorithms (i.e., Snappy [34],
for integers compared to strings, indicating a (slight) decoding Zstd [54]). As shown in Figures 8d to 8f, Zstd achieves a better com-
overhead due to its additional dictionary layer for integer columns. pression ratio than Snappy for all data types. The results also show
Parquet also optimizes the bit-unpacking procedure using SIMD that block compression is effective on float columns in ORC because
instructions and code generation to avoid unnecessary branches. they contain raw values. For the rest of the columns in both Parquet
To further illustrate the performance and space trade-off be- and ORC, however, the space savings of such compression is lim-
tween Bitpacking and RLE, we construct a string column with a ited because they are already compressed via lightweight encoding
pre-configured parameter 𝑟 where each string value repeats 𝑟 times algorithms. Figures 8a to 8c also shows that block compression
consecutively in the column. Recall that ORC applies RLE when imposes up 4.2× performance overhead to scanning.
𝑟 ≥ 3, while the RLE threshold for Parquet is 𝑟 ≥ 8. Figure 9 shows We further investigate the I/O benefit and the computational
the decoding speed and file sizes of Parquet and ORC with different overhead of block compression on Parquet across different storage-
𝑟 ’s. We observe that RLE takes longer to decode compared to Bit- device tiers available in AWS. The x-axis labels in Figure 10 show
packing for short repetitions. As 𝑟 increases, this performance gap the storage tiers, where st1, gp3, gp2, and io1 are from Amazon
shrinks quickly. The file sizes show the opposite trend (Figure 9b) EBS, while nvme is from an AWS i3 instance. These storage tiers
as RLE achieves a much better compression ratio than Bitpacking. are ordered by an increasing read bandwidth. We generate a table
For float columns, ORC achieves a better decoding performance with 1m rows and 20 columns according to the core workload and
than Parquet because ORC does not apply any encoding algorithms store the data in Parquet. The file sizes are 34 MB and 25 MB with
on floating-point values. Although the float columns in ORC occupy Zstd disabled and enabled, respectively. We then perform scans on
much larger space than the dictionary-encoded ones in Parquet (as the Parquet files stored in each storage tier using a single thread.
shown in Figure 7), the saving in computation outweighs the I/O As shown in Figure 10, applying Zstd to Parquet only speeds
overhead with modern NVMe SSDs. up scans on slow storage tiers (e.g., st1) where I/O dominates
Discussion: Although more advanced encoding algorithms, the execution time. For faster storage devices, especially NVMe
such as FSST [65], HOPE [116], Chimp [87] and LeCo [88], have SSDs, the I/O time is negligible compared to the computation time.
been proposed recently, it is important to keep the encoding scheme In this case, the decompression overhead of Zstd hinders scan
in an open format simple to guarantee a fast decoding speed. Se- performance. The situation is different with S3 because of its high
lecting from multiple encoding algorithms at run time imposes access latency [61]. Reading a Parquet file requires several round
noticeable performance overhead on decoding. Future format de- trips, including fetching the footer length, the footer, and lastly the
signs should be cautious about including encoding algorithms that column chunks. Therefore, even with multi-threaded optimization
only excel at specific situations in the decoding critical path. to fully utilize S3’s bandwidth, the I/O cost of reading a medium-
In addition, as the storage device gets faster, the local I/O time sized (e.g., 10s-100s MB) Parquet file is still noticeable.
could be negligible during query processing. According to the float Discussion: As storage gets faster and cheaper, the computa-
results in Figure 8a, even a scheme as lightweight as Dictionary tional overhead of block compression dominates the I/O and storage
Encoding adds significant computational overhead for a sequential savings for a storage format. Unless the application is constrained
scan, and this overhead cannot get covered by the I/O time savings. by storage space, such compression should not be used in future
This indicates that most encoding algorithms still make trade-offs formats. Moreover, as more data is located on cloud-resident ob-
between storage efficiency and decoding speed with modern hard- ject stores (e.g., S3), it is necessary to design a columnar format
ware (instead of a Pareto improvement as in the past). Future for- specifically for this operating environment (e.g., high bandwidth
mats may not want to make any lightweight encoding algorithms and high latency). Potential optimizations include storing all the
“mandatory” (e.g., leave raw data as an option). Also, the ability to metadata continuously in the format to avoid multiple round trips,
operate on compressed data is important with today’s hardware. appropriately sizing the row groups (or files) to hide the access
latency, and coalescing small-range requests to better utilize the
cloud storage bandwidth [9, 63].
156
struct 0 and isolate other noise, we use float data so we can disable encoding
200
0.2
2 we record the file size, the time to read the file into an Arrow table,
and the time to decode the nested structure during the table scan.
0 0.0 As shown in Figure 12b, as the depth of the schema tree gets
2 4 8 17 32 4762 2 4 8 17 32 4762
Max Depth Max Depth
larger, the Parquet file size grows faster than ORC. On the other
hand, ORC spends much more time transforming to Arrow (Fig-
(c) Time of Scanning to Arrow (d) Nested Info Decode Overhead ure 12c). The reason is that ORC needs to be read into its in-memory
Figure 12: Nested Data Model – Varying max depth in the data. data structure and then transformed to Arrow. And the transfor-
mation is not optimized. Therefore, we further profile the time
5.5 Wide-Table Projection decoding the nested information during the scan. The result in Fig-
According to our discussion with Meta’s Alpha [108] team, it is ure 12d shows that ORC’s overhead to decode the nested structure
common to store a large number of features (thousands of key-value information is getting larger than Parquet’s as the schema gets
pairs) for ML training in ORC files using the “flat map” data type deeper. The reason is that ORC needs to decode structure infor-
where the keys and values are stored in separate columns. Because mation of struct and list while Parquet only needs to decode leaf
each ML training process often fetches a subset of the features, the fields along with their levels. This result is consistent with Dremel’s
columnar format must support wide-table projection efficiently. In retrospective work [92].
this experiment, we generate a table of 10K rows with a varying Discussion: The trade-offs between the two nested data models
number of float attributes. We store the table in Parquet and ORC only manifest when the depth is large. Future formats should pay
and randomly select 10 attributes to project. Figure 11 shows the more attention to avoiding extra overhead during the translation
breakdown of the average latency of the projection queries. between the on-disk and in-memory nested models.
As the number of attributes (i.e., features) in the table grows, the
metadata parsing overhead increases almost linearly even though
5.8 Machine Learning Workloads
the number of projection columns stays fixed. This is because the
footer structures in Parquet and ORC do not support efficient ran- We next investigate how well the columnar formats support com-
dom access. The schema information is serialized in Thrift (Parquet) mon ML workloads. Besides raw data (e.g., image URLs, text) and
or Protocol Buffer (ORC), which only supports sequential decoding. the associated metadata (e.g., image dimensions, tags), an ML data
We also notice that ORC’s performance declines as the table gets set often contains the vector embeddings of the raw data, which
wider because there are fewer entries in each row group whose size is a vector of floating-point numbers to enable similarity search in
has a physical limit (64 MB). applications such as text-image matching and ad recommendation.
Discussion: Wide tables are common, especially when storing It is common to store the entire ML data set in Parquet files [35],
features for ML training. Future formats must organize the metadata where the vector embeddings are stored as lists in Parquet’s nested
to support efficient random access to the per-column schema. model. Additionally, ML applications often build separate vector
indexes directly from Parquet to speed up similarity search [23].
5.6 Indexes and Filters
5.8.1 Compression Ratio and Deserialization Performance with Vec-
We tested the efficacy of zone maps and Bloom Filters in Parquet and
tor Embeddings. In this experiment, we collect 30 data sets with
ORC by performing scans with predicates of varying selectivities.
vector embeddings from the top downloaded and top trending
The experiment results are presented in our technical report [114].
lists on Hugging Face and store the embeddings in four differ-
Overall, zone maps and Bloom Filters can boost the performance of
ent formats: Parquet, ORC, HDF5, and Zarr. We then scan those
low-selectivity queries. However, zone maps are effective only for
files into in-memory Numpy arrays and record the scan time for
a smaller number of well-clustered columns, while Bloom Filters
each file. We report the median, 25/75%, and min/max of the com-
are useful only for point queries. Future formats should consider
pression ratio (format_size / Numpy_size) and the scan slowdown
recent research advances in indexing and filtering structures such
(format_scan_time / disk_Numpy_scan_time) in Figure 13.
as column indexes [77, 86, 101] and range filters [107, 115].
Figure 13a shows that none of the four formats achieves good
compression with vector embeddings, although Zarr is optimized
5.7 Nested Data Model for storing numerical arrays. Zarr, however, incurs a smaller scan-
In this section, we quantitatively evaluate the trade-off on the nested ning overhead compared to Parquet and ORC, as shown in Fig-
data model between Parquet and ORC. To only test the nested model ure 13b. This is because Zarr divides a list of (fixed-length) vector
157
20 Vector Index Search Parquet on SSD ORC on SSD Parquet on S3 ORC on S3
Compression Ratio
Number of S3 GETs
15
103 104
0.75
Time (s)
10
0.50 10 1
101
0.25 5
103
0.00 10−1 10−1
zip std td std zip td std std
hdf5-g parquet-z orc-zs zarr-blosc-z hdf5-g parquet-zs orc-z zarr-blosc-z
1
239 25 27 29 1
21 23 25 227 2 2 23 25 27 29
Vector Batch Size Vector Batch Size
(a) Compression Ratio (b) Scan Time w.r.t Numpy
(a) Time of vector index search vs. se- (b) S3 GET requests issued
Figure 13: Efficiency of storing and scanning embeddings
lection on files using resulting row IDs
embeddings into grid chunks to facilitate parallel scanning/decod- Figure 14: Top-k Search Workflow Breakdown (k = 10)
ing of the vectors. On the other hand, Parquet and ORC only support filter_0 filter_1 filter_2 filter_3 filter_4
sequential decoding within a row group. 20
20
Discussion: Existing columnar formats are less optimized to 3
15 15
Time (s)
store and deserialize vector embeddings, which prevail in ML data
Time (s)
2
sets. Future format designs should include specialized data type- 10 10
1
s/structures to allow better floating point compression [83, 87, 94] 5
5
158
Parquet-Arrow Parquet-cuDF ORC-Arrow ORC-cuDF
6 LESSONS AND FUTURE DIRECTIONS
Throughput (Mrows/s)
Throughput. Perc.
15 0.15 We summarize the lessons learned from our evaluation of Parquet
20
10 0.10 and ORC to guide future innovations in columnar storage formats.
15 Lesson 1. Dictionary Encoding is effective across data types
5 0.05
10
(even for floating-point values) because most real-world data have
0 0.00 low NDV ratios. Future formats should continue to apply the tech-
214 217 5220 223 214 217 220 223
Row Count Row Count nique aggressively, as in Parquet.
0 Lesson 2. It is important to keep the encoding scheme simple in a
(a) core workload 214 (b) Peak
217 GPU220
Throughput
223 Percentage
columnar format to guarantee a competitive decoding performance.
Throughput (Mrows/s)
Parquet-uncomp.
150 orc-zstd orc-uncomp. Future format designers should pay attention to the performance
8 Parquet-zstd
ORC-uncomp. I/O(includes PCIe transfer)
decompress
cost of selecting from many codec algorithms during decoding.
Time (ms)
6 ORC-zstd
100 decode Lesson 3. The bottleneck of query processing is shifting from
4 storage to (CPU) computation on modern hardware. Future formats
50
2 should limit the use of block compression and other heavyweight
0 0 encodings unless the benefits are justified in specific cases.
214 217 220 223 128K 256K 512K 1M
Row Count Row Count Lesson 4. The metadata layout in future formats should be
centralized and friendly to random access to better support wide
(c) cuDF varying compression (d) Time breakdown of ORC in (c)
(feature) tables common in ML training. The size of the basic I/O
Figure 16: GPU Decoding block should be optimized for high-latency cloud storage.
Lesson 5. As storage is getting cheaper, future formats could
In the first experiment, we scan and decode the files using Arrow
afford to store more sophisticated indexing and filtering structures
(with multithread and I/O prefetching enabled) and cuDF, respec-
to speed up query processing.
tively. As shown in Figure 16a, ORC-cuDF exhibits higher decoding
Lesson 6. Nested data models should be designed with an affinity
throughput than Parquet-cuDF because ORC has more independent
to modern in-memory formats to reduce the translation overhead.
blocks to better utilize the massive parallelism provided by the GPU:
Lesson 7. The characteristics of common machine learning work-
the smallest zone map in ORC maps to fewer rows than Parquet’s,
loads require future formats to support both wide-table projections
and each GPU thread block is assigned to each smallest zone map
and low-selectivity selections efficiently. This calls for better meta-
region in cuDF. As the number of rows increases in the files, the
data organization and more effective indexing. Besides, future for-
decoding throughput of Parquet-Arrow scales because there are
mats should allocate separate regions for large binary objects and
more row groups to leverage for multi-core parallel decoding with
incorporate compression techniques specifically designed for floats.
asynchronous I/O. On the contrary, the Arrow implementation for
Lesson 8. Future formats should consider the decoding efficiency
ORC does not support parallel read.
with GPUs. This requires not only sufficient parallel data blocks at
We further profile the GPU’s peak throughput in the above ex-
the file level but also encoding algorithms that are parallelizable to
periment over its theoretical maximum throughput using Nsight
fully utilize the computation within a GPU thread block.
Compute [40]. As shown in Figure 16b, the overall compute utiliza-
tion is low (although the GPU occupancy is full when row count
reaches 8M). This is because the integer encoding algorithms used
in Parquet and ORC (e.g., hybrid RLE + Bitpacking) are not designed
7 CONCLUSION
for parallel processing: all threads must wait for the first thread to In this paper, we comprehensively evaluate the common colum-
scan the entire data block to obtain their offsets in the input and nar formats, including Parquet and ORC. We build a taxonomy
output buffers. Moreover, because cuDF assigns a warp (32 threads) of the two formats to summarize the design of their format inter-
to each encoded run, a short run (e.g., a length-3 RLE run in ORC) nals. To better test the formats’ trade-offs, we analyze real-world
would cause the threads within a warp to be underutilized. data sets and design a benchmark that can sweep data distribution
We next perform a controlled experiment under the same setting to demonstrate the differences in encoding algorithms. Using the
as above to evaluate how block compression affects GPU decoding. benchmark, we conduct experiments on various metrics of the for-
Figure 16c shows that applying zstd improves the scan throughput mats. Our results highlight essential design considerations that are
for both Parquet and ORC when there are enough rows in the advantageous for modern hardware and emerging ML workloads.
files (i.e., enough data to leverage GPU parallelism). Figure 16d
shows the scan time breakdown. We observe that the I/O time
(including the PCIe transfer between GPU and CPU) dominates the ACKNOWLEDGMENTS
scan performance, making aggressive block compression pay off. The authors thank Pedro Pedreira, Yoav Helfman, Orri Erling, and
Discussion: Existing columnar formats are not designed to be Zhenyuan Zhao for discussing Meta’s ML use cases. We also thank
GPU-friendly. The integer encoding algorithms operate on variable- Gregory Kimball from NVidia for the feedback on GPU-decoding
length subsequences, making decoding hard to parallelize efficiently. experiments. This work was supported (in part) by Shanghai Qi Zhi
Future formats should favor encodings with better parallel process- Institute, National Science Foundation (IIS-1846158, SPX-1822933),
ing potentials. Besides, aggressive block compression is beneficial VMware Research Grants for Databases, Google DAPA Research
to alleviate the dominating I/O overheads (unlike with CPUs). Grants, and the Alfred P. Sloan Research Fellowship program.
159
REFERENCES [53] 2023. Zarr. https://zarr.dev/.
[1] 2016. File Format Benchmark - Avro, JSON, ORC & Parquet. [54] 2023. Zstandard. https://github.com/facebook/zstd.
https://www.slideshare.net/HadoopSummit/file-format-benchmark-avro- [55] Daniel Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, Samuel Mad-
json-orc-parquet. den, et al. 2013. The design and implementation of modern column-oriented
[2] 2016. Format Wars: From VHS and Beta to Avro and Parquet. http://www.svds. database systems. Foundations and Trends® in Databases 5, 3 (2013), 197–280.
com/dataformats/. [56] Daniel Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating compres-
[3] 2016. Inside Capacitor, BigQuery’s next-generation columnar storage sion and execution in column-oriented database systems. In Proceedings of the
format. https://cloud.google.com/blog/products/bigquery/inside-capacitor- 2006 ACM SIGMOD international conference on Management of data. 671–682.
bigquerys-next-generation-columnar-storage-format. [57] Azim Afroozeh and Peter Boncz. 2023. The FastLanes Compression Layout:
[4] 2017. Apache Arrow vs. Parquet and ORC: Do we really need a third Apache Decoding> 100 Billion Integers per Second with Scalar Code. Proceedings of the
project for columnar data representation? http://dbmsmusings.blogspot.com/ VLDB Endowment 16, 9 (2023), 2132–2144.
2017/10/apache-arrow-vs-parquet-and-orc-do-we.html. [58] Ankur Agiwal and Kevin Lai et al. 2021. Napa: Powering Scalable Data Ware-
[5] 2017. Some comments to Daniel Abadi’s blog about Apache Arrow. https: housing with Robust Query Performance at Google. Proceedings of the VLDB
//wesmckinney.com/blog/arrow-columnar-abadi/. Endowment (PVLDB) 14 (12) (2021), 2986–2998.
[6] 2022. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets. [59] Anastassia Ailamaki, David J DeWitt, Mark D Hill, and Marios Skounakis. 2001.
php. Accessed: 2022-09-22. Weaving Relations for Cache Performance.. In VLDB, Vol. 1. 169–180.
[7] 2023. Amazon S3. https://aws.amazon.com/s3/. [60] Wail Y. Alkowaileet and Michael J. Carey. 2022. Columnar Formats for Schema-
[8] 2023. Apache Arrow. https://arrow.apache.org/. less LSM-Based Document Stores. Proc. VLDB Endow. 15, 10 (sep 2022),
[9] 2023. Apache Arrow Dataset API. https://arrow.apache.org/docs/python/ 2085–2097. https://doi.org/10.14778/3547305.3547314
generated/pyarrow.parquet.ParquetDataset.html. [61] Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu,
[10] 2023. Apache Avro. https://avro.apache.org/. Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja
[11] 2023. Apache Carbondata. https://carbondata.apache.org/. Łuszczak, et al. 2020. Delta lake: high-performance ACID table storage over
[12] 2023. Apache Hadoop. https://hadoop.apache.org/. cloud object stores. Proceedings of the VLDB Endowment 13, 12 (2020), 3411–
[13] 2023. Apache Hive. https://hive.apache.org/. 3424.
[14] 2023. Apache Hudi. https://hudi.apache.org/. [62] Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. 2021. Lake-
[15] 2023. Apache Iceberg. https://iceberg.apache.org/. house: a new generation of open platforms that unify data warehousing and
[16] 2023. Apache Impala. https://impala.apache.org/. advanced analytics. In Proceedings of CIDR. 8.
[17] 2023. Apache ORC. https://orc.apache.org/. [63] Haoqiong Bian and Anastasia Ailamaki. 2022. Pixels: An Efficient Column
[18] 2023. Apache Parquet. https://parquet.apache.org/. Store for Cloud Data Lakes. In 2022 IEEE 38th International Conference on Data
[19] 2023. Apache Presto. https://prestodb.io/. Engineering (ICDE). IEEE, 3078–3090.
[20] 2023. Apache Spark. https://spark.apache.org/. [64] Haoqiong Bian, Ying Yan, Wenbo Tao, Liang Jeff Chen, Yueguo Chen, Xiaoyong
[21] 2023. Arrow C++ and Parquet C++. https://github.com/apache/arrow/tree/ Du, and Thomas Moscibroda. 2017. Wide table layout optimization based on
main/cpp. column ordering and duplication. In Proceedings of the 2017 ACM International
[22] 2023. AutoFaiss. https://github.com/criteo/autofaiss. Conference on Management of Data. 299–314.
[23] 2023. AutoFAISS build index API. https://criteo.github.io/autofaiss/API/ [65] Peter Boncz, Thomas Neumann, and Viktor Leis. 2020. FSST: fast random
_autosummary/autofaiss.external.quantize.build_index.html. Accessed: 2023- access string compression. Proceedings of the VLDB Endowment 13, 12 (2020),
07-17. 2649–2661.
[24] 2023. Azure Blob Storage. https://azure.microsoft.com/en-us/services/storage/ [66] Biswapesh Chattopadhyay, Priyam Dutta, Weiran Liu, Ott Tinn, Andrew Mc-
blobs/. cormick, Aniket Mokashi, Paul Harvey, Hector Gonzalez, David Lomax, Sagar
[25] 2023. BP5. https://adios2.readthedocs.io/en/latest/engines/engines.html#bp5. Mittal, et al. 2019. Procella: Unifying serving and analytical data at YouTube.
[26] 2023. Chroma. https://github.com/chroma-core/chroma/. (2019).
[27] 2023. ClickHouse. https://clickhouse.com/. [67] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and
[28] 2023. ClickHouse Example Datasets. https://clickhouse.com/docs/en/getting- Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In SoCC.
started/example-datasets. 143–154.
[29] 2023. Dremio. https://www.dremio.com//. [68] George P Copeland and Setrag N Khoshafian. 1985. A decomposition storage
[30] 2023. EDGAR Log File Data Sets. https://www.sec.gov/about/data/edgar-log- model. Acm Sigmod Record 14, 4 (1985), 268–279.
file-data-sets.html. [69] Dario Curreri, Olivier Curé, and Marinella Sciortino. [n.d.]. RDF DATA AND
[31] 2023. GeoNames Dataset. http://www.geonames.org/. COLUMNAR FORMATS. Master’s thesis.
[32] 2023. Google BigQuery. https://cloud.google.com/bigquery. [70] Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin
[33] 2023. Google Cloud Storage. https://cloud.google.com/storage. Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel,
[34] 2023. Google snappy. http://google.github.io/snappy/. Jiansheng Huang, et al. 2016. The Snowflake Elastic Data Warehouse. In SIG-
[35] 2023. Hugging Face Datasets Server. https://huggingface.co/docs/datasets- MOD.
server/quick_start#access-parquet-files. Accessed: 2023-07-09. [71] Bailu Ding, Surajit Chaudhuri, Johannes Gehrke, and Vivek Narasayya. 2021.
[36] 2023. image-parquet. https://discuss.huggingface.co/t/image-dataset-best- DSB: A decision support benchmark for workload-driven and traditional data-
practices/13974. base systems. Proceedings of the VLDB Endowment 14, 13 (2021), 3376–3388.
[37] 2023. IMDb Datasets. https://www.imdb.com/interfaces/. [72] Avrilia Floratou, Umar Farooq Minhas, and Fatma Özcan. 2014. Sql-on-hadoop:
[38] 2023. InfluxData. https://www.influxdata.com/. Full circle back to shared-nothing database architectures. Proceedings of the
[39] 2023. NetCDF. https://www.unidata.ucar.edu/software/netcdf/. VLDB Endowment 7, 12 (2014), 1295–1306.
[40] 2023. NVIDIA Nsight Compute. https://developer.nvidia.com/nsight-compute. [73] Mike Folk, Gerd Heber, Quincey Koziol, Elena Pourmal, and Dana Robinson.
[41] 2023. ORC C++. https://github.com/apache/orc/tree/main/c%2B%2B. 2011. An overview of the HDF5 technology suite and its applications. In
[42] 2023. Parquet Bloom Filter Jira Discussion. https://issues.apache.org/jira/ Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases. 36–47.
browse/PARQUET-41. [74] Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1998. Compressing
[43] 2023. Pinecone. https://www.pinecone.io/. relations and indexes. In Proceedings 14th International Conference on Data
[44] 2023. Protocol Buffers. https://developers.google.com/protocol-buffers/. Engineering. IEEE, 370–379.
[45] 2023. Public BI benchmark. https://github.com/cwida/public_bi_benchmark. [75] Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak,
[46] 2023. Querying Parquet with Millisecond Latency. https://www.influxdata. Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case
com/blog/querying-parquet-millisecond-latency/. for Simpler Data Warehouses. In SIGMOD.
[47] 2023. RAPIDS. https://rapids.ai/. [76] Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang,
[48] 2023. Samsung 980 PRO 4.0 NVMe SSD. https://www.samsung.com/us/ and Zhiwei Xu. 2011. RCFile: A fast and space-efficient data placement struc-
computing/memory-storage/solid-state-drives/980-pro-pcie-4-0-nvme-ssd- ture in MapReduce-based warehouse systems. In 2011 IEEE 27th International
1tb-mz-v8p1t0b-am/. Accessed: 2023-02-21. Conference on Data Engineering. IEEE, 1199–1208.
[49] 2023. SequenceFile. https://cwiki.apache.org/confluence/display/HADOOP2/ [77] Brian Hentschel, Michael S Kester, and Stratos Idreos. 2018. Column sketches:
SequenceFile. A scan accelerator for rapid and robust predicate evaluation. In Proceedings of
[50] 2023. The DWRF Format. https://github.com/facebookarchive/hive-dwrf. the 2018 International Conference on Management of Data. 857–872.
[51] 2023. Vector Data Lakes. https://www.databricks.com/dataaisummit/session/ [78] Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N Hanson,
vector-data-lakes/. Accessed: 2023-07-28. Owen O’Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, and Xiaodong Zhang.
[52] 2023. Yelp Open Dataset. https://www.yelp.com/dataset/. 2014. Major technical advancements in apache hive. In Proceedings of the 2014
ACM SIGMOD international conference on Management of data. 1235–1246.
160
[79] S Idreos, F Groffen, N Nes, S Manegold, S Mullender, and M Kersten. 2012. [100] Anil Shanbhag, Bobbi W. Yogatama, Xiangyao Yu, and Samuel Madden. 2022.
Monetdb: Two decades of research in column-oriented database. IEEE Data Tile-Based Lightweight Integer Compression in GPU. In Proceedings of the
Engineering Bulletin (2012). 2022 International Conference on Management of Data (Philadelphia, PA, USA)
[80] Todor Ivanov and Matteo Pergolesi. 2020. The impact of columnar file formats on (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA,
SQL-on-hadoop engine performance: A study on ORC and Parquet. Concurrency 1390–1403. https://doi.org/10.1145/3514221.3526132
and Computation: Practice and Experience 32, 5 (2020), e5523. [101] Lefteris Sidirourgos and Martin Kersten. 2013. Column imprints: a secondary
[81] Hao Jiang, Chunwei Liu, John Paparrizos, Andrew A Chien, Jihong Ma, and index structure. In Proceedings of the 2013 ACM SIGMOD International Conference
Aaron J Elmore. 2021. Good to the Last Bit: Data-Driven Encoding with on Management of Data. 893–904.
CodecDB. In Proceedings of the 2021 International Conference on Management of [102] Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cher-
Data. 843–856. niack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth J.
[82] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity O’Neil, Patrick E. O’Neil, Alex Rasin, Nga Tran, and Stanley B. Zdonik. 2005.
search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547. C-Store: A Column-oriented DBMS. In Proceedings of the 31st International
[83] Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis. Conference on Very Large Data Bases, Trondheim, Norway, August 30 - September
2023. BtrBlocks: Efficient Columnar Compression for Data Lakes. Proc. ACM 2, 2005. ACM, 553–564.
Manag. Data 1, 2, Article 118 (jun 2023), 26 pages. https://doi.org/10.1145/ [103] The Transaction Processing Council. 2021. TPC-DS Benchmark (Revision 3.2.0).
3589263 [104] The Transaction Processing Council. 2022. TPC-H Benchmark (Revision 3.0.1).
[84] Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per [105] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka,
second through vectorization. Software: Practice and Experience 45, 1 (2015), Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a
1–29. warehousing solution over a map-reduce framework. Proceedings of the VLDB
[85] Yinan Li, Jianan Lu, and Badrish Chandramouli. 2023. Selection Pushdown in Endowment 2, 2 (2009), 1626–1629.
Column Stores Using Bit Manipulation Instructions. Proc. ACM Manag. Data 1, [106] Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, and
2, Article 178 (jun 2023), 26 pages. https://doi.org/10.1145/3589323 Bernard Metzler. 2018. Albis: { High-Performance } File Format for Big Data Sys-
[86] Yinan Li and Jignesh M Patel. 2013. Bitweaving: Fast scans for main memory data tems. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 615–630.
processing. In Proceedings of the 2013 ACM SIGMOD International Conference [107] Kapil Vaidya, Subarna Chatterjee, Eric Knorr, Michael Mitzenmacher, Stratos
on Management of Data. 289–300. Idreos, and Tim Kraska. 2022. SNARF: a learning-enhanced range filter. Pro-
[87] Panagiotis Liakos, Katia Papakonstantinopoulou, and Yannis Kotidis. 2022. ceedings of the VLDB Endowment 15, 8 (2022), 1632–1644.
Chimp: efficient lossless floating point compression for time series databases. [108] Suketu Vakharia, Peng Li, Weiran Liu, and Sundaram Narayanan. 2023. Shared
Proceedings of the VLDB Endowment 15, 11 (2022), 3058–3070. Foundations: Modernizing Meta’s Data Lakehouse. In The Conference on Inno-
[88] Yihao Liu, Xinyu Zeng, and Huanchen Zhang. 2023. LeCo: Lightweight Com- vative Data Systems Research, CIDR.
pression via Learning Serial Correlations. arXiv preprint arXiv:2306.15374 (2023). [109] Adrian Vogelsgesang, Michael Haubenschild, Jan Finis, Alfons Kemper, Viktor
[89] Samuel Madden, Jialin Ding, Tim Kraska, Sivaprasad Sudhir, David Cohen, Leis, Tobias Muehlbauer, Thomas Neumann, and Manuel Then. 2018. Get
Timothy Mattson, and Nesime Tatbul. 2022. Self-Organizing Data Containers. Real: How Benchmarks Fail to Represent the Real World. In Proceedings of
In The Conference on Innovative Data Systems Research, CIDR. the Workshop on Testing Database Systems (Houston, TX, USA) (DBTest’18).
[90] Heikki Mannila. 1985. Measures of presortedness and optimal sorting algo- Association for Computing Machinery, New York, NY, USA, Article 1, 6 pages.
rithms. IEEE transactions on computers 100, 4 (1985), 318–325. [110] Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li,
[91] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shiv- Xiangyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, et al. 2021. Milvus:
akumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: interactive analysis of A purpose-built vector data management system. In Proceedings of the 2021
web-scale datasets. Proceedings of the VLDB Endowment 3, 1-2 (2010), 330–339. International Conference on Management of Data. 2614–2627.
[92] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shiv- [111] Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He,
akumar, Matt Tolton, Theo Vassilakis, Hossein Ahmadi, Dan Delorey, Slava Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent
Min, et al. 2020. Dremel: A decade of interactive SQL analysis at web scale. Zhan, Xiaona Li, and Bizhu Qiu. 2014. BigDataBench: A big data benchmark
Proceedings of the VLDB Endowment 13, 12 (2020), 3461–3472. suite from internet services. In 2014 IEEE 20th International Symposium on High
[93] Patrick E O’Neil, Elizabeth J O’Neil, and Xuedong Chen. 2007. The star schema Performance Computer Architecture (HPCA). 488–499. https://doi.org/10.1109/
benchmark (SSB). Pat 200, 0 (2007), 50. HPCA.2014.6835958
[94] Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin [112] Bobbi W Yogatama, Weiwei Gong, and Xiangyao Yu. 2022. Orchestrating data
Meza, and Kaushik Veeraraghavan. 2015. Gorilla: A fast, scalable, in-memory placement and query execution in heterogeneous CPU-GPU DBMS. Proceedings
time series database. Proceedings of the VLDB Endowment 8, 12 (2015), 1816– of the VLDB Endowment 15, 11 (2022), 2491–2503.
1827. [113] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
[95] Pouria Pirzadeh, Michael Carey, and Till Westmann. 2017. A performance study Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012.
of big data analytics platforms. In 2017 IEEE international conference on big data Resilient distributed datasets: A { Fault-Tolerant } abstraction for { In-Memory }
(big data). IEEE, 2911–2920. cluster computing. In 9th USENIX Symposium on Networked Systems Design and
[96] Felix Putze, Peter Sanders, and Johannes Singler. 2010. Cache-, Hash-, and Implementation (NSDI 12). 15–28.
Space-Efficient Bloom Filters. ACM J. Exp. Algorithmics 14, Article 4 (Jan 2010), [114] Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and
18 pages. Huanchen Zhang. 2023. An Empirical Evaluation of Columnar Storage Formats.
[97] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, https://arxiv.org/pdf/2304.05028.pdf/. arXiv preprint arXiv:2304.05028 (2023).
Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, [115] Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G Andersen, Michael
Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crow- Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2018. Surf: Practical range
son, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: query filtering with fast succinct tries. In Proceedings of the 2018 International
An open large-scale dataset for training next generation image-text models. In Conference on Management of Data. 323–336.
NeurIPS. [116] Huanchen Zhang, Xiaoxuan Liu, David G Andersen, Michael Kaminsky, Kim-
[98] Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, berly Keeton, and Andrew Pavlo. 2020. Order-preserving key compression for
Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, et al. in-memory search trees. In Proceedings of the 2020 ACM SIGMOD International
2019. Presto: SQL on everything. In 2019 IEEE 35th International Conference on Conference on Management of Data. 1601–1615.
Data Engineering (ICDE). IEEE, 1802–1813. [117] Marcin Zukowski, Sandor Heman, Niels Nes, and Peter Boncz. 2006. Super-
[99] Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A study of the funda- scalar RAM-CPU cache compression. In 22nd International Conference on Data
mental performance characteristics of GPUs and CPUs for database analytics. In Engineering (ICDE’06). IEEE, 59–59.
Proceedings of the 2020 ACM SIGMOD international conference on Management [118] Marcin Zukowski, Mark Van de Wiel, and Peter Boncz. 2012. Vectorwise: A
of data. 1617–1632. vectorized analytical DBMS. In 2012 IEEE 28th International Conference on Data
Engineering. IEEE, 1349–1350.
161