0% found this document useful (0 votes)

82 views14 pages

Btrblocks - Data Lake Compression

Uploaded by

Lucas Silva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views14 pages

Btrblocks - Data Lake Compression

Uploaded by

Lucas Silva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

BtrBlocks: Efficient Columnar Compression for Data Lakes

Maximilian Kuschewski David Sauerwein

maximilian.kuschewski@tum.de david.sauerwein@fau.de
Technische Universität München Friedrich-Alexander-Universität Erlangen-Nürnberg

Adnan Alhomssi Viktor Leis

adnan.alhomssi@fau.de leis@in.tum.de
Friedrich-Alexander-Universität Erlangen-Nürnberg Technische Universität München

on c5n.18xlarge
ABSTRACT 10 btrblocks

S3 Scans
per Dollar
Analytics is moving to the cloud and data is moving into data lakes.

S3 scan limit
8
These reside on blob storage services like S3 and enable seamless parquet+zstd
6
data sharing and system interoperability. To support this, many sys- parquet+snappy
4
tems build on open storage formats like Apache Parquet. However, 2
parquet
these formats are not optimized for remotely-accessed data lakes
0
and today’s high-throughput networks. Inefficient decompression 0 25 50 75 100
makes scans CPU-bound and thus increases query time and cost. S3 Scan Throughput [gbps]
With this work we present BtrBlocks, an open columnar storage
format designed for data lakes. BtrBlocks uses a set of lightweight Figure 1: S3 scan cost and throughput (c5n.18xlarge) on the 5
encoding schemes, achieving fast and efficient decompression and largest Public BI Benchmark datasets
high compression ratios.
and even systems that initially started with a horizontally parti-
CCS CONCEPTS tioned, shared-nothing design like Redshift are transitioning to
• Information systems → Data compression. disaggregated storage [24].
Data warehouses can become proprietary data traps. Cloud-
KEYWORDS native data warehousing systems are optimized for analytical queries
data lake, query processing, compression, columnar storage through vectorized processing [27] or compilation [50], and all sys-
tems rely on compressed columnar storage [21], which has become
ACM Reference Format: a proven and mature technology. By default, most systems use pro-
Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor
prietary storage formats. The big downside of proprietary formats
Leis. 2023. BtrBlocks: Efficient Columnar Compression for Data Lakes.
is that they effectively trap the data in one system (or one vendor’s
Proc. ACM Manag. Data 1, 2, Article 118 (June 2023), 14 pages. https://doi.
org/10.1145/3589263 ecosystem). Non-SQL analytics systems for machine learning or
business intelligence often have to first extract the data from the
data warehouse, which is not only cumbersome but also inefficient
1 INTRODUCTION
and expensive for large datasets. Often this leads to several unneces-
Data warehousing is moving to the cloud. Many organiza- sary data copies all residing in the same object store – multiplying
tions collect and analyze ever larger datasets, and, increasingly, storage cost and making data changes difficult.
these are stored in public clouds such as Amazon AWS, Microsoft Data lakes and open storage formats. Data lakes enable inter-
Azure and Google Cloud. To analyze these datasets, customers use operability across different analytics applications, including SQL-
cloud-native data warehousing systems such as Snowflake [29], based data warehousing and complex analytics [60]. They do this
Databricks [25], Amazon Redshift [34], Microsoft Azure Synapse by storing data on cloud object stores such as S3, and by relying on
Analytics [23] or Google BigQuery [47, 48]. Another trend in cloud open storage formats such as Parquet or ORC that can be accessed
data warehousing is the disaggregation of storage and compute, by any analytics system. Given that the idea of data lakes is not new,
where the data is stored on distributed cloud object stores such as one may wonder why proprietary solutions are still more common
S3, and where compute power can be spawned elastically on de- than open data lakes. We believe that this is due to two reasons.
mand. This architecture was pioneered by BigQuery and Snowflake First, networks used to be slow, making data lake access from ob-
ject stores relatively slow. Second, compared to their proprietary
Authors’ addresses: Maximilian Kuschewski, maximilian.kuschewski@tum.de, Tech-
nische Universität München, ; David Sauerwein, david.sauerwein@fau.de, Friedrich- cousins, Parquet and ORC are neither efficient in terms of scan
Alexander-Universität Erlangen-Nürnberg, ; Adnan Alhomssi, adnan.alhomssi@fau.de, performance nor compact, which is why they are often combined
Friedrich-Alexander-Universität Erlangen-Nürnberg, ; Viktor Leis, leis@in.tum.de,
Technische Universität München,
with general-purpose compression schemes like Snappy [11] or
Zstd [12]. While the network bottleneck has been solved with the
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. arrival of cheap 100 Gbit networking instances (e.g., c5n or c6gn in
This is the author’s version of the work. It is posted here for your personal use. Not AWS), in this paper we attack the second problem.
for redistribution. The definitive Version of Record was published in Proceedings of the
ACM on Management of Data, https://doi.org/10.1145/3589263. BtrBlocks. In this paper, we propose BtrBlocks ["bEt9ôbl6ks],
an open-source columnar storage format for data lakes. BtrBlocks
SIGMOD ’23, June 18–23, 2023, Seattle, WA Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, & Viktor Leis

is designed to minimize overall workload cost through low storage Table 1: Encoding Schemes used in BtrBlocks
cost and fast decompression. To achieve good compression on real-
world data, we combine seven existing and one new encoding Scheme Reference Code Type
scheme, all of which offer fast decompression performance and can
RLE our all
be used in a cascade (i.e., RLE then Bit-packing). BtrBlocks also
One Value our all
includes an algorithm for determining which encoding to use for a
Dictionary our all
particular block of data. Figure 1 compares its scan speed and cost
Frequency our all
with Parquet, the most common open data lake format. With real-
SIMD-FastPFOR [42] [1] int
world data from the five largest datasets in the Public BI Benchmark,
SIMD-FastBP128 [42] [1] int
scans using BtrBlocks are 2.2× faster and 1.8× cheaper due to
FSST [26] [2] string
its superior decompression performance. This makes BtrBlocks
Roaring [43] [7] bitmap
highly attractive as an in situ data format for data lakes.
Related Work and Contributions. Much of the existing research Pseudodecimal Section 4 our float
on compression focuses on specific encodings for integers [30, 31,
42, 61], while work on compressing strings [26, 39] and floating-
point numbers [46] is more sparse. Furthermore, there is a surpris- Additional general-purpose compression. The set of available
ing lack of end-to-end designs, i.e., a set of complementary encoding encoding schemes in Parquet is small and the rules it uses to
schemes and an algorithm that decides between them. This work choose per-column encoding schemes are simplistic. For example,
consists of the following contributions: (1) A complete compression the default C++ implementation simply tries dictionary compres-
design for relational data based on an empirically-selected set of sion and leaves the data uncompressed if the dictionary grows too
compression schemes that are introduced in Section 2. (2) A sam- large [3, 54]. As a result, the achieved compression ratios are low in
pling-based algorithm for choosing the best compression scheme practice. To remedy this, encoded Parquet columns are often com-
for any piece of data, discussed in Section 3. (3) A novel floating– pressed again with a general-purpose, heavyweight compression
point scheme called Pseudodecimal Encoding, which we describe in scheme. The scheme is configurable [20] and the set of available
Section 4. (4) An extensive evaluation of BtrBlocks in Section 6 options includes Snappy, Brotli, Gzip, Zstd, LZ4, LZO and BZip2.
using the Public BI benchmark, a collection of real-world, hetero- We show results for Zstd and Snappy, which provide two different
geneous, and complex business intelligence datasets. BtrBlocks is trade-offs between compression effectiveness and decompression
open source and available at https://github.com/maxi-k/btrblocks. speed. LZ4 [14] behaved very similar to Snappy in our experiments.
A better way to compress. We found that general-purpose schemes
on top of simple encodings are quite inefficient to decompress and
2 BACKGROUND
thus refrain from using them. Instead, BtrBlocks expands on the
Outline. In this section, we introduce existing open data lake for- selection of lightweight encodings Parquet offers. Additionally, it
mats before describing the encodings used in BtrBlocks. substantially improves the scheme selection algorithm and allows
for applying multiple encoding schemes recursively.
2.1 Existing Open File Formats
Parquet & ORC. Apache Parquet and Apache ORC are open source, 2.2 Compression Schemes Used In BtrBlocks
column-oriented formats widely supported by modern analytics Combining fast encodings. The idea of BtrBlocks is to combine
systems. Like BtrBlocks and most column stores, they apply block- multiple type-specific efficient encoding schemes that cover differ-
based columnar compression. Both are quite similar, but Parquet is ent data distributions and therefore achieve a high compression
more widely used, which is why we focus on it. ratio while keeping decompression fast. Table 1 lists the encoding
Column encoding in Parquet. Parquet encodes columns using a schemes we use in BtrBlocks. BtrBlocks compresses columns of
fixed selection of encoding schemes. The supported encodings are typed data (integers, double floating-point numbers and variable-
Run-length Encoding (RLE), Dictionary, Bit-packing and variants length strings). Like many existing formats [8, 15, 26, 36, 38, 39, 53],
of Delta Encoding [13]. Which encoding to use is either specified it divides each column into fixed-size blocks with a default size of
by the user or decided with hard-coded, implementation-specific 64,000 entries. Compressing blocks individually allows BtrBlocks
rules. After encoding chunks of multiple columns, Parquet bundles to react to changing data distributions by adapting the compression
the results into rowgroups. Multiple rowgroups are combined into scheme to the data in each block. Blocks also facilitate parallelizing
a Parquet file, with metadata about each stored in the footer. compression and decompression. BtrBlocks is based on a number
Metadata & Statistics. Each Parquet file includes metadata, statis- of existing encoding schemes, which we briefly describe below.
tics and lightweight indices. While important for query processing, RLE & One Value. Run-length Encoding (RLE) is a ubiquitous
we believe these are misplaced in the data file. One would like to technique that compresses runs of equal values. Instead of storing
prune data using statistics and indices before accessing a file through the run {42, 42, 42}, for example, we store (42, 3). One Value is a
a high-latency network. We thus follow a different approach by specialization for columns with only one unique value per block.
decoupling compression from the rest of the file format: BtrBlocks Dictionary. Another simple but effective scheme is Dictionary
only produces blocks of compressed data with a configurable size. Encoding, which replaces distinct values in the input with (shorter)
Metadata, statistics and indices are completely orthogonal and may codes. A lookup structure (the dictionary) maps each code to the
be added on top or tracked separately. original distinct value. The data structure used for implementing
BtrBlocks: Efficient Columnar Compression for Data Lakes SIGMOD ’23, June 18–23, 2023, Seattle, WA

Part 0 Part 1 Part 2 Part 3

the dictionary is determined by the encoded type, e.g., an array
64 64 rand() 64 64
for fixed-size values and a string pool with offsets for variable size
values. In some lightweight formats [36], dictionaries are often the
only way of compressing strings.
Frequency. Skewed distributions, where some values are much
compress
more common than the rest, are not uncommon in real-world
datasets. DB2 BLU [53] proposed a Frequency Encoding that uses
several code lengths based on data frequency. For example, a one bit Figure 2: Choosing a random sample from a column block
code can represent the two most frequent values, a three bit code
the next eight most frequent values, and so on [53]. In BtrBlocks, to integer columns and combinations of at most two algorithms
we adapt Frequency Encoding based on our analysis of real-world (single-level cascade). We present a more generic approach that han-
data [17]: Often, a column only has one dominant frequent value, dles multi-level cascades and includes doubles and strings as well.
with the next most frequent values occurring exponentially less Additionally, our scheme selection algorithm avoids cost models
often. We optimize for this case by only storing (1) the top value, and opts for an easily-extendible sampling-based approach.
(2) a bitmap marking which values are the top value and (3) the
exception values which are not the top value.
3 SCHEME SELECTION & COMPRESSION
FOR & Bit-packing. For integers, Frame of Reference (FOR) en- Scheme selection algorithms. In Section 2.2, we presented en-
codes each value as a delta to a chosen base value. For example, coding schemes for different data types. The effectiveness of these
instead of storing {105, 101, 113}, we can choose the base 100 and encodings differs strongly depending on the data distribution. Given
store {5, 1, 13} instead. This can be useful in combination with Bit- a set of encodings, we therefore need an algorithm for deciding
packing, which truncates unnecessary leading bits. After applying which encoding is most effective for a particular data block. Sim-
FOR to our example sequence, we can bit-pack {5, 1, 13} using 4 ple, static heuristics as used by Parquet – such as always encoding
bits for each value instead of 8 bits. However, the basic FOR scheme strings with dictionaries and always bit-packing integers – are not
does not work well with outliers: adding 118 to the example se- capable of exploiting the full compression potential of a particular
quence would require us to use at least 5 bits for each value. Patched dataset. Another approach would be to rely on data statistics. For
FOR (PFOR) thus stores these outliers as exceptions and keeps the formats like Data Blocks [36] a small number of statistics such as
smaller bitwidth for the rest of the values [61]. SIMD-FastPFOR and min, max and unique count are sufficient to select among a small
SIMD-FastBP128 build on this idea and specialize the algorithms set of simple encodings (FOR, dictionary, single value). However,
and layout for SIMD [42]. We use these existing high-performance for more complex encodings, simple statistics are not enough, and
implementations in BtrBlocks. a general solution would require to exhaustively compress the data
FSST. A large portion of real-world data is stored as strings [33, 49]. with each encoding. Even for a moderate number of encodings, this
Fast Static Symbol Table (FSST) is a lightweight compression scheme would be prohibitively slow – even without taking cascading into
for strings [26]. It replaces frequently occurring substrings of up to account, which could increase the search space exponentially.
8 bytes with 1 byte codes. These codes are tracked in a fixed-size 255 Challenges. A better approach for encoding selection is to use
entry dictionary: the symbol table. The symbol table is immutable sampling. For this to work well, the sample must capture the dataset
and used for an entire block of strings. Decompression is simple characteristics relevant for compression. Random sampling, for
and therefore fast: FSST uses codes from the compressed input as an example, may not work well for detecting whether RLE is effective.
index into the symbol table and copies the substring to the output. Simply taking the first k tuples, on the other hand, would result in
Compression is more involved because FSST needs to find a good a very biased sample. Another challenge for the scheme selection
symbol table first. BtrBlocks either uses FSST to compress strings algorithm is to take cascading into account, i.e., it must decide
from the input directly or applies it to a dictionary when beneficial. whether to encode already encoded data again.
NULL Storage Using Roaring Bitmaps. BtrBlocks stores NULL Solution overview. In BtrBlocks, we test each encoding scheme
values for each column using a Roaring Bitmap [43]. The idea be- on a sample and select the scheme that performs best. As Section 3.1
hind Roaring is to use different data structures depending on the describes, our sampling algorithm tries to find a compromise be-
local density of bits. This makes it highly efficient for many data dis- tween preserving the locality of neighboring tuples and accurately
tributions [57]. BtrBlocks uses Roaring Bitmaps through an open representing the entire data range. Section 3.2 describes how Btr-
source C++ library that is optimized for modern hardware [7, 44]. Blocks integrates cascading with our sample-based scheme selec-
Besides tracking NULL values, we also use Roaring Bitmaps to track tion recursively. Given a block of data to compress, each recursion
exceptions for internal encoding schemes like Frequency Encoding. level executes the following steps:
Cascading Compression. With FOR + Bit-packing, we mentioned (1) Collect simple statistics about the block.
the idea of compressing the output of an encoding with another (2) Based on these statistics, filter non-viable encoding schemes.
encoding to further reduce space. This concept has been named (3) For each viable scheme, estimate the compression ratio using
Cascading Compression [30]. Damme et al. [31] classify several en- a sample from the data.
coding schemes into logical and physical compression schemes (4) Pick the scheme with the highest observed compression ratio
and study how well they combine. They develop a gray-box cost and compress the entire block with it.
model for integer compression to tackle the problem of choosing (5) If the output of the compression is in a compressible format,
good schemes for a given dataset. However, they limit themselves then repeat from step 1.
SIGMOD ’23, June 18–23, 2023, Seattle, WA Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, & Viktor Leis

String Double Integer

Dict + One Uncom- One Uncom- Pseudo- Uncom- SIMD- One

FSST Dict Frequency RLE Dict RLE Dict
FSST Value pressed Value pressed decimal pressed FastPFOR Value

codes string pool codes string pool bitmap exceptions length value codes exceptions values values lengths codes

Uncom- Roaring SIMD-

Integer FSST Integer Double Integer Double Integer Double Integer Integer Integer Integer
pressed Bitmap FastBP128

Figure 3: Encoding scheme decision trees that we apply recursively

3.1 Estimating Compression Ratio with Samples struct RLEData {u8 val_scheme, cnt_scheme, data[]}
Choosing samples. To select the best scheme for each block, the double RLE::estimateRatio (Stats& s)
sample has to be representative of the data. The main trade-off is if(s.average_run_length < 2) return 0
between preserving spatial locality in the data while still capturing return estimateFromSamples(stats);
the distribution of unique values across the input. At the same time, u32 RLE::compress (u32* src, u32 cnt,
samples have to stay small to keep scheme selection overhead low. u8* dst, u8 recur)
As Figure 2 illustrates, we propose to select multiple small runs RLEData& res = *((RLEStructure*)dst)
from random positions in non overlapping parts of the data. For vector<u32> values, counts
a chunk size of 64,000 values, we use 10 runs of 64 values each, ... // <- RLE algorithm, writing to vectors
resulting in a sample size of 1% of the data. We have found this // cascading compression for values:
method to yield a good compromise between compression speed Scheme val_casc = pickScheme(values.data(),
and estimation quality, and evaluate this in detail in Section 6.3. cnt, res.data, recur-1)
Estimating compression ratio. BtrBlocks first collects statistics res.values_scheme = val_casc.scheme_code()
like 𝑚𝑖𝑛, 𝑚𝑎𝑥, unique count and average run length in a single u8* cnt_dst = res.data+val_casc.compress()
pass. Based on these statistics, it then applies heuristics to exclude // cascading compression for counts:
nonviable schemes: It excludes RLE, for example, if the average run Scheme cnt_casc = pickScheme(counts.data(),
length is < 2 and Frequency Encoding if ≥ 50% of values are unique. cnt, cnt_dst, recur-1)
BtrBlocks then compresses the sample with each viable encoding res.counts_scheme = cnt_casc.scheme_code()
scheme to estimate the compression ratio of each scheme. return cnt_dest+cnt_casc.compress()-dst
Performance. We evaluated the performance of this method for Scheme pickScheme (u32* src, u32 cnt,
sampling and compression ratio estimation on real-world data. Our u8* dst, u8 recur)
selection algorithm uses only 1.2% of the total compression time if (!recur) return UNCOMPRESSED
while accurately estimating which compression scheme is best. auto stats = genStats(src, cnt)
auto scheme = UNCOMPRESSED; double min_cf = -1
3.2 Cascading for (auto& sc : pool)
double est = sc.estimateRatio(stats)
Recursive application of schemes. After selecting a compression if (est != 0 && est > min_cf)
algorithm, the output (or some part of it) may be compressed using min_cf = est; scheme = sc
a different scheme. This is illustrated in Figure 3, with recursion return scheme
points denoting an additional possible compression step. The
scheme used for the additional step is again selected with our
compression ratio estimation algorithm. The maximum number Listing 1: Pseudocode of the scheme picking algorithm and
of recursions is a parameter of the compression algorithm, with RLE as an example for an implemented scheme
the default value set to 3. Once this recursion depth is reached,
BtrBlocks leaves the data uncompressed.
Cascading compression: An example. Taking an input of dou- array [0, 1, 0] and a dictionary [3.5, 18]. As the maximum
bles [3.5, 3.5, 18, 18, 3.5, 3.5], for example, the sampling recursion depth is not yet reached, BtrBlocks may decide to apply
algorithm may determine that RLE is a good choice. This produces FastBP128 to the code array in a final step. Decompression works
two outputs: A value array of doubles [3.5, 18, 3.5] and a run analogously, with each scheme storing what scheme it cascaded
length array [2, 2, 2]. BtrBlocks will decide to compress the run into and applying the decompression algorithms in reverse order.
length array using One Value using the statistics. The value array is Code example. Listing 1 shows a crosscut of the entire cascad-
also subject to a cascading compression step. Assuming the estima- ing compression algorithm for integers using RLE as an example.
tion algorithm chooses Dictionary Encoding, this will yield a code The RLE ratio estimation method stops early if the scheme
BtrBlocks: Efficient Columnar Compression for Data Lakes SIGMOD ’23, June 18–23, 2023, Seattle, WA

is not feasible, otherwise it uses the sampling algorithm. The dis- const unsigned max_exp = 22, exp_exception = 23;
played part of the RLE compress method shows the recursive const double frac10[] = {1.0, 0.1, 0.01, ...};
calls to the scheme picking algorithm. In this case, there are two struct Decimal {int digits, exp; double patch};
recursive calls: One for the values list and one for the run lengths. Decimal encode_single(const double input)
The scheme picking algorithm simply tests all schemes if the int exp; int digits; bool neg = input < 0
maximum recursion depth is not yet reached. double dbl = neg ? -input : input
The encoding scheme pool. The result is a generic, extensible if (input == -0.0 && std::signbit(input))
framework for cascading compression that draws from a pool of goto patch // -0.0 is exception
arbitrary encoding schemes. The scheme pool strongly affects the // Attempt conversion
overall behavior of BtrBlocks: With more schemes, compression for (exp = 0; exp <= max_exp; exp++)
becomes slower because more samples have to be evaluated, but the double cd = dbl / frac10[exp]
compression ratio increases. Adding more heavyweight schemes digits = round(cd)
may also increase the compression ratio but slows down decompres- double orig = ((double)digits) * frac10[exp]
sion. We have chosen the set of schemes in BtrBlocks based on if (orig == dbl) goto success
our analysis of the diverse set of columns in the Public BI datasets. patch: // return exception in exponent, patch
To build up the encoding scheme pool BtrBlocks uses, we itera- return {0, exp_exception, input}
tively (1) found columns where its compression ratio was worse success: // return decimal; patch is ignored
than heavyweight schemes like Bzip2, (2) analyzed patterns in the return {(int)digits, exp, 0}
data, (3) added schemes that fit those patterns well and (4) pruned
schemes that did not improve compression enough or slowed down
decompression. The result is the list of schemes shown in Fig- Listing 2: Pseudodecimal Compression algorithm
ure 3.

3.25 and 0.99). Second, some decimal numbers such as 0.99 can-
4 PSEUDODECIMAL ENCODING not be represented precisely in binary; the actual value stored is
Floating-point numbers in relational data. Prior research on 0.98999..., which results in periodic and hard-to-compress mantissas
floating-point compression in relational databases is very sparse. like 0xfae147ae147ae. In a lossless compression scheme such as
The lack of interest in floating-point compression schemes has a BtrBlocks, decompression has to yield a bitwise-identical output
historic reason: Relational systems usually represent real numbers to avoid changing semantics.
as Decimal or Numeric, which can physically be stored as inte- Floating-point numbers as integer tuples. As the name sug-
gers. However, this is changing with the move to data lakes and gests, Pseudodecimal Encoding uses a decimal representation for
the subsequent integration with non-relational systems: Tableau’s encoding doubles. It does this using two integers: significant digits
internal analytical DBMS, for example, encodes all real numbers with sign and exponent. For example, 3.25 becomes (+325, 2), as
as floating-point numbers [56], and machine-learning systems rely in 325 × 10 −2 . But what happens to a double such as 0.9899 . . . ?,
on floating-point numbers virtually exclusively. which is 0.99 stored as a double? Intuitively, one would have to
Pseudodecimal Encoding. While some encoding schemes shown store two integers (98999 . . . , 17) to be able to restore the precise
in Figure 3 are applicable to all data types, the two bit-packing double value later. Surprisingly, storing (99, 2) suffices; this effec-
techniques and FSST are not effective for floating-point numbers. tively compresses the floating-point value 0x3fefae147ae147ae,
We thus introduce Pseudodecimal Encoding, a compression scheme to a pair of integers (0x63, 0x2). Thus, the compression value
specifically designed for binary floating-point numbers. We es- of Pseudodecimal Encoding is twofold: First, it strips apart IEEE
tablish the basic idea, the encoding logic and the integration into 754 floating-point values into integers that are more easily com-
BtrBlocks in this section, before describing efficient decompres- pressible. Second, it generates a compact decimal representation
sion in Section 5. We evaluate the scheme both separately and as for hard-to-compress doubles, which is often what users wanted to
part of BtrBlocks as a whole in Section 6.5. store in the first place. To do this it has to find a compact decimal
representation as we describe next.
4.1 Compressing Floating-Point Numbers Encoding Algorithm. The Pseudodecimal Encoding algorithm
Challenges. Pseudodecimal Encoding sprung from our analysis of determines the compact decimal representation by testing all pow-
the Public BI Benchmark. We found that double-precision floating- ers of 10 and checking whether any of them correctly multiply the
point numbers are frequently used where fixed-precision numbers double to an integer value. Listing 2 shows this algorithm adapted
would be sufficient (and indeed better suited). A common example for encoding a single double instead of an entire block like in Btr-
is storing monetary prices such as $3.25 or $0.99 as floating-point Blocks. We store the inverse powers of 10 in a static table to avoid
numbers. While such values may appear to be highly compress- recomputing them for every number1 . The overloaded double num-
ible, there are two problems: First, their physical IEEE 754 repre- ber ±0.0 creates an issue because we encode the sign together with
sentation (1 sign, 11 exponent and 52 mantissa bits) means that the number as an integer. Thus, the algorithm handles negative
standard techniques such as FOR+Bit-packing are not effective. zero, as well as other special floating-point numbers like ±𝐼𝑛𝑓
This is because the most significant bits storing the exponent differ 1 Conceptually, it might be more intuitive to divide by powers of 10, but multiplication
strongly even for numbers that are numerically fairly close (e.g., is slightly faster than division during decompression.
SIGMOD ’23, June 18–23, 2023, Seattle, WA Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, & Viktor Leis

and ±𝑁 𝑎𝑁 , as exceptions. It stores these exceptions separately as void decodeRLEAVX (int *dst, int *runlen,
patches, together with doubles that it cannot encode as integers, int *values, int runcnt)
such as 5.5 × 10 −42 . We limit the number of bits used for the digits // dst must have >= 32 additional bytes
and the exponent to 32 and 5, respectively. These properties ensure for (int run = 0; run < runcnt; run++)
that the encoding produces bitwise-identical results. int *target = dst + runlen[run]
__m256i vals = _mm256_set1_epi32(values[run])
4.2 Pseudodecimal Encoding in BtrBlocks for (; dst < target; dst += 8)
Cascading to integer encoding schemes. Pseudodecimal Encod- _mm256_storeu_si256(dst, vals)
ing converts a column of floating-point numbers to two integer dst = target // SIMD may have overflowed
columns and a small column of exceptions. BtrBlocks may encode
void decodeDictAVX (int *dst, const int *codes,
these columns again using cascading compression:
const int *values, int cnt)
Input Double Significant Digits Integer
int idx = 0 // not shown: 4x loop unroll
SIMD-FastPFOR
0.989… 99,325,-6425 if (cnt >= 8)
3.25
Exponents Integer while (idx < cnt-7)
RLE ...
2,2,3,23=ERR __m256i codes = _mm256_loadu_si256(codes)
-6.425 Patches Double __m256i values = _mm256_i32gather_epi32(
Uncompressed values, codes, 4)
5.5e-42 5.5e-42
_mm256_storeu_si256(dst, values)
The depicted choices for the cascading compression are examples dst += 8; codes += 8; idx += 8
and not fixed; BtrBlocks chooses the schemes using its sampling for (;idx < cnt; idx++) *dst++ = values[*codes++]
algorithm as described earlier.
When to choose Pseudodecimal Encoding. There is data for
Listing 3: Vectorized RLE and Dictionary decompression
which Pseudodecimal Encoding is ill-fitted, like columns with many
exception values: Pseudodecimal Encoding slightly increases the
compression ratio, but decompression is slow because of the many
exception values. We thus disable the scheme for columns that have branch. We instead opt for writing behind the end of the output
more than 50% non-encodable exception values. Similarly, columns buffer in this case. The buffer length is corrected afterwards as
with few unique values usually compress almost as well with dic- shown on the last line of Listing 3 (top) . This gains an average of
tionaries, which have a much higher decompression speed. In the 76% end-to-end decompression performance for blocks that use RLE
context of BtrBlocks, we thus choose to exclude Pseudodecimal at some point in their cascade. Integer columns even decompress
Encoding for columns with less than 10% unique values. 128% faster on average because RLE is commonly chosen by the
scheme selection algorithm. String dictionaries often use RLE to
compress the code sequence and thus also gain 78% performance
5 FAST DECOMPRESSION on average. Doubles gain 14% performance on average.
Decompression speed is vital. Renting compute nodes is one Dictionaries for fixed-size data. The standard decompression
of the main sources of cost in cloud data analytics [41]. Saving algorithm for dictionaries simply scans the code sequence and re-
cost is therefore best done by reducing the rental time of those places each code with its value from the dictionary. We can copy
nodes. Considering a compression technique, we can do this by (1) 8 integer dictionary entries simultaneously using 8×32 = 256 bit
reducing network load time with a good compression ratio and (2) AVX2 vector instructions, as shown in Listing 3 (bottom) . Dou-
reducing compute time with fast decompression. After achieving a ble decoding works analogously with 4 entries. We also manually
good compression ratio with our cascading compression algorithm, unroll the loop 4 times for both data types. For any blocks that use
we thus turn our attention to decompression throughput. Dictionary Encoding in the cascade, we saw an end-to-end speedup
Improving decompression speed. As Table 1 shows, BtrBlocks of 18% for integer decompression and 8% for double decompression.
uses existing highly-optimized (SIMD) implementations of SIMD- String Dictionaries. We avoid copying strings during decompres-
FastPFOR, SIMD-FastBP128, FSST and Roaring. In this section, we sion. Instead, BtrBlocks replaces each code with the string length
describe fast implementations of the other encodings. All presented and the offset (≈ pointer) of the uncompressed string. Offset and
performance numbers pertain to the Public BI Benchmark datasets length form a fixed-size 64 bit tuple, so we can use the same vector-
discussed in Section 6.1. We measure the performance improve- ized algorithm we use for double dictionary decompression. Just by
ments “end-to-end”, meaning for an improved encoding scheme avoiding the string copy, we saw a speedup of more than 10× for
𝐵 that is part of the cascade 𝐴 − 𝐵 − 𝐶, we measure the resulting some low-cardinality columns. We additionally vectorize dictionary
speedup in decompression across the entire cascade 𝐴 − 𝐵 − 𝐶. decompression, which yields another 13% end-to-end speedup.
Run Length Encoding. The standard RLE decompression algo- Fusing RLE and Dictionary decompression. The scheme se-
rithm replicates the value of a length-𝑁 run 𝑁 times to the output. lection algorithm often compresses the (integer) code sequence of
To vectorize RLE using AVX2, we perform 8 (4) simultaneous repli- a dictionary with RLE. It is thus worth optimizing for this case
cations for integer (double) runs. However, run lengths are often not specifically. The standard implementation generated by the cas-
divisible by 8 (4), which we would need to handle in an expensive cading algorithm first decodes runs of dictionary codes into an
BtrBlocks: Efficient Columnar Compression for Data Lakes SIGMOD ’23, June 18–23, 2023, Seattle, WA

Table 2: Public BI Benchmark (PBI) and TPC-H: Comparison of data types by volume (share) and compression ratio (compr.)

datatype String Double Integer Combined

dataset PBI TPC-H PBI TPC-H PBI TPC-H PBI TPC-H
metric share compr. share compr. share compr. share compr. share compr. share compr. compr. compr.
[%] [×] [%] [×] [%] [×] [%] [×] [%] [×] [%] [×] [×] [×]
Uncompressed 71.5 — 61.7 — 14.4 — 19.5 — 14.1 — 18.7 — — —
Parquet 51.0 7.10 64.1 1.63 36.6 1.99 14.0 2.35 12.5 5.73 21.9 1.45 3.37 1.69
Parquet+LZ4 39.8 12.05 46.8 3.65 44.6 2.16 18.4 2.94 15.6 6.07 34.8 1.49 4.72 2.77
Parquet+Snappy 39.3 12.23 45.0 3.92 44.9 2.15 19.2 2.91 15.7 6.05 35.8 1.49 4.79 2.85
Parquet+Zstd 33.6 17.13 40.0 5.27 50.1 2.30 23.3 2.87 16.3 6.97 36.7 1.74 6.05 3.41
BtrBlocks 43.6 11.32 54.9 4.26 41.9 2.36 16.2 4.58 14.5 6.70 28.9 2.46 5.28 3.79
Average 10.14 3.29 1.99 2.78 5.42 1.60 4.20 2.90

intermediate array and then looks those up in the dictionary. We GCC 10.3.1 on Amazon Linux 2, kernel version 5.10. We use the
can fuse these operations and get rid of the intermediate array, TBB library [16] for parallelization and disable hyperthreading. Our
instead doing the dictionary lookup first and directly writing runs benchmarks allocate and touch all memory beforehand to avoid
of (offset, size) pairs. BtrBlocks does this in the vectorized manner page faults. We repeat all measurements and average the results to
discussed previously, but only applies the technique if the average minimize the effects of caching and CPU frequency ramp-up.
run length is greater than 3 as we have found it to have a negative Parquet test setup. For generating Parquet files, we tested both
impact otherwise. This increases the end-to-end decompression the Apache Arrow (pyarrow 9.0.0) and the Apache Spark (pyspark
performance for string columns using RLE by another 7%. 3.3.0) libraries. The only parameter change we made was setting
FSST. FSST exposes an API for decompressing a single string, taking the rowgroup size in Apache Arrow to 217 because we found that
the encoded string offset and length as an argument [19]. We can to be fastest. We implemented the actual benchmarks consuming
use this API to compress an entire block by simply calling it in a the generated Parquet files with the Arrow C++ library. This library
loop for each string in the input data. This, however, moves CPU offers a high-level API based on Arrow constructs and a low-level
time out of FSSTs optimized decompression loop and into edge- API that uses Parquet directly. The high-level interface was signifi-
case detection. We can avoid this overhead by passing the offset cantly slower in our tests, so we chose the low-level API in all tests.
of the first encoded string and the sum of all string lengths to We parallelized decompression over both rowgroups and columns.
the decompression API instead. In microbenchmarks, this yielded
a reduction of 50 instructions per string, independent of string 6.1 Real-World Datasets
length. Additionally, we can forgo storing the offsets and lengths of Synthetic data. Analytical benchmarks such as TPC-H and TPC-
compressed strings; storing uncompressed string lengths suffices. DS have proven useful for evaluating both traditional and cloud-
Pseudodecimal. We implemented the decompression algorithm native query engines [55]. However, it is also well-known that their
of our novel double encoding scheme using vector instructions. data generation algorithms do not necessarily produce realistic data
To reconstruct a double, the decompression simply multiplies the distributions [33, 40, 56]. Assumptions like complete data normal-
significant digits of each value with the respective exponent. This ization, uniform and independent distributions, or most of the data
can be easily vectorized (_mm256_cvtepi32_pd, _mm256_mul_pd), being integers do not reflect typical real-world data – particularly in
producing blocks of 4×64 bit doubles at once. However, exception data lakes. We therefore argue that compression algorithms should
values that could not be encoded during compression complicate be evaluated using real-world rather than synthetic datasets.
matters: As explained in Section 4, Pseudodecimal Encoding stores The Public BI Benchmark. The large real-world collection of
these exceptions separately as patches. The decompression algo- datasets we chose to focus on is the Public BI Benchmark [33]. It
rithm thus first checks for exceptions in each vectorization block us- contains datasets derived from the 46 largest Tableau Public work-
ing a Roaring Bitmap. If there are none, it proceeds with vectorized books at the time of creation [56]. We thus expect its contents to
decompression. Otherwise, it falls back to a scalar implementation be more representative of what one might find in today’s large
for the current block and inserts any patch values into the output. data lakes: Data skew, denormalized tables, misused data types
(e.g. proliferation of strings) and non-uniform NULL representations
6 EVALUATION resulting from the variety of heterogeneous data sources. Addition-
Test setup. We execute all experiments on a c5n.18xlarge AWS EC2 ally, Tableau stores decimal values as floating-point numbers – a
instance. Previous work suggests that c5n is a good instance for data type which we found to be frequently underrepresented in
analytics in the cloud, primarily because of its 100 Gbps network- compression literature [56] and which is becoming more important
ing [41]. It runs an Intel Xen Platinum 8000 series (Skylake-SP) CPU due to the proliferation of machine learning. To get a better under-
with 36×3.5 GHz cores (72 threads), offers the AVX2 and AVX512 standing of the Public BI Benchmark and its effect on compression
instruction sets and has 192 GiB of memory. Code is compiled with performance, we first take a closer look at its datasets.
SIGMOD ’23, June 18–23, 2023, Seattle, WA Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, & Viktor Leis

double integer string double integer string

Decomp. Speed [GB/s]

+ Dictionary

+ FastPFOR
+ FastBP128
11
Compression Ratio

+ OneValue
+ Raw FSST

+ Pseudodecimal

+ Dictionary
+ OneValue

+ Dictionary
+ Pseudodecimal

+ FSST Dict
+ FastPFOR
+ FastBP128
10

+ Frequency
+ Dictionary
+ RLE
9 15

+ RLE
8

+ Raw FSST
+ Frequency

+ FSST Dict
+ Dictionary

+ RLE

+ OneValue
+ Dictionary
+ OneValue

+ OneValue
+ OneValue

6 10
5
+ RLE

4
3 5
2
1
0

Figure 4: Compression ratio and decompression throughput changes when successively enabling techniques, by data type

Public BI vs. TPC-H. Table 2 outlines the differences between each dataset, we only use the largest table: for example, we only
a real-world dataset and a generated dataset by comparing the use TrainsUK1_Table4 from the TrainsUK1 dataset. We do this
Public BI datasets with TPC-H data. We do this for each data type because the tables in each dataset are often derived from each other
separately. Because TPC-H can be generated on different scale and thus very similar; using only one table per dataset prevents over-
factors, we use the relative data volume of each data type as a representing the data mix of larger datasets. Due to their negligible
metric instead of an absolute amount. In addition to the uncom- sizes, we also exclude the datasets IUBLibrary, IGlocations and
pressed format, for which we use our in-memory columnar binary Hatred1 as well as the date and timestamp columns (which can
representation, we convert each dataset to Parquet using multiple be represented as integers). This adapted subset of the Public BI
compression schemes, as well as BtrBlocks. We then reexamine Benchmark totals 119.5 GB of binary data when loaded into memory.
the data volume of each data type in the compressed formats, yield- It contains 43 tables, each containing between 6 and 519 columns,
ing a compression ratio per data type and dataset. In the following, 57 on average. Overall, there are 2451 columns with diverse data
we describe our observations about the differences between the shapes and distributions. Even though this makes the Public BI
Public BI Benchmark and TPC-H in more detail. Benchmark much more suitable for designing and evaluating a
Public BI vs. TPC-H: Strings. As Table 2 shows, the Public BI compression scheme, we also perform experiments with TPC-H so
Benchmark consists of 71.5% strings, compared to TPC-H with that BtrBlocks can be more easily compared with related work.
61.7%. Additionally, many strings in the Public BI Benchmark are
structured, like URLs and product identifiers with common prefixes.
In contrast, the largest strings in TPC-H – the comment columns – 6.2 The Compression Scheme Pool
are random samples from a pool of test data. This has a large impact Measuring the impact of individual techniques. The list of en-
on compression performance: Where the average compression ratio coding techniques BtrBlocks tests with each cascading step forms
for strings in the Public BI Benchmark is 10.2× across all measured a trade-off between compression ratio and decompression speed.
formats, it is only 3.3× in TPC-H. We evaluated this by successively adding techniques to the pool
Public BI vs. TPC-H: Doubles. Doubles make up 19.5% of the and measuring the resulting compression ratio and decompression
data volume in TPC-H, but only 14.3% in the Public BI Benchmark. speed. Figure 4 shows one sequence of technique additions for each
Across all tested compression schemes, doubles compress with a data type. For this experiment, we use a single thread for decom-
ratio of 1.99 in the Public BI Benchmark and 2.78 in TPC-H on pression to avoid measuring noise created through concurrency.
average. The most likely reason for this are the numeric ranges: Impact on compression ratio. For doubles, Dictionary Encoding
Double columns in TPC-H usually contain price data from one size and Pseudodecimal Encoding have the largest impact with a 95%
range. They are thus better suited for compression, especially with and 20% respective improvement. Still, as expected, doubles are in-
Pseudodecimal Encoding introduced in Section 4. herently less compressible than integers and strings. We achieve the
Public BI vs. TPC-H: Integers. TPC-H consists of 18.7% integers best average compression ratio on strings, where Dictionary Encod-
by volume, compared to the Public BI Benchmark with 14.1%. Ad- ing yields the largest improvement (7×). Using FSST to compress
ditionally, integers in the Public BI Benchmark compress with an an existing dictionary improves the compression ratio by another
average factor of 5.4 across all measured formats, and only 1.6 in 51%. FSST applied to raw data slightly improves compression ratio
TPC-H. This effect stems mainly from the unrealistically normal- and decompression speed. One Value barely increases the average
ized data TPC-H contains: Most integers are unique keys or foreign compression ratio, but has a large impact on some columns both in
keys, and few columns contain runs or repeating patterns. In con- compression ratio and speed (cf., Table 4).
trast, the Public BI Benchmark contains denormalized tables where Impact on decompression speed. One Value is also fastest in
joins result in runs and repeating patterns. This is clearly visible, for terms of decompression for doubles and integers, yielding an aver-
example, in samples from the largest two Public BI datasets [9, 10]. age respective throughput of 8.9 and 11.8 GB/s. For string decom-
Extreme cases like the all-zero integer column “RealEstate1/New pression, Dictionary Encoding increases throughput from 9.4 GB/s
Build?” shown in Table 4 are also missing from TPC-H. to 19.6 GB/s. This is because in BtrBlocks, Dictionary Encoding
Adapting for evaluation. We use a subset of the datasets included only decompresses the code sequence into pointers to the dictionary
in the Public BI Benchmark and adapt them to our use case. From contents and can forgo copying strings.
BtrBlocks: Efficient Columnar Compression for Data Lakes SIGMOD ’23, June 18–23, 2023, Seattle, WA

100
Correct Choices [%]

8
Individual Best Single

Parquet+Zstd
Compression Ratio
7

Uncompressed
90

btrblocks
Parquet+LZ4

Parquet+Snappy
Tuples Range

System A
6
80×8 40×16 10×64 5×128

System D
80 5

Parquet
1×640

System C
4
320×2

System B
70 3
640×1 2
60
1
50
Sampling Strategy

Figure 5: Correct scheme choices per strategy (𝑁 = 640) Figure 7: Public BI compression ratios for proprietary column
stores (A-D), Parquet and BtrBlocks
+ 8%
+ 7%
10×128
10×8

10×64
Data Size

+ 6%

entire block
10×256

10×1024
10×16

Sampling in BtrBlocks. For BtrBlocks , we thus choose to

10×512

+ 5%
10×2048
10×4096
+ 4% sample 10 × 64 tuples =ˆ 1% of each block by default. This takes up
10×32

+ 3% 1.2% of CPU time during compression and results in 77% correct

+ 2%
+ 1% scheme choices. With these choices, BtrBlocks compresses only
optimum 3.3% worse than the optimum on average.
0.1 1.0 10.0 100.0
Sampled Tuples [%] (log) 6.4 Compression
Compression ratio. We designed BtrBlocks with relational data
Figure 6: Public BI compressed size for different sample sizes
in mind, e.g., storing aligned columns that form tuples. We thus
compared its compression ratio with four relational column stores
6.3 Sampling Algorithm on the Public BI datasets. These systems base their compression
Sampling research questions. Accurately estimating the com- on internal proprietary formats. To show as complete a picture
pression ratio for different schemes requires choosing a good sam- as possible, we also added the most popular open source format,
pling algorithm. We do this by answering two research questions: Apache Parquet, to the comparison. Parquet provides different built-
(1) Given a fixed sample size, what is the best sampling strategy? in high-level compression options. Figure 7 displays the combined
(2) How does sample size relate to scheme selection accuracy? results, showing that BtrBlocks beats every format except the
We score sampling strategies based on the percentage of correctly heavyweight Zstd compression on Parquet.
selected schemes, which we compute as follows: We compress the Compression speed. Starting from a CSV file, compression with
first block (64k tuples) of every column in the Public BI Benchmark both Parquet and BtrBlocks consists of two steps: (1) Convert the
using every scheme, including cascades, and determine the scheme CSV file to an in-memory format and (2) convert the in-memory
with the best compression ratio: the optimal scheme. We do the format to the compressed final form. Our single-threaded compres-
same again for each sampling strategy, compressing the sample sion speed results are similar to Parquet, both beginning from CSV
instead of the entire block. If a sampling strategy chooses the op- and the in-memory format:
timal scheme or a scheme at most 2% worse than the optimal, we From CSV From binary Compr. Factor
consider the scheme choice to be correct2 .
Best strategy for a fixed sample size. Figure 5 shows the percent- BtrBlocks 38.2 MB/s 75.3 MB/s 7.06 ×
age of correctly selected schemes for different sampling strategies Parquet+Snappy 38.0 MB/s 41.9 MB/s 6.88 ×
that always sample 640 tuples (=1% ˆ of a 64k block). It includes ex- Parquet+Zstd 37.3 MB/s 41.0 MB/s 8.24 ×
treme cases like sampling random individual tuples or choos-
ing a single tuple range , which perform worst. The main take- 6.5 Pseudodecimal Encoding
away is that sampling multiple small chunks across the entire block Evaluation outside of BtrBlocks. Pseudodecimal Encoding
improves accuracy compared to other strategies, though there is is a novel double compression scheme we designed based on our
little difference between strategies that choose chunks of ≥ 16 tu- observations about data in the Public BI Benchmark. To assess its
ples. This confirms the intuition that the sample needs to capture effectiveness, we want to measure its compression factor outside
both data locality and data distribution across the entire block. of BtrBlocks. However, similar to FOR, Pseudodecimal Encoding
Impact of sample size. We now want to evaluate the impact of does not usually reduce data size on its own; instead, it prepares
the overall sample size on compression ratio. Figure 6 shows the the data for compression with another scheme like Bit-packing or
loss in compression ratio compared to the best possible cascade for RLE. This makes Pseudodecimal Encoding a good fit for the cascad-
different sample sizes. Larger samples yield a better compression ing compression applied by BtrBlocks, but it also complicates a
ratio at the cost of exponentially growing CPU overhead. standalone evaluation because the compression cascade conflates
2 Allowingalmost-optimal schemes filters cases where two scheme cascades compress measurements from all used schemes. To remove this effect, the
the same data almost equally well, e.g., Dict→RLE vs RLE→Dict. following evaluation of Pseudodecimal Encoding applies a fixed
SIGMOD ’23, June 18–23, 2023, Seattle, WA Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, & Viktor Leis

Table 3: Compression Ratios of Pseudodecimal Encoding

(PD), other double schemes (Public BI, large double columns) Public BI

Column FPC Gorilla Chimp Chimp128 PDE

CommonGov./10 1.2 1.1 1.5 1.9 1.8
CommonGov./26 15.1 48.0 28.0 6.9 75.0
CommonGov./30 6.4 7.0 7.6 5.0 7.8
CommonGov./31 9.3 14.3 13.3 5.6 23.4
CommonGov./40 14.3 38.0 25.0 6.7 54.6
Arade/4 .95 1.1 1.2 1.6 1.9
NYC/29 1.5 2.1 2.5 1.7 1.0
CMSProvider/1 1.5 1.7 1.8 2.4 1.6
CMSProvider/9 2.7 2.3 3.4 2.4 6.6 TPC-H
CMSProvider/25 .98 .98 1.1 1.2 1.0
Medicare/1 1.2 1.4 1.5 2.0 1.5
Medicare/9 2.6 2.3 3.4 2.3 6.3

two-level cascade: We first compress data using Pseudodecimal

Encoding and then always compress the output with FastBP128.
Comparing to existing double schemes. We first compare Pseu-
dodecimal Encoding to the well-known existing double compres-
sion schemes FPC [28] and Gorilla [51], and the recently proposed
Chimp and Chimp128 [46]. Table 3 shows the compression ratio of
these schemes on the largest non-trivial (e.g., more than one value) Figure 8: Compression ratio vs. in-memory decompression
Public BI double columns. Pseudodecimal Encoding (PDE) does not bandwidth on the Public BI Benchmark (top) and TPC-H
compress columns with high-precision values well, like the longi- (bottom) for Parquet, ORC and BtrBlocks on c5n.18xlarge
tude coordinate values in NYC/29. However, it often outperforms
other schemes on columns with less precision, like the abundant
pricing data columns. independent of other system parts. In the following, we thus focus
Effectiveness inside BtrBlocks. In order to provide a benefit on the widely used open source formats Parquet and ORC. We
as part of the scheme pool in BtrBlocks, Pseudodecimal Encoding described our Parquet configuration at the beginning of Section 6.
also has to outperform general purpose schemes like Dictionary ORC test setup. We generated Apache ORC files using the Apache
Encoding and RLE. We compare these schemes by again applying Arrow library (pyarrow 9.0.0). Using default settings, ORC files
a fixed two-level cascade, where the output of each scheme is al- tended to grow large, preventing parallelism. We thus changed the
ways compressed with FastBP128. We also include non-cascading dictionary_key_size_threshold parameter from the default (0)
FastBP128 to check our reasoning that Bit-packing (BP) should to the default of Apache Hive (0.8). We changed the LZ4 compres-
rarely be effective on IEEE 754 floating point values: sion strategy from the default (SPEED) to COMPRESSION for the same
Column BP Dict. RLE PDE Column BP Dict. RLE PDE reason. Changing the stripe size – the equivalent of the rowgroup
size for Parquet – did not change the performance in our multi-
Gov./10 .99 1.6 1.0 1.8 NYC/29 1.1 2.5 1.6 1.0 threaded tests, so we kept the default value. The actual benchmarks
Gov./26 60.9 4.4 187 75.0 CMS./1 .99 1.6 1.5 1.6 use the ORC C++ library, which cannot read files directly from mem-
Gov./30 4.7 2.9 6.9 7.8 CMS./9 1.0 5.6 .99 6.6 ory. For a fair comparison, we implemented a custom variant of
Gov./31 12.2 4.5 15 23.4 CMS./25 .99 .88 .97 1.0 orc::InputStream that reads directly out of an in-memory buffer.
Gov./40 38.1 4.4 91.5 54.6 Medi./1 .99 1.6 1.2 1.5 Like with Parquet, we parallelized by both stripes and columns.
Arade/4 .99 1.3 .96 1.9 Medi./9 1.0 5.4 .99 6.3 In-memory Public BI decompression throughput. Figure 8
(top) shows our results for the datasets we selected from the Public
Pseudodecimal Encoding (PDE) loses on columns with large runs BI Benchmark as described in Section 6.1. We plot the compression
or few unique values, where RLE and Dictionary Encoding are ratio against decompression throughput (e.g., uncompressed size /
best. But there are columns where Pseudodecimal Encoding offers decompression time) for Parquet, ORC and BtrBlocks. While Zstd
a significant benefit over other schemes. We thus believe it to be a compression is better with both Parquet and ORC in terms of com-
valuable addition to the BtrBlocks encoding scheme pool. pression ratio, BtrBlocks is superior in terms of decompression
speed. It decompressed 2.6×, 3.6× and 3.8× faster than Parquet,
6.6 Decompression Parquet+Snappy and Parquet+Zstd on average, respectively.
Open source formats. We compared our compression ratio with Decompression of Parquet vs. ORC. Interestingly, every Par
proprietary systems in Section 6.4. However, these systems do quet variant performs better than its ORC counterpart in terms
not allow us to introspect compression and decompression time of decompression speed: Uncompressed ORC is 4.1× slower to
BtrBlocks: Efficient Columnar Compression for Data Lakes SIGMOD ’23, June 18–23, 2023, Seattle, WA

Table 4: Compression ratios and decompression throughput on a random Public BI Benchmark column sample

Decomp. Speed Compr. Ratio

Dataset/Column Type ↓ Size Btr Zstd Btr Zstd Scheme Value Example
[MB] [GB/s] [GB/s] [×] [×] (Root)
SalariesFrance/LIBDOM1 string 34.0 70.2 10.4 1,862.6 3,068.1 Dictionary null,null
MulheresMil/pcd string 59.9 20.3 3.7 240.5 418.7 Dictionary “”,””,””
Redfin2/property_type string 24.7 33.6 3.5 1,262 1,598.5 Dictionary All Residential
Motos/Medio string 94.2 30.8 2.4 5,048.8 2,504.1 OneValue CABLE,CABLE,. . .
NYC/Community Board string 89.9 15.0 1.6 8.0 13.6 Dict+FSST 01 BRONX,04 BRONX
PanCreactomy1/N[. . . ]STREET1 string 149.2 17.1 1.4 5.2 7.9 Dict+FSST 5777 E MAYO BLVD
Provider/nppes_provider_city string 77.9 12.4 1.3 5.2 6.6 Dict+FSST null,BETHESDA,ATHENS
PanCreactomy1/N[. . . ]CITY string 77.9 11.8 1.2 5.1 7.7 Dict+FSST null,PHOENIX,RALEIGH
Uberlandia/municipio_da_ue string 22.7 2.7 0.9 10.4 28.5 Dictionary Maceió,Curitiba,Curitiba
RealEstate1/New Build? integer 74.5 26.6 3.1 13,055.7 1,653.5 OneValue 0,0,0,. . .
Medicare1/TOTAL_DAY_SUPPLY integer 33.0 7.1 1.0 2.4 2.2 FastPFOR 26994,18930,7691
Uberlandia/cod_ibge_da_ue integer 28.8 3.5 0.8 3.0 3.5 FastPFOR 2704302,3547304,1200203
Eixo/cod_ibge_da_ue integer 28.8 4.0 0.8 3.0 3.5 FastPFOR 2704302,3547304,1200203
Telco/CHARGD_SMS_P3 double 22.2 5.8 2.0 11.5 14.0 Dictionary 0,0,0
Telco/TOTA_OUTGOING_REV_P3 double 22.2 6.6 1.8 10.5 13.8 Dictionary 0,0,0
Telco/RECHRG[. . . ]USED_P1 double 22.2 2.3 1.7 4.4 5.9 Frequency 83.2833,3.05,9.5999
Motos/InversionQ double 107.6 11.0 1.3 4.6 6.8 Dictionary 0,0,0
Telco/TOTAL_MINS_P1 double 22.2 3.1 0.7 2.7 2.4 Pseudodec. 0,0,0
Redfin4/median_sale_price_mom double 24.9 4.3 0.7 1.3 1.7 Dictionary null,null,null

decode than uncompressed Parquet as measured on the Public BI benefit from faster decompression. This, however, is a false conclu-
Benchmark. For Snappy and Zstd, the respective factors are 4.2× sion stemming from the definition of decompression throughput.
and 2.4×. The difference in compression ratio for the compressed Decompression throughput and network bandwidth. Decom-
variants of both formats is at most 8%, even though ORC without pression throughput is usually measured using the uncompressed
compression is 28% larger than uncompressed Parquet. uncompressed size
data size, e.g.,𝑇𝑢 = decompression time . This is the metric that Figure 8
Per-column performance. Table 4 facilitates more low-level in- shows and it is the relevant metric for the data consumer. But when
sights on how the compression ratio and decompression speed of loading data over a network, decompression throughput has to be
BtrBlocks compare to Parquet+Zstd. It shows metrics for a ran- higher than the network throughput in terms of compressed data size.
dom sample of Public BI columns and lists the encoding scheme Otherwise, the network bandwidth is not yet fully exploited and
that BtrBlocks used for the first cascading step of the first block. decompression is CPU bound. We thus introduce another metric
BtrBlocks outperforms Parquet+Zstd in terms of compression compressed data size
for decompression throughput, 𝑇𝑐 = decompression time , essentially
speed, and comes close in terms of compression factor. The table
also shows a sample from the first 20 entries of each column [17], dividing 𝑇𝑢 by the compression factor. We will see how this impacts
which may not be representative of the data distribution in the en- the scan cost in our end to end cloud cost evaluation.
tire column. This illustrates the necessity of a well-crafted sampling Measuring end-to-end cost. Because what matters in the end for
algorithm for deciding on encoding schemes. analytical processing in the cloud is cost, we explicitly evaluate
In-memory TPC-H decompression throughput. We performed the cost savings BtrBlocks brings. For scans from S3, this cost
another decompression experiment with data from TPC-H and consists of two parts:
show the results in Figure 8 (bottom). The average decompression • We need an EC2 compute instance to load data to, which
throughput of all schemes is less on TPC-H because it compresses has an hourly rate of $3.89 in the case of our test instance
worse. Still, BtrBlocks decompresses 2.6×, 3.9× and 4.2× faster c5n.18xlarge [4, 18].
than Parquet, Parquet+Snappy and Parquet+Zstd, respectively. • Every 1,000 GET requests to S3 cost $0.0004; the amount of
data returned by each request is irrelevant.
6.7 End-to-End Cloud Cost Evaluation Thus, to compute the cost of a scan, it suffices to count the number
Is Parquet decompression fast enough? Slow decompression in of requests and measure the scan duration. The S3 performance
network scans translates to a higher query execution time and thus guidelines recommend fetching 8 MB or 16 MB chunks per request
higher query costs. Looking at Figure 8, however, every Parquet for maximum throughput [5]; we chose 16 MB chunks for this exper-
variant achieves an in-memory decompression throughput of over iment. Consequently, one S3 chunk consists of multiple BtrBlocks
50 GB/s. With the 100 Gbit =ˆ 12.5 GB/s networking of c5n.18xlarge, blocks that add up to 16 MB or slightly less. Parquet data is gener-
it seems like network bandwidth is the bottleneck and scans cannot ated by Apache Spark, which splits it into multiple files by default.
SIGMOD ’23, June 18–23, 2023, Seattle, WA Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, & Viktor Leis

Table 5: S3 Scan Cost on the largest 5 Public BI workbooks compression scheme BtrBlocks introduces. As with the previous
experiment, we used the five largest datasets from the Public BI
S3 𝑇𝑢 S3 𝑇𝑐 Scan cost Normalized Cost Benchmark. We load each dataset 10,000 times and average the
Format [GB/s] [Gbit/s] [$] [×] measured cost and throughput to get rid of network effects.
Cost of loading full datasets. Table 5 shows that BtrBlocks loads
BtrBlocks 174.6 86.2 0.97 1.00
these datasets 2.6× cheaper than uncompressed Parquet and 1.8×
Parquet 56.1 52.6 2.47 2.61
cheaper than Parquet with Zstd/Snappy on average. No Parquet-
+Snappy 77.6 33.2 1.74 1.84
based format can exploit the network bandwidth; BtrBlocks al-
+Zstd 78.6 24.8 1.70 1.77
most does at 𝑇𝑐 = 86 Gbps, which is close to the throughput our S3
client achieves with uncompressed data at 91 Gbps. This reaffirms
the importance of 𝑇𝑐 as a measurement of decompression through-
We have no control over the size of these files, but they usually put when loading data over the network; Figure 1 further illustrates
range from 5.5 to 24 MB. Some of the datasets from the Public BI this point. Considering this benchmark does not include any CPU
Benchmark are too small to get a useful throughput measurement time for query processing, we can expect the cost difference in an
for, so we exclude tables that have a CSV file size of less than 6 GB. actual OLAP system to be even higher.
End-to-end cost test setup. Our benchmark uses the S3 C++
SDK [6] to load compressed chunks of various formats from S3 6.8 Result Discussion
and then decompresses them in-memory like a query processing
Is BtrBlocks only fast because of SIMD? Section 5 describes
engine might. We implement our own memory pool on top of the
low-level decompression optimizations that BtrBlocks includes,
abstractions provided by the S3 SDK in order to measure the raw de-
most of which use SIMD and often improve performance substan-
compression speed without the inefficient stream implementations
tially. One might deduce that BtrBlocks decompresses so much
the SDK provides by default. We map threads to chunks returned by
faster than existing formats solely because of these low-level opti-
S3 one-to-one because this turned out to be the most efficient tech-
mizations, not because of its high-level design. If this were the case,
nique. The requests themselves are issued asynchronously and then
we could simply improve the implementation of Parquet instead of
added to a global work queue to achieve maximum throughput.
designing a new format. We checked this by implementing scalar
Loading individual columns. OLAP queries rarely read entire
versions of every decompression algorithm in BtrBlocks. Running
tables, but instead select individual columns across one or many
the experiments from Section 6.6 again, in-memory decompression
tables. Our first experiment thus loads individual columns from
is slowed down by 17%. This, however, is still 2.3× faster than the
S3 and decompresses them. We choose the columns using random
fastest Parquet variant. We conclude that substantially improving
queries from the five largest Public BI datasets, e.g., our benchmark
Parquet requires more than low-level optimizations such as SIMD.
only fetches columns that a given query scans. We find that Btr-
Update the standard or create a new format? Yet, improving
Blocks scans are 9× cheaper than the compressed Parquet variants
existing widespread formats such as Parquet is more desirable than
and 20× cheaper than uncompressed Parquet, on average. We also
creating a new data format: For users, there would be no costly
measured the cost for loading columns from all 22 TPC-H queries.
data migration, no breaking changes and fast decompression just
In TPC-H, Parquet is 5.5×, Parquet with Snappy 3.6× and Parquet
by updating a library version. Unfortunately, our experiments indi-
with Zstd 2.8× more expensive than BtrBlocks on average.
cate that low-level improvements are not enough, and integrating
Cost comparability. However, we do not think this experiment
larger parts of BtrBlocks – such as new encodings and cascading
represents the contributions of BtrBlocks particularly well be-
compression – into Parquet will cause version incompatibilities.
cause a different factor causes the high performance difference
Such a “Parquet v3” would not share much with the original besides
we measured. Parquet bundles multiple columns into one file and
the name, with no actual benefit to existing users of Parquet. In-
stores column offsets in a metadata footer at the end of the file. Thus,
stead, we have open-sourced BtrBlocks and hope that compatible
to load a single column in Parquet, a client has to perform three
improvements will find their way into Parquet, while also building
separate but dependent requests to S3: fetch the metadata length,
a new format based on BtrBlocks that is independent of Parquet.
fetch the metadata, fetch the partial file containing the column [54].
The alternative is loading the entire file and then decompressing
the column locally, which we often found to be faster. In contrast, 7 RELATED WORK
the BtrBlocks S3 metadata implementation uses one file per col- Columnar Compression. There is a large body of work on colum-
umn and bundles metadata for the entire table in a separate file. nar compression in databases [21, 22, 61]. Below, we discuss a
But metadata handling is not an issue we are trying to address selection that relates most closely to BtrBlocks.
with BtrBlocks; in fact, we argued that metadata is orthogonal SQL Server. With the introduction of column store indexes, SQL
and should be handled separately in Section 2. We thus perform a Server offers an optional column-based storage layout [38]. It di-
different experiment for comparing the scan cost with Parquet. vides data into aligned row groups, each of which contains segments
Loading entire datasets. Instead of loading individual columns, of columns. With column store indexes, SQL Server also adds colum-
we now load entire datasets from S3 and measure the combined nar compression. The system compresses each column segment
compute instance and request cost. For this experiment, we can individually in three steps: (1) encode everything as integers, (2)
forgo loading metadata and just load whole files instead. The mea- reorder rows inside each row group and (3) compress each column.
sured difference in cost can thus be entirely attributed to the novel During the encoding step, SQL Server translates strings to integers
BtrBlocks: Efficient Columnar Compression for Data Lakes SIGMOD ’23, June 18–23, 2023, Seattle, WA

using Dictionary Encoding. In more recent work, it optimizes the statistics collected about that block. Truncation is a specialized ver-
resulting dictionaries further by keeping short strings instead of sion of FOR Encoding where the frame of reference is the 𝑚𝑖𝑛 value
translating them to 32 bit integers [37]. Numeric types are encoded of each block. Ordered Dictionary Encoding is feasible because
as integers by finding the smallest common exponent in each seg- blocks are immutable and do not need fast updates. HyPer chooses
ment and multiplying with it. For integer types, SQL server strips the dictionary code size based on the amount of unique values,
common leading zeros in each segment and then applies FOR en- and ordering the dictionary allows it to evaluate range predicates
coding to reduce data range. After encoding, the system reorders on compressed data. To further increase the processing speed on
rows inside each row group to optimize for encoding using RLE. compressed data, every block also contains an SMA (small material-
Finally, it compresses either using RLE or Bit-packing. How exactly ized aggregate) and a lightweight index that improves point-access
SQL Server chooses which scheme to use is not published. From performance. The authors report compression factors of up to 5×.
the evaluation using Microsoft-internal datasets, this compression SAP BRPFC. With Block-Based Re-Pair Front-Coding (BRPFC), SAP
technique achieves a weighted average compression factor of 5.1×. introduced a new compression scheme for string dictionaries [39].
DB2 BLU. Like SQL Server did with column store indexes, IBM This work is motivated by an internal analysis that showed the
added a column-based storage layout to DB2 with DB2 BLU [53]. string pools required by Dictionary Encoding make up 28% of
Unlike SQL Server, BLU stores multiple columns segments together SAP HANAs total memory footprint. The system already uses
on a single fixed-size page. Column segments are encoded using block-based Front-Coding to compress dictionaries. Given sorted
the previously mentioned Frequency Encoding. Additionally, each input strings, this encoding replaces the common prefix of subse-
page may be compressed again using local dictionaries and offset- quent strings with the length of the prefix. For example, [SIGMM,
coding based on the local data distribution. As with the compression SIGMOBILE, SIGMOD] compresses to [SIGMM, (4)OBILE, (5)D].
schemes used in SQL Server, DB2 BLU aims to allow for query HANA further improves this technique by adding Re-Pair com-
processing on compressed data, like early filtering on range queries. pression, which replaces substrings in the data with shorter codes
However, due to the bitwise encoding schemes used, point access is using a dynamically generated grammar for each block. They apply
more involved and requires unpacking tuples first. Like BtrBlocks, the resulting algorithm to blocks of data that fit in the cache to
DB2 BLU uses bitmaps to indicate NULL values. increase compression speed. Additionally, the authors designed a
SIMD decompression and selective scans. There is a large body SIMD-based decompression algorithm to improve access latency.
of work discussing the use of SIMD and SIMD-optimized data lay- However, decompression is still too slow for our use case: Based
outs to speed up decompression and column scans. Polychroniou et on the reported access latency, one can calculate a sequential de-
al. [52] implement SIMD-optimized versions of common data struc- compression throughput of ≤100 MB/s [26]. This decompression
ture operations and compare them against their scalar counterparts. performance is not sufficient for our use case, which is why we did
Joint work by SAP and Intel focuses on fast predicate evaluation not include a similar compression technique in BtrBlocks.
and decompression in column stores using SSE and AVX2 [58, 59]. Latency on data lakes. BRPFC optimizes for per-string access la-
Vertical BitWeaving [45] and ByteSlice [32] propose separating the tency because this is an important metric in an in-memory database
bits of multiple values in a radix-like fashion, such that the 𝑘-th bits like HANA. As a data format that targets data lakes, BtrBlocks
of these values reside adjacently in memory, thus enabling short- does not profit from this: Access latency matters little when fetch-
circuited predicate evaluation. Motivated by the observation that ing large chunks of data over a high-latency network. We thus
predicates often act on multiple columns simultaneously, Johnson chose to optimize throughput and decompression speed instead.
et al. [35] propose storing multiple columns together in a bank,
such that the resulting compressed partial tuples fit into a word. 8 CONCLUSION
Using a custom-designed algebra on these packed words facilitates
We introduced BtrBlocks, an open columnar compression format
bandwidth- and cache-friendly computation.
for data lakes. By analyzing a collection of real-world datasets, we
Compressed data processing in BtrBlocks. Most of these
selected a pool of fast encoding schemes for this use case. Addi-
academic papers, as well as SQL Server and DB2 BLU, facilitate
tionally, we introduced Pseudodecimal Encoding, a novel compres-
some kind of partial query processing directly on the compressed
sion scheme for floating-point numbers. Using our sample-based
data. This makes sense in proprietary systems where processing
compression scheme selection algorithm and our generic frame-
and storage are tightly integrated. We believe that open formats,
work for cascading compression, we showed that, compared to
in contrast, should optimize for raw decompression speed first:
existing data lake formats, BtrBlocks achieves a high compres-
This way, systems can expect speed improvements without having
sion factor, competitive compression speed and superior decom-
to build their query processing around a single format. Note that
pression performance. BtrBlocks is open source and available at
BtrBlocks can, in principle, also support processing compressed
https://github.com/maxi-k/btrblocks.
data if the used schemes support it.
HyPer Data Blocks. The in-memory HTAP system HyPer intro-
duced Data Blocks to reduce the memory footprint of cold data. ACKNOWLEDGMENTS
Because HyPer targets both OLTP and OLAP, Data Blocks has to Funded/Co-funded by the European Union (ERC, CODAC, 101041375).
preserve fast point access [36]. As such, it only uses lightweight Views and opinions expressed are however those of the author(s)
encoding schemes that keep the data byte-addressable: One Value, only and do not necessarily reflect those of the European Union or
Ordered Dictionary Encoding and Truncation. After splitting the data the European Research Council. Neither the European Union nor
into blocks, HyPer decides which scheme is optimal based on the the granting authority can be held responsible for them.
SIGMOD ’23, June 18–23, 2023, Seattle, WA Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, & Viktor Leis

REFERENCES [35] Ryan Johnson, Vijayshankar Raman, Richard Sidle, and Garret Swart. 2008. Row-
[1] October 11, 2022. https://github.com/lemire/FastPFor. wise parallel predicate evaluation. Proc. VLDB Endow. 1, 1 (2008), 622–634.
[2] October 11, 2022. https://github.com/cwida/fsst. [36] Harald Lang, Tobias Mühlbauer, Florian Funke, Peter A. Boncz, Thomas Neumann,
[3] October 14, 2022. https://github.com/apache/arrow/blob/ and Alfons Kemper. 2016. Data Blocks: Hybrid OLTP and OLAP on Compressed
883580883aab748fe94336cbed844f09e015178f/cpp/src/parquet/column_writer. Storage using both Vectorization and Compilation. In SIGMOD. 311–326.
cc#L1376. [37] Per-Åke Larson, Cipri Clinciu, Campbell Fraser, Eric N. Hanson, Mostafa Mokhtar,
[4] October 14, 2022. https://aws.amazon.com/ec2/pricing/on-demand/. Michal Nowakiewicz, Vassilis Papadimos, Susan L. Price, Srikumar Rangarajan,
[5] October 14, 2022. https://docs.aws.amazon.com/AmazonS3/latest/userguide/ Remus Rusanu, and Mayukh Saubhasik. 2013. Enhancements to SQL server
optimizing-performance-guidelines.html. column stores. In SIGMOD. 1159–1168.
[6] October 14, 2022. https://aws.amazon.com/sdk-for-cpp/. [38] Per-Åke Larson, Cipri Clinciu, Eric N. Hanson, Artem Oks, Susan L. Price, Sriku-
[7] October 4, 2022. https://github.com/RoaringBitmap/CRoaring. mar Rangarajan, Aleksandras Surna, and Qingqing Zhou. 2011. SQL server
[8] October 4, 2022. https://orc.apache.org/specification/ORCv1. column store indexes. In SIGMOD. 1177–1184.
[9] October 6, 2022. https://github.com/cwida/public_bi_benchmark/blob/master/ [39] Robert Lasch, Ismail Oukid, Roman Dementiev, Norman May, Süleyman Sirri
benchmark/CommonGovernment/samples/CommonGovernment_1.sample. Demirsoy, and Kai-Uwe Sattler. 2019. Fast & Strong: The Case of Compressed
csv. String Dictionaries on Modern CPUs. In DaMoN. 4:1–4:10.
[10] October 6, 2022. https://github.com/cwida/public_bi_benchmark/blob/master/ [40] Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter A. Boncz, Alfons Kemper,
benchmark/Generico/samples/Generico_1.sample.csv. and Thomas Neumann. 2015. How Good Are Query Optimizers, Really? PVLDB
[11] September 20, 2022. https://github.com/google/snappy. 9, 3 (2015), 204–215.
[12] September 20, 2022. https://github.com/facebook/zstd. [41] Viktor Leis and Maximilian Kuschewski. 2021. Towards Cost-Optimal Query
[13] September 20, 2022. https://parquet.apache.org/docs/file-format/data-pages/ Processing in the Cloud. PVLDB 14, 9 (2021), 1606–1612.
encodings/. [42] Daniel Lemire and Leonid Boytsov. 2012. Decoding billions of integers per
[14] September 20, 2022. https://github.com/lz4/lz4. second through vectorization. CoRR abs/1209.2137 (2012). arXiv:1209.2137
[15] September 20, 2022. https://parquet.apache.org/. http://arxiv.org/abs/1209.2137
[16] September 21, 2022. https://oneapi-src.github.io/oneTBB/. [43] Daniel Lemire, Gregory Ssi Yan Kai, and Owen Kaser. 2016. Consistently faster
[17] September 24, 2022. https://github.com/cwida/public_bi_benchmark. and smaller compressed bitmaps with Roaring. CoRR abs/1603.06549 (2016).
[18] September 24, 2022. https://aws.amazon.com/ec2/instance-types/c5/. [44] Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O’Hara, François
[19] September 27, 2022. https://github.com/cwida/fsst/blob/master/fsst.h#L144. Saint-Jacques, and Gregory Ssi Yan Kai. 2017. Roaring Bitmaps: Implementation
[20] September 29, 2022. https://arrow.apache.org/docs/cpp/api/utilities.html? of an Optimized Software Library. CoRR abs/1709.07821 (2017).
highlight=lz4#compression. [45] Yinan Li and Jignesh M. Patel. 2013. BitWeaving: fast scans for main memory
[21] Daniel Abadi, Peter A. Boncz, Stavros Harizopoulos, Stratos Idreos, and Samuel data processing. In SIGMOD Conference. ACM, 289–300.
Madden. 2013. The Design and Implementation of Modern Column-Oriented [46] Panagiotis Liakos, Katia Papakonstantinopoulou, and Yannis Kotidis. 2022. Chimp:
Database Systems. Found. Trends Databases 5, 3 (2013), 197–280. Efficient Lossless Floating Point Compression for Time Series Databases. PVLDB
[22] Daniel J. Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating compres- 15, 11 (2022), 3058–3070.
sion and execution in column-oriented database systems. In SIGMOD Conference. [47] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shiv-
ACM, 671–682. akumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive Analysis of
[23] Josep Aguilar-Saborit and Raghu Ramakrishnan. 2020. POLARIS: The Distributed Web-Scale Datasets. PVLDB 3, 1 (2010), 330–339.
SQL Engine in Azure Synapse. Proc. VLDB Endow. 13, 12 (2020), 3204–3216. [48] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shiv-
[24] Nikos Armenatzoglou, Sanuj Basu, Naga Bhanoori, Mengchu Cai, Naresh akumar, Matt Tolton, Theo Vassilakis, Hossein Ahmadi, Dan Delorey, Slava Min,
Chainani, Kiran Chinta, Venkatraman Govindaraju, Todd J. Green, Monish Gupta, Mosha Pasumansky, and Jeff Shute. 2020. Dremel: A Decade of Interactive SQL
Sebastian Hillig, Eric Hotinger, Yan Leshinksy, Jintian Liang, Michael McCreedy, Analysis at Web Scale. PVLDB 13, 12 (2020), 3461–3472.
Fabian Nagel, Ippokratis Pandis, Panos Parchas, Rahul Pathak, Orestis Polychro- [49] Ingo Müller, Cornelius Ratsch, and Franz Färber. 2014. Adaptive String Dictionary
niou, Foyzur Rahman, Gaurav Saxena, Gokul Soundararajan, Sriram Subramanian, Compression in In-Memory Column-Store Database Systems. In EDBT. 283–294.
and Doug Terry. 2022. Amazon Redshift Re-invented. In SIGMOD. 2205–2217. [50] Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern
[25] Alexander Behm, Shoumik Palkar, Utkarsh Agarwal, Timothy Armstrong, David Hardware. PVLDB 4, 9 (2011), 539–550.
Cashman, Ankur Dave, Todd Greenstein, Shant Hovsepian, Ryan Johnson, [51] Tuomas Pelkonen, Scott Franklin, Paul Cavallaro, Qi Huang, Justin Meza, Justin
Arvind Sai Krishnan, Paul Leventis, Ala Luszczak, Prashanth Menon, Mostafa Teller, and Kaushik Veeraraghavan. 2015. Gorilla: A Fast, Scalable, In-Memory
Mokhtar, Gene Pang, Sameer Paranjpye, Greg Rahn, Bart Samwel, Tom van Bus- Time Series Database. Proc. VLDB Endow. 8, 12 (2015), 1816–1827.
sel, Herman Van Hovell, Maryann Xue, Reynold Xin, and Matei Zaharia. 2022. [52] Orestis Polychroniou and Kenneth A. Ross. 2015. Efficient Lightweight Compres-
Photon: A Fast Query Engine for Lakehouse Systems. In SIGMOD. 2326–2339. sion Alongside Fast Scans. In DaMoN. ACM, 9:1–9:6.
[26] Peter A. Boncz, Thomas Neumann, and Viktor Leis. 2020. FSST: Fast Random [53] Vijayshankar Raman, Gopi K. Attaluri, Ronald Barber, Naresh Chainani, David
Access String Compression. PVLDB 13, 11 (2020), 2649–2661. Kalmuk, Vincent KulandaiSamy, Jens Leenstra, Sam Lightstone, Shaorong Liu,
[27] Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper- Guy M. Lohman, Tim Malkemus, René Müller, Ippokratis Pandis, Berni Schiefer,
Pipelining Query Execution. In CIDR. 225–237. David Sharpe, Richard Sidle, Adam J. Storm, and Liping Zhang. 2013. DB2 with
[28] Martin Burtscher and Paruj Ratanaworabhan. 2007. High Throughput Compres- BLU Acceleration: So Much More than Just a Column Store. PVLDB 6, 11 (2013),
sion of Double-Precision Floating-Point Data. In DCC. IEEE Computer Society, 1080–1091.
293–302. [54] Alice Rey, Michael Freitag, and Thomas Neumann. 2023. Seamless Integration
[29] Benoît Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin of Parquet Files into Data Processing. In BTW (LNI, Vol. P-331). Gesellschaft für
Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Informatik e.V., 235–258.
Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, [55] Alexander van Renen and Viktor Leis. 2023. Cloud Analytics Benchmark. PVLDB
Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. 2016. 16, 6 (2023), 1413–1425.
The Snowflake Elastic Data Warehouse. In SIGMOD. 215–226. [56] Adrian Vogelsgesang, Michael Haubenschild, Jan Finis, Alfons Kemper, Viktor
[30] Patrick Damme, Dirk Habich, Juliana Hildebrandt, and Wolfgang Lehner. 2017. Leis, Tobias Mühlbauer, Thomas Neumann, and Manuel Then. 2018. Get Real:
Lightweight Data Compression Algorithms: An Experimental Survey. In EDBT. How Benchmarks Fail to Represent the Real World. In DBTest. 1:1–1:6.
72–83. [57] Jianguo Wang, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2017.
[31] Patrick Damme, Annett Ungethüm, Juliana Hildebrandt, Dirk Habich, and Wolf- An Experimental Study of Bitmap Compression vs. Inverted List Compression.
gang Lehner. 2019. From a Comprehensive Experimental Survey to a Cost-based In SIGMOD. 993–1008.
Selection Strategy for Lightweight Integer Compression Algorithms. ACM Trans. [58] Thomas Willhalm, Ismail Oukid, Ingo Müller, and Franz Faerber. 2013. Vectorizing
Database Syst. 44, 3 (2019), 9:1–9:46. Database Column Scans with Complex Predicates. In ADMS@VLDB. 1–12.
[32] Ziqiang Feng, Eric Lo, Ben Kao, and Wenjian Xu. 2015. ByteSlice: Pushing [59] Thomas Willhalm, Nicolae Popovici, Yazan Boshmaf, Hasso Plattner, Alexander
the Envelop of Main Memory Data Processing with a New Storage Layout. In Zeier, and Jan Schaffner. 2009. SIMD-Scan: Ultra Fast in-Memory Table Scan
SIGMOD Conference. ACM, 31–46. using on-Chip Vector Processing Units. Proc. VLDB Endow. 2, 1 (2009), 385–394.
[33] Bogdan Ghita, Diego G. Tomé, and Peter A. Boncz. 2020. White-box Compression: [60] Matei Zaharia, Ali Ghodsi, Reynold Xin, and Michael Armbrust. 2021. Lakehouse:
Learning and Exploiting Compact Table Representations. In CIDR. A New Generation of Open Platforms that Unify Data Warehousing and Advanced
[34] Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Analytics. In CIDR.
Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler [61] Marcin Zukowski, Sándor Héman, Niels Nes, and Peter A. Boncz. 2006. Super-
Data Warehouses. In SIGMOD Conference. ACM, 1917–1923. Scalar RAM-CPU Cache Compression. In ICDE. 59.

cca498_final_review - jiajia
No ratings yet
cca498_final_review - jiajia
86 pages
Colorblind Racial Profiling A History 1974 To The Present 1st Edition Guy Padula Download PDF
100% (3)
Colorblind Racial Profiling A History 1974 To The Present 1st Edition Guy Padula Download PDF
62 pages
Azure Data Engineering Complete Guide
No ratings yet
Azure Data Engineering Complete Guide
130 pages
Exploring The Shared Experience of ABM Students in Selling Products
100% (1)
Exploring The Shared Experience of ABM Students in Selling Products
34 pages
big_data_topic1_[introduction]_[thanh_binh_nguyen].TextMark
No ratings yet
big_data_topic1_[introduction]_[thanh_binh_nguyen].TextMark
46 pages
Reproductive System: Dr. Dicky Moch. Rizal, Mkes, Spand Bag. Ilmu Faal, FK Ugm
No ratings yet
Reproductive System: Dr. Dicky Moch. Rizal, Mkes, Spand Bag. Ilmu Faal, FK Ugm
94 pages
AWS+Data+Lake (1)
No ratings yet
AWS+Data+Lake (1)
118 pages
Freshers Bits
No ratings yet
Freshers Bits
23 pages
AWS Data Lake
100% (1)
AWS Data Lake
104 pages
p148-zeng
No ratings yet
p148-zeng
14 pages
Data Engineering - Behind the Scene of Data by Hoda Ragaie
No ratings yet
Data Engineering - Behind the Scene of Data by Hoda Ragaie
44 pages
What Is Apache Parquet
No ratings yet
What Is Apache Parquet
20 pages
Lance: Efficient Random Access in Columnar Storage through Adaptive Structural Encodings
No ratings yet
Lance: Efficient Random Access in Columnar Storage through Adaptive Structural Encodings
13 pages
REPUBLIC V SANDIGANBAYAN (Puno Concurring)
No ratings yet
REPUBLIC V SANDIGANBAYAN (Puno Concurring)
36 pages
Contemporary Arts From CAR (Cordillera Administrative Region)
92% (26)
Contemporary Arts From CAR (Cordillera Administrative Region)
40 pages
OD 03 PDE Building and Operationalizing Data Processing Systems
No ratings yet
OD 03 PDE Building and Operationalizing Data Processing Systems
34 pages
Exploiting Cloud Object Storage For High-Performance Analytics
No ratings yet
Exploiting Cloud Object Storage For High-Performance Analytics
14 pages
Dinner Party Wbs PM
No ratings yet
Dinner Party Wbs PM
24 pages
Birth Preparedness Handbook
100% (2)
Birth Preparedness Handbook
338 pages
The State of Data Engineering 2022 - LakeFS
No ratings yet
The State of Data Engineering 2022 - LakeFS
15 pages
Amazon Redshift
No ratings yet
Amazon Redshift
20 pages
final report
No ratings yet
final report
22 pages
ETCC Application Form - Lagos State
75% (16)
ETCC Application Form - Lagos State
2 pages
A Study of Caustic Corrosion of Carbon Steel Waste Tanks
No ratings yet
A Study of Caustic Corrosion of Carbon Steel Waste Tanks
10 pages
GCP Technologies
No ratings yet
GCP Technologies
12 pages
DWH
No ratings yet
DWH
7 pages
Circular3OrientationProgramXI202526pdf_202504040417_0
No ratings yet
Circular3OrientationProgramXI202526pdf_202504040417_0
2 pages
Unit3 - Cloud Data Storage
No ratings yet
Unit3 - Cloud Data Storage
7 pages
Advanced Candlestick Pattern 2.0
0% (1)
Advanced Candlestick Pattern 2.0
12 pages
Ch11-A - ARNEL - CORPUZ SN 2019390031 03222020
No ratings yet
Ch11-A - ARNEL - CORPUZ SN 2019390031 03222020
4 pages
20 - 04 - 2024 Cheatsheet
No ratings yet
20 - 04 - 2024 Cheatsheet
3 pages
Assignment After Seminar
No ratings yet
Assignment After Seminar
2 pages
Storage Options for Transformed Data
No ratings yet
Storage Options for Transformed Data
3 pages
Activity No. 6: Tissues, Glands and Membranes
No ratings yet
Activity No. 6: Tissues, Glands and Membranes
3 pages
Psychoanalysis Presentation
No ratings yet
Psychoanalysis Presentation
62 pages
Tugas Makalah B Inggris
No ratings yet
Tugas Makalah B Inggris
12 pages
Data Lake On Aws
No ratings yet
Data Lake On Aws
29 pages
5 Minute Speaking Games
100% (1)
5 Minute Speaking Games
35 pages
Children's Drama Script - RedBalloon
No ratings yet
Children's Drama Script - RedBalloon
2 pages
1952 Madden Report (Original)
No ratings yet
1952 Madden Report (Original)
464 pages
Akash High Scale Benchmarks
No ratings yet
Akash High Scale Benchmarks
74 pages
The New Age of Data-Intensive Applications
No ratings yet
The New Age of Data-Intensive Applications
7 pages
GCP - DataPlex - Building A Data Lakehouse
No ratings yet
GCP - DataPlex - Building A Data Lakehouse
19 pages
Analytical Solution To Problems Os Hydraulic Jump in Horizontal Triangular Channels
No ratings yet
Analytical Solution To Problems Os Hydraulic Jump in Horizontal Triangular Channels
4 pages
Lakehouse: A New Generation of Open Platforms That Unify Data Warehousing and Advanced Analytics
No ratings yet
Lakehouse: A New Generation of Open Platforms That Unify Data Warehousing and Advanced Analytics
8 pages
Elec
No ratings yet
Elec
15 pages
Upper-Intermediate Keys
No ratings yet
Upper-Intermediate Keys
8 pages
Big Data PDF
No ratings yet
Big Data PDF
18 pages
Eb Cloud Data Warehouse Comparison Ebook en
No ratings yet
Eb Cloud Data Warehouse Comparison Ebook en
10 pages
AWS Data Lake
No ratings yet
AWS Data Lake
13 pages
Managing The Narrative: Investor Relations Officers and Corporate Disclosure
No ratings yet
Managing The Narrative: Investor Relations Officers and Corporate Disclosure
22 pages
Article 1
No ratings yet
Article 1
6 pages
Commentary On Exodus
50% (2)
Commentary On Exodus
185 pages
Basic Noise Calculations April 2007
No ratings yet
Basic Noise Calculations April 2007
20 pages
Menue For Pizza Express
No ratings yet
Menue For Pizza Express
2 pages
Advance Multiple Choice Quiz
No ratings yet
Advance Multiple Choice Quiz
4 pages
M.SC - in Electrical and Comp Control Eng
No ratings yet
M.SC - in Electrical and Comp Control Eng
36 pages
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
From Everand
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
Anand Vemula
No ratings yet
Strimzi Essentials: The Complete Guide for Developers and Engineers
From Everand
Strimzi Essentials: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Red Hat AMQ Streams for Cloud-Native Messaging: The Complete Guide for Developers and Engineers
From Everand
Red Hat AMQ Streams for Cloud-Native Messaging: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Citus for Scalable PostgreSQL Systems: The Complete Guide for Developers and Engineers
From Everand
Citus for Scalable PostgreSQL Systems: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Virtuoso Database Systems: The Complete Guide for Developers and Engineers
From Everand
Virtuoso Database Systems: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Practical Parquet Engineering: Definitive Reference for Developers and Engineers
From Everand
Practical Parquet Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Data Science Workflows with Vaex: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Science Workflows with Vaex: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Azure Synapse Analytics Solutions: Definitive Reference for Developers and Engineers
From Everand
Azure Synapse Analytics Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Iceberg Table Formats and Analytics: Definitive Reference for Developers and Engineers
From Everand
Iceberg Table Formats and Analytics: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Amazon EMR Solutions in Cloud Computing: Definitive Reference for Developers and Engineers
From Everand
Amazon EMR Solutions in Cloud Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
From Everand
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Apache Samza: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Apache Samza: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
From Everand
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Google Cloud Memorystore in Practice: Definitive Reference for Developers and Engineers
From Everand
Google Cloud Memorystore in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Containers in Practice: Architecture and Management
From Everand
Containers in Practice: Architecture and Management
Richard Johnson
No ratings yet
Striim Platform Essentials: Definitive Reference for Developers and Engineers
From Everand
Striim Platform Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
From Everand
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kinesis Stream Processing Essentials: Definitive Reference for Developers and Engineers
From Everand
Kinesis Stream Processing Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Decentralized Cloud: How Blockchains Will Disrupt and Unseat Centralized Computing
From Everand
The Decentralized Cloud: How Blockchains Will Disrupt and Unseat Centralized Computing
Daniel W. Marshall
No ratings yet
Building Container Solutions with Fargate: Definitive Reference for Developers and Engineers
From Everand
Building Container Solutions with Fargate: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Distributed Systems and Beyond
From Everand
Distributed Systems and Beyond
Pasquale De Marco
No ratings yet
Optimized Caching Techniques: Application for Scalable Distributed Architectures
From Everand
Optimized Caching Techniques: Application for Scalable Distributed Architectures
Peter Jones
No ratings yet
Data Lakes & Pipelines: A Modern Azure Guide
From Everand
Data Lakes & Pipelines: A Modern Azure Guide
Kameron Hussain
No ratings yet
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
From Everand
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
Rob Botwright
No ratings yet
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
From Everand
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
Mustafa Al-Dori
4/5 (1)
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
From Everand
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
From Everand
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Azure Data Demystified: From SQL to Synapse
From Everand
Azure Data Demystified: From SQL to Synapse
Kameron Hussain
No ratings yet
SQL and NoSQL: Building Hybrid Data Solutions for Modern Applications
From Everand
SQL and NoSQL: Building Hybrid Data Solutions for Modern Applications
Robert Johnson
No ratings yet
Mastering Delta Lake: Optimizing Data Lakes for Performance and Reliability
From Everand
Mastering Delta Lake: Optimizing Data Lakes for Performance and Reliability
Robert Johnson
No ratings yet
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
Management Strategies for the Cloud Revolution (Review and Analysis of Babcock's Book)
From Everand
Management Strategies for the Cloud Revolution (Review and Analysis of Babcock's Book)
BusinessNews Publishing
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Btrblocks - Data Lake Compression

Uploaded by

Btrblocks - Data Lake Compression

Uploaded by

BtrBlocks: Efficient Columnar Compression for Data Lakes

Maximilian Kuschewski David Sauerwein

Adnan Alhomssi Viktor Leis

Part 0 Part 1 Part 2 Part 3

String Double Integer

Dict + One Uncom- One Uncom- Pseudo- Uncom- SIMD- One

Uncom- Roaring SIMD-

Figure 3: Encoding scheme decision trees that we apply recursively

datatype String Double Integer Combined

double integer string double integer string

Decomp. Speed [GB/s]

Sampling in BtrBlocks. For BtrBlocks , we thus choose to

+ 3% 1.2% of CPU time during compression and results in 77% correct

Table 3: Compression Ratios of Pseudodecimal Encoding

Column FPC Gorilla Chimp Chimp128 PDE

two-level cascade: We first compress data using Pseudodecimal

Decomp. Speed Compr. Ratio

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.