0% found this document useful (0 votes)

81 views10 pages

Calcite

The document introduces Apache Calcite, an open-source framework that provides optimized query processing over heterogeneous data sources. It was developed to solve problems encountered by specialized data systems, such as needing to support SQL queries and related optimizations, and to allow integration across different data sources. Calcite provides common functionality for query execution, optimization, and languages, while leaving data storage and management to specialized engines. It has been adopted by many popular data systems, bringing them advanced optimizations and query language support. The document describes Calcite's architecture, components, relational algebra core, adapters for external data sources, optimizer, and extensions for different query paradigms.

Uploaded by

zhangxin1992pm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views10 pages

Calcite

Uploaded by

zhangxin1992pm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Industry 1: Adaptive Query Processing SIGMOD’18, June 10-15, 2018, Houston, TX, USA

Apache Calcite: A Foundational Framework for Optimized

Query Processing Over Heterogeneous Data Sources
Edmon Begoli Jesús Camacho-Rodríguez Julian Hyde
Oak Ridge National Laboratory Hortonworks Inc. Hortonworks Inc.
(ORNL) Santa Clara, California, USA Santa Clara, California, USA
Oak Ridge, Tennessee, USA jcamacho@hortonworks.com jhyde@hortonworks.com
begolie@ornl.gov

Michael J. Mior Daniel Lemire

David R. Cheriton School of University of Quebec (TELUQ)
Computer Science Montreal, Quebec, Canada
University of Waterloo lemire@gmail.com
Waterloo, Ontario, Canada
mmior@uwaterloo.ca
ABSTRACT Optimized Query Processing Over Heterogeneous Data Sources. In SIG-
Apache Calcite is a foundational software framework that provides MOD’18: 2018 International Conference on Management of Data, June 10–
15, 2018, Houston, TX, USA. ACM, New York, NY, USA, 10 pages. https:
query processing, optimization, and query language support to
//doi.org/10.1145/3183713.3190662
many popular open-source data processing systems such as Apache
Hive, Apache Storm, Apache Flink, Druid, and MapD. The goal of
this paper is to formally introduce Calcite to the broader research 1 INTRODUCTION
community, briefly present its history, and describe its architecture, Following the seminal System R, conventional relational database
features, functionality, and patterns for adoption. Calcite’s archi- engines dominated the data processing landscape. Yet, as far back as
tecture consists of a modular and extensible query optimizer with 2005, Stonebraker and Çetintemel [49] predicted that we would see
hundreds of built-in optimization rules, a query processor capable of the rise a collection of specialized engines such as column stores,
processing a variety of query languages, an adapter architecture de- stream processing engines, text search engines, and so forth. They
signed for extensibility, and support for heterogeneous data models argued that specialized engines can offer more cost-effective per-
and stores (relational, semi-structured, streaming, and geospatial). formance and that they would bring the end of the “one size fits
This flexible, embeddable, and extensible architecture is what makes all” paradigm. Their vision seems today more relevant than ever.
Calcite an attractive choice for adoption in big-data frameworks. It Indeed, many specialized open-source data systems have since be-
is an active project that continues to introduce support for the new come popular such as Storm [50] and Flink [16] (stream processing),
types of data sources, query languages, and approaches to query Elasticsearch [15] (text search), Apache Spark [47], Druid [14], etc.
processing and optimization. As organizations have invested in data processing systems tai-
lored towards their specific needs, two overarching problems have
CCS CONCEPTS arisen:
• Information systems → DBMS engine architectures; • The developers of such specialized systems have encoun-
tered related problems, such as query optimization [4, 25]
KEYWORDS or the need to support query languages such as SQL and
related extensions (e.g., streaming queries [26]) as well as
Apache Calcite, Relational Semantics, Data Management, Query
language-integrated queries inspired by LINQ [33]. With-
Algebra, Modular Query Optimization, Storage Adapters
out a unifying framework, having multiple engineers inde-
ACM Reference Format: pendently develop similar optimization logic and language
Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J. Mior, support wastes engineering effort.
and Daniel Lemire. 2018. Apache Calcite: A Foundational Framework for • Programmers using these specialized systems often have to
integrate several of them together. An organization might
rely on Elasticsearch, Apache Spark, and Druid. We need
Publication rights licensed to ACM. ACM acknowledges that this contribution was
authored or co-authored by an employee, contractor or affiliate of the United States to build systems capable of supporting optimized queries
government. As such, the Government retains a nonexclusive, royalty-free right to across heterogeneous data sources [55].
publish or reproduce this article, or to allow others to do so, for Government purposes
only. Apache Calcite was developed to solve these problems. It is
SIGMOD’18, June 10–15, 2018, Houston, TX, USA a complete query processing system that provides much of the
© 2018 Copyright held by the owner/author(s). Publication rights licensed to the common functionality—query execution, optimization, and query
Association for Computing Machinery.
ACM ISBN 978-1-4503-4703-7/18/06. . . $15.00 languages—required by any database management system, except
https://doi.org/10.1145/3183713.3190662 for data storage and management, which are left to specialized

221
Industry 1: Adaptive Query Processing SIGMOD’18, June 10-15, 2018, Houston, TX, USA

engines. Calcite was quickly adopted by Hive, Drill [13], Storm, nested data. In addition, Calcite includes a driver conforming
and many other data processing engines, providing them with to the standard Java API (JDBC).
advanced query optimizations and query languages.1 For example,
The remainder is organized as follows. Section 2 discusses re-
Hive [24] is a popular data warehouse project built on top of Apache
lated work. Section 3 introduces Calcite’s architecture and its main
Hadoop. As Hive moved from its batch processing roots towards an
components. Section 4 describes the relational algebra at the core
interactive SQL query answering platform, it became clear that the
of Calcite. Section 5 presents Calcite’s adapters, an abstraction to
project needed a powerful optimizer at its core. Thus, Hive adopted
define how to read external data sources. In turn, Section 6 describes
Calcite as its optimizer and their integration has been growing since.
Calcite’s optimizer and its main features, while Section 7 presents
Many other projects and products have followed suit, including
the extensions to handle different query processing paradigms. Sec-
Flink, MapD [12], etc.
tion 8 provides an overview of the data processing systems already
Furthermore, Calcite enables cross-platform optimization by
using Calcite. Section 9 discusses possible future extensions for the
exposing a common interface to multiple systems. To be efficient,
framework before we conclude in Section 10.
the optimizer needs to reason globally, e.g., make decisions across
different systems about materialized view selection.
Building a common framework does not come without chal- 2 RELATED WORK
lenges. In particular, the framework needs to be extensible and Though Calcite is currently the most widely adopted optimizer for
flexible enough to accommodate the different types of systems big-data analytics in the Hadoop ecosystem, many of the ideas that
requiring integration. lie behind it are not novel. For instance, the query optimizer builds
We believe that the following features have contributed to Cal- on ideas from the Volcano [20] and Cascades [19] frameworks,
cite’s wide adoption in the open source community and industry: incorporating other widely used optimization techniques such as
• Open source friendliness. Many of the major data pro- materialized view rewriting [10, 18, 22]. There are other systems
cessing platforms of the last decade have been either open that try to fill a similar role to Calcite.
source or largely based on open source. Calcite is an open- Orca [45] is a modular query optimizer used in data manage-
source framework, backed by the Apache Software Founda- ment products such as Greenplum and HAWQ. Orca decouples
tion (ASF) [5], which provides the means to collaboratively the optimizer from the query execution engine by implementing a
develop the project. Furthermore, the software is written framework for exchanging information between the two known as
in Java making it easier to interoperate with many of the Data eXchange Language. Orca also provides tools for verifying the
latest data processing systems [12, 13, 16, 24, 28, 44] that are correctness and performance of generated query plans. In contrast
often written themselves in Java (or in the JVM-based Scala), to Orca, Calcite can be used as a standalone query execution engine
especially those in the Hadoop ecosystem. that federates multiple storage and processing backends, including
• Multiple data models. Calcite provides support for query pluggable planners, and optimizers.
optimization and query languages using both streaming Spark SQL [3] extends Apache Spark to support SQL query exe-
and conventional data processing paradigms. Calcite treats cution which can also execute queries over multiple data sources
streams as time-ordered sets of records or events that are as in Calcite. However, although the Catalyst optimizer in Spark
not persisted to the disk as they would be in conventional SQL also attempts to minimize query execution cost, it lacks the
data processing systems. dynamic programming approach used by Calcite and risks falling
• Flexible query optimizer. Each component of the opti- into local minima.
mizer is pluggable and extensible, ranging from rules to cost Algebricks [6] is a query compiler architecture that provides
models. In addition, Calcite includes support for multiple a data model agnostic algebraic layer and compiler framework
planning engines. Hence, the optimization can be broken for big data query processing. High-level languages are compiled
down into phases handled by different optimization engines to Algebricks logical algebra. Algebricks then generates an opti-
depending on which one is best suited for the stage. mized job targeting the Hyracks parallel processing backend. While
• Cross-system support. The Calcite framework can run and Calcite shares a modular approach with Algebricks, Calcite also
optimize queries across multiple query processing systems includes a support for cost-based optimizations. In the current
and database backends. version of Calcite, the query optimizer architecture uses dynamic
• Reliability. Calcite is reliable, as its wide adoption over programming-based planning based on Volcano [20] with exten-
many years has led to exhaustive testing of the platform. sions for multi-stage optimizations as in Orca [45]. Though in prin-
Calcite also contains an extensive test suite validating all ciple Algebricks could support multiple processing backends (e.g.,
components of the system including query optimizer rules Apache Tez, Spark), Calcite has provided well-tested support for
and integration with backend data sources. diverse backends for many years.
• Support for SQL and its extensions. Many systems do not Garlic [7] is a heterogeneous data management system which
provide their own query language, but rather prefer to rely represents data from multiple systems under a unified object model.
on existing ones such as SQL. For those, Calcite provides sup- However, Garlic does not support query optimization across differ-
port for ANSI standard SQL, as well as various SQL dialects ent systems and relies on each system to optimize its own queries.
and extensions, e.g., for expressing queries on streaming or FORWARD [17] is a federated query processor that implements
a superset of SQL called SQL++ [38]. SQL++ has a semi-structured
1 http://calcite.apache.org/docs/powered_by data model that integrate both JSON and relational data models

222
Industry 1: Adaptive Query Processing SIGMOD’18, June 10-15, 2018, Houston, TX, USA

First, Calcite contains a query parser and validator that can

translate a SQL query to a tree of relational operators. As Calcite
does not contain a storage layer, it provides a mechanism to define
table schemas and views in external storage engines via adapters
(described in Section 5), so it can be used on top of these engines.
Second, although Calcite provides optimized SQL support to
systems that need such database language support, it also provides
optimization support to systems that already have their own lan-
guage parsing and interpretation:
• Some systems support SQL queries, but without or with lim-
ited query optimization. For example, both Hive and Spark
initially offered support for the SQL language, but they did
not include an optimizer. For such cases, once the query has
been optimized, Calcite can translate the relational expres-
sion back to SQL. This feature allows Calcite to work as a
stand-alone system on top of any data management system
with a SQL interface, but no optimizer.
• The Calcite architecture is not only tailored towards op-
Figure 1: Apache Calcite architecture and interaction.
timizing SQL queries. It is common that data processing
systems choose to use their own parser for their own query
language. Calcite can help optimize these queries as well.
whereas Calcite supports semi-structured data models by repre- Indeed, Calcite also allows operator trees to be easily con-
senting them in the relational data model during query planning. structed by directly instantiating relational operators. One
FORWARD decomposes federated queries written in SQL++ into can use the built-in relational expressions builder interface.
subqueries and executes them on the underlying databases accord- For instance, assume that we want to express the following
ing to the query plan. The merging of data happens inside the Apache Pig [41] script using the expression builder:
FORWARD engine.
Another federated data storage and processing system is Big- emp = LOAD ' employee_data ' AS ( deptno , sal );
emp_by_dept = GROUP emp by ( deptno );
DAWG, which abstracts a wide spectrum of data models including emp_agg = FOREACH emp_by_dept GENERATE GROUP as deptno ,
relational, time-series and streaming. The unit of abstraction in COUNT ( emp . sal ) AS c , SUM ( emp . sal ) as s ;
dump emp_agg ;
BigDAWG is called an island of information. Each island of informa-
tion has a query language, data model and connects to one or more
The equivalent expression looks as follows:
storage systems. Cross storage system querying is supported within
the boundaries of a single island of information. Calcite instead final RelNode node = builder
provides a unifying relational abstraction which allows querying . scan ( " employee_data ")
. aggregate ( builder . groupKey (" deptno ") ,
across backends with different data models. builder . count ( false , " c ") ,
Myria is a general-purpose engine for big data analytics, with builder . sum ( false , " s " , builder . field (" sal " )))
advanced support for the Python language [21]. It produces query . build ();

plans for other backend engines such as Spark and PostgreSQL.

This interface exposes the main constructs necessary for
building relational expressions. After the optimization phase
3 ARCHITECTURE is finished, the application can retrieve the optimized rela-
Calcite contains many of the pieces that comprise a typical database tional expression which can then be mapped back to the
management system. However, it omits some key components, system’s query processing unit.
e.g., storage of data, algorithms to process data, and a repository for
storing metadata. These omissions are deliberate: it makes Calcite 4 QUERY ALGEBRA
an excellent choice for mediating between applications having one
or more data storage locations and using multiple data processing Operators. Relational algebra [11] lies at the core of Calcite. In
engines. It is also a solid foundation for building bespoke data addition to the operators that express the most common data manip-
processing systems. ulation operations, such as filter, project, join etc., Calcite includes
Figure 1 outlines the main components of Calcite’s architecture. additional operators that meet different purposes, e.g., being able to
Calcite’s optimizer uses a tree of relational operators as its internal concisely represent complex operations, or recognize optimization
representation. The optimization engine primarily consists of three opportunities more efficiently.
components: rules, metadata providers, and planner engines. We For instance, it has become common for OLAP, decision making,
discuss these components in more detail in Section 6. In the figure, and streaming applications to use window definitions to express
the dashed lines represent possible external interactions with the complex analytic functions such as moving average of a quantity
framework. There are different ways to interact with Calcite. over a time period or number or rows. Thus, Calcite introduces a

223
Industry 1: Adaptive Query Processing SIGMOD’18, June 10-15, 2018, Houston, TX, USA

Figure 2: A Query Optimization Process.

Figure 3: Calcite’s Data Source Adapter Design.
window operator that encapsulates the window definition, i.e., up-
per and lower bound, partitioning etc., and the aggregate functions
to execute on each window. 5 ADAPTERS
Traits. Calcite does not use different entities to represent logical An adapter is an architectural pattern that defines how Calcite
and physical operators. Instead, it describes the physical properties incorporates diverse data sources for general access. Figure 3 de-
associated with an operator using traits. These traits help the opti- picts its components. Essentially, an adapter consists of a model, a
mizer evaluate the cost of different alternative plans. Changing a schema, and a schema factory. The model is a specification of the
trait value does not change the logical expression being evaluated, physical properties of the data source being accessed. A schema is
i.e., the rows produced by the given operator will still be the same. the definition of the data (format and layouts) found in the model.
During optimization, Calcite tries to enforce certain traits on The data itself is physically accessed via tables. Calcite interfaces
relational expressions, e.g., the sort order of certain columns. Rela- with the tables defined in the adapter to read the data as the query
tional operators can implement a converter interface that indicates is being executed. The adapter may define a set of rules that are
how to convert traits of an expression from one value to another. added to the planner. For instance, it typically includes rules to
Calcite includes common traits that describe the physical proper- convert various types of logical relational expressions to the corre-
ties of the data produced by a relational expression, such as ordering, sponding relational expressions of the adapter’s convention. The
grouping, and partitioning. Similar to the SCOPE optimizer [57], schema factory component acquires the metadata information from
the Calcite optimizer can reason about these properties and exploit the model and generates a schema.
them to find plans that avoid unnecessary operations. For exam- As discussed in Section 4, Calcite uses a physical trait known
ple, if the input to the sort operator is already correctly ordered— as the calling convention to identify relational operators which cor-
possibly because this is the same order used for rows in the backend respond to a specific database backend. These physical operators
system—then the sort operation can be removed. implement the access paths for the underlying tables in each adapter.
In addition to these properties, one of the main features of Calcite When a query is parsed and converted to a relational algebra expres-
is the calling convention trait. Essentially, the trait represents the sion, an operator is created for each table representing a scan of the
data processing system where the expression will be executed. data on that table. It is the minimal interface that an adapter must
Including the calling convention as a trait allows Calcite to meet implement. If an adapter implements the table scan operator, the
its goal of optimizing transparently queries whose execution might Calcite optimizer is then able to use client-side operators such as
span over different engines i.e., the convention will be treated as sorting, filtering, and joins to execute arbitrary SQL queries against
any other physical property. these tables.
For example, consider joining a Products table held in MySQL to This table scan operator contains the necessary information the
an Orders table held in Splunk (see Figure 2). Initially, the scan of adapter requires to issue the scan to the adapter’s backend database.
Orders takes place in the splunk convention and the scan of Products To extend the functionality provided by adapters, Calcite defines
is in the jdbc-mysql convention. The tables have to be scanned an enumerable calling convention. Relational operators with the
inside their respective engines. The join is in the logical convention, enumerable calling convention simply operate over tuples via an
meaning that no implementation has been chosen. Moreover, the iterator interface. This calling convention allows Calcite to im-
SQL query in Figure 2 contains a filter (where clause) which is plement operators which may not be available in each adapter’s
pushed into splunk by an adapter-specific rule (see Section 5). One backend. For example, the EnumerableJoin operator implements
possible implementation is to use Apache Spark as an external joins by collecting rows from its child nodes and joining on the
engine: the join is converted to spark convention, and its inputs are desired attributes.
converters from jdbc-mysql and splunk to spark convention. But For queries which only touch a small subset of the data in a
there is a more efficient implementation: exploiting the fact that table, it is inefficient for Calcite to enumerate all tuples. Fortu-
Splunk can perform lookups into MySQL via ODBC, a planner rule nately, the same rule-based optimizer can be used to implement
pushes the join through the splunk-to-spark converter, and the join adapter-specific rules for optimization. For example, suppose a
is now in splunk convention, running inside the Splunk engine. query involves filtering and sorting on a table. An adapter which

224
Industry 1: Adaptive Query Processing SIGMOD’18, June 10-15, 2018, Houston, TX, USA

can perform filtering on the backend can implement a rule which

matches a LogicalFilter and converts it to the adapter’s calling
convention. This rule converts the LogicalFilter into another
Filter instance. This new Filter node has a lower associated cost
that allows Calcite to optimize queries across adapters.
The use of adapters is a powerful abstraction that enables not
only optimization of queries for a specific backend, but also across
multiple backends. Calcite is able to answer queries involving tables
across multiple backends by pushing down all possible logic to each
backend and then performing joins and aggregations on the result-
ing data. Implementing an adapter can be as simple as providing a
table scan operator or it can involve the design of many advanced
optimizations. Any expression represented in the relational algebra
can be pushed down to adapters with optimizer rules.
(a) Before (b) After
6 QUERY PROCESSING AND OPTIMIZATION Figure 4: FilterIntoJoinRule application.
The query optimizer is the main component in the framework.
Calcite optimizes queries by repeatedly applying planner rules to
a relational expression. A cost model guides the process, and the
planner engine tries to generate an alternative expression that has
the same semantics as the original but a lower cost. the sales table, we can move the filter before the join as in Fig-
Every component in the optimizer is extensible. Users can add ure 4b. This optimization can significantly reduce query execution
relational operators, rules, cost models, and statistics. time since we do not need to perform the join for rows which
do match the predicate. Furthermore, if the sales and products
Planner rules. Calcite includes a set of planner rules to transform tables were contained in separate backends, moving the filter be-
expression trees. In particular, a rule matches a given pattern in fore the join also potentially enables an adapter to push the fil-
the tree and executes a transformation that preserves semantics ter into the backend. Calcite implements this optimization via
of that expression. Calcite includes several hundred optimization FilterIntoJoinRule which matches a filter node with a join node
rules. However, it is rather common for data processing systems as a parent and checks if the filter can be performed by the join.
relying on Calcite for optimization to include their own rules to This optimization illustrates the flexibility of the Calcite approach
allow specific rewritings. to optimization.
For example, Calcite provides an adapter for Apache Cassan-
dra [29], a wide column store which partitions data by a subset Metadata providers. Metadata is an important part of Calcite’s
of columns in a table and then within each partition, sorts rows optimizer, and it serves two main purposes: (i) guiding the planner
based on another subset of columns. As discussed in Section 5, it is towards the goal of reducing the cost of the overall query plan, and
beneficial for adapters to push down as much query processing as (ii) providing information to the rules while they are being applied.
possible to each backend for efficiency. A rule to push a Sort into Metadata providers are responsible for supplying that informa-
Cassandra must check two conditions: tion to the optimizer. In particular, the default metadata providers
implementation in Calcite contains functions that return the overall
(1) the table has been previously filtered to a single partition cost of executing a subexpression in the operator tree, the num-
(since rows are only sorted within a partition) and ber of rows and the data size of the results of that expression, and
(2) the sorting of partitions in Cassandra has some common the maximum degree of parallelism with which it can be executed.
prefix with the required sort. In turn, it can also provide information about the plan structure,
This requires that a LogicalFilter has been rewritten to a e.g., filter conditions that are present below a certain tree node.
CassandraFilter to ensure the partition filter is pushed down to Calcite provides interfaces that allow data processing systems to
the database. The effect of the rule is simple (convert a LogicalSort plug their metadata information into the framework. These systems
into a CassandraSort) but the flexibility in rule matching enables may choose to write providers that override the existing functions,
backends to push down operators even in complex scenarios. or provide their own new metadata functions that might be used
For an example of a rule with more complex effects, consider the during the optimization phase. However, for many of them, it is
following query: sufficient to provide statistics about their input data, e.g., number
of rows and size of a table, whether values for a given column are
SELECT products . name , COUNT (*)
FROM sales JOIN products USING ( productId ) unique etc., and Calcite will do the rest of the work by using its
WHERE sales . discount IS NOT NULL default implementation.
GROUP BY products . name
As the metadata providers are pluggable, they are compiled and
ORDER BY COUNT (*) DESC ;
instantiated at runtime using Janino [27], a Java lightweight com-
The query corresponds to the relational algebra expression pre- piler. Their implementation includes a cache for metadata results,
sented in Figure 4a. Because the WHERE clause only applies to which yields significant performance improvements, e.g., when

225
Industry 1: Adaptive Query Processing SIGMOD’18, June 10-15, 2018, Houston, TX, USA

we need to compute multiple types of metadata such as cardinal- The first approach is based on view substitution [10, 18]. The aim
ity, average row size, and selectivity for a given join, and all these is to substitute part of the relational algebra tree with an equiva-
computations rely on the cardinality of their inputs. lent expression which makes use of a materialized view, and the
algorithm proceeds as follows: (i) the scan operator over the materi-
Planner engines. The main goal of a planner engine is to trigger
alized view and the materialized view definition plan are registered
the rules provided to the engine until it reaches a given objective. At
with the planner, and (ii) transformation rules that try to unify
the moment, Calcite provides two different engines. New engines
expressions in the plan are triggered. Views do not need to exactly
are pluggable in the framework.
match expressions in the query being replaced, as the rewriting
The first one, a cost-based planner engine, triggers the input rules
algorithm in Calcite can produce partial rewritings that include
with the goal of reducing the overall expression cost. The engine
additional operators to compute the desired expression, e.g., filters
uses a dynamic programming algorithm, similar to Volcano [20],
with residual predicate conditions.
to create and track different alternative plans created by firing the
The second approach is based on lattices [22]. Once the data
rules given to the engine. Initially, each expression is registered
sources are declared to form a lattice, Calcite represents each of
with the planner, together with a digest based on the expression
the materializations as a tile which in turn can be used by the opti-
attributes and its inputs. When a rule is fired on an expression e 1
mizer to answer incoming queries. On the one hand, the rewriting
and the rule produces a new expression e 2 , the planner will add
algorithm is especially efficient in matching expressions over data
e 2 to the set of equivalence expressions S a that e 1 belongs to. In
sources organized in a star schema, which are common in OLAP
addition, the planner generates a digest for the new expression,
applications. On the other hand, it is more restrictive than view
which is compared with those previously registered in the planner.
substitution, as it imposes restrictions on the underlying schema.
If a similar digest associated with an expression e 3 that belongs
to a set Sb is found, the planner has found a duplicate and hence
will merge S a and Sb into a new set of equivalences. The process
7 EXTENDING CALCITE
continues until the planner reaches a configurable fix point. In As we have mentioned in the previous sections, Calcite is not only
particular, it can (i) exhaustively explore the search space until all tailored towards SQL processing. In fact, Calcite provides extensions
rules have been applied on all expressions, or (ii) use a heuristic- to SQL expressing queries over other data abstractions, such as semi-
based approach to stop the search when the plan cost has not structured, streaming and geospatial data. Its internal operators
improved by more than a given threshold δ in the last planner adapt to these queries. In addition to extensions to SQL, Calcite also
iterations. The cost function that allows the optimizer to decide includes a language-integrated query language. We describe these
which plan to choose is supplied through metadata providers. The extensions throughout this section and provide some examples.
default cost function implementation combines estimations for
CPU, IO, and memory resources used by a given expression. 7.1 Semi-structured Data
The second engine is an exhaustive planner, which triggers rules Calcite supports several complex column data types that enable
exhaustively until it generates an expression that is no longer mod- a hybrid of relational and semi-structured data to be stored in
ified by any rules. This planner is useful to quickly execute rules tables. Specifically, columns can be of type ARRAY, MAP, or MULTISET.
without taking into account the cost of each expression. Furthermore, these complex types can be nested so it is possible for
Users may choose to use one of the existing planner engines example, to have a MAP where the values are of type ARRAY. Data
depending on their concrete needs, and switching from one to within the ARRAY and MAP columns (and nested data therein) can be
another, when their system requirements change, is straightforward. extracted using the [] operator. The specific type of values stored
Alternatively, users may choose to generate multi-stage optimization in any of these complex types need not be predefined.
logic, in which different sets of rules are applied in consecutive For example, Calcite contains an adapter for MongoDB [36], a
phases of the optimization process. Importantly, the existence of document store which stores documents consisting of data roughly
two planners allows Calcite users to reduce the overall optimization equivalent to JSON documents. To expose MongoDB data to Calcite,
time by guiding the search for different query plans. a table is created for each document collection with a single column
Materialized views. One of the most powerful techniques to accel- named _MAP: a map from document identifiers to their data. In many
erate query processing in data warehouses is the precomputation of cases, documents can be expected to have a common structure. A
relevant summaries or materialized views. Multiple Calcite adapters collection of documents representing zip codes may each contain
and projects relying on Calcite have their own notion of mate- columns with a city name, latitude and longitude. It can be useful
rialized views. For instance, Cassandra allows the user to define to expose this data as a relational table. In Calcite, this is achieved
materialized views based on existing tables which are automatically by creating a view after extracting the desired values and casting
maintained by the system. them to the appropriate type:
These engines expose their materialized views to Calcite. The SELECT CAST ( _MAP [ ' city '] AS varchar (20)) AS city ,
optimizer then has the opportunity to rewrite incoming queries to CAST ( _MAP [ ' loc ' ][0] AS float ) AS longitude ,
CAST ( _MAP [ ' loc ' ][1] AS float ) AS latitude
use these views instead of the original tables. In particular, Calcite FROM mongo_raw . zips ;
provides an implementation of two different materialized view-
based rewriting algorithms. With views over semi-structured data defined in this manner, it
becomes easier to manipulate data from different semi-structured
sources in tandem with relational data.

226
Industry 1: Adaptive Query Processing SIGMOD’18, June 10-15, 2018, Houston, TX, USA

7.2 Streaming 7.3 Geospatial Queries

Calcite provides first-class support for streaming queries [26] based Geospatial support is preliminary in Calcite, but is being imple-
on a set of streaming-specific extensions to standard SQL, namely mented using Calcite’s relational algebra. The core of this imple-
STREAM extensions, windowing extensions, implicit references mentation consists in adding a new GEOMETRY data type which
to streams via window expressions in joins, and others. These encapsulates different geometric objects such as points, curves, and
extensions were inspired by the Continuous Query Language [2] polygons. It is expected that Calcite will be fully compliant with the
while also trying to integrate effectively with standard SQL. The OpenGIS Simple Feature Access [39] specification which defines a
primary extension, the STREAM directive tells the system that the standard for SQL interfaces to access geospatial data. An example
user is interested in incoming records, not existing ones. query finds the country which contains the city of Amsterdam:
SELECT name FROM (
SELECT STREAM rowtime , productId , units
SELECT name ,
FROM Orders
ST_GeomFromText ( ' POLYGON ((4.82 52.43 , 4.97 52.43 , 4.97 52.33 ,
WHERE units > 25;
4.82 52.33 , 4.82 52.43)) ') AS " Amsterdam " ,
ST_GeomFromText ( boundary ) AS " Country "
In the absence of the STREAM keyword when querying a stream, FROM country
the query becomes a regular relational query, indicating the system ) WHERE ST_Contains ( " Country " , " Amsterdam " );

should process existing records which have already been received

from a stream, not the incoming ones. 7.4 Language-Integrated Query for Java
Due to the inherently unbounded nature of streams, windowing
Calcite can be used to query multiple data sources, beyond just
is used to unblock blocking operators such as aggregates and joins.
relational databases. But it also aims to support more than just
Calcite’s streaming extensions use SQL analytic functions to express
the SQL language. Though SQL remains the primary database lan-
sliding and cascading window aggregations.
guage, many programmers favour language-integrated languages
SELECT STREAM rowtime , like LINQ [33]. Unlike SQL embedded within Java or C++ code,
productId ,
units ,
language-integrated query languages allow the programmer to
SUM ( units ) OVER ( ORDER BY rowtime write all of her code using a single language. Calcite provides
PARTITION BY productId Language-Integrated Query for Java (or LINQ4J, in short) which
RANGE INTERVAL '1 ' HOUR PRECEDING ) unitsLastHour
FROM Orders ;
closely follows the convention set forth by Microsoft’s LINQ for
the .NET languages.
Tumbling, hopping and session windows2 are enabled by the TUMBLE,
HOPPING, SESSION functions and related utility functions such as 8 INDUSTRY AND ACADEMIA ADOPTION
TUMBLE_END and HOP_END that can be used respectively in GROUP Calcite enjoys wide adoption, specially among open-source projects
BY clauses and projections. used in industry. As Calcite provides certain integration flexibility,
SELECT STREAM
these projects have chosen to either (i) embed Calcite within their
TUMBLE_END ( rowtime , INTERVAL '1 ' HOUR ) AS rowtime , core, i.e., use it as a library, or (ii) implement an adapter to allow
productId , Calcite to federate query processing. In addition, we see a growing
COUNT (*) AS c ,
SUM ( units ) AS units interest in the research community to use Calcite as the cornerstone
FROM Orders of the development of data management projects. In the following,
GROUP BY TUMBLE ( rowtime , INTERVAL '1 ' HOUR ) , productId ;
we describe how different systems are using Calcite.
Streaming queries involving window aggregates require the pres-
ence of monotonic or quasi-monotonic expressions in the GROUP 8.1 Embedded Calcite
BY clause or in the ORDER BY clause in case of sliding and cascading Table 1 provides a list of software that incorporates Calcite as a
window queries. library, including (i) the query language interface that they expose
Streaming queries which involve more complex stream-to-stream to users, (ii) whether they use Calcite’s JDBC driver (called Avatica),
joins can be expressed using an implicit (time) window expression (iii) whether they use the SQL parser and validator included in
in the JOIN clause. Calcite, (iv) whether they use Calcite’s query algebra to represent
their operations over data, and (v) the engine that they rely on
SELECT STREAM o . rowtime , o . productId , o . orderId ,
s . rowtime AS shipTime for execution, e.g., their own native engine, Calcite’s operators
FROM Orders AS o (referred to as enumerable), or any other project.
JOIN Shipments AS s
Drill [13] is a flexible data processing engine based on the Dremel
ON o . orderId = s . orderId
AND s . rowtime BETWEEN o . rowtime AND system [34] that internally uses a schema-free JSON document data
o . rowtime + INTERVAL '1 ' HOUR ; model. Drill uses its own dialect of SQL that includes extensions to
express queries on semi-structured data, similar to SQL++ [38].
In the case of an implicit window, Calcite’s query planner vali- Hive [24] first became popular as a SQL interface on top of
dates that the expression is monotonic. the MapReduce programming model [52]. It has since moved to-
wards being an interactive SQL query answering engine, adopting
2 Tumbling, hopping, sliding, and session windows are different schemes for grouping Calcite as its rule and cost-based optimizer. Instead of relying on
of the streaming events [35]. Calcite’s JDBC driver, SQL parser and validator, Hive uses its own

227
Industry 1: Adaptive Query Processing SIGMOD’18, June 10-15, 2018, Houston, TX, USA

JDBC SQL Parser Relational

System Query Language Execution Engine
Driver and Validator Algebra

Apache Drill SQL + extensions ✓ ✓ ✓ Native

Apache Hive SQL + extensions ✓ Apache Tez, Apache Spark
Apache Solr SQL ✓ ✓ ✓ Native, Enumerable, Apache Lucene
Apache Phoenix SQL ✓ ✓ ✓ Apache HBase
Apache Kylin SQL ✓ ✓ Enumerable, Apache HBase
Apache Apex Streaming SQL ✓ ✓ ✓ Native
Apache Flink Streaming SQL ✓ ✓ ✓ Native
Apache Samza Streaming SQL ✓ ✓ ✓ Native
Apache Storm Streaming SQL ✓ ✓ ✓ Native
MapD [32] SQL ✓ ✓ Native
Lingual [30] SQL ✓ ✓ Cascading
Qubole Quark [42] SQL ✓ ✓ ✓ Apache Hive, Presto
Table 1: List of systems that embed Calcite.

implementation of these components. The query is then translated

Adapter Target language
into Calcite operators, which after optimization are translated into
Hive’s physical algebra. Hive operators can be executed by multiple Apache Cassandra Cassandra Query Language (CQL)
engines, the most popular being Apache Tez [43, 51] and Apache
Spark [47, 56]. Apache Pig Pig Latin
Apache Solr [46] is a popular full-text distributed search platform Apache Spark Java (Resilient Distributed Datasets)
built on top of the Apache Lucene library [31]. Solr exposes multiple
query interfaces to users, including REST-like HTTP/XML and Druid JSON
JSON APIs. In addition, Solr integrates with Calcite to provide SQL Elasticsearch JSON
compatibility.
Apache Phoenix [40] and Apache Kylin [28] both work on top JDBC SQL (multiple dialects)
of Apache HBase [23], a distributed key-value store modeled after MongoDB Java
Bigtable [9]. In particular, Phoenix provides a SQL interface and
orchestration layer to query HBase. Kylin focuses on OLAP-style Splunk SPL
SQL queries instead, building cubes that are declared as materialized Table 2: List of Calcite adapters.
views and stored in HBase, and hence allowing Calcite’s optimizer
to rewrite the input queries to be answered using those cubes. In
Kylin, query plans are executed using a combination of Calcite Table 2 also shows the languages that Calcite translates into for
native operators and HBase. each of these adapters.
Recently Calcite has become popular among streaming sys- The JDBC adapter supports the generation of multiple SQL di-
tems too. Projects such as Apache Apex [1], Flink [16], Apache alects, including those supported by popular RDBMSes such as
Samza [44], and Storm [50] have chosen to integrate with Calcite, PostgreSQL and MySQL. In turn, the adapter for Cassandra [8] gen-
using its components to provide a streaming SQL interface to their erates its own SQL-like language called CQL whereas the adapter
users. Finally, other commercial systems have adopted Calcite, such for Apache Pig [41] generates queries expressed in Pig Latin [37].
as MapD [32], Lingual [30], and Qubole Quark [42]. The adapter for Apache Spark [47] uses the Java RDD API. Finally,
Druid [14], Elasticsearch [15] and Splunk [48] are queried through
8.2 Calcite Adapters REST HTTP API requests. The queries generated by Calcite for
Instead of using Calcite as a library, other systems integrate with these systems are expressed in JSON or XML.
Calcite via adapters which read their data sources. Table 2 provides
the list of available adapters in Calcite. One of the main key com- 8.3 Uses in Research
ponents of the implementation of these adapters is the converter In a research setting, Calcite has been considered [54] as a polystore-
responsible for translating the algebra expression to be pushed alternative for precision medicine and clinical analysis scenarios.
to the system into the query language supported by that system. In those scenarios, heterogeneous medical data has to be logically

228
Industry 1: Adaptive Query Processing SIGMOD’18, June 10-15, 2018, Houston, TX, USA

assembled and aligned to assess the best treatments based on the to integrate PostgreSQL with Vertica, and on a standard benchmark,
comprehensive medical history and the genomic profile of the pa- one gets that the integrated system is superior to a baseline where
tient. The data comes from relational sources representing patients’ entire tables are copied from one system to another to answer
electronic medical records, structured and semi-structured sources specific queries. Based on real-world experience, we believe that
representing various reports (oncology, psychiatry,laboratory tests, more ambitious goals are possible for integrated multiple systems:
radiology, etc.), imaging, signals, and sequence data, stored in sci- they should be superior to the sum of their parts.
entific databases. In those circumstances, Calcite represents a good
foundation with its uniform query interface, and flexible adapter 10 CONCLUSION
architecture, but the ongoing research efforts are aimed at (i) in-
Emerging data management practices and associated analytic uses
troduction of the new adapters for array, and textual sources, and
of data continue to evolve towards an increasingly diverse, and het-
(ii) support efficient joining of heterogeneous data sources.
erogeneous spectrum of scenarios. At the same time, relational data
sources, accessed through the SQL, remain an essential means to
9 FUTURE WORK how enterprises work with the data. In this somewhat dichotomous
The future work on Calcite will focus on the development of the space, Calcite plays a unique role with its strong support for both
new features, and the expansion of its adapter architecture: traditional, conventional data processing, and for its support of
• Enhancements to the design of Calcite to further support its other data sources including those with semi-structured, streaming
use a standalone engine, which would require a support for and geospatial models. In addition, Calcite’s design philosophy with
data definition languages (DDL), materialized views, indexes a focus on flexibility, adaptivity, and extensibility, has been another
and constraints. factor in Calcite becoming the most widely adopted query opti-
• Ongoing improvements to the design and flexibility of the mizer, used in a large number of open-source frameworks. Calcite’s
planner, including making it more modular, allowing users dynamic and flexible query optimizer, and its adapter architecture
Calcite to supply planner programs (collections of rules or- allows it to be embedded selectively by a variety of data manage-
ganized into planning phases) for execution. ment frameworks such as Hive, Drill, MapD, and Flink. Calcite’s
• Incorporation of new parametric approaches [53] into the support for heterogeneous data processing, as well as for the ex-
design of the optimizer. tended set of relational functions will continue to improve, in both
• Support for an extended set of SQL commands, functions, functionality and performance.
and utilities, including full compliance with OpenGIS.
• New adapters for non-relational data sources such as array ACKNOWLEDGMENTS
databases for scientific computing. We would like to thank the Calcite community, contributors and
• Improvements to performance profiling and instrumenta- users, who build, maintain, use, test, write about, and continue
tion. to push the Calcite project forward. This manuscript has been in
part co-authored by UT-Battelle, LLC under Contract No. DE-AC05-
9.1 Performance Testing and Evaluation 00OR22725 with the U.S. Department of Energy.
Though Calcite contains a performance testing module, it does not
evaluate query execution. It would be useful to assess the perfor- REFERENCES
mance of systems built with Calcite. For example, we could compare [1] Apex. Apache Apex. https://apex.apache.org. (Nov. 2017).
the performance of Calcite with similar frameworks. Unfortunately, [2] Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2003. The CQL Continuous
it might be difficult to craft fair comparisons. For example, like Query Language: Semantic Foundations and Query Execution. Technical Report
2003-67. Stanford InfoLab.
Calcite, Algebricks optimizes queries for Hive. Borkar et al. [6] [3] Michael Armbrust et al. 2015. Spark SQL: Relational Data Processing in Spark.
compared Algebricks with the Hyracks scheduler against Hive ver- In Proceedings of the 2015 ACM SIGMOD International Conference on Management
of Data (SIGMOD ’15). ACM, New York, NY, USA, 1383–1394.
sion 0.12 (without Calcite). The work of Borkar et al. precedes signif- [4] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K.
icant engineering and architectural changes into Hive. Comparing Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei
Calcite against Algebricks in a fair manner in terms of timings Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of
the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD
does not seem feasible, as one would need to ensure that each uses ’15). ACM, New York, NY, USA, 1383–1394.
the same execution engine. Hive applications rely mostly on ei- [5] ASF. The Apache Software Foundation. (Nov. 2017). Retrieved November 20,
ther Apache Tez or Apache Spark as execution engines whereas 2017 from http://www.apache.org/
[6] Vinayak Borkar, Yingyi Bu, E. Preston Carman, Jr., Nicola Onose, Till Westmann,
Algebricks is tied to its own framework (including Hyracks). Pouria Pirzadeh, Michael J. Carey, and Vassilis J. Tsotras. 2015. Algebricks: A
Moreover, to assess the performance of Calcite-based systems, Data Model-agnostic Compiler Backend for Big Data Languages. In Proceedings
of the Sixth ACM Symposium on Cloud Computing (SoCC ’15). ACM, New York,
we need to consider two distinct use cases. Indeed, Calcite can NY, USA, 422–433.
be used either as part of a single system—as a tool to accelerate [7] M. J. Carey et al. 1995. Towards heterogeneous multimedia information systems:
the construction of such a system—or for the more difficult task the Garlic approach. In IDE-DOM ’95. 124–131.
[8] Cassandra. Apache Cassandra. (Nov. 2017). Retrieved November 20, 2017 from
of combining several distinct systems—as a common layer. The http://cassandra.apache.org/
former is tied to the characteristics of the data processing system, [9] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach,
and because Calcite is so versatile and widely used, many distinct Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006.
Bigtable: A Distributed Storage System for Structured Data. In 7th Symposium on
benchmarks are needed. The latter is limited by the availability of Operating Systems Design and Implementation (OSDI ’06), November 6-8, Seattle,
existing heterogeneous benchmarks. BigDAWG [55] has been used WA, USA. 205–218.

229
Industry 1: Adaptive Query Processing SIGMOD’18, June 10-15, 2018, Houston, TX, USA

[10] Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. Evaluation and Benchmarking. Springer, 221–236.
1995. Optimizing Queries with Materialized Views. In Proceedings of the Eleventh [36] Mongo. MongoDB. (Nov. 2017). Retrieved November 28, 2017 from https:
International Conference on Data Engineering (ICDE ’95). IEEE Computer Society, //www.mongodb.com/
Washington, DC, USA, 190–200. [37] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew
[11] E. F. Codd. 1970. A Relational Model of Data for Large Shared Data Banks. Tomkins. 2008. Pig Latin: A Not-so-foreign Language for Data Processing. In
Commun. ACM 13, 6 (June 1970), 377–387. Proceedings of the 2008 ACM SIGMOD International Conference on Management of
[12] Alex Şuhan. Fast and Flexible Query Analysis at MapD with Apache Calcite. (feb Data (SIGMOD ’08). ACM, New York, NY, USA, 1099–1110.
2017). Retrieved November 20, 2017 from https://www.mapd.com/blog/2017/02/ [38] Kian Win Ong, Yannis Papakonstantinou, and Romain Vernoux. 2014. The SQL++
08/fast-and-flexible-query-analysis-at-mapd-with-apache-calcite-2/ query language: Configurable, unifying and semi-structured. arXiv preprint
[13] Drill. Apache Drill. (Nov. 2017). Retrieved November 20, 2017 from http: arXiv:1405.3631 (2014).
//drill.apache.org/ [39] Open Geospatial Consortium. OpenGIS Implementation Specification for Ge-
[14] Druid. Druid. (Nov. 2017). Retrieved November 20, 2017 from http://druid.io/ ographic information - Simple feature access - Part 2: SQL option. http:
[15] Elastic. Elasticsearch. (Nov. 2017). Retrieved November 20, 2017 from https: //portal.opengeospatial.org/files/?artifact_id=25355. (2010).
//www.elastic.co [40] Phoenix. Apache Phoenix. (Nov. 2017). Retrieved November 20, 2017 from
[16] Flink. Apache Flink. https://flink.apache.org. (Nov. 2017). http://phoenix.apache.org/
[17] Yupeng Fu, Kian Win Ong, Yannis Papakonstantinou, and Michalis Petropoulos. [41] Pig. Apache Pig. (Nov. 2017). Retrieved November 20, 2017 from http://pig.
2011. The SQL-based all-declarative FORWARD web application development apache.org/
framework. In CIDR. [42] Qubole Quark. Qubole Quark. (Nov. 2017). Retrieved November 20, 2017 from
[18] Jonathan Goldstein and Per-Åke Larson. 2001. Optimizing Queries Using Ma- https://github.com/qubole/quark
terialized Views: A Practical, Scalable Solution. SIGMOD Rec. 30, 2 (May 2001), [43] Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun C. Murthy,
331–342. and Carlo Curino. 2015. Apache Tez: A Unifying Framework for Modeling and
[19] Goetz Graefe. 1995. The Cascades Framework for Query Optimization. IEEE Building Data Processing Applications. In Proceedings of the 2015 ACM SIGMOD
Data Eng. Bull. (1995). International Conference on Management of Data (SIGMOD ’15). ACM, New York,
[20] Goetz Graefe and William J. McKenna. 1993. The Volcano Optimizer Genera- NY, USA, 1357–1369.
tor: Extensibility and Efficient Search. In Proceedings of the Ninth International [44] Samza. Apache Samza. (Nov. 2017). Retrieved November 20, 2017 from http:
Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, //samza.apache.org/
209–218. [45] Mohamed A. Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw,
[21] Daniel Halperin, Victor Teixeira de Almeida, Lee Lee Choo, Shumo Chu, Zhongxian Gu, Entong Shen, George C. Caragea, Carlos Garcia-Alvarado, Foyzur
Paraschos Koutris, Dominik Moritz, Jennifer Ortiz, Vaspol Ruamviboonsuk, Rahman, Michalis Petropoulos, Florian Waas, Sivaramakrishnan Narayanan,
Jingjing Wang, Andrew Whitaker, Shengliang Xu, Magdalena Balazinska, Bill Konstantinos Krikellas, and Rhonda Baldwin. 2014. Orca: A Modular Query
Howe, and Dan Suciu. 2014. Demonstration of the Myria Big Data Management Optimizer Architecture for Big Data. In Proceedings of the 2014 ACM SIGMOD
Service. In Proceedings of the 2014 ACM SIGMOD International Conference on International Conference on Management of Data (SIGMOD ’14). ACM, New York,
Management of Data (SIGMOD ’14). ACM, New York, NY, USA, 881–884. NY, USA, 337–348.
[22] Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. 1996. Implement- [46] Solr. Apache Solr. (Nov. 2017). Retrieved November 20, 2017 from http://lucene.
ing Data Cubes Efficiently. SIGMOD Rec. 25, 2 (June 1996), 205–216. apache.org/solr/
[23] HBase. Apache HBase. (Nov. 2017). Retrieved November 20, 2017 from http: [47] Spark. Apache Spark. (Nov. 2017). Retrieved November 20, 2017 from http:
//hbase.apache.org/ //spark.apache.org/
[24] Hive. Apache Hive. (Nov. 2017). Retrieved November 20, 2017 from http: [48] Splunk. Splunk. (Nov. 2017). Retrieved November 20, 2017 from https://www.
//hive.apache.org/ splunk.com/
[25] Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N. Hanson, [49] Michael Stonebraker and Ugur Çetintemel. 2005. “One size fits all”: an idea whose
Owen O’Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, and Xiaodong Zhang. time has come and gone. In 21st International Conference on Data Engineering
2014. Major Technical Advancements in Apache Hive. In Proceedings of the 2014 (ICDE’05). IEEE Computer Society, Washington, DC, USA, 2–11.
ACM SIGMOD International Conference on Management of Data (SIGMOD ’14). [50] Storm. Apache Storm. (Nov. 2017). Retrieved November 20, 2017 from http:
ACM, New York, NY, USA, 1235–1246. //storm.apache.org/
[26] Julian Hyde. 2010. Data in Flight. Commun. ACM 53, 1 (Jan. 2010), 48–52. [51] Tez. Apache Tez. (Nov. 2017). Retrieved November 20, 2017 from http://tez.
[27] Janino. Janino: A super-small, super-fast Java compiler. (Nov. 2017). Retrieved apache.org/
November 20, 2017 from http://www.janino.net/ [52] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka,
[28] Kylin. Apache Kylin. (Nov. 2017). Retrieved November 20, 2017 from http: Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a
//kylin.apache.org/ warehousing solution over a map-reduce framework. VLDB (2009), 1626–1629.
[29] Avinash Lakshman and Prashant Malik. 2010. Cassandra: A Decentralized Struc- [53] Immanuel Trummer and Christoph Koch. 2017. Multi-objective parametric query
tured Storage System. SIGOPS Oper. Syst. Rev. 44, 2 (April 2010), 35–40. optimization. The VLDB Journal 26, 1 (2017), 107–124.
[30] Lingual. Lingual. (Nov. 2017). Retrieved November 20, 2017 from http://www. [54] Ashwin Kumar Vajantri, Kunwar Deep Singh Toor, and Edmon Begoli. 2017. An
cascading.org/projects/lingual/ Apache Calcite-based Polystore Variation for Federated Querying of Heteroge-
[31] Lucene. Apache Lucene. (Nov. 2017). Retrieved November 20, 2017 from https: neous Healthcare Sources. In 2nd Workshop on Methods to Manage Heterogeneous
//lucene.apache.org/ Big Data and Polystore Databases. IEEE Computer Society, Washington, DC, USA.
[32] MapD. MapD. (Nov. 2017). Retrieved November 20, 2017 from https://www. [55] Katherine Yu, Vijay Gadepally, and Michael Stonebraker. 2017. Database engine
mapd.com integration and performance analysis of the BigDAWG polystore system. In 2017
[33] Erik Meijer, Brian Beckman, and Gavin Bierman. 2006. LINQ: Reconciling Object, IEEE High Performance Extreme Computing Conference (HPEC). IEEE Computer
Relations and XML in the .NET Framework. In Proceedings of the 2006 ACM Society, Washington, DC, USA, 1–7.
SIGMOD International Conference on Management of Data (SIGMOD ’06). ACM, [56] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion
New York, NY, USA, 706–706. Stoica. 2010. Spark: Cluster Computing with Working Sets. In HotCloud.
[34] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shiv- [57] Jingren Zhou, Per-Åke Larson, and Ronnie Chaiken. 2010. Incorporating partition-
akumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive Analysis of ing and parallel plans into the SCOPE optimizer. In 2010 IEEE 26th International
Web-Scale Datasets. PVLDB 3, 1 (2010), 330–339. http://www.comp.nus.edu.sg/ Conference on Data Engineering (ICDE 2010). IEEE Computer Society, Washington,
~vldb2010/proceedings/files/papers/R29.pdf DC, USA, 1060–1071.
[35] Marcelo RN Mendes, Pedro Bizarro, and Paulo Marques. 2009. A performance
study of event processing systems. In Technology Conference on Performance

230

Apache Calcite Paper
No ratings yet
Apache Calcite Paper
10 pages
Building Cost-Based Query Optimizers With Apache Calcite
No ratings yet
Building Cost-Based Query Optimizers With Apache Calcite
33 pages
PTG Chapter 15 Asal Physics
No ratings yet
PTG Chapter 15 Asal Physics
10 pages
Apache Calcite - A Foundational Framework For Optimized Query Processing Over Heterogeneous Data Sources - Sigmod-2018
No ratings yet
Apache Calcite - A Foundational Framework For Optimized Query Processing Over Heterogeneous Data Sources - Sigmod-2018
23 pages
Gsm-Relay Manual: V5.0 From 2017
No ratings yet
Gsm-Relay Manual: V5.0 From 2017
12 pages
CNv6 instructorPPT Chapter6
No ratings yet
CNv6 instructorPPT Chapter6
44 pages
History of Computers Homework
100% (1)
History of Computers Homework
8 pages
Swru295e-Sub-1 GHZ RF TransceiversTransmitter
No ratings yet
Swru295e-Sub-1 GHZ RF TransceiversTransmitter
111 pages
Intel Memo To Employees
No ratings yet
Intel Memo To Employees
3 pages
Total Productive Maintenance (TPM)
No ratings yet
Total Productive Maintenance (TPM)
27 pages
CBSE Class 10 Computer Applications Syllabus 2019 2020
No ratings yet
CBSE Class 10 Computer Applications Syllabus 2019 2020
7 pages
Problem Solving Unit 1
No ratings yet
Problem Solving Unit 1
6 pages
Cisco Transcender 200-301 Practice Test 2020-Apr-22 by Lester 33q Vce
No ratings yet
Cisco Transcender 200-301 Practice Test 2020-Apr-22 by Lester 33q Vce
8 pages
(TM) Hyundai Manual Electrico Hyundai Santa Fe 2010 en Ingles
No ratings yet
(TM) Hyundai Manual Electrico Hyundai Santa Fe 2010 en Ingles
261 pages
A Survey Paper On Hard Disk Failure Prediction Using Machine Learning
No ratings yet
A Survey Paper On Hard Disk Failure Prediction Using Machine Learning
6 pages
Uc 3525 A
No ratings yet
Uc 3525 A
17 pages
Industrial Control & Automation Training Course Part 2
No ratings yet
Industrial Control & Automation Training Course Part 2
242 pages
Manual Thermal Load - Calculations
No ratings yet
Manual Thermal Load - Calculations
48 pages
CT 2
No ratings yet
CT 2
8 pages
Instrumentation Installation Verification Procedure:: How To Use This Document
No ratings yet
Instrumentation Installation Verification Procedure:: How To Use This Document
3 pages
Oracle ASM Load Balancing - Anthony Noriega
0% (1)
Oracle ASM Load Balancing - Anthony Noriega
48 pages
User Manual: by Firstech LLC, Version: 1.3
No ratings yet
User Manual: by Firstech LLC, Version: 1.3
23 pages
Net-Centric Past Questions Answers
No ratings yet
Net-Centric Past Questions Answers
7 pages
Catalogo de Referencias EZC
No ratings yet
Catalogo de Referencias EZC
17 pages
Set of 15 Sample Papers With Solutions & Blueprint For Class 10 (3) - 1
No ratings yet
Set of 15 Sample Papers With Solutions & Blueprint For Class 10 (3) - 1
82 pages
4084 - A5 Landscape Booklet
No ratings yet
4084 - A5 Landscape Booklet
8 pages
Purposive Communication
No ratings yet
Purposive Communication
5 pages
SDLC Introduction
No ratings yet
SDLC Introduction
33 pages
Wincc Unified Faceplate
No ratings yet
Wincc Unified Faceplate
27 pages
HTML, CSS, JavaScript Task
No ratings yet
HTML, CSS, JavaScript Task
35 pages
Web Development PHP
No ratings yet
Web Development PHP
9 pages
Informatic Notes
No ratings yet
Informatic Notes
2 pages
Computer 10 AB (Pre Board-2)
No ratings yet
Computer 10 AB (Pre Board-2)
4 pages
Real Time Language Translation System: Chandana M Deeksha M Dugiwade Prachi Sanjay
No ratings yet
Real Time Language Translation System: Chandana M Deeksha M Dugiwade Prachi Sanjay
4 pages
Building Scalable Data-Intensive Applications
From Everand
Building Scalable Data-Intensive Applications
Chandani Kaul
No ratings yet
ELT Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
ELT Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Complete Guide to Technology & Programming
From Everand
The Complete Guide to Technology & Programming
MATHY WISDOM
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
CrateDB for IoT and Machine Data: The Complete Guide for Developers and Engineers
From Everand
CrateDB for IoT and Machine Data: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Elasticsearch Engineering in Practice: Definitive Reference for Developers and Engineers
From Everand
Elasticsearch Engineering in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering ClickHouse: High-Performance Data Analytics for Modern Applications
From Everand
Mastering ClickHouse: High-Performance Data Analytics for Modern Applications
Robert Johnson
No ratings yet
Databricks Platform Essentials: Definitive Reference for Developers and Engineers
From Everand
Databricks Platform Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Hyperdrive Architecture and Implementation: The Complete Guide for Developers and Engineers
From Everand
Hyperdrive Architecture and Implementation: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
From Everand
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
Mustafa Al-Dori
4/5 (1)
Practical NetCDF Techniques: Definitive Reference for Developers and Engineers
From Everand
Practical NetCDF Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
X++ Language Development Guide: Definitive Reference for Developers and Engineers
From Everand
X++ Language Development Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Virtuoso Database Systems: The Complete Guide for Developers and Engineers
From Everand
Virtuoso Database Systems: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Comprehensive Guide to Hive Architecture and Query Language: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Hive Architecture and Query Language: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Ultimate++: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Ultimate++: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical Parquet Engineering: Definitive Reference for Developers and Engineers
From Everand
Practical Parquet Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
From Everand
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Strapi Development and Best Practices: Definitive Reference for Developers and Engineers
From Everand
Strapi Development and Best Practices: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
AWS Timestream Data Management and Analysis: Definitive Reference for Developers and Engineers
From Everand
AWS Timestream Data Management and Analysis: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CloverDX Design and Integration Solutions: Definitive Reference for Developers and Engineers
From Everand
CloverDX Design and Integration Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical Confluent Platform Architecture: Definitive Reference for Developers and Engineers
From Everand
Practical Confluent Platform Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
RisingWave for Real-Time Data Processing: The Complete Guide for Developers and Engineers
From Everand
RisingWave for Real-Time Data Processing: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Textract Workflows and Applications: Definitive Reference for Developers and Engineers
From Everand
Textract Workflows and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Qdrant Vector Search in Practice: The Complete Guide for Developers and Engineers
From Everand
Qdrant Vector Search in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Metaflow for Data Science Workflows: The Complete Guide for Developers and Engineers
From Everand
Metaflow for Data Science Workflows: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Teradata Architecture and SQL Essentials: Definitive Reference for Developers and Engineers
From Everand
Teradata Architecture and SQL Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Amazon EMR Solutions in Cloud Computing: Definitive Reference for Developers and Engineers
From Everand
Amazon EMR Solutions in Cloud Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Resoto for Cloud Resource Automation: The Complete Guide for Developers and Engineers
From Everand
Resoto for Cloud Resource Automation: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Kinesis Stream Processing Essentials: Definitive Reference for Developers and Engineers
From Everand
Kinesis Stream Processing Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Red Hat AMQ Streams for Cloud-Native Messaging: The Complete Guide for Developers and Engineers
From Everand
Red Hat AMQ Streams for Cloud-Native Messaging: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Efficient Data Querying with Drill: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Querying with Drill: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical High Performance Computing: Definitive Reference for Developers and Engineers
From Everand
Practical High Performance Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical Dataflow Engineering: Definitive Reference for Developers and Engineers
From Everand
Practical Dataflow Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
StreamSets Data Integration Architecture and Design: The Complete Guide for Developers and Engineers
From Everand
StreamSets Data Integration Architecture and Design: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Data Integration with Blendo: Definitive Reference for Developers and Engineers
From Everand
Data Integration with Blendo: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Striim Platform Essentials: Definitive Reference for Developers and Engineers
From Everand
Striim Platform Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Alteryx Workflow Automation and Data Transformation: Definitive Reference for Developers and Engineers
From Everand
Alteryx Workflow Automation and Data Transformation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Redshift Essentials: Definitive Reference for Developers and Engineers
From Everand
Redshift Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Aerospike Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Aerospike Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Architecting Real-Time Analytics with Druid: Definitive Reference for Developers and Engineers
From Everand
Architecting Real-Time Analytics with Druid: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Oracle Cloud Infrastructure Explained: Definitive Reference for Developers and Engineers
From Everand
Oracle Cloud Infrastructure Explained: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Couchbase Essentials: Definitive Reference for Developers and Engineers
From Everand
Couchbase Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cohesity Architecture and Administration: Definitive Reference for Developers and Engineers
From Everand
Cohesity Architecture and Administration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
WhereScape Solutions for Data Warehouse Automation: Definitive Reference for Developers and Engineers
From Everand
WhereScape Solutions for Data Warehouse Automation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Lakes & Pipelines: A Modern Azure Guide
From Everand
Data Lakes & Pipelines: A Modern Azure Guide
Kameron Hussain
No ratings yet
Practical Guide to H2O.ai: Definitive Reference for Developers and Engineers
From Everand
Practical Guide to H2O.ai: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
InfluxDB Essentials: Definitive Reference for Developers and Engineers
From Everand
InfluxDB Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Dataiku Platform Foundations: Definitive Reference for Developers and Engineers
From Everand
Dataiku Platform Foundations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Pipeline Automation with Airbyte: Definitive Reference for Developers and Engineers
From Everand
Data Pipeline Automation with Airbyte: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Superset Data Exploration and Analysis Framework: Definitive Reference for Developers and Engineers
From Everand
Superset Data Exploration and Analysis Framework: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
TIBCO BusinessWorks Integration Solutions: Definitive Reference for Developers and Engineers
From Everand
TIBCO BusinessWorks Integration Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
From Everand
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Calcite

Uploaded by

Calcite

Uploaded by

Industry 1: Adaptive Query Processing SIGMOD’18, June 10-15, 2018, Houston, TX, USA

Apache Calcite: A Foundational Framework for Optimized

Michael J. Mior Daniel Lemire

First, Calcite contains a query parser and validator that can

plans for other backend engines such as Spark and PostgreSQL.

Figure 2: A Query Optimization Process.

can perform filtering on the backend can implement a rule which

7.2 Streaming 7.3 Geospatial Queries

should process existing records which have already been received

JDBC SQL Parser Relational

Apache Drill SQL + extensions ✓ ✓ ✓ Native

implementation of these components. The query is then translated

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.