hyper
hyper
Abstract—The two areas of online transaction processing database system. In addition, a separate Data Warehouse
(OLTP) and online analytical processing (OLAP) present differ- system is installed for business intelligence query processing.
ent challenges for database architectures. Currently, customers Periodically, e.g., during the night, the OLTP database changes
with high rates of mission-critical transactions have split their
data into two separate systems, one database for OLTP and are extracted, transformed to the layout of the data warehouse
one so-called data warehouse for OLAP. While allowing for schema, and loaded into the data warehouse. This data staging
decent transaction rates, this separation has many disadvantages and its associated ETL (Extract–Transform–Load) obviously
including data freshness issues due to the delay caused by only pe- incurs the problem of data staleness as the ETL process can
riodically initiating the Extract Transform Load-data staging and only be executed periodically.
excessive resource consumption due to maintaining two separate
information systems. We present an efficient hybrid system, called Recently, strong arguments for so-called real time business
HyPer, that can handle both OLTP and OLAP simultaneously intelligence were made. Hasso Plattner, the co-founder of
by using hardware-assisted replication mechanisms to maintain SAP, advocates the “data at your fingertips”-goal for enter-
consistent snapshots of the transactional data. HyPer is a main- prise resource planning systems [2]. The currently exercised
memory database system that guarantees the ACID properties of separation of transaction processing on the OLTP database
OLTP transactions and executes OLAP query sessions (multiple
queries) on the same, arbitrarily current and consistent snapshot. and BI query processing on the data warehouse that is only
The utilization of the processor-inherent support for virtual periodically refreshed violates this goal as business analysts
memory management (address translation, caching, copy on have to base their decisions on stale (outdated) data. Real-
update) yields both at the same time: unprecedentedly high time/operational business intelligence demands to execute
transaction rates as high as 100000 per second and very fast OLAP queries on the current, up-to-date state of the trans-
OLAP query response times on a single system executing both
workloads in parallel. The performance analysis is based on a actional OLTP data. We propose to enhance the transactional
combined TPC-C and TPC-H benchmark. database with highly effective query processing capabilities
– thereby shifting (some of) the query processing from the
I. I NTRODUCTION DW to the OLTP system. Therefore, mixed workloads of
Historically, database systems were mainly used for online OLTP transaction processing and OLAP query processing on
transaction processing. Typical examples of such transaction the same database have to be supported. This is somewhat
processing systems are sales order entry or banking transaction counter to the recent trend of building dedicated systems
processing. These transactions access and process only small for different application scenarios. The integration of these
portions of the entire data and, therefore, can be executed quite two very different workloads on the same system necessitates
fast. According to the standardized TPC-C benchmark results drastic performance improvements which can be achieved by
the currently most powerful systems can process more than main-memory database architectures.
100.000 such sales transactions per second. On first view, the dramatic explosion of the (Internet acces-
About two decades ago a new usage of database systems sible) data volume may contradict this premise of keeping all
evolved: Business Intelligence (BI). BI-applications rely on transactional data main memory-resident. However, a closer
long running so-called Online Analytical Processing (OLAP) examination shows that the business critical transactional
queries that process substantial portions of the data in order to database volume has limited size, which favors main memory
generate reports for business analysts. Typical reports include data management. To corroborate this assumption let us ana-
aggregated sales statistics grouped by geographical regions, or lyze one of the largest commercial enterprises, Amazon, which
by product categories, or by customer classifications, etc. Ini- has a yearly revenue of about 15 billion Euros. Assuming that
tial attempts – such as SAP’s EIS project [1] – to execute these an individual order line values at about 15 Euros and each
queries on the operational OLTP database were dismissed order line incurs stored data of about 54 bytes – as specified
as the OLAP query processing led to resource contentions for the TPC-C-benchmark –, we derive a total data volume of
and severely hurt the mission-critical transaction processing. 54 GB per year for the order lines which is the dominating
Therefore, the data staging architecture was devised where repository in such a sales application. This estimate neither
the transaction processing is carried out on a dedicated OLTP includes the other data (customer and product data) which
the yearly sales data can be fit into main memory of a large Hybrid
As efficient as dedicated OLTP OLTP&OLAP As fast as dedicated
scale server. This was also analyzed by Ousterhout et. al. [3] main memory DBMS (e.g., High-Performance OLAP main memory DBMS
VoltDB, TimesTen) Database (e.g., MonetDB, TREX)
who proclaim the so-called RAMcloud as a main-memory System
storage device for the largest Internet software applications.
Extrapolating the past developments it is safe to forecast that
the main memory capacity of commodity as well as high-end Fig. 1. Hybrid OLTP&OLAP Database Architecture
servers is growing faster than the largest business customer’s an efficient hybrid system, called HyPer, that can handle both
requirements. For example, Intel announced a large multi-core OLTP and OLAP simultaneously by using hardware-assisted
processor with several TB of main memory as part of its so- replication mechanisms to maintain consistent snapshots of the
called Tera Scale initiative [4]. We are currently in the process transactional data. HyPer is a main-memory database system
of ordering a TB server from Dell for a “mere” 60000 Euros. that guarantees the ACID properties of OLTP transactions. In
The transaction rate of such a large scale enterprise with particular, we devised logging and backup archiving schemes
15 billion Euro revenue can be estimated at about 32 order for durability, atomicity and fast recovery. In parallel to the
lines per second. Even though the arrival rate of such business OLTP processing, HyPer executes OLAP query sessions (mul-
transactions is highly skewed (e.g., Christmas sales peaks) it is tiple queries) on the same, arbitrarily current and consistent
fair to assume that the peak load will be below a few thousand snapshot. These snapshots are created by forking the OLTP
order lines per second. process and thereby creating a consistent virtual memory
For our HyPer system we adopt a main-memory architecture snapshot. This snapshot is kept consistent via the implicit
for transaction processing. We follow the lock-less approach OS/processor-controlled lazy copy-on-write mechanism. The
first advocated in [5] whereby all OLTP transactions are utilization of the processor-inherent support for virtual mem-
executed sequentially – or on private partitions. This archi- ory management (address translation, caching, copy on update)
tecture obviates the need for costly locking and latching of accomplishes both in the same system and at the same time:
data objects or index structures as the sole update transaction unprecedentedly high transaction rates of millions of trans-
“owns” the entire database – or its private partition of the actions per minute as high as any OLTP-optimized database
database. Obviously, this serial execution approach is only system and ultra-low OLAP query response times as low as
viable for a pure main memory database where there is no the best OLAP-optimized column stores. These numbers were
need to mask IO operations on behalf of one transaction by achieved on a commodity desktop server. Even the creation
interleavingly utilizing the CPUs for other transactions. In a of a fresh, transaction-consistent snapshot can be achieved in
main-memory architecture a typical business transaction (e.g., subseconds.
an order entry or a payment processing) has a duration of
only a few up to ten microseconds. Such a system’s viability II. R ELATED W ORK /S YSTEMS
for OLTP processing was previously proven in a research HyPer is a new RISC-style database systems [7] like RDF-
prototype named H-Store [6] conducted by researchers led by 3X [8] (albeit for a very different purpose). Both systems are
Mike Stonebraker at MIT, Yale and Brown University. The developed from scratch. Thereby, historically motivated ballast
H-Store prototype was recently commercialized by a start-up of traditional database systems is omitted and new hardware
company named VoltDB. and OS-functionality can be leveraged.
However, the H-Store architecture is limited to OLTP trans- The development of main memory database systems (or
action processing only. If we simply allowed complex OLAP- in-memory DBMS) originally started for the OLTP world.
style queries to be injected into the workload queue they would TimesTen [9] was among the first such systems and was
clog the system, as all subsequent OLTP transactions have to recently acquired by Oracle and primarily serves as a “front”
wait for the completion of such a long running query. Even cache for the Oracle mainstream database system. P*TIME
if such OLAP queries finish within, say, 30 ms they lock the / Transact in Memory [10] was acquired by SAP in 2005.
system for a duration in which around 1000 or more OLTP SolidDB of Solid Information Technology is a main memory
transactions could have completed. DB developed in Helsinki. In the meantime IBM took over
Nevertheless, our goal was to architect a main-memory this company. For SolidDB the tuple level [11] snapshots were
database system that can proposed that are kept consistent by tuple shadowing instead
• process OLTP transactions at rates of tens or hundreds of page shadowing. The authors report 30 % transactional
of thousands per second as efficiently as dedicated OLTP throughput increase and a smaller main memory footprint.
main memory systems such as VoltDB or TimesTen, and, The page-level shadowing dates back to the early ages of
at the same time, relational database system development [12]. In HyPer we rely
• process OLAP queries on up-to-date snapshots of the on hardware-supported page shadowing that is controlled by
transactional data as efficiently as dedicated OLAP main the processor’s memory management unit (MMU). For disk
memory DBMS such as MonetDB or TREX. based database systems shadowing was not really successful
This challenge is sketched in Figure 1. We architected such because it destroys the page clustering. This hurts the scan
196
performance, e.g., for a full table scan, as the disk’s read/write Analytics Optimizer (ISAO) [26], [27] are recent develop-
head has to be moved. HyPer is based on virtual memory ments at IBM to augment an OLTP database system with an
supported shadow paging where scan performance is not in-memory database for OLAP queries. Their original design
hurt by shadowing. In main memory there is no difference was based on materializing all the joins and use compression
between accessing two consecutive physical memory pages to reduce the size of the resulting in-memory data.
versus accessing two physical pages that are further apart.
Furthermore, the snapshots based on VM shadowing do not III. S YSTEM A RCHITECTURE
affect the logical page layout, i.e., potentially non-sequential The HyPer architecture was devised such that OLTP trans-
physical page accesses are hidden by the hardware. actions and OLAP queries can be performed on the same
Most recently, the drastic increase of main memory capacity main memory resident database – without interfering with each
and the demand for real-time/operational business intelligence other. In contrast to old-style disk-based storage servers we
has led to a revival of main memory database system re- omitted any database-specific buffer management and page
search and commercial development. The recent main-memory structuring. The data resides in quite simple, main-memory
database systems can be separated by their application domain: optimized data structures within the virtual memory. Thus, we
OLAP verus OLTP. MonetDB is the most influential database can exploit the OS/CPU-implemented address translation at
research project on column store storage schemes for an in- “full speed” without any additional indirection. We currently
memory OLAP database. An overview of the system can be experiment with the two predominant relational database stor-
found in the summary paper [13] presented on the occasion age schemes: In the row store approach we maintain relations
of receiving the 10 year test of time award of the VLDB as arrays of entire records and in the column store approach
conference. TREX [14] is SAP’s most prominent database the relations are vertically partitioned into vectors of attribute
project that relies, like MonetDB, on the column-major storage values. Currently, the HyPer prototype is globally configured
scheme. It is now known as Business Warehouse Accelerator to operate as a column or a row store – but in future work the
and serves as the basis for SAP’s business intelligence func- table layout will be adjustable according to the access patterns.
tionality. According to Hasso Plattners key note at SIGMOD Even though the virtual memory can (significantly) outgrow
2009 [2] SAP intends to extend it to include OLTP func- the physical main memory we limit the database to the size
tionality and then make it the basis for hosted applications, of the physical main memory in order to avoid OS-controlled
e.g., Business by Design. The hybrid system is apparently a swapping of virtual memory pages.
combination of TREX and P*TIME and relies on merging
the OLTP updates periodically into the column store of the A. OLTP Processing
OLAP TREX database [15]. In HyPer this merge is implicit Since all data is main-memory resident there will never be a
and hardware-supported by creating a new VM snapshot. halt to await IO. Therefore, we can rely on a single-threading
Based on an early study for banking transactions [16] the approach first advocated in [5] whereby all OLTP transactions
authors of H-Store [17], [6], [5] deserve the credit for ana- are executed sequentially. This architecture obviates the need
lyzing the overhead imposed by various traditional database for costly locking and latching of data objects as the sole
management features (buffer management, logging, locking, update transaction “owns” the entire database. Obviously, this
etc.). They proved the feasibility of a main memory database serial execution approach is only viable for a pure main
system that processes transactions sequentially without syn- memory database where there is no need to mask IO operations
chronization overhead. VoltDB [18] is the commercialization on behalf of one transaction by interleavingly utilizing the
of H-Store. The published VoltDB performance numbers are CPUs for other transactions. In a main-memory architecture
largely due to database partitioning across a compute cluster. a typical business transaction (e.g., an order entry or a pay-
[19] devised synchronization concepts for allowing inter- ment processing) has a duration of only around ten µs. This
partition transactions. Ulusoy and Buchmann [20] investigated translates to throughputs in the order of tens of thousands per
main memory database partitioning for optimized concurrency second, much more than even large scale business applications
control for real time applications. The automatic derivation of require – as analyzed in the Introduction.
partitioning schemes is an old research issue of distributed The serial execution of OLTP transactions is exemplified
database design and receives renewed interest [21]. in Figure 2 by the queue on the left-hand side in which the
HyPer’s partitioning technique (cf. Section III-D) is primar- transactions are serialized to await execution. The transactions
ily used for intra-node parallelism and is particularly beneficial are implemented as stored procedures in a high-level scripting
for multi-tenancy database applications [22]. language. This language provides the functionality to look-
Crescando is a research project at ETH Zürich [23] that up database entries by search key, iterate through sets of
processes queries in a batch by periodically scanning all the objects, insert, update and delete data records, etc. The high-
data in a similar fashion as executing continuous queries over level scripting code is then compiled by the HyPer system into
streaming data. At EPFL Lausanne several projects around low-level code that directly manipulates the in-memory data
the database system Shore have the goal to optimize the structures.
locking [24] and logging [25] performance on modern multi- Obviously, the OLTP transactions have to guarantee short
core processors. Blink and its commercial product IBM Smart response times in order to avoid long waiting times for
197
OLAP Queries OLAP Queries
Read a
OLTP Requests /Tx OLTP Requests /Tx b
Read a a
c cc
d d
b b
a a’
subsequent transactions in the queue. This prohibits any kind segments right away. Rather, they employ a lazy copy-on-
of interactive transactions, e.g., requesting user input or syn- update strategy – as sketched out in Figure 3. Initially,
chronously invoking a credit card check of an external agency. parent process (OLTP) and child process (OLAP) share the
This, however, does not constitute a real limitation as our same physical memory segments by translating either virtual
experience with high-performance business applications, such addresses (e.g., to object a) to the same physical main memory
as SAP R/3 [28], [29] reveals that these kinds of interactions location. The sharing of the memory segments is highlighted in
occur outside the database context in the application servers.1 the graphics by the dotted frames. A dotted frame represents a
virtual memory page that was not (yet) replicated. Only when
B. OLAP Snapshot Management an object, like data item a, is updated, the OS- and hardware-
supported copy-on-update mechanism initiate the replication
If we simply allowed complex OLAP-style queries to be of the virtual memory page on which a resides. Thereafter,
injected into the OLTP workload queue they would clog the there is a new state denoted a0 accessible by the OLTP-process
system, as all subsequent OLTP transactions have to wait that executes the transactions and the old state denoted a, that
for the completion of such a long running query. Even if is accessible by the OLAP query session. Unlike the figure
such OLAP queries finish within, say, 30 ms they lock the suggests, the additional page is really created for the OLTP
system for a duration in which possibly thousands of OLTP process that initiated the page change and the OLAP snapshot
transactions could have completed. To achieve our goal of refers to the old page – this detail is important for estimating
architecting a main-memory database system that the space consumption if several such snapshots are created
• processes OLTP transactions at rates of tens of thousands (cf. Figure 4).
per second, and, at the same time, Another intuitive way to view the functionality is as follows:
• processes OLAP queries on up-to-date snapshots of the The OLTP process operates on the entire database, part of
transactional data which is shared with the OLAP module. All OLTP changes
we exploit the operating systems functionality to create virtual are applied to a separate copy (area), the Delta – consisting of
memory snapshots for new, duplicated processes. In Unix, for copied (shadowed) database pages. Thus, the OLTP process
example, this is done by creating a child process of the OLTP creates its working set of updated pages on demand. This is
process via the fork() system call. To guarantee transac- somewhat analogous to swapping pages into a buffer pool –
tional consistency, the fork() should only be executed in however, the copy on demand of updated pages is three to
between two (serial) transactions, never in the middle of one four orders of magnitude faster as it takes only 2 µs to copy a
transaction. In section IV-F we will relax this constraint by main memory page instead of 10 ms to handle a page fault in
utilizing the undo log to convert an action consistent snapshot the buffer pool. Every “now and then” the Delta is merged
(created in the middle of a transaction) into a transaction with the OLAP database by forking a new process for an
consistent one. up-to-date OLAP session. Thereby, the Delta is conceptually
The forked child process obtains an exact copy of the parent re-integrated into the (main snapshot) database. Unlike any
processes address space, as exemplified in Figure 2 by the software solution for merging a Delta back into the main
overlayed page frame panel. This virtual memory snapshot that database, our hardware-supported virtual memory merge (fork)
is created by the fork()-operation will be used for executing can be achieved very efficiently in subseconds.
a session of OLAP queries – as indicated on the right hand The replication (into the Delta) is carried out at the granular-
side of Figure 2. ity of entire pages, which usually have a default size of 4 KB.
The snapshot stays in precisely the state that existed at In our example, the state change of a to a0 induces not only the
the time the fork() took place. Fortunately, state-of-the- replication of a but also of all other data items on this page,
art operating systems do not physically copy the memory such as b, even though they have not changed. This is the price
we opt to pay in exchange for relying on the very effective
1 Nevertheless, we are currently devising an optimistic lock-less concurrency and fast virtual memory management by the OS and the
scheme for long-running transactions being executed in our system. processor, such as ultra-efficient VM address transformation
198
via TLB caching and copy-on-write enforcement. Also, it
should be noted that the replicated pages only persist until the
OLAP session terminates – usually within seconds or minutes.
Traditional shadowing concepts in database systems are based
on pure software mechanisms that maintain shadow copies at b
c
the page level [30] or shadow individual objects [11]. a
d O
Our snapshots incur storage overhead proportional to the b
L
199
undo log. The OLAP queries can be formulated across all
partitions and the shared data, which is even needed in multi-
OLAP Queries tenancy applications for administrative purposes.
Ptn 1
The partitioning of the database can be further exploited
for a distributed system that allocates the private partitions to
Shared Data
Read Mostly
OLTP Requests /Tx Ptn 2 different nodes in a compute cluster. The read-mostly, shared
c partition can be replicated across all nodes. Then, partition-
Ptn 3 constrained transactions can be transferred to the correspond-
ing node and run in parallel without any synchronization
Ptn 4
overhead. Synchronization is needed for partition-crossing
Virtual Memory transactions and for the synchronized snapshot creation across
Fig. 5. Multi-Threaded OLTP Processing on Partitioned Data
all nodes.
IV. T RANSACTION S EMANTICS AND R ECOVERY
There are many application scenarios where it is natural to
partition the data. One very important application class for this Our OLTP/OLAP transaction model corresponds most
is multi-tenancy – as described in [22]. The different database closely to the multiversion mixed synchronization method, as
users (called tenants) work on the same or similar database described by Bernstein, Hadzilacos and Goodman [31] (Sec-
schemas but do not share their transactional data. Rather, they tion 5.5). In this model, updaters (in our terminology OLTP
maintain their private partitions of the data. Only some read- transactions including the read-only OLTP transactions) are
mostly data (e.g., product catalogs, geographical information, fully serializable and read-only queries (our OLAP queries)
business information catalogs like Dun & Bradstreet) is shared access the database in a “frozen” transaction consistent state
among the different tenants. that existed at a point in time before the query was started.
Interestingly, the TPC-C benchmark exhibits a similar par- Recently, such relaxed synchronization methods have re-
titioning as most of the data can be partitioned horizontally gained attention as full serializability was, in the past, con-
by the Warehouse, to which it belongs. The only exception is sidered too costly for scalable systems. HyPer achieves both:
the Items table, which corresponds to our read-mostly, shared utmost scalability via OLAP snapshots and full serializabil-
data partition. ity for OLTP processing. A variation of the multiversion
In such a partitioned application scenario HyPer’s OLTP synchronization is called snapshot isolation and was first
process can be configured as multiple threads – to increase described in [32]. It currently gains renewed interest in the
performance even further via parallelism. This is sketched out database research community – see, e.g., [33], [34]. Herein,
in Figure 5. As long as the transactions access and update the snapshot synchronization is not constrained to read-only
only their private partition and access (not update) the shared queries but also to the read requests in update transactions.
data we can run multiple such transactions in parallel – one A. Snapshot Isolation of OLAP Query Sessions
per partition. This is shown in the figure where each oval In snapshot isolation a transaction/query continuously sees
(representing a transaction) inside the panel corresponds to one the transaction consistent database state as it existed at a point
such partition-constrained transaction executed by a separate in time (just) before the transaction started. There are different
thread. possibilities to implement such a snapshot – while database
However, transactions reading across partitions or updating modifications are running in parallel:
the shared data partition require synchronization. For the Roll-Back: This method, as used in Oracle, updates the
VoltDB partitioned database two synchronization methods database objects in place. If an older query requires an older
were analyzed in [21]: a lock-based approach and an optimistic version of a data item it is created by undoing all updates on
method that may necessitate cascaded roll-backs. this object. Thus, an older copy of the object is created in a
In our current HyPer-prototype cross-partition transactions so-called roll-back segment by reversely applying all undo log
request exclusive access to the system – just as in our initial records up to the required point in time.
purely sequential approach. This is sufficiently efficient in Versioning: All object updates create a new timestamped
a central system where all partitions reside on one node. version of the object. Thus, a read on behalf of a query
However, if the nodes are distributed across a compute retrieves the youngest version (largest timestamp) whose
cluster, which necessitates a two-phase commit protocol for timestamp is smaller than the starting time of the query. The
multi-partition transactions, more advanced synchronization versioned objects are either maintained durably (which allows
approaches are beneficial. The synchronization aspects are time travelling queries) or temporarily until no more active
further detailed in Section IV-C. query needs to access them.
OLAP snapshots can be forked as before – except that Shadowing [30]: Originally, shadowing was invented to
we have to quiesce all threads before this can be done in a obviate the need for undo logging as all changes were written
transaction consistent manner. Again, we refer to Section IV-F to shadows first and then installed in the database at transaction
for a relaxation of this requirement by transforming action commit time. However, the shadowing concept can also be
consistent snapshots into transaction consistent ones via the applied to maintaining snapshots.
200
Virtual Memory Snapshots: Our snapshot mechanism
explicitly creates a snapshot for a series of queries, called a
query session. In this respect, all queries of a Query Session
are bundled to one transaction that can rely on the transaction
consistent state preserved via the fork()-process. b
c
a
d OL
B. Transaction Consistent Archiving b AP
Undo Se
-Log a’ s
We can also exploit the VM snapshots for creating backup OL
OLTP Requests / Tx AP
archives of the entire database on non-volatile storage. This a’
b
Se
a’’ ss
process is sketched on the lower right hand side of Figure 6. cc’ io
n
Ba
ck
Typically, the archive is written via a high-bandwidth network d
up
b
P
of 1 to 10 Gb/s to a dedicated storage server within the same
roc
a’’’
e
compute center. It is beneficial to use an rDMA interface (e.g.,
ss
Virtual Memory
Myrinet or Infiniband) in order to unburden the server’s CPU
from the data transmission task. To maintain this transfer speed
the storage server has to employ several (more than 10) disks
for a corresponding aggregated bandwidth. Tx-consistent
DB-Archive
C. OLTP Transaction Synchronization Fig. 6. Durable Redo and Volatile Undo Logging
In the single-threaded mode the OLTP transactions do not
synchronisation of the shared data partition or the private data
need any synchronization mechanisms as they own the entire
partitions.
database.
In the multi-threaded mode (cf. Section III-D) we distin- D. Durability
guish two types of transactions:
The durability of transactions requires that all effects of
• partition-constrained transactions can read and update committed transactions have to be restored after a failure.
the data in their own partition as well as read the data in To achieve this we employ classical redo logging in HyPer.
the shared partition. However, the updates are limited to This is highlighted by the gray/pink ovals emanating from the
their own partition. serial transaction stream leading to the non-volatile Redo-Log
• partition-crossing transactions are those that, in addi- storage device in Figure 6. We employ logical redo logging
tion, update the shared data or access (read or update) [35] by logging the parameters of the stored procedures that
data in another partition. represent the transactions. In traditional database systems
Partition crossing transactions should be rare as updates logical logging is problematic because after a system crash the
to shared data seldom occur and the partitioning is derived database may be in an action-inconsistent state. This cannot
such that transactions usually operate only on their own data. happen in HyPer as we restart from a transaction consistent
The classification of the stored procedure transactions in the archive (cf. Figure 6). It is only important to write these logical
OLTP workload is done automatically based on analyzing log records in the order in which they were executed in order
their implementation code and their invocation parameters. If, to be able to correctly recover the database. In the single
during execution it turns out that a transaction was erroneously threaded OLTP configuration this is easily achieved. For the
classified as “partition constrained” it is rolled back and re- multi-threaded system only the log records of the partition
inserted into the OLTP workload queue as “partition crossing.” crossing transactions have to be totally ordered relative to all
The HyPer system admits at most one partition constrained transactions while the partition constrained transactions’ log
transaction per partition in parallel. Therefore, there is no need records may be written in parallel and thus only sequentialized
for any kind of locking or latching as the partitions have non- per partition.
overlapping data structures and the shared data is accesses High Availability and OLAP Load Balancing via Secondary
read-only. Server: The redo log stream can also be utilized to maintain
A partition crossing transaction, however, has to be admitted a secondary server. This secondary HyPer server merely exe-
in exclusive mode. In essence, it has to preclaim an exclusive cutes the same transactions as the primary server. In case of
lock (or, in POSIX terminology, it has to pass a barrier before a primary server failure the transaction processing is switched
being admitted) on the entire database before it is admitted. over to the secondary server. However, we do not propose
Thus, the execution of partition crossing transactions is rela- to abandon the writing of redo log records to stable storage
tively costly as they have to wait until all other transactions and to only rely on the secondary server for fault tolerance.
are terminated and for their duration no other transactions are A software error may – in the worst case – lead to a
admitted. Once admitted to the system, the transaction runs “synchronous” crash of primary and secondary servers.
at full speed as the exclusive admittance of partition crossing The secondary server is typically under less load as it needs
transactions again obviates any kind of locking or latching not execute any read-only OLTP transactions and, therefore,
201
their log records being flushed all their locks are already
freed. This is called early log release (ELR) and does not
jeopardize the serializability correctness. In our non-locking
system this translates to admitting the next transaction(s) for
b
c
a
the corresponding partition – viewing admission as granting
d
b
OL
AP an exclusive lock for the entire partition. Once the log buffer
S
a’ is flushed for the group of transactions, their commit is
O
OLTP Requests /Tx b
LA
P
Se
acknowledged to the client.
a’ ss
cc’
a’’
ion Another, less safe, method can be configured in Oracle
Ba
c
d
and PostgreSQL. It relaxes the WAL principle by avoiding
ku
pP
b
to wait for the flushing of the log records. As soon as the log
roc
a’’’
sse
Virtual Memory records are written into the volatile log buffer the transaction
is committed. This is called asynchronous commit. In the case
of a failure some of these log records may be lost and thus
the recovery process will miss those committed transactions
during restart.
c E. Atomicity
d
b The atomicity of transactions requires being able to elimi-
a’
O
nate any effects of a failed transaction from the database. We
LA
a’
b P
Se only have to consider explicitly aborted transactions, called
a’’ ss
cc’ io
OL
AP
n the R1-recovery. The so-called R3-recovery that demands that
d
b
Se
ss
updates of “loser”-transactions (those that were active at the
io
n
a’’’ time of the crash) are undone in the restored database is
Virtual Memory not needed in HyPer, as the database is in volatile memory
Fig. 7. Secondary Server: Stand-By for OLTP and Active for OLAP only and the logical redo logs are written only at the time
when the successful commit of the transaction is guaranteed.
has less OLTP load than the primary server. This can be Furthermore, the archive copy of the database that serves as
exploited by delegating some (or all) of the OLAP querying the starting point for the recovery is transaction consistent
sessions to the secondary server. Instead of – or in addition and, therefore, does not contain any operations that need to
to – forking an OLAP session’s process on the primary server be undone during recovery (cf. Figure 6). As a consequence,
we could just as well use the secondary server. undo logging is only needed for the active transaction (in
The usage of a secondary server that acts as a stand-by multi-threaded mode for all active transactions) and can be
for OLTP processing and as an active OLAP processor is maintained in volatile memory only. This is highlighted in
illustrated in Figure 7. Not shown in the figure is the possibility Figure 6 by the ring buffer in the top left side of the page
to use the secondary server instead of the primary server for frame panel. During transaction processing the before images
writing a consistent snapshot to a storage server’s archive. of any updated data objects are logged into this buffer. The
Thereby, the backup process is delegated from the primary size of the ring buffer is quite small as it is bounded by the
to the less-loaded secondary server. number of updates per transaction (times the number of active
Optimization of the Logging: The write ahead logging transactions in multi-threaded operation).
(WAL) principle may turn out to become a performance
bottleneck as it requires to flush log records before committing
F. Cleaning Action Consistent Snapshots
a transaction. This is particularly costly in a single-threaded
execution as the transaction – and all succeeding ones – have Undo-logging can also be used to create a transaction
to wait. consistent snapshot out of an action-consistent VM snapshot
Two commonly employed strategies that were already de- that was created while some transactions were still active. This
scribed by DeWitt et. al. [36] and extended in the recent paper is particularly beneficial in a multi-threaded OLTP system as
about the so-called Aether system [25] are possible: Group it avoids completely quiescing transaction processing. After
commit or asynchronous commit. forking the OLAP process including its associated VM snap-
Group commit is, for example, configurable in DB2 or MS shot the undo log records are applied to the snapshot state –
SQL Server. A final commit of a transaction is not executed in reverse chronological order. As the undo log buffer reflects
right after the end of a transaction. Rather, log records of all effects of active transactions (at the time of the fork) – and
several transactions are accumulated and flushed in a batched only those – the resulting snapshot is transaction-consistent
mode. Thus, the acknowledgment of a commit is delayed. and reflects the state of the database before initiation of the
While waiting for the batch of transactions to complete and transactions that were still active at the time of the fork.
202
Region (5)
203
HyPer configurations MonetDB VoltDB
one query session (stream) 8 query sessions (streams) 3 query sessions (streams) no OLTP no OLAP
single threaded OLTP single threaded OLTP 5 OLTP threads 1 query stream only OLTP
OLTP Query resp. OLTP Query resp. OLTP Query resp. Query resp. results from
Query No. throughput times (ms) throughput times (ms) throughput times (ms) times (ms) [18]
Q1 67 71 71 63
Q2 163 233 212 210
Fig. 9. Performance Comparison: HyPer OLTP&OLAP, MonetDB only OLAP, VoltDB only OLTP
CH schema of Figure 8. In the re-formulation we made sure difference; however, the OLAP query processing was signifi-
that the queries retained their semantics (from a business point cantly sped up by a column-wise storage scheme. Therefore,
of view) and their syntactical structure. The OLAP queries we only report the OLTP and OLAP performance of column
do not benefit from database partitioning as they all require store configurations.
scanning the data across all partition boundaries. For example, The HyPer benchmark as well as the MonetDB query
Query Q5 of the TPC-H benchmark lists the revenue achieved benchmark were run on a commodity server, with the fol-
through local suppliers and is re-formulated on our TPC-CH lowing specifications:
schema as follows: • Dual Intel X5570 Quad-Core-CPU, 8MB Cache
select n_name, sum(ol_amount) as revenue • 64GB RAM
from Nation join Customer on ... join Order on ... • 16 300GB SAS-HD (not used in benchmarks)
join Order-Line on ... join Stock on ...
• Linux operating system RHEL 5.4
join Supplier on ... join Region on ...
where su_nationkey=n_nationkey /* Cu and Su in the */ • Price: 13,886 Euros (discounted price for universities)
and r_name=’Europe’ /* same N of this R */
and o_entry_d>= ... The OLTP performance of VoltDB we list for comparison
group by n_name was not measured on our hardware but extracted from the
order by revenue desc; product overview brochure [18] and discussions on their web
site [37]. The VoltDB benchmark was carried out on similar
C. Performance of Different HyPer Configurations hardware (dual-quad Xeon CPU Dell R610 servers). The major
All benchmarks were carried out on a TPC-C-setup with difference was that the HyPer benchmark was run on a single
12 Warehouses. Thus, the initial database contained 360,000 server whereas VoltDB was scaled out to 6 nodes. In addition,
Customers with 3.6 million order lines – totalling about 1 the HyPer benchmark was carried out with redo logging to
GB of net data. For reproducability reasons all query sessions another storage server while VoltDB was run without any
were started (fork-ed) at the beginning of the benchmark (i.e., logging or replication.
in the initial 12 Warehouse state) and the 22 queries were run HyPer’s throughput results obtained on a single commod-
in – altered, to exclude caching effects – sequence five times ity server correspond to the published throughput results of
within each query session. Thus, each OLAP session/process VoltDB [18] on a 6-node cluster. As the VoltDB publications
was executing 110 queries sequentially. We report the median point out [18], these throughput numbers correspond to the
of each query’s response times. These query sessions were very best published TPC-C results for high-scaled disk-based
executed in parallel to a single- or multi-threaded OLTP database configurations. The HyPer OLTP throughput numbers
process – see Figure 9. were even achieved while one, eight, or three parallel OLAP
HyPer can be configured as a row store or as a column store. processes were continuously executing the OLAP queries in
For OLTP we did not experience a significant performance parallel to the OLTP workload (cf. Figure 9 from left to right).
204
A. OLTP only
The VoltDB system cannot support the parallel session(s) of 6000
B. Hybrid (idle OLAP)
C. Hybrid (idle OLAP, respawned)
B
C
OLAP queries. The performance results reveal that the left- A
MonetDB reveal that the two query execution engine essen- Fig. 10. Memory Consumption: (A) pure OLTP, (B) OLTP&one stable OLAP,
tially have the same performance. For those outlier queries (C) OLTP&continuously refreshed OLAP
where the response times vastly vary we simply failed to
E. Scaling to Very Large Main Memory Sizes
”tweak” MonetDB (e.g., by hints or query rewrites or query
unnesting) to execute the same logical plan as HyPer. Fur- Technological advances will soon allow main memory sizes
thermore, the out-of-the-box MonetDB installation we used of several TB capacity. For a default page size of 4 KB, a TB
does not appear to employ the advanced ”cracking” technique database has to manage a page table with 250 mil entries,
that horizontally partitions the columns on demand to optimize summing up to 4 GB in size. For such ultra-large scale main
similar queries executed in sequence. MonetDB was run as a memory databases the fork-execution can be optimized in
dedicated OLAP engine as we could not effectively execute the several ways:
OLTP workload on MonetDB – the lack of indexes prevents 1) Use lazy page table copying as devised by [38]. Only
any reasonable throughput on the TPC-C benchmark. the top levels of the hierarchically structured page table
are eagerly copied whereas the lowest level with the so-
D. Memory Consumption called pte-entries is copied on demand.
2) Fork only the secondary server and while forking, buffer
In these experiments we monitored the memory consump- the incoming log records.
tion to assess the overhead imposed by the copy-on-write 3) Increase the page size of segments of data objects that
mechanism that maintains the consistency of the forked OLAP are most likely to be immutable.
sessions. To isolate the effect of snapshot maintenance from Current operating systems and processors can accommodate
the transient query execution’s memory consumption, the different page sizes: For example, 4 KB as a default size
OLAP processes remained idle while the OLTP process was and 2 MB for large segments. We propose to partition the
executing and maintaining the snapshot via the implicit copy- data into two partitions: a so-called cold and a warm partition
on-write. The lower curve (A) of Figure 10 shows the mem- which are maintained in self-organized fashion. An update of
ory footprint of the pure OLTP system without any OLAP a cold tuple will initiate the exchange of this tuple with an
snapshot. The memory footprint increases proportional to the aged (“cooled down”) tuple from the hot partition. The cold
volume of newly generated transactional data. The steps in this partition is stored on large 2MB-pages and the hot partition
curve are due to resizing the data structures due to reaching which incurs the replication costs due to snapshot maintenance
the capacity of pre-allocated column vectors. The upper curve is stored on default small 4KB-pages. The subsequent table
(B) shows the memory consumption of the system in which demonstrates the costs of forking a main memory database of
we forked an OLAP snapshot/process at the beginning of the various sizes under the two different page sizes:
OLTP transaction processing. We see that initially the OLTP
process builds up its working set of replicated pages. The small pages (4 KB) large pages (2 MB)
DB size fork . . . per fork . . . per
size of this working set – once created – does not increase in MB duration 1 MB DB duration 1 MB DB
much during the continuous benchmark run as the updates 409.6 7ms 17µs 0.087ms 0.21µs
concern mostly newly generated data – therefore the curves 819.2 14ms 17µs 0.119ms 0.15µs
A and B run largely parallel. The “zig-zag”-curve (C) shows 1638.4 28ms 17µs 0.165ms 0.10µs
the memory footprint of the system consisting of an OLTP 4096 34ms 8µs 0.300ms 0.07µs
8192 69ms 14µs 0.529ms 0.06µs
process and an OLAP process that is initially forked and then
16384 136ms 8µs 0.958ms 0.06µs
at intervals of 500,000 transactions refreshed, i.e., terminated 32768 271ms 8µs 1.863ms 0.06µs
and reforked. The memory consumption of this configuration 40960 344ms 8µs 2.702ms 0.06µs
oscillates between the pure OLTP system and the configuration
with one “long duration” OLAP snapshot. The spikes (above VI. S UMMARY
the other OLTP&OLAP configuration, B) are due to artifacts Our HyPer architecture is based on virtual memory sup-
of the storage allocation for increased vector sizes and process ported snapshots on transactional data for multiple query
forking overhead (the memory footprint was measured at the sessions. Thereby, the two workloads – OLTP transactions
OS level in number of physically allocated pages). and OLAP queries – are executed on the same data without
205
interfering with each other. The snapshot maintenance and the [10] S. K. Cha and C. Song, “P*TIME: Highly scalable OLTP DBMS for
high processing performance in terms of OLTP throughput managing update-intensive stream workload,” in VLDB, 2004.
[11] A.-P. Liedes and A. Wolski, “Siren: A memory-conserving, snapshot-
and OLAP query response times is achieved via hardware consistent checkpoint algorithm for in-memory databases,” in ICDE,
supported copy on demand (= write) to preserve snapshot con- 2006.
sistency. The detection of shared pages that need replication [12] R. A. Lorie, “Physical integrity in a large segmented database,” TODS,
vol. 2, no. 1, 1977.
is done efficiently by the OS with Memory Management Unit [13] P. A. Boncz, S. Manegold, and M. L. Kersten, “Database architecture
(MMU) assistance. The concurrent transactional workload and evolution: Mammals flourished long before dinosaurs became extinct,”
the BI query processing use multi core architectures effectively PVLDB, vol. 2, no. 2, 2009.
[14] C. Binnig, S. Hildenbrand, and F. Färber, “Dictionary-based order-
without concurrency interference – as they are separated via preserving string compression for main memory column stores,” in
the VM snapshot. SIGMOD, 2009.
In this way, HyPer achieves the query performance of [15] J. Krüger, M. Grund, C. Tinnefeld, H. Plattner, A. Zeier, and F. Faer-
ber, “Optimizing write performance for read optimized databases,” in
OLAP-centric systems such as SAP’s TREX and MonetDB DASFAA, 2010.
and, in parallel on the same system, retains the high trans- [16] A. Whitney, D. Shasha, and S. Apter, “High volume transaction process-
action throughput of OLTP-centric systems, such as Oracles’s ing without concurrency control, two phase commit, SQL or C,” Intl.
Workshop on High Performance Transaction Systems, 1997.
TimesTen, SAP’s P*Time, or VoltDB’s H-Store. As the OLAP [17] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem,
snapshot can be as current as desired by forking a new and P. Helland, “The end of an architectural era (it’s time for a complete
OLAP session we are convinced that HyPer’s virtual memory rewrite),” in VDLB, 2007.
[18] VoltDB, “Overview,” http://www.voltdb.com/ pdf/VoltDBOverview.pdf,
snapshot approach is a promising architecture for real-time March 2010.
business intelligence systems. [19] C. Curino, Y. Zhang, E. P. C. Jones, and S. Madden, “Schism: a
While the current HyPer prototype is a single server scale- workload-driven approach to database replication and partitioning,” in
VLDB, 2010.
up system, the VM snapshotting mechanism is orthogonal to [20] Ö. Ulusoy and A. P. Buchmann, “A real-time concurrency control
a distributed architecture that scales out across a compute protocol for main-memory database systems,” Inf. Syst., vol. 23, no. 2,
cluster – as we will demonstrate in the future. The snapshot 1998.
[21] E. P. C. Jones, D. J. Abadi, and S. Madden, “Low overhead concurrency
mechanism could also be used in a data warehouse configu- control for partitioned main memory databases,” in SIGMOD, 2010.
ration where the transaction workload queues corresponds to [22] S. Aulbach, D. Jacobs, A. Kemper, and M. Seibold, “A comparison of
a continuous refresh stream emanating from one or several flexible schemas for software as a service,” in SIGMOD, 2009.
[23] P. Unterbrunner, G. Giannikis, G. Alonso, D. Fauser, and D. Kossmann,
OLTP systems. Then, the “data-owning” process corresponds “Predictable performance for unpredictable workloads,” PVLDB, vol. 2,
to the installer of these updates while the OLAP queries can no. 1, 2009.
be executed in parallel against consistent snapshots. [24] I. Pandis, R. Johnson, N. Hardavellas, and A. Ailamaki, “Data-oriented
transaction execution,” in VLDB, 2010.
[25] R. Johnson, I. Pandis, R. Stoica, M. Athanassoulis, and A. Ailamaki,
ACKNOWLEDGMENT “Aether: A scalable approach to logging,” in VLDB, 2010.
We thank Florian Funke and Michael Seibold for helping [26] V. Raman, G. Swart, L. Qiao, F. Reiss, V. Dialani, D. Kossmann,
I. Narang, and R. Sidle, “Constant-time query processing,” in ICDE,
with the performance evaluation. We acknowledge the many 2008.
colleagues with whom we discussed HyPer’s virtual memory [27] L. Qiao, V. Raman, F. Reiss, P. J. Haas, and G. M. Lohman, “Main-
snapshot architecture. memory scan sharing for multi-core cpus,” PVLDB, vol. 1, no. 1, 2008.
[28] A. Kemper, D. Kossmann, and F. Matthes, “SAP R/3: A database
application system (tutorial),” in SIGMOD, 1998.
R EFERENCES [29] S. Finkelstein, D. Jacobs, and R. Brendle, “Principles for inconsistency,”
[1] J. Doppelhammer, T. Höppler, A. Kemper, and D. Kossmann, “Database in CIDR, 2009.
performance in the real world - TPC-D and SAP R/3,” in SIGMOD, [30] S. Bailey, “Us patent 7389308b2: Shadow paging,” 17. Juni 2008, filed:
1997. 30. Mai 2004, granted to Microsoft.
[2] H. Plattner, “A common database approach for OLTP and OLAP using [31] P. A. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control
an in-memory column database,” in SIGMOD, 2009. and Recovery in Database Systems. Addison-Wesley, 1987.
[3] J. K. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, [32] H. Berenson, P. A. Bernstein, J. Gray, J. Melton, E. J. O’Neil, and P. E.
D. Mazières, S. Mitra, A. Narayanan, G. M. Parulkar, M. Rosen- O’Neil, “A critique of ANSI SQL isolation levels,” in SIGMOD, 1995.
blum, S. M. Rumble, E. Stratmann, and R. Stutsman, “The case for [33] M. J. Cahill, U. Röhm, and A. D. Fekete, “Serializable isolation for
RAMClouds: scalable high-performance storage entirely in DRAM,” snapshot databases,” TODS, vol. 34, no. 4, 2009.
Operating Systems Review, vol. 43, no. 4, 2009. [34] T. Neumann and G. Weikum, “x-RDF-3X: fast querying, high update
[4] Intel, “Tera-scale computing research program,” 2010, rates, and consistency for RDF databases,” in VLDB, 2010.
http://techresearch.intel.com/articles/Tera-Scale/1421.htm. [35] J. Gray and A. Reuter, Transaction Processing: Concepts and Tech-
[5] S. Harizopoulos, D. J. Abadi, S. Madden, and M. Stonebraker, “OLTP niques. Morgan Kaufmann, 1993.
through the looking glass, and what we found there,” in SIGMOD, 2008. [36] D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro, M. Stonebraker, and
[6] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. B. Zdonik, D. A. Wood, “Implementation techniques for main memory database
E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and systems,” in SIGMOD, 1984.
D. J. Abadi, “H-store: a high-performance, distributed main memory [37] VoltDB, VoltDB TPC-C-like Benchmark Comparison-Benchmark De-
transaction processing system,” PVLDB, vol. 1, no. 2, 2008. scription, https://community.voltdb.com/node/134, May 2010.
[7] S. Chaudhuri and G. Weikum, “Rethinking database system architecture: [38] D. McCracken, “Sharing page tables in the Linux kernel,” in Proceedings
Towards a self-tuning risc-style database system,” in VLDB, 2000. of the Linux Symposium. Ottawa, CA: IBM Linux Technology Center,
[8] T. Neumann and G. Weikum, “The RDF-3X engine for scalable man- July 23rd, 2003, http://www.kernel.org/doc/ols/2003/ols2003-pages-315-
agement of RDF data,” VLDB J., vol. 19, no. 1, 2010. 320.pdf.
[9] Oracle, Extreme Performance Using Oracle TimesTen In-
Memory Database, http://www.oracle.com/technology/products/
timesten/pdf/wp/wp timesten tech.pdf, July 2009.
206