Berkeley DB Java Edition Architecture: An Oracle White Paper September 2006
Berkeley DB Java Edition Architecture: An Oracle White Paper September 2006
Architecture
1. INTRODUCTION
Berkeley DB Java Edition (JE) finds its functional roots in the Berkeley DB (DB)
embedded database library. DB was designed to provide persistent, application-
specific data storage in a fast, scalable, and easily administered package. It provides
the traditional atomicity, consistency, isolation, and durability (ACID) guarantees
of a relational database management system (RDBMS), without the need for
separate installation of a database server or the services of a database
administrator. In addition, Berkeley DB executes directly in the application’s
address space, allowing for single binary installation. This is an attractive
proposition for manufacturers of embedded devices, such as set top boxes or
mobile phones, as well as for developers of large heavily concurrent application
servers, such as messaging and directory servers.
Java Edition opens up many additional possibilities. First, it brings the functionality
of DB to environments that require 100% Java. Second, it creates the possibilities
that storage can be provided directly in application servers, which may not want to
rely on an external DBMS. Application servers frequently need robust data storage,
where relying on the existence of a RDBMS is burdensome. Additionally,
applications written for application server environments may want the benefit of a
small-footprint, embedded data manager that does not incur the overhead of a
JDBC interface and associated process switch. Even for those applications where
an RDBMS using JDBC is appropriate, we anticipate JE will be used, as Berkeley
DB is often used, as a fast front-end cache for the RDBMS.
Whether used in conjunction with an RDBMS or used directly, JE supports a
variety of Java standards. The Java Transaction API (JTA) specifies an interface
2. ARCHITECTURAL OVERVIEW
As JE is functionally compatible with Berkeley DB, we have retained the key
concepts and definitions from the initial Berkeley DB product. We begin by
introducing these concepts and the terms Berkeley DB uses for them.
2.1 Terminology
A transaction is a group of operations that adhere to the ACID properties of
Atomicity, Consistency, Isolation, and Durability [14]. Atomicity implies that a
collection of operations appears atomically in the database; either all the operations
appear or none of them do. Consistency means the database always presents a
consistent view. (For example, consider a database of cats and owners. If Joy owns
Elsa, we would never see a database with owner Joy without cat Elsa or cat Elsa
without owner Joy.) Isolation describes the property that any transaction operates
under the illusion that it is the only transaction operating in the database at a
particular time. That is, there is no way to tell if there are multiple transactions
active concurrently. Finally, durability implies that modifications made by a
transaction are guaranteed to be persistent, even in the presence of system or
application failure.
Korat
Nebelung
Pixiebob
Somali
BIN
BIN
Korat
Laperm . . . Somali
Database structure. Databases are implemented as B+trees with fixed size internal nodes (INs). BINs
are subclasses of INs that support iteration over the elements in the tree.
LNs
Having mastered the basic terminology and structure of JE, we are now ready to
explore implementation details in more depth.
3. THE IMPLEMENTATION
The implementation revolves around the log-structured design of JE. Rather than
marshalling and unmarshalling Java objects to and from disk-backed pages, JE
serializes objects and writes them sequentially to a log, optionally using the Java
NIO API. All the necessary index structures sit atop this sequential log, and there
is no other representation of the data.
We found that the java.io.Serializable interface was too heavyweight for
our purposes, so the use of the term “serialize” throughout this paper references
the generic action of marshalling objects, although our implementation does not
use the java.io.Serializable interface to implement this functionality.
This design offers several advantages. First, JE does not have to pay the
performance penalty of marshalling and unmarshalling variable length objects onto
fixed-length pages. Second, JE can perform record-level locking without the
significant recovery complexity usually associated with traditional RDBMS record-
locking solutions. Third, there is only a single representation of the data, instead of
the separate database and log files in conventional logging systems. Fourth, all
writes happen sequentially, so unless the disk is used to handle read misses, the
disk head never moves, providing near maximum disk bandwidth utilization.
The log-structured design is not without its disadvantages. The primary
disadvantage is that a cache-miss for JE is more costly than a cache-miss in a
traditional database engine. For example, a cursor traversing the database, which
misses in the cache, will be forced to issue an I/O for any items not found in the
cache, whereas a traditional on-disk structure will amortize a single I/O over some
3.1 Concurrency
JE uses record-level transactional locks on the LNs that store key/data pairs,
providing highly concurrent database access. Internal nodes are maintained non-
transactionally in order to preserve high-concurrency access, minimize the data
written to the log, and simplify abort (as modifications do not have to be undone
when transactions abort). Instead, modifications to internal nodes (INs) are
synchronized via short-term latches protecting only against physical manipulation.
Since Java monitors, using the synchronized keyword, cannot be used to perform
latch coupling (atomically acquiring a new latch and dropping a held latch) and do
not prevent starvation in the case that multiple threads are waiting for the same
object, JE provides its own latching mechanism using Java monitors as the basis.
Each object that requires a latch (for example, an IN) contains a reference to a JE
latch object. JE uses Java monitors to synchronize access to the latch object. When
a thread holds a latch that is requested by another thread, the incoming thread
waits on a synchronizer object.
Latch granting is strictly first in, first out. When a thread releases a latch it
examines the queue and notifies the first entry in that queue. Since the releasing
thread only calls notify on a single thread, we guarantee that threads are awakened
in the order they appear on the queue, reducing the possibility of starvation.
Latch deadlocks must be avoided, because we perform no latch deadlock detection
or resolution. Acquiring locks in a well-defined order is the most common
deadlock-avoidance technique [30]. While descending the tree, this well-defined
order is obvious: always acquire latches down the tree. For other data structures,
such as the transaction table, we define a lock order and adhere to it.
Using short-term latches rather than transactional locks on internal nodes allows
for improved concurrency but does introduce complexity in the implementation,
especially in recovery.
JE uses a conventional lock manager to implement transactional locks. The lock
table is an object that contains a hash table-mapping node IDs to lock objects.
1. Read all INs to find the largest allocated node ID (so that we can
begin allocating new IDs) and the largest used transaction ID (before
we need to perform any transactional operations).
The next three passes recover the mapping tree, which maps database IDs to
actual databases:
2. Read all the BIN Delta records for the mapping tree.
3. Undo all aborted LNs in the map (i.e., roll backward). Keep track of
all committed transaction IDs.
4. Redo all committed LNs in the map (i.e., roll forward).
The next two passes reconstruct the physical structure of all the database trees:
5. Read all the INs and link them back into the tree.
6. Read all the BIN Deltas and apply those to the INs.
Next, we reconstruct duplicate data trees, similarly to how we reconstructed the
database trees. The reason this must be implemented as a separate set of passes is
because we cannot necessarily traverse enough of the tree to access the duplicate
trees until pass numbers 5 and 6 are complete.
7. Read all the DINs and DBINs and link them back into the tree.
8. Read all the DBIN Deltas and apply those to the INs.
And finally, we execute the conventional roll back and roll forward phases:
9. Roll backward undoing aborted transactions.
10. Roll forward reapplying the committed transactions.
3.6 Cleaning
No discussion of a log-structured system would be complete without a discussion
of the cleaner. As has been demonstrated [24], selecting the appropriate segments
(or JE log files) for cleaning has a significant effect on cleaner performance. JE
maintains a utilization profile for each log file so the cleaner can select the eligible
log file with the lowest utilization. A log file is eligible for cleaning if its removal
does not interfere with recovery. Thus, any log file is eligible if it does not contain
the last checkpoint and has not been written since the last checkpoint.
JE cleaning consists of reading a log file, appending records that are still “live”
back into the log, and then reclaiming the space freed up by the now useless log
file. A record is alive if the node it references is still present in the database. If log
files are copied to archival media before being removed, the collection of archived
log files provides the basis for catastrophic recovery.
There are two types of information in the utilization profile: summary and detail.
Summary information for each file indicates approximately how much of the file is
obsolete and how much it will cost to clean the file. It contains the total number of
nodes, the number of obsolete nodes, and the average size of the nodes. The
summary information is cached in memory and is used to select the next log file to
be cleaned.
The detailed information for each log file consists of a list of the byte offsets in the
file for all entries that are known to be obsolete. Without detailed summary
information, identifying obsolete entries requires traversing the live database tree
to determine if the entry in the log file exists in the tree. The detail information is
used to avoid this potentially costly check during cleaning.
The utilization profile is stored as an internal hidden database. Each record in that
database contains both the summary and some of the detail information for a
particular log file. A log file can have multiple profile records with each one
containing the complete summary information and the detail information
accumulated since the last record for the log file was written. By storing the detail
information incrementally and in a packed form, the utilization profile information
is less than 2% of the total disk space used in the environment.
JE tracks utilization profile information during live database operations. In most
cases, this is accomplished without additional latching by tracking utilization while
a latch is already held. The challenge in keeping utilization information accurate is
keeping it transactionally consistent, both during regular operation and during
recovery.
4. RELATED WORK
There are three types of prior work related to this paper. First, there is the
enormous literature on transactional systems. Second, there is the rich research
history in log-structured file systems. Lastly, there are a number of alternative pure
Java database implementations. We discuss each of these areas in the next three
sections.
5. APPLICATIONS USING JE
The key distinction between JE and the other systems discussed in the previous
section is its flexibility. If an application is not wedded to a SQL data management
interface, then JE can be molded to address practically every data management
need. Indeed, this is precisely what we observe in our customers’ applications. In
this section, we provide three examples of how customers are using JE.
6. PERFORMANCE
We began this paper by citing some of the advantages of JE and its log-structured
architecture. Up to this point, we’ve focused largely on JE’s architecture and its
functional flexibility. In this section, we illustrate its performance characteristics.
We compare JE 2.1.30 to its JNI counterpart implemented atop Berkeley DB’s C
library using a pre-release version of 4.5 (referred to as JNI for the rest of this
paper). Such a comparison is not perfect as the C library has been in widespread
commercial use for nearly a decade and has been optimized, while JE has been in
commercial use for less than two years. JE has undergone much less extensive
performance tuning. Nonetheless, it is the best comparison available and
highlights the areas where JE’s architecture delivers outstanding performance.
6.4 Concurrency
Our next benchmark explores the improved concurrency possible due to JE’s
record level locking. This benchmark is similar to the one in Section 6.3, except
that rather than select the records uniformly at random, we skew the distribution
selecting less than 1% of the records for update. This sets the stage to explore the
behavior of the system under contention. Our expectation is that the record-level
locking of JE will produce better scalability than JNI’s page-level locking.
7. CONCLUSIONS
We have presented the design and implementation of the Berkeley DB Java
Edition, a native Java transactional data manager. JE’s log-structured storage
system delivers outstanding write performance without jeopardizing read
performance.
8. AVAILABILITY
Additional information about Berkeley DB Java Edition and the full product
including source code, documentation, sample code and test code is available for
download from:
http://www.oracle.com/technology/products/berkeley-db/je/index.html.
Oracle Corporation
World Headquarters
500 Oracle Parkway
Redwood Shores, CA 94065
U.S.A.
Worldwide Inquiries:
Phone: +1.650.506.7000
Fax: +1.650.506.7200
oracle.com