Principles of Database Buffer Management
Principles of Database Buffer Management
net/publication/220225054
CITATIONS READS
294 3,238
2 authors, including:
Theo Härder
Technische Universität Kaiserslautern
332 PUBLICATIONS 4,415 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Theo Härder on 06 June 2014.
1. INTRODUCTION
Database management systems (DBMSs) use external magnetic devices (disks)
for the storage of mass data. They offer low cost per bit and nonvolatility, which
makes them indispensible in today’s DBMS technology. However, under com-
mercially available operating systems, data can only be manipulated (i.e., com-
pared, inserted, modified, and deleted) in the main storage of the computer.
Therefore, part of the database has to be loaded into a main storage area before
manipulation and written back to disk after modification. A database buffer has
to be maintained for purposes of interfacing main memory and disk.
Although several modern operating systems provide a main storage “cache”
for their file systems, most DBMSs have their own buffer pools in the user
address space-they do not use the OS file cache for various reasons (for a
Authors’ addresses: W. Effelsberg, IBM Scientific Center, Tiergartenstrasse 15, 6900 Heidelberg,
West Germany; T. Haerder, Fachbereich Informatik, University of Kaiserslautern, Postfacb 3049,
6750 Kaiserslautern, West Germany.
Permission to copy without fee all or part of this material is granted provided that the copies are not
made or distributed for direct commercial advantage, the ACM copyright notice and the title of the
publication and its date appear, and notice is given that copying is by permission of the Association
for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific
permission.
0 1984 ACM 0730-0301/34/1200-0560.$00.75
ACM Transactions on Database Systems, Vol. 9, No. 4, December 19S4, Pages 560-595.
Principles of Database Buffer Management 0 561
detailed discussion, see, for example, [24]). In order to facilitate the exchange of
data between disk and main storage, the database is divided into pages of equal
size (generally 512 to 4096 bytes). The buffer consists of page frames of the same
size. The number of frames in the buffer can be selected as a DBMS parameter,
which remains constant during a DBMS session. Today buffer sizes vary from
about 16 K to 12 M bytes. A typical buffer size may be assumed to be between
128 K and 256 K bytes.
Since a physical access to a database page on disk is much more expensive
than an access to a database page in the buffer, the main goal of a database
buffer manager is the minimization of physical I/O for a given buffer size. This
goal has to be accomplished under certain restrictions resulting from the interface
between the buffer manager and other DBMS components.
The purpose of this paper is to describe, in some detail, the main functions of
a database buffer manager. In Section 2, its typical interface to the calling DBMS
routines is investigated. Section 3 of the paper compares the applicability of
different techniques for searching the buffer. Section 4 concentrates on the
problem of allocating sufficient buffer space for concurrent transactions. In
addition to the usual techniques, a new, page-type-oriented allocation algorithm
is considered for use in the DBMS context. In Section 5, various page replacement
algorithms are classified. The combination of classification criteria leads to the
refinement of known algorithms. Section 6 presents an empirical study of the
performance aspects of various buffer allocation algorithms in connection with
page replacement algorithms. The results were gained using page reference strings
of CODASYL DBMS applications. Section 7 describes some further buffer
management problems related to a virtual OS environment and control of
overload behavior. The final section summarizes the major aspects of DBMS
buffer management.
search strategy
/\
direct search within Indirect search using
the buffer frames additional tables
(sequential search
in the buffer pool)
translation table unsorted table sorted table chained table hash table
(by PageW
physical references are strongly influenced by the size of the database buffer and
the page replacement algorithm of the buffer manager. The same string of logical
references can result in quite different physical reference strings under different
replacement algorithms. Since physical references are expensive, the optimization
of the page replacement algorithm is very important for the overall performance
of the DBMS. Optimization means the minimization of the number of physical
disk accessesfor a typical transaction load, described by a logical reference string.
As the characteristics of logical reference strings depend on the implementation
details of a DBMS, the empirical results given in this paper should not be
generalized. Our emphasis is on the basic principles of database buffer manage-
ment and on the methods used in our evaluation rather than on the results
themselves.
Having described the interface and operations of a database buffer manager,
we now proceed to the implementation of single actions, as mentioned above-
search within the buffer, buffer allocation, and page replacement.
where:
Pi,pj# P,o PI: page numbers
BAi, BAj, BA,, BAl: buffer addresses (frame numbers)
h: hash function
HT (k): k-th entry of hash table
therefore restricted to very small databases. All other tables contain only N
entries for a buffer of size N, independent of the size D of the database. The
unsorted and sorted tables both require N/2 accesseson the average for a page
found in the buffer. The sorted table reduces the number of accessesfrom N to
N/2 for an unsuccessful sequential search and allows binary search techniques,
but involves a much higher overhead when a table entry is inserted or deleted.
By maintaining an index to the sorted table or by implementing the sorted table
with a balanced binary tree, the search can be reduced to log,N accessesin either
case; update costs, however, are even higher. A table with chained entries has
two advantages over a compact table:
(1) update is less costly, since no entries have to be moved;
(2) the chaining sequence can be used to represent additional information. For
example, table entries could be chained in LRU sequence, representing the
replacement information for an LRU algorithm and speeding the buffer
search when locality in the reference behavior is observed (e.g., when the
probability of rereferencing recently used pages is high).
Since the most frequent operation in a page table is direct accessusing a page
number, hush techniques can be used efficiently. The hash algorithm transforms
a page number into a displacement within the page table, where the entry
describing the page and its current position in the buffer can be found. Collisions
can be resolved by chaining overflow entries to the “home” entry. With an
appropriately sized hash table, the number n of entries searched per logical
reference can be on the order of 1 < n < 1.2. An example of such a hash table is
given in Figure 2.
replacement decisions (e.g., a global LRU algorithm), However, since the problem
of allocating frames to transactions in an optimal way is logically different from
the problem of selecting a page for replacement, buffer allocation algorithms are
treated separately.
Before discussing specific algorithms, the reference behavior of database trans-
actions has to be considered. In order to design optimal allocation and replace-
ment algorithms, as much knowledge of the actual database reference character-
istics as possible should be used. The following three basic properties of database
reference strings distinguish them clearly from page reference strings of programs
executing under a virtual memory operating system.
(1) Since database pages are a centralized resource shared by many users, the
concurrent use of a page in the buffer by several transactions is quite frequent.
(2) Locality in the reference behavior of a DBMS is not necessarily due to the
references of a single transaction; rather, the parallel execution of many trans-
actions can increase the rereferencing probability across transaction boundaries
(intertransaction locality, intratransaction sequentiality [18]).
(3) In some cases, the reference behavior of database transactions is predict-
able, being based on existing access path structures. Often, specific pages con-
taining system tables, upper index levels, and so on, have a higher reference
probability than do data pages. These identifiable, special-purpose pages can be
treated in a special way when they are referenced.
Besides these general observations, it is important to know as much as possible
about the reference behavior of the specific DBMS for which the buffer manage-
ment component is to be implemented. On such a detailed level, different systems
show different behavior. It is therefore more interesting to look at evaluation
methods for reference strings than at the results for a specific DBMS in a specific
database environment.
For storage allocation and page replacement algorithms, the most important
property of a reference string is the locality of the reference behavior. Locality
means that the probability of reference for recently referenced pages is higher
than the average reference probability. If locality is observed in a reference string,
most of the virtual memory allocation and replacement algorithms can be applied
to buffer management; these algorithms were designed to keep the most recently
referenced pages in main memory, since programs executing under virtual mem-
ory operating systems show high locality in their reference behavior [ 231.
Detailed information on locality is contained in an LRUstack depth distribution
of the reference string, which shows the frequency of references to pages managed
in the form of an LRU stack [22,26]. The more the distribution is biased towards
low stack depths, the higher is the locality in the string. Figure 3 shows two
examples of LRU stack depth distributions calculated from the page reference
strings of a CODASYL DBMS. The schema and transactions were taken from a
school DB application. The schema consisted of 20 record types and 21 set types.
The database contained approximately 330,000 record occurrences. The two
reference strings discussed here correspond to session times of 30 to 40 minutes
each; they contained 130,366 and 99,975 logical references. Figure 3a shows the
stack depth distribution of a transaction load with a high percentage of short
update transactions, whereas Figure 3b shows the distribution of a transaction
ACM Transactions on Database Systems, Vol. 9, No. 4, December 1984.
. W. Effelsberg and T. Haerder
(a) MIX40
length of the string: 130366 logical references
number of different pages in the string: 3553
I ’ I I I I ’ I I
10 20 30 40 50 60 70 80 90 100
LRU Stack Depth
(b) MIX50
length of the string: 99975 logical references
number of different pages in the string: 5245
10 20 30 40 50 60 70 80 90 100
LRU Stack Depth
Fig. 3. LRU stack depth distributions of reference strings from a CODASYL DBMS.
ACM Transactions on Database Systems, Vol. 9, No. 4, December 1984.
Principles of Database Buffer Management l 567
10 20 30 40 50 60 70 80 90 100 %
DENSITY
53500
54000
54500
55Doo
55500
56CIOD
56500
57Doo
57500
58000
58500
59000
TIME
REFERENCES
Helatw trequency 01 page types tor this example:
FPA = 0.1%. DBTT = 6.1%. USER = 93.8% (access path data and records)
mix consisting of sequential retrieval transactions with only very few updates. In
both cases, up to eight transactions ran concurrently. It is easily seen that the
first mix (MIX40) had a much higher degree of locality than did the second
(MIX50); both mixes are taken from [9].
In contrast to the reference behavior of programs under virtual memory
operating systems, the highest reference probability is not found in stack depth
1 (containing the “youngest” page). This is due to the fact that the references
observed are logical references to database pages and not addresses used by
machine instructions. Since the “youngest” page in the buffer will be fixed in
most cases,data objects within that page can be addressed without a new logical
reference to the page. Another difference between the reference (or better:
addressing) behavior of programs and the data reference behavior of database
transactions is the probability of reference in stack positions 6 to 40 (approxi-
mately). For example, the first five stack positions may cover as much as 97
percent of all references of a program (data taken from [26]), whereas our MIX50
would find only 9.5 percent of its references in this range. Rereferencing in
deeper stack positions (e.g., 6 to 40) is mainly caused by transactions being
suspended for a certain time because they are blocked by concurrent transactions
having exclusive access to the needed resources. Also, the DBMS stack depth
distributions are not monotonically decreasing, for the same reason. Similar
results have been reported by Fernandez, Lang, and Wood [lo].
As mentioned before, the accesspath structures used by a specific DBMS lead
to a higher reference probability for certain system pages, such as pages contain-
ing free-space data, address translation tables, root pages of B-trees, and so forth.
An evaluation technique showing these effects is presented in Figure 4. The
ACM Transactions on Database Systems, Vol. 9, No. 4, December 1984.
568 l W. Effelsbergand T. Haerder
The given classification scheme seemsto reflect all buffer allocation algorithms
that promise r( successful application and are feasible with a reasonable amount
of overhead.
The main disadvantage of static allocation (whether transaction oriented or
page-type oriented) is its inflexibility in situations where the DBMS load changes
frequently. Since the number of buffer frames allocated to a single transaction
remains constant, static allocation is especially inefficient in an interactive
ACM Transactions on Database Systems, Vol. 9, No. 4, December 1984.
Principles of Database Buffer Management 569
buffer allocation
/I\
local global page-type oriented
(transaction oriented)
/\ /\
dynamic static dyyta[
/\
equal-size adaptable-size equal-size adaptable-size
fixed partitions fixed partitions
environment where transactions can be blocked by long user think times. Because
of its inflexibility, static allocation is not considered to be applicable in database
buffer management.
When local dynamic allocation is applied, the size of the partition of a single
transaction grows and shrinks with the transaction’s changing need for buffer
space. Only the current reference behavior of the transaction itself is taken into
account by the allocation algorithm. When a buffer fault occurs, the allocation
algorithm calculates the optimal partition size for the transaction. Depending on
the reference history of the transaction, it may acquire an additional buffer
frame, keep the partition size constant, or lose one or more frames. In the same
way, partition sizes may vary under page-type-oriented dynamic allocation. For
example, a database buffer could consist of a system part and a user part. New
system pages would be placed in the system partition; new user pages in the user
partition. The size of the partitions would vary with changing demand. With a
fixed-size buffer, a second algorithm that selects pages for replacement has to be
provided.
Whereas local algorithms consider only the reference behavior of a single
transaction when calculating its optimal partition size, global algorithms take
into account the reference behavior of all parallel transactions. All references to
data pages are considered in the same way, independent of the transactions
causing them. Since the DBMS buffer is considered to be of fixed size, global
buffer allocation and the page replacement algorithm coincide. When a buffer
fault occurs, one single algorithm decides which page has to be replaced; the
decision is global. Depending on the owner of the corresponding buffer frame,
the actual partition sizes of the transactions change automatically. (Page replace-
ment algorithms are discussed in Section 5.)
Since static allocation is inefficient in a database environment and global
allocation coincides with replacement algorithms, the only buffer allocation
algorithms to be discussed in some detail here are local dynamic algorithms.
referencing
transaction Tl Tl T2 Tl T2 T2 Tl Tl T2 T2 Tl Tl T2 Tl
paw A A B C DEAAEFGHFA
I I I I I I1 I I I I I I)
time tl t2 t3 t
W(t,T) with T = 5
the last 7 references at time t. 7 is called window size. w (c, 7) = 1lV(t, 7) 1is the
working-set size at time t. Figure 6 shows examples of working-sets and their
sizes.
The average working-set size ti(~) of a transaction can be used as a measure
of locality. The higher the locality, the more references are made to elements
previously referenced in the window of size T, leading to a lower average working-
set size W.
Denning’s working-set model can be used for the implementation of a buffer
allocation algorithm, WS [17]. The basic principle is to keep the pages forming
the working set of a transaction in the buffer, and to make all pages not belonging
to any working set available for replacement. The window size T has to be
determined carefully so that the working set of a transaction contains just the
minimum of pages needed for an efficient execution. In phases of high locality,
the working set of a transaction shrinks, and the buffer frames freed will be
available for reallocation to other transactions. A further refinement of the WS
algorithm could be the assignment of different window sizes r to different types
of transactions, thereby establishing a priority system.
Note that the WS algorithm only decides whether or not a certain page in the
buffer is available for replacement. It is irrelevant if the last reference to the
page occurred 7 + 1 or T + n (n >l) references ago. Therefore, in addition to the
buffer allocation algorithm WS, a page replacement algorithm is needed that
selects one of the eligible pages for replacement when a buffer fault occurs. All
replacement algorithms described in the next section can be used in combination
with WS.
Implementing the WS algorithm means that whenever a buffer fault occurs,
the working sets of all active transactions must be determined. Denning’s
proposals for an implementation were based on a hardware feature of virtual
memory computers: Associated with each storage frame is a reference bit that is
set by hardware whenever a page is referenced (addressed). Since a database
buffer manager defines references to pages in a different way, these references
have to be recorded by the software. A straightforward implementation is the
following: Every active transaction has a reference counter TRC(T), which counts
ACM Transactions on Database Systems, Vol. 9, No. 4, December 1984.
Principles of Database Buffer Management 571
TRC(TI): 8 TRC(W): 6
LRC(T1, A): 8 LRC(T2, B): 1
LRC(T1, C): 3 LRC(T2, D): 2
LRC(T1, C): 6 LRC(T2, E): 4
LRC(T1, H): 7 LRC(T2, F): 6
Fig. 7. An implementation of the WS algorithm for buffer allocation. For 7 = 5, pages C and B are
available for replacement.
all logical references of the transaction. Every page i in the buffer has, for every
transaction using it, a field “last reference count” LRC(T, i). When transaction
T references page i, TRC(T) is incremented and then copied into LRC(T, i). A
buffer page i is available for replacement iff
TRC(T) - LRC(T, i) L r
for all transactions T using it. Figure 7 shows an example of this implementation,
using the reference string of Figure 6.
Another dynamic storage allocation algorithm discussed in the literature is the
page-fault-frequency algorithm (PFF). It uses the current interval between the
last two page faults (which is related to the current page fault rate) for the
allocation decision: As long as the actual page fault rate FA of a transaction is
lower than a predefined maximum rate F, the transaction keeps its working set
in the buffer (as under WS). When the actual fault rate FA is higher than F
(determined by the fact that the interval between the last two page faults was
less than r ’ = l/F), a new buffer frame is allocated to the transaction, independ-
ent of its current working set. PFF is designed to guarantee a maximum fault
rate of F for all transactions. (Further details may be found in the literature
[5, 81).
Further dynamic buffer allocation algorithms are proposed in the literature
(e.g., a so-called WSCLOCK algorithm and an allocation based on a modified
CLOCK algorithm; their complete descriptions and evaluations can be found in
PI and[41).
In this section we have discussed the application of dynamic buffer allocation
algorithms to transactions. Page-type-oriented dynamic allocations can be im-
plemented in the same way. Hence, algorithms such as WS and PFF can be used
with dynamic partitions in a transaction- or page-type-oriented database buffer.
100
FRMax
FRCS
BMin D B
A
El33
C
A A
I343
B
reference of page A found
B B
A
the only decision criterion. Hence, FIFO is only appropriate for sequential access
behavior. Figure 10a shows a common representation of FIFO, using a circular
allocation of pages and a rotating pointer moved one step at every replacement.
The pointer indicates the next page to be replaced.
The algorithm LFU (least frequently used) uses only the second decision
criterion, and replaces the buffer page with the lowest reference frequency. As
shown in Figure lOc, reference counters (RC) are needed to record all references
to a buffer page. When a page is fetched, the corresponding RC is initialized to
1; every rereference increments it by 1. When replacement is necessary, the
buffer page with the smallest value of RC is chosen; a tie is resolved by some
mechanism. In this strict LFU realization, the age of a page is not taken into
account at all; pages with very high reference activity during a short interval can
obtain such high RC values that they will never be displaced, even if they are
never referenced again. For this reason, the pure LFU mechanism should not be
implemented in a database environment. Using additional measures, the LFU
concept can be made more appropriate, while losing its original characteristics.
All further algorithms to be discussed consider age as well as references. The
widespread algorithm LRU replaces the buffer page that was least recently used,
and can be explained easily by means of a so-called LRU stack, as shown in
Figure 9.
The replacement decision is determined by which page is referenced and by
the age of each buffer page since its most recent reference. The FIX mechanism
for pages causes LRU to be optionally implemented by two versions, depending
on how the term “used” is interpreted, as
- least recently referenced, or
- least recently unfixed.
The following scenario can help to clarify the difference.
FIX FIX UNFIX UNFIX
A B B A
I I I I I *
T ’ t1 t
At time tl with
- least recently referenced, page A is replaced;
- least recently unfixed, page B is replaced.
The version considering the UNFIX time is preferable in DBMS buffer manage-
ment because FIX phases can last a very long time due to delays caused by a
ACM Transactions on Database Systems, Vol. 9, No. 4, December 1984.
Principles of Database Buffer Management l 575
(a) FIFO
RC
2
4
1
3
selected fl
or 3
EliI
page:
6
G
1
3
(c) LFU
GRC RC FC
RD(1) = l/l5
RD(2) = l/6
selected
page: RD(3) l/IO
q
RD(4) = 3/5
RD(5) = l/15
RD(6) = l/8
RD(7) l/l3
q
transaction’s blocking times and action interrupts. Thus, only this (UNFIX time)
version guarantees the intended observation of the basic LRU idea.
The CLOCK algorithm attempts to simulate LRU behavior by means of a
simpler implementation. As shown in Figure lob, CLOCK is a modification of
the FIFO mechanism (Figure 10a). A use-bit is added to every buffer page,
indicating whether or not the page was referenced during the recent circulation
of the selection pointer. The page to be replaced is determined by the stepwise
examination of the use-bits. Encountering a l-bit causes a reset to 0 and the
move of the selection pointer to the next page. The first page found with a O-bit
is the victim for replacement. Another name for the CLOCK algorithm is
ACM Transactions on Database Systems, Vol. 9, No. 4, December 1984.
576 l W. Effelsberg and T. Haerder
lol Use-bit
lol FI
(b) CLOCK
-421 1
(d) GCLOCK
Fig. 10. (Conk.)
The use of page weights (or virtual references) for the different page-types Tj (Fj
for fetch and Rj for rereference) is appropriate to introduce knowledge about
accesspaths and their traffic frequencies. For example, let T1 be DBTT pages,
Tz FPA pages, T3 index pages, and T4 data pages. Then a fetch weight of 2 could
be assigned to DBTT pages (J’i = 2), a fetch weight of 1 to FPA and index pages
(F2 = F3 = l), while a weight of 0 could be assigned to data pages (F4 = O),
expressing a low probability of rereference. A rereference could be treated by
assigning the weights, R = (2, 2, 2, 1).
This idea leads to the following versions Vl and V2 of GCLOCK, characterized
by the way they handle the reference counter RC(i) related to page Pi of type q:
Vl: first reference (fetch) : RC(i) := 6
each rereference : RC(i) := RC(i) + Rj
V2: first reference : RC(1’) := Fj
each rereference : RC(i) := Rj
When 6 = 1 and Rj 7 1 for all j, V2 is equivalent to CLOCK, while Vl
represents the basic version of GCLOCK. In V2, Rj should be I Fj; otherwise,
an immediate rereference to a recently fetched page would decrease the value of
the reference counter, an undesired effect.
In a real implementation, GCLOCK can be further refined-at the expense of
increased overhead. It is generally possible to create a special version, called
DGCLOCK, assigning dynamically calculated, page-related weights Fj (t) and
R,(t). Further implementation details such as threshold values or periodic de-
crease of the RCs are necessary to adapt a GCLOCK version to transitions in
load characteristics.
GCLOCK represents a class of algorithms in which the different versions can
be tailored to special applications and types of reference behavior by the appro-
priate choice of parameters. Its classification is difficult and necessarily fuzzy
because of the variety of parameters involved,
The algorithms discussed (with the exception of FIFO) evaluate the age of a
buffer page in some indirect way (via the latest reference). It appears to be
promising to relate the actual number of references to a buffer page Pi counted
in RC(i) to its age, defined as the number of elapsed references (to all buffer
pages) since the first reference to Pi. The age of a page is measured in units of
logical references, and can be determined as follows. Let the GRC (global
reference counter) be the total number of logical references. For each buffer page
Pi the time of its first reference (fetch) is FC(i). Hence, GRC-FC(i) is the
reference interval of the age of Pi. Since both the age and RC are measured in
units of logical references, they can be related to each other. By the use of simple
division, the reference density RD(i) of Pi can be obtained. In our terminology,
reference frequency always refers to an absolute number of references, whereas
reference density means a frequency related to a reference interval (i.e., a relative
frequency). This idea is materialized by the following algorithm (see Figure 10e):
RD(i) = RC(i)/(GRC-FC(i)) where GRC-FC(i) L 1.
A buffer fault requires the determination of which buffer page has the lowest
value for RD. GRC is incremented by the reference leading to the buffer fault,
ACM Transactions on Database Systems, Vol. 9, No. 4, December 1984.
578 l W. Effelsberg and T. Haerder
before RD(1’) is evaluated; a tie has to be resolved in some way. This algorithm,
presented in its simplest version, can be generalized in various ways. Let us call
the resulting class of algorithms LRD (least reference density) and the described
version LRD(V1).
LRD(V1) determines the average reference density of a page. It assumes
equidistant arrival of page references. High reference activity at the beginning of
the reference interval keeps a page in the buffer much longer than desired,
because the actual reference distribution within the interval is not known. The
influence of older references on the selection decision-especially in the case of
clustered arrivals-should be reduced. This goal, the reduction of the weight of
references according to their actual age without the overhead of collecting
additional page-related information, is achievable by the following LRD variant:
After reference intervals of appropriate size, the reference counters RC of all
buffer pages are reduced (e.g., by subtraction or division, using properly chosen
constants). For example, the method to enforce some kind of “periodic aging” at
the end of specific reference intervals IR could be chosen as follows:
LRD(V2): aging by subtraction:
RC(1’)-Cl if RC(i) - Cl z C2
RC(1’) = I with Cl > 0, C2 2 0
I c2 if RC(1’) - Cl < C2
aging by division:
RC(1’) = RC(i)/C3 with C3 > 1
Cl, C2, and C3 are appropriately selected constants. The size of the reference
interval IR for aging must also be selected carefully. In each algorithm counting
reference frequencies, a number of modifications are conceivable (e.g., the use of
page weights in case of fetch and/or rereference).
An overview of the discussed replacement algorithms is given in Figure 11,
which attempts to classify them according to their parameters and they way age
and reference are taken into consideration. Those algorithms that are candidates
for an application in buffer management are emphasized.
Another important criterion to be considered in the replacement decision is
the type of reference to a page, that is, whether a page is read only or modified.
In general, it may be preferable to keep modified pages in the buffer longer (at
least those with high probability of further updates), because their replacement
is expensive (the page itself and the corresponding log information has to be
written). On the other hand, overemphasizing this principle carries the danger
of shrinking the active window for the read-only pages that are kept in the buffer.
In a specific implementation, all algorithms have to be adapted to the particular-
ities of the buffer interface (e.g., fixed pages are not displaceable and pages being
modified have to be forced to disk at the end of the corresponding transaction
when required by the logging mechanism).
With local or page-type-oriented buffer allocation, it is conceivable to combine
various replacement algorithms, tailored to specific characteristics of the refer-
ence behavior. For example, four different reference types, apparently related to
various page types, can be observed in the DBMS INGRES [24]. Hence, further
ACM Transactions on Database Systems. Vol. 9, No. 4, December 1984.
Principles of Database Buffer Management l 579
Since most
Consideration during No recent Since first
selection decision consideration reference reference
no considera- RANDOM FIFO
tion
references most recent LRU
reference CLOCK
GCLOCK(V2)
all references LFU GCLOCK(V1)
DGCLOCK LRD(V1)
LRD(V2)
I
Fig. 11. Classification of replacement algorithms.
MIX40 MIX50
I
Total number of pages (school-DB) 30,o 1
Number of different pages in the string 3,553 5,245
Number of logical references 130,366 99,975
Number of page modifications 9,378 2.865
11 10
Number of pages being fixed” ~v~~~im
4.61 6.26
=1 71% 41%
Percentage of references with FIX duration 2-10 22% 41%
Z-10 7% 18%
max. 1,786 -
FIX duration (in logical references)
avg. 4.62 6.26
FPA 0.1%
Percentage of pages of a given type DBTT 6.1%
USER 93.
FPA 0.9% 0.1%
Percentage of references to page-types DBTT 9.4% 21.7%
USER 89.7% 78.2%
Percentage of references to most fre- 1. 12.6% 3.3%
quently referenced pages (“hot spot” 2. 9.8% 0.5%
pages) 3. 4.8% 0.4%
Relative frequency distribution of refer- >0.9% 5 1
encestotheotherpages 0.9%-0.1% 195 293
O.l%-0.03% 1,606 2,741
<0.03% 1,741 2,208
Number of references to shared pages (concurrently fixed)” 1,175 359
cold start buffer fault rate in % 2.72 5.24
total 262 39
number of executed transactions parallel: max. 8 a
avg. 6.31 6.29
’ measured with a buffer size of 128 pages
Fig. 12. Characteristics of the DB, transaction load and logical reference strings.
Worst
(a) MIX40
Random
I I I I I
50 150 250
Buffer Size (Pages)
Random
LRU (Local/Fixed Partitions)
i J
50 750 250
Buffer Size (Pages)
A
25 -
(a) MIX40
5-
LRU (UNFIX)
I I I I I I
50 150 250
Buffer Size (Pages)
Fig. 14. The buffer fault rates of the CLOCK and LRU algorithms.
4
25 -
I
(b) MIX50
CLOCK
5- LRU (UNFIX)
I I I I I I )
50 150 250
Buffer Size (Pages)
Fig. 14. (Conk)
(a) MIX40
50 150 250
Buffer Size (Pages)
(b) MIX50
L DGCLOCK
GCLOCK (V2)
50 150 250
Buffer Size (Pages)
25
(a) MIX40
-I
I I I I I 1)
50 150 250
Buffer Size (Pages)
(b) MIX50
LRD (V2)
I I I*
I I I I
50 150 250
Buffer Size (Pages)
Fig. 16. The buffer fault rates of LRD strategies.
25
(a)MIX40
'OPT
1’1’1’1’1’1’1’1’1’1 ‘)
50 150 250 350 450 550 650 750 850 950
Buffer Size (Pages)
25
(b) MIX50
‘OPT
I
II I 111 I I ‘1 I I I I I I I I I+
(a)MIX40
Working-Set/LRU
OPT
l’l’l’l’l’l’l’l’l’l’*
50 150 250 350 450 550 650 750 850 950
Buffer Size (Pages)
(b) MIX50
Working-Set/LRU
I’ I ’ I’ I ‘I ’ I ’ 1’1’1’1 lW
50 150 250 350 450 550 650 750 850 950
Buffer Size (Pages)
be worthwhile for restricted buffer sizes (less than 200 pages (400 K bytes)). In
the range greater than 200 pages, the best algorithms come fairly close to OPT,
so that additional efforts are not justified by the potential gain-at least in our
applications, where we had an average degree of parallelism of less than 7. It is,
however, conceivable that the range of buffer sizes, in which further optimization
efforts pay-off, is enlarged under transaction loads having a higher number of
concurrent transactions [13].
6.2 Page-Type-Oriented Buffer Allocation
The disadvantages of local buffer allocation do not apply to page-type-oriented
buffer allocation. The allocation of dynamic partitions by means of a WS or PFF
algorithm permits flexible and fast adaptation to changes in reference behavior
to the various page types. As compared to global allocation, the selective use of
a particular replacement algorithm on the set of eligible pages of a specific type
is considered to be an extra advantage. Hence, concepts of this kind allow
allocation and replacement algorithms to be tailored to the various types of
reference behavior.
An analysis of page-type-related reference behavior revealed the following
characteristics, considering three different types of pages containing system data
(DBTT/FPA), accesspath data (tables, pointer-arrays, B*-trees, etc.), and rec-
ords. Here pages containing accesspath data are summarized as TABLE pages,
whereas pages storing data records are called USER pages.
distinct pages number of
referenced(in %) references(in %)
pagetypes MIX40 MIX50 MIX40 MIX50
system data (SYSTEM) 4.9 5.0 10.3 21.8
access paths (TABLE) 39.6 39.5 22.5 33.9
records (USER) 55.5 55.5 67.2 44.3
When only two partitions were considered, access path data and records were
put together into a single partition (called USER).
Static partition allocation is straightforward; replacement is always done in
the partition where the buffer fault occurs. The dynamic partition mechanism
works as follows: With N as the total number of buffer frames, at most Np =
0.8 N pages were allocated to partitions (working sets) at a time. Different 7s
were assigned to determine the partition sizes dynamically, according to the
following ratio:
two partitions: ~S/ul = 15185
three partitions: 7s/7& = 10/50/40.
Np is used to determine the various 7s directly; for instance, for two partitions,
7s = 15/100 * Np, ru = 85/100 * Np,
with a suitable lower limit of each 7. For three partitions, a similar assignment
was chosen. Hence, the number of buffer pages eligible for replacement was
N,, L 0.2 * N.
The influence of partition size with static buffer allocation is evaluated in
Figure 19. The behavior of two static partitions for system and user data is shown
ACM Transactions on Database Systems, Vol. 9, NO. 4, December 1984.
Principles of Database Buffer Management l 589
5-
I
I I I 1 I I )
50 150 250
Buffer Size (Pages)
150 250
Buffer Size (Pages)
Fig. 19. The buffer fault rate for page-type-oriented allocation schemes with two partitions.
for various ratios of partition sizes and LRU replacement in each partition. The
different curves confirm the critical nature of the choice of partition size and the
superiority of dynamic allocation. Furthermore, the results indicate the danger
of congestion with small buffer partitions, due to pages being fixed.
ACM Transactions on Database Systems, Vol. 9, No. 4, December 1984.
590 l W. Effelsberg and T. Haerder
I I 1 I I )
50 150 250
Buffer Size (Pages)
1
I I I I I I c
50 150 250
Buffer Size (Pages)
Fig. 20. The buffer fault rate for page-type-oriented allocation schemes with three partitions.
an LRU-type replacement is the worst possible strategy for the index partition.
LRD or GCLOCK algorithms with page weights for the roots and upper-level
index pages are better suited to choose the best replacement decision. For data
pages, “toss immediately” or FIFO could be used.
From the results derived in our experiments, it is at least debatable whether
or not page-type-oriented schemes should be chosen. Even with a dynamic
partitioning scheme, the parameters have to be determined to approximate the
page-type reference ratio. In addition, the choice of tailored replacement algo-
rithms becomes complex and susceptible to wrong assumptions. Hence, there is
no convincing evidence that they are distinctly superior to global schemes.
flagged with FIX status. This situation is especially threatening with small buffer
sizes. A solution to the problem is to undo the current operation of a transaction,
thereby freeing buffer frames occupied by its (the transaction’s) associated pages.
With a given buffer size, the number of available frames per transaction
decreasesas the number of active transactions increases. Unless there is a very
high degreeof intertransaction locality, the relative frequency of logical references
leading to physical I/O grows with the number of parallel transactions. Although
the cost of an I/O access remains constant, the total overhead is increased
drastically by the increasing relative frequency of page replacement. In this case,
so-called thrashing [ 71 can occur-a system state in which almost no useful work
is done. To limit the danger of thrashing, a number of measures are proposed [9,
111:
- optimization of the replacement algorithm,
- reduction of the costs for replacing a page,
- program restructuring to optimize its reference behavior.
These measures serve to reduce the system overhead and to safely increase the
number of concurrent transactions that can be processed without provoking the
thrashing phenomenon. No guarantee is given that thrashing cannot occur if the
system’s concurrency is further extended. Hence, thrashing can be prevented,
while achieving optimal throughput, only by dynamically limiting the number of
transactions. This implies the close cooperation of transaction and buffer man-
agement in a DBMS in order to perform effective scheduling of transactions and
accurate load control. The same demand is stated for OS: “Memory management
and process scheduling must be closely related activities.” [6]
8. CONCLUSIONS
We have explained the interface requirements of a DBMS buffer manager and
introduced the concept of fixing pages in a buffer to prevent their uncontrolled
replacement. The spectrum of possible strategies for searching the buffer was
then discussed; hash techniques on buffer information tables with overflow
chaining are recommended as the most efficient implementation alternative for
the buffer search function.
Initial experiments have shown that locality in DBMS reference behavior is
much less significant than locality in the reference behavior of programs under
virtual memory operating systems. This motivates the thorough analysis of buffer
allocation and page replacement algorithms with respect to DBMS characteris-
tics.
We have classified different buffer allocation algorithms and explained their
relationship to page replacement algorithms. Buffer allocation and page replace-
ment are considered to be orthogonal, allowing the combination of each allocation
algorithm with an arbitrary replacement algorithm. Since the buffer manager is
implemented in software, it is not restricted to the use of hardware flags available
in a specific virtual machine architecture. This leads to much more freedom in
the design of replacement algorithms. Specifically, we have investigated new
ways to combine the age of a page in the buffer, information about recent
ACM Transactions on Database Systems, Vol. 9, No. 4, December 1984.
594 l W. Effelsberg and T. Haerder
references, and information about page contents (page type) into new replacement
criteria.
In order to evaluate the performance of various allocation and replacement
algorithms in a DBMS environment, we have conducted an empirical study,
using two reference strings from CODASYL DBMS applications. A comparison
of local and global allocation algorithms shows that local allocation with dynamic
partitions leads to buffer fault rates very similar to global LRU replacement;
however, we recommend that local allocation algorithms in an on-line transaction
processing environment not be used, because user think times would freeze pages
in the buffer for long periods of time.
With a global allocation scheme, the adaptation of the buffer contents to the
particular reference behavior is left entirely to the replacement algorithm. LRU
and CLOCK indicate a satisfactory overall behavior; nevertheless, it could be
shown that the LRD and GCLOCK algorithms are also good candidates. Since
none of these global schemes requires explicit buffer allocation, they are easy to
understand and implement. In addition to global and local algorithms, we have
investigated algorithms using page-type information. Different working-set sizes
can be assigned to various page types to reflect their specific kind of reference
behavior.
As a general conclusion, we were able to show the optimization potential of
some of the new algorithms. Since they are parameterized, they can be tailored
to a specific DBMS and application environment. The basic trade-off is the
conceptual simplicity of the old algorithms versus a potential improvement in
performance with the new algorithms.
The following problems are likely to be of interest for future research:
- How can knowledge about the application program be made available for
the prediction of future reference behavior? Some means have to be introduced
to allow the buffer manager to accept “advice” from the query interpretation or
compilation process in order to use the context information of high-level database
languages.
- How does the DBMS-specific locality depend on the degree of concurrency
of transactions? What is the minimum buffer requirement, as a function of the
degree of concurrency?
- How can two-level storage management be generalized for DBMS use? What
additional gain can be expected when another level is introduced in the storage
hierarchy?
ACKNOWLEDGMENTS
We would like to thank Mary Loomis and Andreas Reuter for their many helpful
comments and discussions of this paper. We are grateful to Michael Brunner
and Paul Hirsch for their support of the empirical study. The helpful comments
of the referees are gratefully acknowledged.
REFERENCES
1. BABAOGLU, O., AND JOY, W. Converting a swap-based system to do paging in an architecture
lacking page-referenced bits. In Proceedings 8th Symposium on Operating Systems Principles.
SIGOPS 15,5 (Dec. 1981), 78-86.
ACM Transactions on Database Systems, Vol. 9, No. 4, December 1984.
Principles of Database Buffer Management l 595
2. BELADY, L.A. A study of replacement algorithms for virtual storage computers. IBM Syst. J. 5,
2 (1966), 78-101.
3. BRICE, R.S., AND SHERMAN,S.W. An extension of the performance of a database manager in
a virtual memory system using partially locked virtual buffers. ACM Trans. Database Syst. 2, 2
(1977), 196-207.
4. CARR, R.W., AND HENNESSY,J.L. WSCLOCK-a simple and effective algorithm for virtual
memory management. In Proceedings 8th Symposium on Operating Systems Principles. SZGOPS
15,5 (Dec. 1981), 87-95.
5. CHU, W.W., AND OPDERBECK,H. Program behavior and the page fault frequency replacement
algorithm. Computer 9, 11 (1976), 29-38.
6. DENNING, P.J. The working set model for program behavior. Commun. ACM 11,5 (1968), 323-
333.
7. DENNING, P.J. Thrashing: Its causes and prevention. In AFZPS Conference Proceedings, Vol.
33, FJCC, 1968,915-922.
8. DENNING, P.J. Working sets past and present. IEEE Trans. Softw. Eng. SE-6, 1 (1980), 64-84.
9. EFFELSBERG,W. Buffer management in database systems. Dissertation, Fachbereich Informa-
tik, Technische Hochschule Darmstadt, 1981, (in German).
10. FERNANDEZ,E.B., LANG, T., AND WOOD,C. Effect of replacement algorithms on a paged buffer
database system. IBM J. Res. Deu. 22,2 (1978), 185-196.
11. FERRARI,D. The improvement of program behavior. Computer 9, 11 (1976), 39-47.
12. HAERDER,T. Embedding a database system in an operating system environment. In Daten-
banktechnologie, J. Niedereichholz, Ed. Proceedings II/79 of the German Chapter of the ACM,
Teubner-Verlag, Stuttgart, 1979,9-24, (in German).
13. HOWARD,J.H. Virtual memory buffering. IBM Res. Rep., San Jose, Calif., 1980, (in prepara-
tion).
14. LANG, T., WOOD,C., AND FERNANDEZ,E.B. Database buffer paging in virtual storage systems.
ACM Trans. Database Syst. 2,4 (1977), 339-351.
15. MAYTSON,R.L., GECSEI,J., SLUTZ,D.R., AND TRAIGER,I.L. Evaluation techniques for storage
hierarchies. IBM Syst. J. 9, 2 (1970), 78-117.
16. REITER, A. A study of buffer management policies for data management systems. Tech.
Summary Rep. No. 1619, Mathematics Research Center, Univ. of Wisconsin, Madison, Mar.
1976.
17. RODRIGUEZ-ROSELL,J., AND DUPUY, J-P. The design, implementation, and evaluation of a
working set dispatcher. Commun. ACM 16,4 (1973), 247-253.
18. RODRIGUEZ-ROSELL, J. Empirical data reference behavior in database systems. Computer 9,11
(1976), 9-13.
19. SACCO,G.M., AND SCHKOLNICK,M. A mechanism for managing the buffer pool in a relational
database system using the hot set model. In Proceedings 8th Conference on Very Large Data
Bases, (Mexico, 1982).
20. SHERMAN,S.W., AND BRICE, R.S. Performance of a database manager in a virtual memory
system. ACM Trans. Database Syst. 1,4 (1976), 317-343.
21. SMITH, A.J. Sequentiality and prefetching in database systems. ACM Trans. Database Syst. 3,
3 (1978), 223-247.
22. SPIRN, J. Distance string models for program behavior. Computer 9,ll (1976), 14-20.
23. SPIRN, J.R., AND DENNING, P.J. Experiments with program locality. In AFZPS Conference
Proceedings, Vol. 42, FJCC, 1972, 611-621.
24. STONEBRAKER,M. Operating system support for database management systems. Commun.
ACM 24, 7 (1981), 412-418.
25. TUEL, W.G. An analysis of buffer paging in virtual storage systems. IBM J. Res. Deu. 20, 5
(1976), 518-520.
26. TURNER, R., AND STRECKER,B. Use of the LRU stack depth distribution for simulation of
paging behavior. Commun. ACM 20, 11 (1977), 795-798.