Appendix F
Appendix F
Objectives
In this appendix you will learn:
Steps 4.2 and 4.3 of the physical database design methodology presented in
Chapter 18 require the selection of appropriate file organizations and indexes for
the base relations that have been created to represent the part of the enterprise
being modeled. In this appendix we introduce the main concepts regarding the
physical storage of the database on secondary storage devices such as magnetic
disks and optical disks. The computer’s primary storage—that is, main memory—
is inappropriate for storing the database. Although the access times for primary
storage are much faster than secondary storage, primary storage is not large or
reliable enough to store the quantity of data that a typical database might require.
Because the data stored in primary storage disappears when power is lost, we refer
to primary storage as volatile storage. In contrast, the data on secondary storage
persists through power loss, and is consequently referred to as nonvolatile storage.
F-1
F-2 | Appendix F File Organizations and Indexes
In addition, the cost of storage per unit of data is an order of magnitude greater
for primary storage than for disk storage.
The order in which records are stored and accessed in the file is dependent on
the file organization.
Access method The steps involved in storing and retrieving records from a file.
Because some access methods can be applied only to certain file organizations
(for example, we cannot apply an indexed access method to a file without an
index), the terms file organization and access method are used interchangeably. In the
remainder of this appendix, we discuss the main types of file organization and
access techniques and provide guidelines for their use.
field is also a key of the file, and therefore guaranteed to have a unique value in
each record, the field is also called the ordering key for the file. For example, con-
sider the following SQL query:
SELECT *
FROM Staff
ORDER BY staffNo;
If the tuples of the Staff relation are already ordered according to the ordering
field staffNo, it should be possible to reduce the execution time for the query, as no
sorting is necessary. (Although in Section 4.2 we stated that tuples are unordered,
this applies as an external (logical) property, not as an implementation or physical
property. There will always be a first record, second record, and nth record.) If the
tuples are ordered on staffNo, under certain conditions we can use a binary search
to execute queries that involve a search condition based on staffNo. For example,
consider the following SQL query:
SELECT *
FROM Staff
WHERE staffNo = ‘SG37’;
If we use the sample tuples shown in Figure F.1 and for simplicity assume there is
one record per page, we would get the ordered file shown in Figure F.3. The
binary search proceeds as follows:
(1) Retrieve the mid-page of the file. Check whether the required record is between
the first and last records of this page. If so, the required record lies on this page
and no more pages need to be retrieved.
(2) If the value of the key field in the first record on the page is greater than the
required value, the required value, if it exists, occurs on an earlier page.
Therefore, we repeat the previous steps using the lower half of the file as the
new search area.
(3) If the value of the key field in the last record on the page is less than the
required value, the required value occurs on a later page, and so we repeat the
previous steps using the top half of the file as the new search area. In this way,
half the search space is eliminated from the search with each page retrieved.
In our case, the middle page is page 3, and the record on the retrieved page (SG14)
does not equal the one we want (SG37). The value of the key field in page 3 is
less than the one we want, so we can discard the first half of the file from the
search. We now retrieve the mid-page of the top half of the file, that is, page 5.
This time the value of the key field (SL21) is greater than SG37, which enables us
Figure F.3
Binary search on
an ordered file.
F.4 Hash Files | F-5
to discard the top half of this search space. We now retrieve the mid-page of the
remaining search space, that is, page 4, which is the record we want.
In general, the binary search is more efficient than a linear search. However,
binary search is applied more frequently to data in primary storage than secondary
storage.
Inserting and deleting records in a sorted file are problematic because the order
of records has to be maintained. To insert a new record, we must find the correct
position in the ordering for the record and then find space to insert it. If there is
sufficient space in the required page for the new record, then the single page can
be reordered and written back to disk. If this is not the case, then it would be nec-
essary to move one or more records on to the next page. Again, the next page may
have no free space and the records on this page must be moved, and so on.
Inserting a record near the start of a large file could be very time-consuming.
One solution is to create a temporary unsorted file, called an overflow (or
transaction) file. Insertions are added to the overflow file, and periodically, the over-
flow file is merged with the main sorted file. This makes insertions very efficient,
but has a detrimental effect on retrievals. If the record is not found during the
binary search, the overflow file has to be searched linearly. Inversely, to delete a
record we must reorganize the records to remove the now free slot.
Ordered files are rarely used for database storage unless a primary index is
added to the file (see Section F.5.1).
Figure F.4
Collision
resolution using
open addressing.
occurred (the records are called synonyms). In this situation, we must insert the
new record in another position, because its hash address is occupied. Collision
management complicates hash file management and degrades overall perfor-
mance. There are several techniques that can be used to manage collisions:
• open addressing
• unchained overflow
• chained overflow
• multiple hashing
Open addressing
If a collision occurs, the system performs a linear search to find the first available
slot to insert the new record. When the last bucket has been searched, the system
starts back at the first bucket. Searching for a record employs the same technique
used to store a record, except that the record is considered not to exist when an
unused slot is encountered before the record has been located. For example,
assume we have a trivial hash function that takes the digits of the staff number
MOD 3, as shown in Figure F.4. Each bucket has two slots and staff records SG5
and SG14 hash to bucket 2. When record SL41 is inserted, the hash function gen-
erates an address corresponding to bucket 2. As there are no free slots in bucket 2,
it searches for the first free slot, which it finds in bucket 1, after looping back and
searching bucket 0.
Unchained overflow
Instead of searching for a free slot, an overflow area is maintained for collisions
that cannot be placed at the hash address. Figure F.5 shows how the collision
illustrated in Figure F.4 would be handled using an overflow area. In this case,
Figure F.5
Collision
resolution using
overflow.
F.4 Hash Files | F-7
Figure F.6
Collision
resolution using
chained overflow.
instead of searching for a free slot for record SL41, the record is placed in the
overflow area. At first sight, this may appear not to offer much performance
improvement. However, using open addressing, collisions are located in the first
free slot, potentially causing additional collisions in the future with records that
hash to the address of the free slot. Thus, the number of collisions that occur is
increased and performance is degraded. On the other hand, if we can minimize
the number of collisions, it will be faster to perform a linear search on a smaller
overflow area.
Chained overflow
As with the previous technique, an overflow area is maintained for collisions that
cannot be placed at the hash address. However, with this technique each bucket
has an additional field, sometimes called a synonym pointer, that indicates
whether a collision has occurred and, if so, points to the overflow page used. If the
pointer is zero, no collision has occurred. In Figure F.6, bucket 2 points to an over-
flow bucket 3; buckets 0 and 1 have a 0 pointer to indicate that there have been no
collisions with these buckets yet.
A variation of this technique provides faster access to the overflow record by
using a synonym pointer that points to a slot address within the overflow area
rather than a bucket address. Records in the overflow area also have a synonym
pointer that gives the address in the overflow area of the next synonym for the
same target address, so that all synonyms for a particular address can be retrieved
by following a chain of pointers.
Multiple hashing
An alternative approach to collision management is to apply a second hashing
function if the first one results in a collision. The aim is to produce a new hash
address that will avoid a collision. The second hashing function is generally used
to place records in an overflow area.
With hashing, a record can be located efficiently by first applying the hash
function and, if a collision has occurred, using one of these approaches to locate
its new address. To update a hashed record, the record first has to be located. If
the field to be updated is not the hash key, the update can take place and the
record written back to the same slot. However, if the hash field is being updated,
the hash function has to be applied to the new value. If a new hash address is
generated, the record has to be deleted from its current slot and stored at its new
address.
F-8 | Appendix F File Organizations and Indexes
Figure F.7
Example of
extendible
hashing: (a) after
insert of SL21
and SG37; (b)
after insert of
SG14; (c) after
insert of SA9.
h(rmin) < h(rmax). In addition, hashing is inappropriate for retrievals based on a field
other than the hash field. For example, if the Staff table is hashed on staffNo, then
hashing could not be used to search for a record based on the IName field. In this
case, it would be necessary to perform a linear search to find the record, or add
IName as a secondary index (see Section F.5.3).
F.5 Indexes
In this section we discuss techniques for making the retrieval of data more efficient
using indexes.
Index A data structure that allows the DBMS to locate particular records in a
file more quickly and thereby speed response to user queries.
records in a file. As in the book index analogy, the index is ordered, and each
index entry contains the item required and one or more locations (record identi-
fiers) where the item can be found.
Although indexes are not strictly necessary to use the DBMS, they can have a
significant impact on performance. As with the book index, we could find the
desired keyword by looking through the entire book, but this approach would be
tedious and time-consuming. Having an index at the back of the book in alpha-
betical order of keyword allows us to go directly to the page or pages we want.
An index structure is associated with a particular search key and contains
records consisting of the key value and the address of the logical record in the file
containing the key value. The file containing the logical records is called the data
file and the file containing the index records is called the index file. The values in
the index file are ordered according to the indexing field, which is usually based on
a single attribute.
Figure F.8
Indexes on the
Staff table:
(a) (salary,
branchNo)
and salary;
(b) (branchNo,
salary) and
branchNo.
IBM’s Indexed Sequential Access Method (ISAM) uses this structure and is closely
related to the underlying hardware characteristics. Periodically, these types of file
need reorganizing to maintain efficiency. Reorganization is not only expensive but
makes the file unavailable while it takes place. The later development, Virtual
Sequential Access Method (VSAM), is an improvement on ISAM, in that it is
hardware-independent. There is no separate designated overflow area, but there
is space allocated in the data area to allow for expansion. As the file grows and
shrinks, the process is handled dynamically without the need for periodic reorga-
nization. Figure F.9(a) illustrates a dense index on a sorted file of Staff records.
However, as the records in the data file are sorted, we can reduce the index to a
sparse index as shown in Figure F.9(b).
Typically, a large part of a primary index can be stored in main memory and
processed faster. Access methods, such as the binary search method discussed in
Section F.3, can be used to further speed up the access. The main disadvantage of
using a primary index, as with any sorted file, is maintaining the order as we insert
and delete records. These problems are compounded as we have to maintain the
Figure F.9 Example of dense and sparse indexes: (a) dense index; (b) sparse index.
F-12 | Appendix F File Organizations and Indexes
sorted order in the data file and in the index file. One method that can be used is
the maintenance of an overflow area and chained pointers, similar to the technique
described in Section F.4 for the management of collisions in hash files.
Figure F.10
Example of a
multilevel index.
F.5 Indexes | F-13
an access key value and a page address. The stored access key value is the highest
in the addressed page.
To locate a record with a specified staffNo value, say SG14, we start from the second-
level index and search the page for the last access key value that is less than or equal
to SG14, in this case SG37. This record contains an address to the first-level index
page to continue the search. Repeating the process leads to page 2 in the data file,
where the record is stored. If a range of staffNo values had been specified, we could
use the same process to locate the first record in the data file corresponding to the
lower range value. As the records in the data file are sorted on staffNo, we can find
the remaining records in the range by reading serially through the data file.
IBM’s ISAM is based on a two-level index structure. Insertion is handled by
overflow pages, as discussed in Section F.4. In general, an n-level index can be
built, although three levels are common in practice; a file would have to be very
large to require more than three levels. In the following section we discuss a par-
ticular type of multilevel dense index called a B+-tree.
F.5.5 B+-trees
Many DBMSs use a data structure called a tree to hold data or indexes. A tree con-
sists of a hierarchy of nodes. Each node in the tree, except the root node, has one
parent node and zero or more child nodes. A root node has no parent. A node that
does not have any children is called a leaf node.
The depth of a tree is the maximum number of levels between the root node
and a leaf node in the tree. Depth may vary across different paths from root to leaf,
or depth may be the same from the root node to each leaf node, producing a tree
called a balanced tree, or B-tree (Bayer and McCreight, 1972; Comer, 1979). The
degree, or order, of a tree is the maximum number of children allowed per par-
ent. Large degrees, in general, create broader, shallower trees. Because access time
in a tree structure depends more often upon depth than on breadth, it is usually
advantageous to have “bushy,” shallow trees. A binary tree has order 2 in which
each node has no more than two children. The rules for a B+-tree are as follows:
• If the root is not a leaf node, it must have at least two children.
• For a tree of order n, each node (except the root and leaf nodes) must have between
n/2 and n pointers and children. If n/2 is not an integer, the result is rounded up.
• For a tree of order n, the number of key values in a leaf node must be between
(n ⫺ 1)/2 and (n ⫺ 1) pointers and children. If (n ⫺ 1)/2 is not an integer, the
result is rounded up.
• The number of key values contained in a nonleaf node is 1 less than the num-
ber of pointers.
• The tree must always be balanced: that is, every path from the root node to a leaf
must have the same length.
• Leaf nodes are linked in order of key values.
Figure F.11 represents an index on the staffNo field of the Staff table in Figure F.1
as a B+-tree of order 1. Each node is of the form:
• keyValue1 • keyValue2 •
F-14 | Appendix F File Organizations and Indexes
where • can be blank or represent a pointer to another record. If the search key
value is less than or equal to key Valuei, the pointer to the left of key Valuei is used
to find the next node to be searched; otherwise, the pointer at the end of the node
is used. For example, to locate SL21, we start from the root node. SL21 is greater
than SG14, so we follow the pointer to the right, which leads to the second-level
node containing the key values SG37 and SL21. We follow the pointer to the left
of SL21, which leads to the leaf node containing the address of record SL21.
In practice, each node in the tree is actually a page, so we can store more than
three pointers and two key values. If we assume that a page has 4096 bytes, each
pointer is 4 bytes long and the staffNo field requires 4 bytes of storage, and each page
has a 4-byte pointer to the next node on the same level, we could store (4096 ⫺ 4)/
(4 ⫹ 4) ⫽ 511 index records per page. The B+-tree would be order 512. The root
can store 511 records and can have 512 children. Each child can also store 511
records, giving a total of 261,632 records. Each child can also have 512 children,
giving a total of 262,144 children on level 2 of the tree. Each of these children can
have 511 records giving a total of 133,955,584. This gives a theoretical maximum
number of index records as:
root: 511
Level 1: 261,632
Level 2: 133,955,584
TOTAL 134,217,727
Thus, we could randomly access one record in the Staff file containing 134,217,727
records within four disk accesses (in fact, the root would normally be stored in main
memory, so there would be one fewer disk access). In practice, however, the num-
ber of records held in each page would be smaller, as not all pages would be full (see
Figure F.11).
A B+-tree always takes approximately the same time to access any data record, by
ensuring that the same number of nodes is searched: in other words, by ensuring that
the tree has a constant depth. Being a dense index, every record is addressed by the
index so there is no requirement for the data file to be sorted; for example, it could
F.5 Indexes | F-15
be stored as a heap file. However, balancing can be costly to maintain as the tree con-
tents are updated. Figure F.12 provides a worked example of how a B+-tree would
be maintained as records are inserted using the order of the records in Figure F.1.
Figure F.12(a) shows the construction of the tree after the insertion of the first
two records SL21 and SG37. The next record to be inserted is SG14. The node is
full, so we must split the node by moving SL21 to a new node. In addition, we cre-
ate a parent node consisting of the rightmost key value of the left node, as shown
in Figure F.12(b). The next record to be inserted is SA9. SA9 should be located to
the left of SG14, but again the node is full. We split the node by moving SG37 to
a new node. We also move SG14 to the parent node, as shown in Figure F.12(c).
The next record to be inserted is SG5. SG5 should be located to the right of SA9,
Figure F.12 Insertions into a B+-tree index: (a) after insertion of SL21, SG37; (b) after
insertion of SG14; (c) after insertion of SA9; (d) after insertion of SG5.
F-16 | Appendix F File Organizations and Indexes
but again the node is full. We split the node by moving SG14 to a new node and
add SG5 to the parent node. However, the parent node is also full and has to be
split. In addition, a new parent node has to be created, as shown in Figure F.12(d).
Finally, record SL41 is added to the right of SL21, as shown in Figure F.11.
Figure F.13
(a) Staff relation;
(b) bitmap
indexes on the
position and
branchNo
attributes.
F.5 Indexes | F-17
In this case, we can take the third bit vector for position and perform a bitwise AND
with the first bit vector for branchNo to obtain a bit vector that has a 1 for every
Supervisor who works at branch ‘B003’.
Figure F.14 (a) Branch and PropertyForRent relations; (b) join index on the nonkey city attribute.
F-18 | Appendix F File Organizations and Indexes
but it could have been sorted on any of the three attributes. Sometimes two join
indexes are created, one as shown and one with the two rowlD attributes reversed.
This type of query could be common in data warehousing applications when
attempting to find out facts about related pieces of data (in this case, we are
attempting to find how many properties come from a city that has an existing
branch). The join index precomputes the join of the Branch and PropertyForRent rela-
tions based on the city attribute, thereby removing the need to perform the join
each time the query is run, and improving the performance of the query. This
could be particularly important if the query has a high frequency. Oracle combines
the bitmap index and the join index to provide a bitmap join index.
Figure F.15
How the Branch
and Staff tables
would be stored
clustered on
branchNo.
F.6 Clustered and Nonclustered Tables | F-19
Clusters can improve performance of data retrieval, depending on the data dis-
tribution and what SQL operations are most often performed on the data. In
particular, tables that are joined in a query benefit from the use of clusters,
because the records common to the joined tables are retrieved with the same I/O
operation.
To create an indexed cluster in Oracle called BranchlndexedCluster with the cluster
key column branchNo, we could use the following SQL statement:
CREATE CLUSTER BranchlndexedCluster
(branchNo CHAR(4))
SIZE 512
STORAGE (INITIAL 100K NEXT 50K PCTINCREASE 10);
The SIZE parameter specifies the amount of space (in bytes) to store all records
with the same cluster key value. The size is optional and, if omitted, Oracle
reserves one data block for each cluster key value. The INITIAL parameter spec-
ifies the size (in bytes) of the cluster’s first extent, and the NEXT parameter spec-
ifies the size (in bytes) of the next extent to be allocated. The PCTINCREASE
parameter specifies the percentage by which the third and subsequent extents
grow over the preceding extent (default 50). In our example, we have specified
that each subsequent extent should be 10% larger than the preceding extent.
Heap (unordered)
The heap file organization is discussed in Appendix F.2. Heap is a good storage
structure in the following situations:
(1) When data is being bulk-loaded into the relation. For example, to populate a
relation after it has been created, a batch of tuples may have to be inserted into
the relation. If heap is chosen as the initial file organization, it may be more
efficient to restructure the file after the insertions have been completed.
(2) The relation is only a few pages long. In this case, the time to locate any tuple
is short, even if the entire relation has to be searched serially.
(3) When every tuple in the relation has to be retrieved (in any order) every time the
relation is accessed. For example, retrieve the addresses of all properties for rent.
(4) When the relation has an additional access structure, such as an index key,
heap storage can be used to conserve space.
Heap files are inappropriate when only selected tuples of a relation are to be accessed.
Hash
The hash file organization is discussed in Appendix F.4. Hash is a good storage
structure when tuples are retrieved based on an exact match on the hash field value,
particularly if the access order is random. For example, if the PropertyForRent relation
is hashed on propertyNo, retrieval of the tuple with propertyNo equal to PG36 is efficient.
However, hash is not a good storage structure in the following situations:
(1) When tuples are retrieved based on a pattern match of the hash field value.
For example, retrieve all properties whose property number, propertyNo, begins
with the characters “PG.”
(2) When tuples are retrieved based on a range of values for the hash field. For
example, retrieve all properties with a rent in the range 300–500.
(3) When tuples are retrieved based on a field other than the hash field. For
example, if the Staff relation is hashed on staffNo, then hashing cannot be used
to search for a tuple based on the IName attribute. In this case, it would be nec-
essary to perform a linear search to find the tuple, or add IName as a secondary
index (see Step 4.3).
(4) When tuples are retrieved based on only part of the hash field. For example,
if the PropertyForRent relation is hashed on rooms and rent, then hashing cannot
F.7 Guidelines for Selecting File Organizations | F-21
be used to search for a tuple based on the rooms attribute alone. Again, it would
be necessary to perform a linear search to find the tuple.
(5) When the hash field is frequently updated. When a hash field is updated, the
DBMS must delete the entire tuple and possibly relocate it to a new address (if
the hash function results in a new address). Thus, frequent updating of the
hash field affects performance.
B+-tree
The B+-tree file organization is discussed in Appendix F.5.5. Again, B+-tree is a
more versatile storage structure than hashing. It supports retrievals based on
exact key match, pattern matching, range of values, and part key specification.
The B+-tree index is dynamic, growing as the relation grows. Thus, unlike
ISAM, the performance of a B+-tree file does not deteriorate as the relation is
updated. The B+-tree also maintains the order of the access key even when the
file is updated, so retrieval of tuples in the order of the access key is more effi-
cient than ISAM. However, if the relation is not frequently updated, the ISAM
structure may be more efficient, as it has one fewer level of index than the
B+-tree, whose leaf nodes contain pointers to the actual tuples rather than the
tuples themselves.
Clustered tables
Some DBMSs, for example Oracle, support clustered tables (see Appendix F.6).
The choice of whether to use a clustered or nonclustered table depends on the
analysis of the transactions undertaken previously, but the choice can have an
impact on performance. Following, we provide guidelines for the use of clustered
tables. Note in this section, we use the Oracle terminology, which refers to a rela-
tion as a table with columns and rows.
Clusters are groups of one or more tables physically stored together because
they share common columns and are often used together. With related rows being
physically stored together, disk access time is improved. The related columns of
the tables in a cluster are called the cluster key. The cluster key is stored only once,
so clusters store a set of tables more efficiently than if the tables were stored indi-
vidually (not clustered). Oracle supports two types of clusters: indexed clusters and
hash clusters.
F-22 | Appendix F File Organizations and Indexes
(a) Indexed clusters In an indexed cluster, rows with the same cluster key are
stored together. Oracle suggests using indexed clusters when:
• queries retrieve rows over a range of cluster key values;
• clustered tables may grow unpredictably.
The following guidelines may be helpful when deciding whether to cluster tables:
• Consider clustering tables that are often accessed in join statements.
• Do not cluster tables if they are joined only occasionally or their common col-
umn values are modified frequently. (Modifying a row’s cluster key value takes
longer than modifying the value in an unclustered table, because Oracle may
have to migrate the modified row to another block to maintain the cluster.)
• Do not cluster tables if a full search of one of the tables is often required. (A
full search of a clustered table can take longer than a full search of an unclus-
tered table. Oracle is likely to read more blocks, because the tables are stored
together.)
• Consider clustering tables involved in a one-to-many (1:*) relationship if a row
is often selected from the parent table and then the corresponding rows from the
child table. (Child rows are stored in the same data block(s) as the parent row,
so they are likely to be in memory when selected, requiring Oracle to perform
less I/O.)
• Consider storing a child table alone in a cluster if many child rows are selected
from the same parent. (This measure improves the performance of queries that
select child rows of the same parent but does not decrease the performance of a
full search of the parent table.)
• Do not cluster tables if the data from all tables with the same cluster key value
exceeds more than one or two Oracle blocks. (To access a row in a clustered
table, Oracle reads all blocks containing rows with that value. If these rows
occupy multiple blocks, accessing a single row could require more reads than
accessing the same row in an unclustered table.)
(b) Hash clusters Hash clusters also cluster table data in a manner similar to
index clusters. However, a row is stored in a hash cluster based on the result of
applying a hash function to the row’s cluster key value. All rows with the same hash
key value are stored together on disk. Oracle suggests using hash clusters when:
• queries retrieve rows based on equality conditions involving all cluster key
columns (for example, return all rows for branch B003);
• clustered tables are static or the maximum number of rows and the maximum
amount of space required by the cluster can be determined when it is created.
The following guidelines may be helpful when deciding whether to use hash clusters:
• Consider using hash clusters to store tables that are frequently accessed using a
search clause containing equality conditions with the same column(s). Designate
these column(s) as the cluster key.
• Store a table in a hash cluster if it can be determined how much space is required
to hold all rows with a given cluster key value, both now and in the future.
• Do not use hash clusters if space is scarce and it is not affordable to allocate addi-
tional space for rows to be inserted in the future.
Appendix Summary | F-23
• Do not use a hash cluster to store a constantly growing table if the process of
occasionally creating a new, larger hash cluster to hold that table is impractical.
• Do not store a table in a hash cluster if a search of the entire table is often
required and a significant amount of space must be allocated to the hash cluster
in anticipation of the table growing. (Such full searches must read all blocks allo-
cated to the hash cluster, even though some blocks may contain few rows.
Storing the table alone would reduce the number of blocks read by a full table
search.)
• Do not store a table in a hash cluster if the cluster key values are frequently
modified.
• Storing a single table in a hash cluster can be useful, regardless of whether the
table is often joined with other tables, provided that hashing is appropriate for
the table based on the previous guidelines.
Appendix Summary
• A file organization is the physical arrangement of data in a file into records and pages of secondary storage.
An access method is the steps involved in storing and retrieving records from a file.
• Heap (unordered) files store records in the same order they are inserted. Heap files are good for inserting a
large number of records into the file. They are inappropriate when only selected records are to be retrieved.
• Sequential (ordered) files store records sorted on the values of one or more fields (the ordering fields).
Inserting and deleting records in a sorted file is problematic, because the order of records has to be maintained.
As a result, ordered files are rarely used for database storage unless a primary index is added to the file.
• Hash files are good when retrieval is based on an exact key match. They are not good when retrieval is based
on pattern matching, range of values, part keys, or a column other than the hash field.
• An index is a data structure that allows the DBMS to locate particular records in a file more quickly and thereby
speed response to user queries. There are three main types of index: a primary index, clustering index, and
a secondary index (an index that is defined on a non-ordering field of the data file).
• Secondary indexes provide a mechanism for specifying an additional key for a base relation that can be used
to retrieve data more efficiently. However, there is an overhead involved in the maintenance and use of sec-
ondary indexes that has to be balanced against the performance improvement gained when retrieving data.
• ISAM is more versatile than hashing, supporting retrievals based on exact key match, pattern matching, range of
values, and part key specification. However, the ISAM index is static, so performance deteriorates as the table is
updated. Updates also cause the ISAM file to lose the access key sequence, so retrievals in order of the access
key become slower.
• These two problems are overcome by the B+-tree file organization, which has a dynamic index. However, unlike
B+-tree, because the ISAM index is static, concurrent access to the index can be easily managed. If the relation is
not frequently updated or not very large or likely to be, the ISAM structure may be more efficient as it has one
less level of index than the B+-tree, whose leaf nodes contain record pointers.
• A bitmap index stores a bit vector for each attribute indicating which tuples contain this particular domain
value. Each bit that is set to 1 in the bitmap corresponds to a row identifier. If the number of different domain
values is small, then bitmap indexes are very space efficient.
F-24 | Appendix F File Organizations and Indexes
• A join index is an index on attributes from two or more relations that come from the same domain. The join
index precomputes the join of the two relations based on the specified attribute, thereby removing the need to
perform the join each time the query is run, and improving the performance of the query. This could be par-
ticularly important if the query has a high frequency.
• Clusters are groups of one or more tables physically stored together because they share common columns
and are often used together. With related records being physically stored together, disk access time is improved.
The related columns of the tables in a cluster are called the cluster key. The cluster key is stored only once,
and so clusters store a set of tables more efficiently than if the tables were stored individually (not clustered).
Oracle supports two types of clusters: indexed clusters and hash clusters.