Indexing
Indexing
Content
• Overview of Physical Storage Media
• RAID
• Ordered Indices
• Primary, Secondary index structures
• Multi-level indexes
• B trees and B+ trees
Overview of Physical Storage Media
• Several types of data storage exist in most computer systems.
• They vary in speed of access, cost per unit of data, and
reliability.
Cache:
-Most costly and fastest form of storage.
-Usually very small, and managed by the operating system.
Main Memory (MM):
-The storage area for data available to be operated on.
-General-purpose machine instructions operate on main memory.
-Contents of main memory are usually lost in a power failure or
``crash''.
-Usually too small (even with megabytes) and too expensive to
store the entire database
Overview of Physical Storage Media
Flash memory:
• EEPROM (electrically erasable programmable read-only
memory).
• Data in flash memory survive from power failure.
• Reading data from flash memory takes about 10 nano-secs
(roughly as fast as from main memory), and writing data into
flash memory is more complicated: write-once takes about 4-
10 microsecs.
• To overwrite what has been written, one has to first erase
the entire bank of the memory. It may support only a limited
number of erase cycles ( tex2html_wrap_inline570 to
tex2html_wrap_inline572 ).
• It has found its popularity as a replacement for disks for
storing small volumes of data (5-10 megabytes).
Overview of Physical Storage Media
Magnetic-disk storage:
-Primary medium for long-term storage.
-Typically the entire database is stored on disk.
-Data must be moved from disk to main memory in
order for the data to be operated on.
-After operations are performed, data must be copied
back to disk if any changes were made.
-Disk storage is called direct access storage as it is
possible to read data on the disk in any order (unlike
sequential access).
-Disk storage usually survives power failures and system
Overview of Physical Storage Media
Optical storage:
-CD-ROM (compact-disk read-only memory),
WORM (write-once read-many) disk (for archival
storage of data), and Juke box (containing a few
drives and numerous disks loaded on demand).
Tape Storage:
-Used primarily for backup and archival data.
-Cheaper, but much slower access, since tape must
be read sequentially from the beginning.
-Used as protection from disk failures!
Overview of Physical Storage Media
• Storage device hierarchy is presented where the
higher levels are expensive (cost per bit), fast
(access time), but the capacity is smaller.
RAID
• RAID is a technique which is used to combine
multiple disks together for more efficient storage
of data across the disks.
• Some RAID techniques can also be used to
reconstruct the data if it is lost.
• Redundancy array of independent disk (RAID) is
a way to combine multiple disk storages for
increased performance, data redundancy and
disk reliability.
RAID
RAID
RAID
• RAID -Level 1
RAID
• RAID - Level 2
RAID
• RAID - Level 3
RAID
• RAID - Level 4
RAID
• RAID - Level 5
RAID
• RAID - Level 6
RAID
Index
• An index is a data structure that organizes data
records on disk to optimize the retrieval
operations.
• The index structure typically provides secondary
access path, which provides alternative way of
retrieving the records without affecting the
physical storage of records on the disk
• Index types:
1. Single-level indexing
2. Multi-level indexing
Single-Level Indexes
• Index is usually defined on a single attribute or field of a file, called
indexing field or indexing attribute
• Generally, the index stores each value of the index field along with a
list of pointers to all disk blocks that contain records with that field
value
• The values in the index are ordered so that binary search can be
performed on the index.
• The index file is much smaller than the data file, so searching the
index using a binary search is highly efficient
Single-Level Indexes
• Single-level indexing types:
1. Primary indexing
2. Clustering indexing
3. Secondary indexing
• A data file can have either a primary index or a
cluster index depending on its ordering field.
• It can have several secondary indexes.
Primary Indexes
• A primary index is an ordered file whose records are
of fixed length with two fields.
1. First field - It is of the same data type as the ordering
key field of the data file, called the primary key
2. Second field - It is a pointer to a disk block.
• There is one index entry in the index file for each
block in the data file.
Primary Indexes
• Each index entry has
- Value of the primary key field
- A pointer to that block as its two field values.
• The index file needs fewer blocks than data file,
for two reasons.
1. There are fewer index entries than there are
records in the data file.
2.Each index entry is typically smaller in size than a
data record because it has only two fields.
Primary Indexes
• A binary search on the index file hence requires
fewer block accesses than a binary search on the
data file.
• When a user wants to insert a record:
-Existing records should be moved to make space
for the new record as well as index entries will be
changed
• Similarly, deletion process is also difficult due to
the index entries updation.
Clustering Indexes
• If records of a file are physically ordered on a
non-key field, that field is called the clustering
field.
• It requires that the ordering field of the data file
have a distinct value for each record.
Clustering Indexes
• There is one entry in the clustering index for
each distinct value of the clustering field,
containing the value and a pointer to the first
block in the data file.
• To solve the problem of insertion, it reserve a
whole block (or a cluster of contiguous blocks)
for each value of the clustering field; all records
with that value are placed in the block (or block
cluster).
• This makes insertion and deletion relatively
straightforward.
Secondary Indexes
• A secondary index is also an ordered file with two fields.
1. First field - It is of the same data type as some non-ordering
field of the data file that is an indexing field.
2. Second field - It is either a block pointer or a record pointer.
• There can be many secondary indexes for the same file.
Secondary Indexes
• There is one index entry for each record in the
data file.
- It contains the value of the secondary key for
the record
- A pointer either to the block in which the record
is stored or to the record itself
• A secondary index usually needs more storage
space and longer search time than does a primary
index, because of its larger number of entries.
• Secondary index provides a logical ordering on the
records by the indexing field.
INDEX
• A dense index has an index entry for every
search key value in the data file.
• A sparse or nondense index has index entries for
only some of the search values.
Question
• Suppose we have an ordered data file with r = 24,000
records stored on a disk with block size B = 512 bytes. File
record are of fixed size with record length, R = 120 bytes.
• One primary index file of the given data file is created based
on ordering key field of the file. Assume that, the length of
each index entry is 12 bytes (key field size= 7 bytes and a
block pointer size = 5 bytes). Calculate the following:
• Deletion of 10
B+ Trees
• The leaf nodes of the B+ tree are usually linked together to provide
ordered access on the search field to the records.
• These leaf nodes are similar to the first level of an index Internal nodes
of the B+ tree correspond to the other levels of a multilevel index.
• The structure of the internal nodes of a B+ tree of order p is as follows:
Each internal node is of the form <P1, <K1>, P2, <K2> ... <Kq-1>, Pq> ,
where q <= p and each Pi is a tree pointer
Within each internal node, K1 < K2 < ... <Kq-1
Each internal node has at most ’p’ tree pointers
For all search field values X in the subtree pointed by Pi , the rule is X
<=K1; Ki-1 < X <= Ki for 1 < i < q; and Ki-1 < X for i = q
Each internal node has at least ┌ p / 2 ┐ children.
The root has at least two children if it is an internal node .
An internal node with q tree pointers, q <= p, has q-1 search key field
values
B+ Trees
• The structure of the leaf nodes of a B+ tree of order p is
as follows:
• Each leaf node is of the form «K1, Pr1>, <K2, Pr2> ...
• <Kq-1, Prq-1>, Pnext >, where q <= p, each Pri is a data
• pointer and Pnext is the pointer to the next leaf node
• Each leaf node has at least p / 2 values
• All leaf nodes are at the same level
• Within each leaf node, K1, K2 ... Kq-1
• The pointers in internal nodes are the tree pointers,
which point to blocks that are tree nodes, whereas the
pointers in leaf nodes are the data pointers to the data
B+ Trees
Insertion
The rules for the insertion of a data item to the B+ tree are:
• Find correct leaf L
• Put data entry onto L. There are two options for this entry:
If L has enough space, done
Else, split L into L and a new node L2. Redistribute entries
evenly and copy up the middle key. Also, insert index entry
pointing to L2 into parent of L
• For each insertion, this process will be repeated
B+ Trees
Deletion
• The rules for deleting a data item from the B+-tree are:
• Find the correct leaf L
• Remove the entry. There may be two possible cases for this deletion:
If L is at least half-full, done!
Else, try to re-distribute the entries by borrowing from the adjacent
node or sibling. If re-distribution fails, merge L and the sibling. If
merge occurred, then delete the entry (pointing to L or sibling) from
parent of L
• The merging process could propagate to root; thus, decreasing the
height
B+ Trees
• Deletion of 45 followed by 40
Question
• Construct a B+ tree of order 3, for (1, 4, 7, 10, 16, 20,
32, 41). Mention all steps for every insertion during the
creation of the tree.
Question
B and B+ Tree
• B+ tree searching is faster than the B-tree
searching;
• B+ tree takes more storage space than B-tree,
because it uses extra pointers than B-tree
• Insertion and deletion operations in B-tree are
more complex than those in B+-tree
• Ex: Draw the B-tree of the order 5 for the data
items 10, 50, 30, 70, 90, 25, 40, 45, 48.
MCQ
MCQ
MCQ
MCQ
MCQ
MCQ
• Q3. how many redo and undo operatin will be
performed for which transaction on the following:
<T1 Start>
<T1, A, 300>
<T2 Start>
<T2, B, 400>
Checkpoint
<T2 commit>
<T3 Start>
<T1 Commit>
<T4 Start>
• Q5. What will be done for immediate database modification
for which transaction if failure occure after <T3 Start>:
<T1 Start>
<T1, A, 300>
<T1 Commit>
<T2 Start>
<T2, B, 400>
<T2 commit>
<T3 Start>
<T3, C, 700>
<T3 Commit>