0% found this document useful (0 votes)
22 views53 pages

Indexing

The document provides an overview of various physical storage media, including cache, main memory, flash memory, magnetic disk storage, optical storage, and tape storage, highlighting their characteristics and uses. It also explains RAID techniques for efficient data storage and redundancy, as well as different indexing methods such as single-level and multi-level indexing, including primary, clustering, and secondary indexes. Additionally, it covers B-trees and B+ trees, detailing their structures, insertion, and deletion processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views53 pages

Indexing

The document provides an overview of various physical storage media, including cache, main memory, flash memory, magnetic disk storage, optical storage, and tape storage, highlighting their characteristics and uses. It also explains RAID techniques for efficient data storage and redundancy, as well as different indexing methods such as single-level and multi-level indexing, including primary, clustering, and secondary indexes. Additionally, it covers B-trees and B+ trees, detailing their structures, insertion, and deletion processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

INDEXING

Content
• Overview of Physical Storage Media
• RAID
• Ordered Indices
• Primary, Secondary index structures
• Multi-level indexes
• B trees and B+ trees
Overview of Physical Storage Media
• Several types of data storage exist in most computer systems.
• They vary in speed of access, cost per unit of data, and
reliability.
Cache:
-Most costly and fastest form of storage.
-Usually very small, and managed by the operating system.
Main Memory (MM):
-The storage area for data available to be operated on.
-General-purpose machine instructions operate on main memory.
-Contents of main memory are usually lost in a power failure or
``crash''.
-Usually too small (even with megabytes) and too expensive to
store the entire database
Overview of Physical Storage Media
Flash memory:
• EEPROM (electrically erasable programmable read-only
memory).
• Data in flash memory survive from power failure.
• Reading data from flash memory takes about 10 nano-secs
(roughly as fast as from main memory), and writing data into
flash memory is more complicated: write-once takes about 4-
10 microsecs.
• To overwrite what has been written, one has to first erase
the entire bank of the memory. It may support only a limited
number of erase cycles ( tex2html_wrap_inline570 to
tex2html_wrap_inline572 ).
• It has found its popularity as a replacement for disks for
storing small volumes of data (5-10 megabytes).
Overview of Physical Storage Media
Magnetic-disk storage:
-Primary medium for long-term storage.
-Typically the entire database is stored on disk.
-Data must be moved from disk to main memory in
order for the data to be operated on.
-After operations are performed, data must be copied
back to disk if any changes were made.
-Disk storage is called direct access storage as it is
possible to read data on the disk in any order (unlike
sequential access).
-Disk storage usually survives power failures and system
Overview of Physical Storage Media
Optical storage:
-CD-ROM (compact-disk read-only memory),
WORM (write-once read-many) disk (for archival
storage of data), and Juke box (containing a few
drives and numerous disks loaded on demand).
Tape Storage:
-Used primarily for backup and archival data.
-Cheaper, but much slower access, since tape must
be read sequentially from the beginning.
-Used as protection from disk failures!
Overview of Physical Storage Media
• Storage device hierarchy is presented where the
higher levels are expensive (cost per bit), fast
(access time), but the capacity is smaller.
RAID
• RAID is a technique which is used to combine
multiple disks together for more efficient storage
of data across the disks.
• Some RAID techniques can also be used to
reconstruct the data if it is lost.
• Redundancy array of independent disk (RAID) is
a way to combine multiple disk storages for
increased performance, data redundancy and
disk reliability.
RAID
RAID
RAID
• RAID -Level 1
RAID
• RAID - Level 2
RAID
• RAID - Level 3
RAID
• RAID - Level 4
RAID
• RAID - Level 5
RAID
• RAID - Level 6
RAID
Index
• An index is a data structure that organizes data
records on disk to optimize the retrieval
operations.
• The index structure typically provides secondary
access path, which provides alternative way of
retrieving the records without affecting the
physical storage of records on the disk
• Index types:
1. Single-level indexing
2. Multi-level indexing
Single-Level Indexes
• Index is usually defined on a single attribute or field of a file, called
indexing field or indexing attribute
• Generally, the index stores each value of the index field along with a
list of pointers to all disk blocks that contain records with that field
value
• The values in the index are ordered so that binary search can be
performed on the index.
• The index file is much smaller than the data file, so searching the
index using a binary search is highly efficient
Single-Level Indexes
• Single-level indexing types:
1. Primary indexing
2. Clustering indexing
3. Secondary indexing
• A data file can have either a primary index or a
cluster index depending on its ordering field.
• It can have several secondary indexes.
Primary Indexes
• A primary index is an ordered file whose records are
of fixed length with two fields.
1. First field - It is of the same data type as the ordering
key field of the data file, called the primary key
2. Second field - It is a pointer to a disk block.
• There is one index entry in the index file for each
block in the data file.
Primary Indexes
• Each index entry has
- Value of the primary key field
- A pointer to that block as its two field values.
• The index file needs fewer blocks than data file,
for two reasons.
1. There are fewer index entries than there are
records in the data file.
2.Each index entry is typically smaller in size than a
data record because it has only two fields.
Primary Indexes
• A binary search on the index file hence requires
fewer block accesses than a binary search on the
data file.
• When a user wants to insert a record:
-Existing records should be moved to make space
for the new record as well as index entries will be
changed
• Similarly, deletion process is also difficult due to
the index entries updation.
Clustering Indexes
• If records of a file are physically ordered on a
non-key field, that field is called the clustering
field.
• It requires that the ordering field of the data file
have a distinct value for each record.
Clustering Indexes
• There is one entry in the clustering index for
each distinct value of the clustering field,
containing the value and a pointer to the first
block in the data file.
• To solve the problem of insertion, it reserve a
whole block (or a cluster of contiguous blocks)
for each value of the clustering field; all records
with that value are placed in the block (or block
cluster).
• This makes insertion and deletion relatively
straightforward.
Secondary Indexes
• A secondary index is also an ordered file with two fields.
1. First field - It is of the same data type as some non-ordering
field of the data file that is an indexing field.
2. Second field - It is either a block pointer or a record pointer.
• There can be many secondary indexes for the same file.
Secondary Indexes
• There is one index entry for each record in the
data file.
- It contains the value of the secondary key for
the record
- A pointer either to the block in which the record
is stored or to the record itself
• A secondary index usually needs more storage
space and longer search time than does a primary
index, because of its larger number of entries.
• Secondary index provides a logical ordering on the
records by the indexing field.
INDEX
• A dense index has an index entry for every
search key value in the data file.
• A sparse or nondense index has index entries for
only some of the search values.
Question
• Suppose we have an ordered data file with r = 24,000
records stored on a disk with block size B = 512 bytes. File
record are of fixed size with record length, R = 120 bytes.
• One primary index file of the given data file is created based
on ordering key field of the file. Assume that, the length of
each index entry is 12 bytes (key field size= 7 bytes and a
block pointer size = 5 bytes). Calculate the following:

a. Blocking factor of data file and index file.


b. Total number of blocks required for data file and index file.
c. Number of block access on data file for a binary search and
Number of block access on Index file for a binary search.
Solution
Multilevel Indexing
• A multilevel indexing can contain any number of levels,
each of which acts as a non-dense index to the level below.
• The top level contains a single entry
• A multilevel index can be created for any type of first-level
index (whether it is primary, clustering or secondary) as
long as the first-level index consists of more than one disk
block
• The advantage of multilevel index is that it reduces the
number of blocks accessed when searching a record, given
its indexing field value
• The problems associated with index insertions and
deletions still exist because all index levels are physically
ordered files
Multilevel Indexing
• To avoid insertion and deletion problem:
- most multilevel indexes use B-tree or B+ tree data structures,
- it leave space in each tree node (disk block) to allow for new
index entries
• B-Tree and B+ Tree data structures,
 Each node corresponds to a disk block.
 Here, each node is kept between half-full and completely full.
 An insertion into a node that is not full is quite efficient.
 If a node is full the insertion causes a split into two nodes.
 Similarly, a deletion is quite efficient if a node does not
become less than half full.
 If a deletion causes a node to become less than half-full, it
must be merged with the neighboring nodes.
B-Trees
• B-tree is a specialized multi-way tree designed especially for use on disk.
• A B-tree of order ’p’ can be defined as follows:
 Each internal node is of the form <P1, <K1>, P2, <K2> ... <Kq-1>, Pq> ,
where q p.
 Each Pi is a tree pointer, which is a pointer to another node in the B-tree
 Within each node, K1 < K2 < ... <Kq-1
 Each node has at most ’p’ tree pointers
 For all search key field values X in the subtree pointed by Pi , the rule is
X < K1; Ki-1 < X < Ki for 1 < i < q; and Ki-1 <X for i = q
 All non-leaf nodes except the root have at least ┌ p / 2 ┐ children.
 The root is either a leaf node, or it has from two to p children
 A node with q tree pointers, q p, has q-1 search key field values
 All leaf nodes are at the same level.
 Leaf nodes have the same structure as internal nodes except that all of
their tree pointers Pi are null
B-Trees
Insertion
• The insertion to a B-tree is an easier process.
• A B-tree starts with a single root node at level 0.
• The rules for the insertion to B-tree are:
• It attempts to insert the new key into a leaf. If this would result
in that leaf becoming too big, split the leaf into two, promoting
the middle key to the leaf’s parent
• If the insertion would result in the parent becoming too big,
split the parent into two, promoting the middle key. This
strategy might have to be repeated all the way to the top
• If necessary, the root is split in two and the middle key is
promoted to a new root, making the tree one level higher
B-Trees
B-Trees
Deletion:
At deletion, removal should be done from a leaf:
• If the key is already in a leaf node, and removing it doesn’t cause that leaf
node to have too few keys, then simply remove the key to be deleted
• If the key is not in a leaf, then it is guaranteed that its predecessor or
successor will be in a leaf. In this case, delete the key and promote the
predecessor or successor key to the non-leaf deleted key’s position
• If first or second condition lead to a leaf node containing less than the
minimum number of keys, then look at the siblings immediately adjacent to
the leaf:
 If one of them has more than the minimum number of keys, then promote
one of its keys to the parent and take the parent key into the lacking leaf
 If neither of them has more than the minimum number of keys, then the
lacking leaf and one of its neighbours can be combined with their shared
parent; & the new leaf will have the correct number of keys. If this step
leaves the parent with too few keys, repeat the process up to the root itself
B-Trees
• Deletion of 40

• Deletion of 10
B+ Trees
• The leaf nodes of the B+ tree are usually linked together to provide
ordered access on the search field to the records.
• These leaf nodes are similar to the first level of an index Internal nodes
of the B+ tree correspond to the other levels of a multilevel index.
• The structure of the internal nodes of a B+ tree of order p is as follows:
 Each internal node is of the form <P1, <K1>, P2, <K2> ... <Kq-1>, Pq> ,
where q <= p and each Pi is a tree pointer
 Within each internal node, K1 < K2 < ... <Kq-1
 Each internal node has at most ’p’ tree pointers
 For all search field values X in the subtree pointed by Pi , the rule is X
<=K1; Ki-1 < X <= Ki for 1 < i < q; and Ki-1 < X for i = q
 Each internal node has at least ┌ p / 2 ┐ children.
 The root has at least two children if it is an internal node .
 An internal node with q tree pointers, q <= p, has q-1 search key field
values
B+ Trees
• The structure of the leaf nodes of a B+ tree of order p is
as follows:
• Each leaf node is of the form «K1, Pr1>, <K2, Pr2> ...
• <Kq-1, Prq-1>, Pnext >, where q <= p, each Pri is a data
• pointer and Pnext is the pointer to the next leaf node
• Each leaf node has at least p / 2 values
• All leaf nodes are at the same level
• Within each leaf node, K1, K2 ... Kq-1
• The pointers in internal nodes are the tree pointers,
which point to blocks that are tree nodes, whereas the
pointers in leaf nodes are the data pointers to the data
B+ Trees
Insertion
The rules for the insertion of a data item to the B+ tree are:
• Find correct leaf L
• Put data entry onto L. There are two options for this entry:
 If L has enough space, done
 Else, split L into L and a new node L2. Redistribute entries
evenly and copy up the middle key. Also, insert index entry
pointing to L2 into parent of L
• For each insertion, this process will be repeated
B+ Trees
Deletion
• The rules for deleting a data item from the B+-tree are:
• Find the correct leaf L
• Remove the entry. There may be two possible cases for this deletion:
 If L is at least half-full, done!
 Else, try to re-distribute the entries by borrowing from the adjacent
node or sibling. If re-distribution fails, merge L and the sibling. If
merge occurred, then delete the entry (pointing to L or sibling) from
parent of L
• The merging process could propagate to root; thus, decreasing the
height
B+ Trees
• Deletion of 45 followed by 40
Question
• Construct a B+ tree of order 3, for (1, 4, 7, 10, 16, 20,
32, 41). Mention all steps for every insertion during the
creation of the tree.
Question
B and B+ Tree
• B+ tree searching is faster than the B-tree
searching;
• B+ tree takes more storage space than B-tree,
because it uses extra pointers than B-tree
• Insertion and deletion operations in B-tree are
more complex than those in B+-tree
• Ex: Draw the B-tree of the order 5 for the data
items 10, 50, 30, 70, 90, 25, 40, 45, 48.
MCQ
MCQ
MCQ
MCQ
MCQ
MCQ
• Q3. how many redo and undo operatin will be
performed for which transaction on the following:
<T1 Start>
<T1, A, 300>
<T2 Start>
<T2, B, 400>
Checkpoint
<T2 commit>
<T3 Start>
<T1 Commit>
<T4 Start>
• Q5. What will be done for immediate database modification
for which transaction if failure occure after <T3 Start>:
<T1 Start>
<T1, A, 300>
<T1 Commit>
<T2 Start>
<T2, B, 400>
<T2 commit>
<T3 Start>
<T3, C, 700>
<T3 Commit>

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy