0% found this document useful (0 votes)
41 views32 pages

Dbms Unit-V Notes

Uploaded by

Chandu Gantala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views32 pages

Dbms Unit-V Notes

Uploaded by

Chandu Gantala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

UNIT – V

Data on External Storage, File Organization and Indexing, Cluster Indexes, Primary and Secondary Indexes, Index data
Structures, Hash Based Indexing, Tree base Indexing, Comparison of File Organizations, Indexes and Performance Tuning,
Intuitions for tree Indexes, Indexed Sequential Access Methods (ISAM), B+ Trees: A Dynamic Index Structure.

Data on External Storage:


A DBMS stores vast quantity of data and the data must persist across program executions.
Therefore, data is stored on external storage devices such as disks and tapes, and fetched into main
memory as needed for processing. In storage hierarchy storage space is increases from top to
bottom (CPU to Tape) and access speed is minimizes and storage capacity is high.
These storage devices can be broadly categorized into three types
o Primary Storage
o Secondary Storage
o Tertiary Storage

Primary Storage

It is the primary area that offers quick access to the stored data. We also know the primary storage
as volatile storage. It is because this type of memory does not permanently store the data. As soon
as the system leads to a power cut or a crash, the data also get lost. Main memory and cache are
the types of primary storage.

o Main Memory: It is the one that is responsible for operating the data that is available by the
storage medium. The main memory handles each instruction of a computer machine. This type of
memory can store gigabytes of data on a system but is small enough to carry the entire database.
At last, the main memory loses the whole content if the system shuts down because of power
failure or other reasons.
o Cache: It is one of the costly storage media. On the other hand, it is the fastest one. A cache is a
tiny storage media which is maintained by the computer hardware usually. While designing the
algorithms and query processors for the data structures, the designers keep concern on the cache
effects.

Secondary Storage

Secondary storage it is also called as online storage. It is the storage area that allows the user to
save and store data permanently. This type of memory does not lose the data due to any power
failure or system crash. That's why we also call it non-volatile storage.

There are some commonly described secondary storage media which are available in almost every
type of computer system:
o Flash Memory: A flash memory stores data in USB (Universal Serial Bus) keys which are further
plugged into the USB slots of a computer system. These USB keys help transfer data to a
computer system, but it varies in size limits. Unlike the main memory, it is possible to get back
the stored data which may be lost due to a power cut or other reasons. This type of memory
storage is most commonly used in the server systems for caching the frequently used data. This
leads the systems towards high performance and is capable of storing large amounts of databases
than the main memory.
o Magnetic Disk Storage: This type of storage media is also known as online storage media. A
magnetic disk is used for storing the data for a long time. It is capable of storing an entire
database. It is the responsibility of the computer system to make availability of the data from a
disk to the main memory for further accessing. Also, if the system performs any operation over
the data, the modified data should be written back to the disk. The tremendous capability of a
magnetic disk is that it does not affect the data due to a system crash or failure, but a disk failure
can easily ruin as well as destroy the stored data.

Tertiary Storage
It is the storage type that is external from the computer system. It has the slowest speed. But it is
capable of storing a large amount of data. It is also known as Offline storage. Tertiary storage is
generally used for data backup. There are following tertiary storage devices available:
o Optical Storage: An optical storage can store megabytes or gigabytes of data. A Compact Disk
(CD) can store 700 megabytes of data with a playtime of around 80 minutes. On the other hand, a
Digital Video Disk or a DVD can store 4.7 or 8.5 gigabytes of data on each side of the disk.
o Tape Storage: It is the cheapest storage medium than disks. Generally, tapes are used for
archiving or backing up the data. It provides slow access to data as it accesses data sequentially
from the start. Thus, tape storage is also known as sequential-access storage. Disk storage is
known as direct-access storage as we can directly access the data from any location on disk.

Fig. Storage Hierarchy


Magnetic Disks

Hard disk drives are the most common secondary storage devices in present computer systems.
These are called magnetic disks because they use the concept of magnetization to store
information. Hard disks consist of metal disks coated with magnetizable material. These disks are
placed vertically on a spindle. A read/write head moves in between the disks and is used to
magnetize or de-magnetize the spot under it. A magnetized spot can be recognized as 0 (zero) or
1 (one).
Hard disks are formatted in a well-defined order to store data efficiently. A hard disk plate has
many concentric circles on it, called tracks. Every track is further divided into sectors. A sector
on a hard disk typically stores 512 bytes of data.

Redundant Array of Independent Disks

RAID refers to redundancy array of the independent disk. It is a technology which is used to
connect multiple secondary storage devices for increased performance, data redundancy or both. It
gives you the ability to survive one or more drive failure depending upon the RAID level used.

It consists of an array of disks in which multiple disks are connected to achieve different goals.

o Level 0: Non- redundant: RAID level 0 provides data stripping, i.e., a data can place across
multiple disks. It is based on stripping that means if one disk fails then all data in the array is
lost.

o This level doesn't provide fault tolerance but increases the system performance. It doesn't
contain any error detection mechanism.
o In this level, failure of either disk results in complete data loss in respective array.

In our example, the RAID Level 0 solution consists of only four data disks. Independent of the
number of data disks, the effective space utilization for a RAID Level 0 system is always 100 percent.
Level 1: Mirrored:
This level is called mirroring of data as it copies the data from drive 1 to drive 2. It provides 100%
redundancy in case of a failure. Instead of having one copy of the data, two identical copies of the
data on two different disks are maintained. This type of redundancy is often called mirroring.
Only half space of the drive is used to store the data. The other half of drive is just a mirror to the
already stored data.

Level 2: Stripping and Mirroring:


RAID 2 records Error Correction Code using Hamming distance for its data, striped on different
disks. Like level 0, each data bit in a word is recorded on a separate disk and ECC codes of the
data words are stored on different set disks. Due to its complex structure and high cost, RAID 2 is
not commercially available.

Level 3: Error- Correcting Codes:


o RAID 3 consists of byte-level striping with dedicated parity. In this level, the parity
information is stored for each disk section and written to a dedicated parity drive.
o RAID 3 stripes the data onto multiple disks. The parity bit generated for data word is stored
on a different disk. This technique makes it to overcome single disk failures.
o In case of drive failure, the parity drive is accessed, and data is reconstructed from the
remaining devices. Once the failed drive is replaced, the missing data can be restored on the
new drive.
Level 4: Block - Interleaved Parity:
In this level, an entire block of data is written onto data disks and then the parity is generated and
stored on a different disk. Note that level 3 uses byte-level striping, whereas level 4 uses block-
level striping. Both level 3 and level 4 require at least three disks to implement RAID.
The effective space utilization is 80 percent. The effective space utilization increases with the
number of data disks, since always only one check disk is necessary.

Level 5:
RAID 5 writes whole data blocks onto different disks, but the parity bits generated for data block
stripe are distributed among all the data disks rather than storing them on a different dedicated
disk.

Level 6: RAID 6 is an extension of level 5. In this level, two independent parities are generated
and stored in distributed fashion among multiple disks. Two parities provide additional fault
tolerance. This level requires at least four disk drives to implement RAID.
File Organization:
A database consists of a huge amount of data. The data is grouped within a table in RDBMS,
and each table has related records. A user can see that the data is stored in form of tables, but in
actual this huge amount of data is stored in physical memory in form of files.
A file is named collection of related information that is recorded on secondary storage such as
magnetic disks, magnetic tapes and optical disks.
File Organization refers to the logical relationships among various records that constitute the
file, particularly with respect to the means of identification and access to any specific record. In
simple terms, Storing the files in certain order is called file Organization. File Structure refers to
the format of the label and data blocks and of any logical control record.

Types of File Organizations

o Sequential file organization


o Heap file organization
o Hash file organization
o Clustered file organization

Sequential File Organization

This method is the easiest method for file organization. In this method, files are stored sequentially. This
method can be implemented in two ways:

1. Pile File Method:


o It is a simple method. In this we store the record in a sequence, i.e., one after another. Here, the
record will be inserted in the order in which they are inserted into tables.
o In case of updating or deleting of any record, the record will be searched in the memory blocks.
When it is found, then it will be marked for deleting, and the new record is inserted.
Insertion of the new record:

Suppose we have four records R1, R3 and so on up to R9 and R8 in a sequence. Hence, records are
nothing but a row in the table. Suppose we want to insert a new record R2 in the sequence, then it
will be placed at the end of the file. Here, records are nothing but a row in any table.

2. Sorted File Method:


o In this method, the new record is always inserted at the file's end, and then it will sort the
sequence in ascending or descending order. Sorting of records is based on any primary key or
any other key.
o In the case of modification of any record, it will update the record and then sort the file, and
lastly, the updated record is placed in the right place.

Insertion of the new record:

Suppose there is a preexisting sorted sequence of four records R1, R3 and so on up to R6 and R7.
Suppose a new record R2 has to be inserted in the sequence, then it will be inserted at the end of
the file, and then it will sort the sequence.
Pros of sequential file organization
o It contains a fast and efficient method for the huge amount of data.
o In this method, files can be easily stored in cheaper storage mechanism like magnetic tapes.
o It is simple in design. It requires no much effort to store the data.
o This method is used when most of the records have to be accessed like grade calculation of a
student, generating the salary slip, etc.
o This method is used for report generation or statistical calculations.

Cons of sequential file organization


o It will waste time as we cannot jump on a particular record that is required but we have to move
sequentially which takes our time.
o Sorted file method takes more time and space for sorting the records.

Heap file organization


o It is the simplest and most basic type of organization. It works with data blocks. In heap file
organization, the records are inserted at the file's end. When the records are inserted, it doesn't
require the sorting and ordering of records.
o When the data block is full, the new record is stored in some other block. This new data block
need not to be the very next data block, but it can select any data block in the memory to store
new records. The heap file is also known as an unordered file.
o In this file, every record has a unique id, and every page in a file is of the same size. It is the
DBMS responsibility to store and manage the new records.
Insertion of a new record

Suppose we have five records R1, R3, R6, R4 and R5 in a heap and suppose we want to insert a
new record R2 in a heap. If the data block 3 is full then it will be inserted in any of the data block
selected by the DBMS, let's say data block 1.
If we want to search, update or delete the data in heap file organization, then we need to traverse
the data from staring of the file till we get the requested record.

If the database is very large then searching, updating or deleting of record will be time-consuming
because there is no sorting or ordering of records. In the heap file organization, we need to check
all the data until we get the requested record.

Pros of Heap file organization


o It is a very good method of file organization for bulk insertion. If there is a large number of data
which needs to load into the database at a time, then this method is best suited.
o In case of a small database, storing and retrieving of records is faster than the sequential record.

Cons of Heap file organization


o This method is inefficient for the large database because it takes time to search or modify the
record.
o This method is inefficient for large databases.

Hash File Organization

Hash File Organization uses the computation of hash function on some fields of the records. The
hash function's output determines the location of disk block where the records are to be placed.

When a record has to be received using the hash key columns, then the address is generated, and
the whole record is retrieved using that address. In the same way, when a new record has to be
inserted, then the address is generated using the hash key and record is directly inserted. The same
process is applied in the case of delete and update.
In this method, there is no effort for searching and sorting the entire file. In this method, each
record will be stored randomly in the memory.

Indexing
If the records in the file are in sorted order, then searching will become very fast. But, in
most of the cases, records are placed in the file in the order in which they are inserted, so new
records are inserted at the end of the file. It indicates, the records are not in sorted order. In
order to make searching faster in the files with unsorted records, indexing is used.

Indexing is a data structure technique which allows you to quickly retrieve records from a
database file. An Index is a small table having only two columns. The first column contains a
copy of the primary or candidate key of a table. The second column contains a set of disk block
addresses where the record with that specific key value stored.
Indexing in DBMS can be of the following types:

Indexing

Primary Indexing Secondary Indexing Clustering Indexing

Dense Indexing Sparse Indexing

i. Primary Index
 If the index is created by using the primary key of the table, then it is known as primary
indexing.
 As primary keys are unique and are stored in a sorted manner, the performance of the
searching operation is quite efficient.
 The primary index can be classified into two types: dense index and sparse index. 

Dense index
 If every record in the table has one index entry in the index table, then it is called dense
index. 
 In this, the number of records (rows) in the index table is same as the number of records
(rows) in the main table.
 As every record has one index entry, searching becomes faster.

TS TS Hyderabad KCR
AP AP Amaravathi Jagan
TN TN Madras Palani Swamy
MH MH Bombay Thackray

Sparse index
 If only few records in the table have index entries in the index table, then it is called
sparse index. 
 In this, the number of records (rows) in the index table is less than the number of records
(rows) in the main table.
 As not all the record have index entries, searching becomes slow for records that does not
have index entries. 
TS TS Hyderabad KCR
TN AP Amaravathi Jagan
MH TN Madras Palani Swamy
MH Bombay Thackray

ii. Secondary Index


When the size of the main table grows, then size of index table also grows. If the index table size
grows then fetching the address itself becomes slower. To overcome this problem, secondary
indexing is introduced.

 In secondary indexing, to reduce the size of mapping, another level of indexing is introduced.
 It contains two levels. In the first level each record in the main table has one entry in the first-
level index table.
 The index entries in the fisrt level index table are divided into different groups. For each
group, one index entry is created and added in the second level index table.

Multi-level Index: When the main table size becomes too large, creating secondary level index
improves the search process. Even if the search process is slow; we can add one more level of
indexing and so on. This type of indexing is called multi-level index.
iii. Clustering Index
 Sometimes the index is created on non-primary key columns which may not be unique
for each record.
 In this case, to identify the record faster, we will group two or more columns to get the
unique value and create index out of them. This method is called a clustering index. 
 The records which have similar characteristics are grouped, and indexes are created for
these group.

Example: Consider a college contains many students in each department. All the students belong
to the same Dept_ID are grouped together and treated as a single cluster. One index pointers
point to the one cluster as a whole. The idex pointer points to the first record in each cluster.
Here Dept_ID is a non-unique key.

Index File Records of table in memory


CSE 501 Ajay BCD
ECE 502 Ramya BCA
EEE … … …
… 560 Fathima BCE
401 Vijay Reddy OC
… … …
460 Mallesh ST
201 Jhon SC
… … …
260 Sowmya BCC
… … …
… … …

In above diagram we can see that, indexes are created for each department in the index
file. In the data block, the students of each department are grouped together to form the cluster.
The address in the index file points to the beginning of each cluster.

Hash Based Indexing

Hashing is a technique to directly search the location of desired data on the disk without
using index structure. Hash function is a function which takes a piece of data ( key) as input and
produces a hash value as output which maps the data to a particular location in the hash table.
The concept of hashing and hash table is shown in the below figure

There are mainly two types of hashing methods:


i. Static Hashing
ii. Dynamic Hashing
 Extended hashing
 Linear hashing

Static Hashing:

In static hashing, the hash function produce only fixed number of hash values. For
example consider the hash function
f(x) = x mod 7
For any value of x, the above function produces one of the hash values from {0, 1, 2, 3, 4, 5, 6}.
Itmeans static hashing maps search-key values to a fixed set of bucket addresses.

Example: Inserting 10, 21, 16 and 12 in hash table.

Hash Value Data Record


f(10) = 10 mod 7 = 3 0 21*
f(21) = 21 mod 7 = 0 1
f(16) = 16 mod 7 = 2 2 16*
f(12) = 12 mod 7 = 5 3 10*
4
5 12*
6

Figure 5.1: Static hashing


Suppose, latter if we want to insert 23, it produce hash value as 2 (23 mod 7 = 2). But, in the
above hash table, the location with hash value 2 is not empty (it contains 16*). So, a collision
occurs. To resolve this collision, the following techniques are used.
o Open addressing
o Separate Chaining or Closed addressing

i. Open Addressing:

Open addressing is a collision resolving technique which stores all the keys inside the
hash table. No key is stored outside the hash table. Techniques used for open addressing are:

o Linear Probing
o Quadratic Probing
o Double Hashing

 Linear Probing:

In linear probing, when there is a collision, we scan forwards for the next empty slot to
fill the key’s record. If you reach last slot, then start from beginning.

Example: Consider figure 1. When we try to insert 23, its hash value is 2. But the slot
with 2 is not empty. Then move to next slot (hash value 3), even it is also full, then move
once again to next slot with hash value 4. As it is empty store 23 there. This is shown in
the below diagram.

Hash Value Data Record

0 21*

f(23) = 23 mod 7 = 2 2 16*

3 10*

4 23*

5 12*

Figure 5.2: Linear Probing


 Quadratic Probing:

In quadratic probing, when collision occurs, it compute new hash value by taking the
original hash value and adding successive values of quadratic polynomial until an open
slot is found. If here is a collision, it use the following hash function: h(x) = ( f(x) + i2 )
mod n , where I = 1, 2, 3, 4,….. and f(x) is initial hash value.

Example: Consider figure 1. When we try to insert 23, its hash value is 2. But the slot
with hash value 2 is not empty. Then compute new hash value as (2 +1 2) mod 7 =3, even
it is also full, and then once again compute new hash value as (2 +2 2) mod 7 = 6. As it is
empty store 23 there. This is shown in the below diagram.

Hash
Data Record
Value

0 21*

f(23) = 23 mod 7 = 2 2 16*

3 10*

5 12*

6 23*

Figure 5.3: Quadratic Probing

 Double Hashing

In double hashing, there are two hash functions. The second hash function is used to
provide an offset value in case the first function causes a collision. The following
function is an example of double hashing: (first Hash(key) + i * secondHash(key)) %
table Size. Use i = 1, 2, 3, …

A popular second hash function is : secondHash(key) = PRIME – (key % PRIME)


where PRIME is a prime smaller than the TABLE_SIZE.
Example: Consider figure 1. When we try to insert 23, its hash value is 2. But the slot
with hash value 2 is not empty. Then compute double hashing value as
secondHash (key) = PRIME – (key % PRIME) → secondHash (23) = 5 – (23 % 5) = 2
Double hashing: (firstHash(key) + i * secondHash(key)) % tableSize → ( 2+1*2))%7 =4
As the slot with hash value 4 is empty, store 23 there. This is shown in the below
diagram.
Hash Value Data Record
0 21*
1
f(23) = 23 mod 7 = 2 2 16*
3 10*
4 23*
5 12*
6

Figure 5.4: Double Probing

ii. Separate Chaining or Closed addressing:

To handle the collision, this technique creates a linked list to the slot for which collision
occurs. The new key is then inserted in the linked list. These linked lists to the slots appear like
chains. So, this technique is called as separate chaining. It is also called as closed addressing.

Example: Inserting 10, 21, 16, 12, 23, 19, 28, 30 in hash table.

Hash Value Data Record


f(10) = 10 mod 7 = 3 0 21*

f(21) = 21 mod 7 = 0 1

f(16) = 16 mod 7 = 2 2 16*


23* 30*
f(12) = 12 mod 7 = 5 3 10*

f(23) = 23 mod 7 = 2 4

f(19) = 19 mod 7 = 5 5 12* 19*


f(30) = 30 mod 7 = 2 6

Figure 5.5: Separate chaining example


2. Dynamic Hashing
The problem with static hashing is that it does not expand or shrink dynamically as the
size of the database grows or shrinks. Dynamic hashing provides a mechanism in which data
buckets are added and removed dynamically and on-demand. Dynamic hashing can be
implemented using two techniques. They are:

o Extended hashing
o Linear Hashing

i. Extendable hashing
In extendable hashing, a separate directory of pointers to buckets is used. The number
bits used in directory is called global depth (gd) and number entries in directory = 2 gd. Number of
bits used for locating the record in the buckets is called local depth(ld) and each bucket can
stores up to 2ld entries. The hash function use last few binary bits of the key to find the bucket. If
a bucket overflows, it splits, and if local depth greater than global depth, then the table doubles in
size. It is one form of dynamic hashing.
Example: Let global depth (gd) = 2. It means the directory contains four entries. Let the local
depth (ld) of each bucket = 2. It means each bucket need two bits to perform search operation.
Let each Bucket capacity is four. Let us insert 21, 15, 28, 17, 16, 13, 19, 12, 10, 24, 25 and 11.
21 = 10101 19 = 10011
15 = 01111 12 = 01100
28 = 11100 10 = 01010
17 = 10001 24 = 11000
16 = 10000 25 = 11101
13 = 01101 11 = 01011

Local depth 2
Global depth 28* 16* 12* 24* Bucket A
2 2
00 21* 17* 25* 13* Bucket B
01
2
10
10* Bucket C
11
Directory 2
15* 19* 11* Bucket D
Figure 6.1: Extendible hashing example
Now, let us consider insertion of data entry 20* (binary 10100). Looking at directory
element 00, we are led to bucket A, which is already full. So, we have to split the bucket by
allocating a new bucket and redistributing the contents (including the new entry to be inserted)
across the old bucket and its 'split image (new bucket). To redistribute entries across the old
bucket and its split image, we consider the last three bits; the last two bits are 00, indicating a
data entry that belongs to one of these two buckets, and the third bit discriminates between these
buckets. That is if a key’s last three bits are 000, then it belongs to bucket A and if the last three
bits are 100, then the key belongs to Bucket A2. As we are using three bits for A and A2, so the
local depth of these buckets becomes 3. This is illustrated in the below Figure 6.2.

Local depth 3
16* 24* Bucket A
Global depth 3
28* 12* 20* Bucket A2
2
00 2
01 21* 17* 25* 13* Bucket B
10
2
11
10* Bucket C

Directory 2
15* 19* 11* Bucket D
Figure 6.2: After inserting 20 and splitting Bucket A

After split, Bucket A and A2 have local depth greater than global depth (3 > 2), so double
the directory and use three bits as global depth. As Bucket A and A2 has local depth 3, so they
have separate pointers from the directory. But, Buckets B, C and D use only local depth 2, so
they have two pointers each. This is shown in the below diagram.

Local depth 3
16* 24* Bucket A
Global depth 3
28* 12* 20* Bucket A2
3
000
2 Bucket B
001
21* 17* 25* 13*
010
011
100 2 Bucket C
101 10*
110
111 2 Bucket D
15* 19* 11*
Directory

Figure 6.3: After inserting 20 and doubling the directory


An important point that arises is whether splitting a bucket necessitates a directory
doubling. Consider our example, as shown in Figure 6.3. If we now insert 9* (01001), it belongs
in bucket B; this bucket is already full. We can deal with this situation by splitting the bucket and
using directory elements 001 and 101to point to the bucket and its split image. This is shown in
the below diagram. As Bucket B and its split image now have local depth three and it is not
greater than global depth. Hence, a bucket split does not necessarily require a directory doubling.
However, if either bucket A or A2 grows full and an insert then forces a bucket split, we are
forced to double the directory once again.
Local depth 3
16* 24* Bucket A
Global depth 3
28* 12* 20* Bucket A2
3
000 3
001 9*
17* Bucket B
010
011 3
100 21* 25* 13* Bucket B2
101
110 2
111 10* Bucket C
2
Directory 15* 19* 11* Bucket D
Figure 6.4: After inserting 9

Key Observations:
 A Bucket will have more than one pointers pointing to it if its local depth is less than the
global depth.
 When overflow condition occurs in a bucket, all the entries in the bucket are rehashed
with a new local depth.
 If new Local Depth of the overflowing bucket is equal to the global depth, only then the
directories are doubled and the global depth is incremented by 1.
 The size of a bucket cannot be changed after the data insertion process begins.

ii. Linear Hashing


Linear hashing is a dynamic hashing technique that linearly grows or shrinks number of
buckets in a hash file without a directory as used in Extendible Hashing. It uses a family of hash
functions instead of single hash function.
This scheme utilizes a family of hash functions h0, h1, h2, ... , with the property that each
function's range is twice that of its predecessor. That is, if hi maps a data entry into one of N
buckets, hi+1 maps a data entry into one of 2N buckets. One example of such hash function
family can be obtained by: hi+1 (key) = key mod (2i N) where N is the initial number of
buckets and i = 0,1,2,….
Initially it use N buckets labelled 0 through N–1 and an initial hashing function h0(key) =
key % N is used to map any key into one of the N buckets. For each overflow bucket, one of the
buckets in serial order will be splited and its content is redistributed between it and its split
image. That is, for first time overflow in any bucket, bucket 0 will be splited, for second time
overflow in any bucket; bucket 1 will be splited and so on. When number of buckets becomes
2N, then this marks the end of splitting round 0. Hashing function h0 is no longer needed as all
2N buckets can be addressed by hashing function h1. In new round namely splitting-round 1,
bucket split once again starts from bucket 0. A new hash function h2 will be used. This process is
repeated when the hash file grows.

Example: Let N = 4, so we use 4 buckets and hash function h0(key) = key % 4 is used to map
any key into one of the four buckets. Let us initially insert 4, 13, 19, 25, 14, 24, 15, 18, 23, 11,
16, 12 and 10.This is shown in the below figure.

Bucket# h1 h0 Primary pages Overflow pages


0 000 00 4* 24* 16* 12*

1 001 01 13* 25*

2 010 10 14* 18* 10*

3 011 11 19* 15* 23* 11*

Next, when 27 is inserted, an overflow occurs in bucket 3. So, bucket 0 (first bucket) is splited
and its content is distributed between bucket 0 and new bucket. This is shown in below figure.

Bucket# h1 h0 Primary pages Overflow pages


0 000 00 24* 16*

1 001 01 13* 25*


2 010 10 14* 18* 10*
3 011 11 19* 15* 23* 11* 27*

4 100 00 4* 12*
Next, when 30, 31 and 34 is inserted, an overflow occurs in bucket 2. So, bucket 1 is splited and
its content is distributed between bucket 1 and new bucket. This is shown in below figure.
Bucket# h1 h0 Primary pages Overflow pages
0 000 00 24* 16*
1 001 01 13*
2 010 10 14* 18* 10* 30* 34*
3 011 11 19* 15* 23* 11* 27* 31*
4 100 00 4* 12*
5 101 01 25*

When 32, 35, 40 and 48 is inserted, an overflow occurs in bucket 0. So, bucket 2 is splited and its
content is distributed between bucket 2 and new bucket. This is shown in below figure.
Bucket# h1 h0 Primary pages Overflow pages
0 000 00 24* 16* 32* 40* 48*
1 001 01 13*
2 010 10 18* 10* 34*
3 011 11 19* 15* 23* 11* 27* 31* 35*
4 100 00 4* 12*
5 101 01 25*
6 110 10 14* 30*

When 26, 20 and 42 is inserted, an overflow occurs in bucket 0. So, bucket 3 is splited and its
content is distributed between bucket 3 and new bucket. This is shown in below figure.
Bucket# h1 h0 Primary pages Overflow pages
0 000 00 24* 16* 32* 40* 48*
1 001 01 13*
2 010 10 18* 10* 34* 26* 42

3 011 11 19* 11* 27* 35*

4 100 00 4* 12* 20*


5 101 01 25*
6 110 10 14* 30*
7 111 11 15* 23* 31*
This marks the end of splitting round. Hashing function h0 is no longer needed as all 2N
buckets can be addressed by hashing function h1. In new round namely splitting-round 1, bucket
split once again starts from bucket 0. A new hash function h2 will be used. This process is
repeated.

Intuitions for Tree Indexes

We can use tree-like structures as index as well. For example, a binary search tree can
also be used as an index. If we want to find out a particular record from a binary search tree, we
have the added advantage of binary search procedure, that makes searching be performed even
faster. A binary tree can be considered as a 2-way Search Tree, because it has two pointers in
each of its nodes, thereby it can guide you to two distinct ways. Remember that for every node
storing 2 pointers, the number of value to be stored in each node is one less than the number of
pointers, i.e. each node would contain 1 value each.

The abovementioned concept can be further expanded with the notion of the m-Way
Search Tree, where m represents the number of pointers in a particular node. If m = 3, then each
node of the search tree contains 3 pointers, and each node would then contain 2 values.
We use mainly two tree structure indexes in DBMS. They are:

 Indexed Sequential Access Methods (ISAM)


 B+ Tree

INDEXED SEQUENTIAL ACCESS METHODS (ISAM)


ISAM is a tree structure data that allows the DBMS to locate particular record using index
without having to search the entire data set.
 The records in a file are sorted according to the primary key and saved in the disk.
 For each primary key, an index value is generated and mapped with the record. This
index is nothing but the address of record.
 A sorted data file according to primary index is called an indexed sequential file.
 The process of accessing indexed sequential file is called ISAM.
 ISAM makes searching for a record in larger database is easy and quick. But proper
primary key has to be selected to make ISAM efficient.
 ISAM gives flexibility to generate index on other fields also in addition to primary key
fields.
ISAM contain three types of nodes:
 Non-leaf nodes: They store the search index key values.
 Leaf nodes: They store the index of records.
 Overflow nodes: They also store the index of records but after the leaf node is full.

On ISAM, we can perform search, insertion and deletion operations.


Search Operation: It follows binary search process. The record to be searched will be available
in the leaf nodes or in overflow nodes only. The non-leaf nodes are used to search the value.
Insertion operation: First locate a leaf node where the insertion to be take place (use binary
search). After finding leaf node, insert it in that leaf node if space is available, else create an
overflow node and insert the record index in it, and link the overflow node to the leaf node.
Deletion operation: First locate a leaf node where the deletion to be take place (use binary
search). After finding leaf node, if the value to be deleted is in leaf node or in overflow node,
remove it. If the overflow node is empty after removing the deleted value, then delete overflow
node also.
Example: Insert 10, 23, 31, 20, 68, 35, 42, 61, 27, 71, 46 and 59
31

23 68 42 59

10 20 23 27 68 71 31 35 42 46 59 61
After inserting 24, 33, 36, and 39 in the above tree, it looks like
31

23 68 42 59

10 20 23 27 68 71 31 35 42 46 59 61

24 33 36

39

Deletion: From the above figure, after deleting 42, 71, 24 and 36
31

23 68 42 59

10 20 23 27 68 31 35 46 59 61

33

39

B+ Tree:
B+ Tree is an extension of Binary Tree which allows efficient insertion, deletion and search
operations. It is used to implement indexing in DBMS. In B+ tree, data can be stored only on the
leaf nodes while internal nodes can store the search key values.

1. B+ tree of an order m can store max m-1 values at each node.


2. Each node can have a maximum of m children and at least m/2 children (except root).
3. The values in each node are in sorted order.
4. All the nodes must contain at least half full except the root node.
5. Only leaf nodes contain values and non-leaf nodes contain search keys.
B+ tree Search:
Searching for a value in the B+-Tree always starts at the root node and moves downwards
until it reaches a leaf node. The search procedure follows binary tree search procedure.

1. Read the value to be searched. Let us say this value as X.


2. Start the search process from root node
3. At each non-leaf node (including root node),
a. If all the values in the non-leaf node are greater than X, then move to its first child
b. If all the values in the non-leaf node are less than or equal to X, then move to its
last child
c. If for any two consecutive values in the non-leaf node, left value is less and right
value greater than or equal to X, then move to the child node whose pointer is in
between these two consecutive values.
4. Repeat step-3 until a leaf node reaches.
5. At leaf node compare the key with the values in that node from left to right. If the key
value is found then display found. Otherwise display it is not found.

Example: Searching for 35 in the below given B+ tree. The search path is shown in red color.

18

11 31 64

8 15 23 42 59 68

2 5 8 9 11 12 15 16 18 20 23 27 31 35 42 46 59 61 64 66 68 71

B+ tree Insertion:

1. Apply search operation on B+ tree and find a leaf node where the new value has to insert.
2. If the leaf node is not full, then insert the value in the leaf node.
3. If the leaf node is full, then
a. Split that leaf node including newly inserted value into two nodes such that each
contains half of the values (In case of odd, 2nd node contains extra value).
b. Insert smallest value from new right leaf node (2nd node) into the parent node.
Add pointers from these new leaf nodes to their parent node.
c. If the parent is full, split it too. Add the middle key (In case of even,1st value from
2nd part)of this parent node to its parent node.
d. Repeat until a parent is found that need not split.
4. If the root splits, create a new root which has one key and two pointers.

Example: Insert 1,5,3,7,9,2,4,6,8,10 into B+ tree of an order 4.

B+ tree of order 4 indicates there are maximum 3 values in a node.

Initially

After inserting 1
1

After inserting 5
1 5

After inserting 3
1 3 5

5
After inserting 7
1 3 5 7
1 3 5 7

After inserting 9
5

1 3 5 7 9

After inserting 2
5

1 2 3 5 7 9
After inserting 4
5 3 5

1 2 3 4 5 7 9 1 2 3 4 5 7 9

After inserting 6
3 5

1 2 3 4 5 7 9 6

3 5 7

1 2 3 4 5 6 7 9

After inserting 8
3 5 7

1 2 3 4 5 6 7 8 9

After inserting 10
3 5 7

1 2 3 4 5 6 7 8 9 10

3 5 7 9

1 2 3 4 5 6 7 8 9 10

3 5
9

9 10
1 2 3 4 5 6 7 8
B+ tree Deletion

 Identify the leaf node L from where deletion should take place.
 Remove the data value to be deleted from the leaf node L
 If L meets the "half full" criteria, then its done.
 If L does not meets the "half full" criteria, then
o If L's right sibling can give a data value, then move smallest value in right sibling
to L (After giving a data value, the right sibling should satisfy the half full
criteria. Otherwise it should not give)
o Else, if L's left sibling can give a data value, then move largest value in left
sibling to L (After giving a data value, the left sibling should satisfy the half full
criteria. Otherwise it does not give)
o Else, merge L and a sibling
o If any internal nodes (including root) contain key value same as deleted value,
then delete those values and replace with it successor. This deletion may
propagate up to root. (If the changes propagate up to root then tree height
decreases).

Example: Consider the given below tree and delete 19,


19

5 14 24 33

2 3 5 7 14 16 19 20 22 24 27 29 33 34 38 39

Delete 19 : Half full criteria is satisfied even after deleting 19, so just delete 19 from leaf node

19

5 14 24 33

2 3 5 7 14 16 20 22 24 27 29 33 34 38 39
Delete 20: Half full criteria is not satisfied after deleting 20, so bring 24 from its right sibling
and change key values in the internal nodes.
19

5 14 27 33

2 3 5 7 14 16 22 24 27 29 33 34 38 39

Delete 24: Half full criteria is not satisfied after deleting 24, bringing a value from its siblings
also not possible. Therefore merge it with its right sibling and change key values in the internal
nodes.
19

5 14 33

2 3 5 7 14 16 22 27 29 33 34 38 39

Delete 5: a half full criterion is not satisfied after deleting 5, bringing a value from its siblings
also not possible. Therefore merge it with its left sibling (you can also merge with right) and
change key values in the internal nodes.
19

14 33

2 3 7 14 16 22 27 29 33 34 38 39

Delete 7: Half full criteria is satisfied even after deleting 7, so just delete 7 from leaf node.
17

14 33

2 3 14 16 22 27 29 33 34 38 39
Delete 2: Half full criteria is not satisfied after deleting 2, bringing a value from its siblings also
not possible. Therefore merge it with its right sibling and change key values in the internal
nodes.
22 33

3 14 16 22 27 29 33 34 38 39

Indexes and Performance Tuning


Indexing is very important to execute DBMS query more efficiently. Adding indexes to
important tables is a regular part of performance tuning. When we identify a frequently executed
query that is scanning a table or causing an expensive key lookup, then first consideration is if an
index can solve this problem. If yes add index for that table.

While indexes can improve query execution speed, the price we pay is on index
maintenance. Update and insert operations need to update the index with new data. This means
that writes will slow down slightly with each index we add to a table. We also need to monitor
index usage and identify when an existing index is no longer needed. This allows us to keep our
indexing relevant and trim enough to ensure that we don’t waste disk space and I/O on write
operations to any unnecessary indexes. To improve performance of the system, we need to do the
following:

 Identify the unused indexes and remove them.


 Identify the minimally used indexes and remove them.
 An index that is scanned more frequently, but rarely finds the required answer. Modify
the index to reach the answer.
 Identify the indexes that are very similar and combine them.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy