Dbms Notes
Dbms Notes
Figure 6.1
The indexing has various attributes:
Access Types: This refers to the type of access such as value based search,
range access, etc.
Access Time: It refers to the time needed to find particular data element or set of
elements.
Insertion Time: It refers to the time taken to find the appropriate space and
insert a new data.
Deletion Time: Time taken to find an item and delete it as well as update the
index structure.
Space Overhead: It refers to the additional space required by the index.
In general, there are two types of file organization mechanism which are followed by the
indexing methods to store the data:
6.1.1 Sequential File Organization or Ordered Index File: In this, the indices are
based on a sorted ordering of the values. These are generally fast and a more
traditional type of storing mechanism. These Ordered or Sequential file organization
might store the data in a dense or sparse format:
o Dense Index:
For every search key value in the data file, there is an index record.
This record contains the search key and also a reference to the first
data record with that search key value.
Figure 6.2
Sparse Index:
The index record appears only for a few items in the data file. Each item points to
a block as shown.
To locate a record, we find the index record with the largest search key value less
than or equal to the search key value we are looking for.
We start at that record pointed to by the index record, and proceed along with the
pointers in the file (that is, sequentially) until we find the desired record.
Figure 6.3
6.1.2 Hash File organization: Indices are based on the values being distributed
uniformly across a range of buckets. The buckets to which a value is assigned is
determined by a function called a hash function.
There are primarily three methods of indexing:
Clustered Indexing
Non-Clustered or Secondary Indexing
Multilevel Indexing
Primary Indexing:
This is a type of Clustered Indexing wherein the data is sorted according to the search
key and the primary key of the database table is used to create the index. It is a default
format of indexing where it induces sequential file organization. As primary keys are
unique and are stored in a sorted manner, the performance of the searching operation is
quite efficient.
Figure 6.6
6.2. Tree Structured Indexing:
Consider a file of Students recorcls sorted by gpa. To answer a range selection such as
"Find all students with a gpa higher than 3.0," we must identify the first such student by
doing a binary search of the file and then scan the file from that point on. If the file is
large, the initial binary search can be quite expensive, since cost is proportional to the
number of pages fetched.
One idea is to create a second file with One record per page in the original (data) file, of
the form (first key on page, pointer to page), again sorted by the key attribute (which is
gpa in our example). The format of a page in the second index file is illustrated in Figure.
Figure 6.7
We refer to pairs of the form (key, pointer) as indx entries or just entries when the
ontext is clear. Note that each index page contains One pointer more than the number of
keys --- each key serves as a separator- for the contents of the pages pointed to by the
pointers to its left and right.The simple index file data structure is illustrated in Figure
Figure 6.8
We can do a binary search of the index file to identify the page containing the first key
(gpa.) value that satisfies the range selection (in our example, the first student with gpa
over 3.0) and follow the pointer to the page containing the first data. record with that
key value. We can then scan the data file sequentially from that point on to retrieve
other qualifying records. This example uses the index to find the first data page
containing a Students record with gpa greater than 3.0, and the data file is scanned from
that point on to retrieve other such Students records.
6.2.2 Indexed Sequential Access Methods (ISAM):
ISAM method is an advanced sequential file organization. In this method, records are
stored in the file using the primary key. An index value is generated for each primary
key and mapped with the record. This index contains the address of the record in the
file.
The data entries of the ISAM index are in the leaf pages of the tree and additional
overflow pages chained to some leaf page. Database systems carefully organize the
layout of pages so that page boundaries correspond closely to the physical
characteristics of the underlying storage device. The ISAM structure is completely static
(except for the overflow pages, of which it is hoped, there will be few) and facilitates
such low-level optimizations. The ISAM data structure is illustrated in Figure
Figure 6.9
Each tree node is a disk page, and all the data resides in the leaf pages. This orresponds
to an index that uses Alternative .
(1) for data entries, in terms of the alternatives we can create an index with Alternative
(2) by storing t.he data records in a separate file and storing (key, rid) pairs in the leaf
pages of the ISAM index.
If there are several inserts to the file subsequently, so that
more entries are inserted into a leaf than will fit onto a
single page, additional pages are needed because the
index structure is static. These additional pages are
allocated from an overflow area.
The allocation of pages is illustrated
The basic operations of insertion, deletion, and search are all quite straightforward. For
an equality selection search, we start at the root node and determine which sub tree to
search by comparing the value in the search field of the given record with the key values
in the node.
The following example illustrates the ISAM index structure. Consider the tree shown
below
Figure 6.10
All searches begin at the root. For example, to locate a record with the key value 27, we
start at the root and follow the left pointer, since 27 < 40. We then follow the middle
pointer, since 20 <= 27 < 33. For a range search, we find the first qualifying data entry
as for an equality selection and then retrieve primary leaf pages sequentially The
primary leaf pages are assumed to be allocated sequentially this assumption is
reasonable because the number of such pages is known when the tree is created and
does not change subsequently under inserts and deletes-and so no 'next leaf page'
pointers are needed.
Figure 6.11
We assume that each leaf page can contain two entries. If we now insert a record with
key value 23, the entry 23* belongs in the second data page, which already contains
20* and 27* and has no more space. We deal with this situation by adding an overflow
page and putting 23* in the overflow page. Chains of overflow pages can easily develop.
For instance, inserting 48*, 41* and 42* leads to an overflow chain of two pages. all
these insertions shown in Figure 6.11
The deletion of an entry h is handled by simply removing the entry. If this entry is on an
overflow page and the overflow page becomes empty, the page can be removed. If the
entry is on a primary page and deletion makes the primary page empty, the simplest
approach is to simply leave the empty primary page as it is; it serves as a placeholder
for future insertions. Thus, the number of primary leaf pages is fixed at file creation
time.
6.2.3 B+ Trees:
Figure 6.12
Internal node:
An internal node of the B+ tree can contain at least n/2 record pointers except the root
node. At most, an internal node of the tree contains n pointers.
Leaf node:
The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
At most, a leaf node contains n record pointer and n key values.
Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.
So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at
the end, we will be redirected to the third leaf node. Here DBMS will perform a
sequential search to find 55.
Figure 6.13
6.2.3.3 B+ Tree Insertion:
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf
node after 55. It is a balanced tree, and a leaf node of this tree is already full, so we
cannot insert 60 there.
In this case, we have to split the leaf node, so that it can be inserted into tree without
affecting the fill factor, balance and order.
Figure 6.14
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We
will split the leaf node of the tree in the middle so that its balance is not altered. So we
can group (50, 55) and (60, 65, 70) into 2 leaf nodes.
If these two has to be leaf nodes, the intermediate node cannot branch from 50. It
should have 60 added to it, and then we can have pointers to a new leaf node.
Figure 6.15
This is how we can insert an entry when there is overflow. In a normal scenario, it is
very easy to find the node where it fits and then place it in that leaf node.
6.2.3.4 B+ Tree Deletion:
Suppose we want to delete 60 from the above example. In this case, we have to remove
60 from the intermediate node as well as from the 4th leaf node too. If we remove it
from the intermediate node, then the tree will not satisfy the rule of the B+ tree. So we
need to modify it to have a balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as
follows:
Figure 6.16
Advantages:
Automatically Adjust the nodes to fit the new record. Similarly it re-organizes the
nodes in the case of delete, if required. Hence it does not alter the definition of
B+ tree.
No file degradation problem
Is suitable for partial and range search too
6.3 Hash Based Indexing:
Introduction:
In DBMS, hashing is a technique to directly search the location of desired data on the
disk without using index structure. Data is stored in the form of data blocks whose
address is generated by applying a hash function in the memory location where these
records are stored known as a data block or data bucket.
Hashing maps a search key directly to the pid of the containing page/page-
overflow chain
Doesn’t require intermediate page fetches for internal “steering nodes” of tree-
based indices
Hash-based indexes are best for equality selections.They do not support efficient
range searches.
Static and dynamic hashing techniques exist with trade-offs similar to ISAM vs.
B+ trees.
Important Terminologies using in Hashing:
Here, are important terminologies which are used in Hashing:
Data bucket – Data buckets are memory locations where the records are stored.
It is also known as Unit Of Storage.
Key: A DBMS key is an attribute or set of an attribute which helps you to identify
a row(tuple) in a relation(table). This allows you to find the relationship between
two tables.
Hash function: A hash function, is a mapping function which maps all the set of
search keys to the address where actual records are placed.
Linear Probing – Linear probing is a fixed interval between probes. In this
method, the next available data block is used to enter the new record, instead of
overwriting on the older record.
Quadratic probing- It helps you to determine the new bucket address. It helps
you to add Interval between probes by adding the consecutive output of quadratic
polynomial to starting value given by the original computation.
Hash index – It is an address of the data block. A hash function could be a
simple mathematical function to even a complex mathematical function.
Double Hashing –Double hashing is a computer programming method used in
hash tables to resolve the issues of has a collision.
Bucket Overflow: The condition of bucket-overflow is called collision. This is a
fatal stage for any static has to function.
There are mainly two types of SQL hashing methods:
1. Static Hashing
2. Dynamic Hashing
6.3.1 Static Hashing
In the static hashing, the resultant data bucket address will always remain the same.
Therefore, if you generate an address for say Student_ID = 10 using hashing
function mod(3), the resultant bucket address will always be 1. So, you will not see any
change in the bucket address.
Therefore, in this static hashing method, the number of data buckets in memory always
remains constant.
Hence in this static hashing, the number of data buckets in memory remains constant
throughout. In this example, we will have five data buckets in the memory used to store
the data.
Figure 6.17
Static Hash Functions
Inserting a record: When a new record requires to be inserted into the table,
you can generate an address for the new record using its hash key. When the
address is generated, the record is automatically stored in that location.
Searching: When you need to retrieve the record, the same hash function should
be helpful to retrieve the address of the bucket where data should be stored.
Delete a record: Using the hash function, you can first fetch the record which is
you wants to delete. Then you can remove the records for that address in
memory.
Static hashing is further divided into:
1. Open hashing
2. Close hashing.
1. Open Hashing: In Open hashing method, Instead of overwriting older one the next
available data block is used to enter the new record, This method is also known as linear
probing.
For example: suppose R3 is a new address which needs to be inserted, the hash
function generates address as 112 for R3. But the generated address is already full. So
the system searches next available data bucket, 113 and assigns R3 to it.
Figure 6.18
2. Close Hashing:
When buckets are full, then a new data bucket is allocated for the same hash result and
is linked after the previous one. This mechanism is known as Overflow chaining.
For example: Suppose R3 is a new address which needs to be inserted into the table,
the hash function generates address as 110 for it. But this bucket is full to store the new
data. In this case, a new bucket is inserted at the end of 110 buckets and is linked to it.
Figure 6.19
6.3.2 Dynamic Hashing:
o The dynamic hashing method is used to overcome the problems of static hashing
like bucket overflow.
o In this method, data buckets grow or shrink as the records increases or
decreases. This method is also known as Extendable hashing method.
o This method makes hashing dynamic, i.e., it allows insertion or deletion without
resulting in poor performance.
How to search a key
o First, calculate the hash address of the key.
o Check how many bits are used in the directory, and these bits are called as i.
o Take the least significant i bits of the hash address. This gives an index of the
directory.
o Now using the index, go to the directory and find bucket address where the
record might be.
How to insert a new record
o Firstly, you have to follow the same procedure for retrieval, ending up in some
bucket.
o If there is still space in that bucket, then place the record in it.
o If the bucket is full, then we will split the bucket and redistribute the records.
For example:
Consider the following grouping of keys into buckets, depending on the prefix of their
hash address:
Figure 6.20
The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits
of 5 and 6 are 01, so it will go into bucket B1. The last two bits of 1 and 3 are
10, so it will go into bucket B2. The last two bits of 7 are 11, so it will go into B3.
Figure 6.21
Insert key 9 with hash address 10001 into the above structure:
o Since key 9 has hash address 10001, it must go into the first bucket. But bucket
B1 is full, so it will get split.
o The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it
will go into bucket B1, and the last three bits of 6 are 101, so it will go into
bucket B5.
o Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry
because last two bits of both the entry are 00.
o Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry
because last two bits of both the entry are 10.
o Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because
last two bits of both the entry are 11.
Figure 6.22
Advantages of dynamic hashing:
o In this method, the performance does not decrease as the data grows in the
system. It simply increases the size of memory to accommodate the data.
o In this method, memory is well utilized as it grows and shrinks with the data.
There will not be any unused memory lying.
o This method is good for the dynamic database where data grows and shrinks
frequently.
If the data distribution is very skewed, however, overflow chains could cause
Linear Hashing performance to be worse than that of Extendible Hashing.
The scheme utilizes a family of hash functions h0, h1, h2, ... , with the property
that each function's range is twice that of its predecessor.
That is, if hi maps a data entry into one of M buckets, h i+1 maps a data entry into one of
2M buckets.
The idea is best understood in terms of rounds of splitting. During round number Level,
only hash functions hLevel and hLevel+1 are in use. The buckets in the file at the
beginning of the round are split, one by one from the first to the last bucket, thereby
doubling the number of buckets.
Figure 6.23
Unlike Extendible Hashing, when an insert triggers a split, the bucket into which the data
entry is inserted is not necessarily the bucket that is split. An overflow page is added to
store the newly inserted data entry as in Static Hashing.
We now describe Linear Hashing in more detail.
A counter Level is used to indicate the current round number and is initialized to 0. The
bucket to split is denoted by Next and is initially bucket. We denote the number of
buckets in the file at the beginning of round Level by NLevel.
We can easily verify that NLevel = N * 2Level. Let the number of buckets at the
beginning of round 0, denoted by No, be N.
Whenever a split is triggered the Next bucket is split, and hash function hLevel+l
redistributes entries between this bucket (say bucket number b) and its split image the
split image is therefore bucket number b+ NLevel.
After splitting a bucket, the value of Next is incremented by 1. In the example file,
Figure 6.24
data entry 43* triggers a split. The file after completing the insertion is shown in the
figure.
At any time in .the middle of a round Level, all buckets above bucket Next have been
split, and the file contains buckets that are their split images, as illustrated.
Buckets Next through NLevel have not yet been split. If we use hLevel on a data entry
and obtain a number b in the range Next through NLevel, the data entry belongs to
bucket b.
Figure 6.25
Not all insertions trigger a split, of course. If we insert 37* into the file the appropriate
bucket has space for the new data entry. The file after the insertion is shown in Figure
Figure 6.26
Sometimes the bucket pointed to by Next (the current candidate for splitting) is full, and
a new data entry should be inserted in this bucket. In this case, a split is triggered, of
course, but we do not need a new overflow bucket. This situation is illustrated by
inserting 29* into the file.
When Next is equal to NLevel - 1 and a split is triggered, we split the last of the buckets
present in the file at the beginning of round Level.
Figure 6.27
The number of buckets after the split is twice the number at the beginning of the round,
and we start a new round with Level incremented by 1 and Next reset to O.
Incrementing Level amounts to doubling the effective range into which keys are hashed.
'We not discuss deletion in detail, but it is essentially the inverse of insertion. If the last
bucket in the file is empty, it can be removed and Next can be decremented.
If we wish, we can combine the last bucket with its split image even when it is not
empty, using some criterion to trigger this merging in essentially the same way.
6.3.4 Extendable vs. Linear hashing:
To understand the relationship between Linear Hashing and Extendible Hashing, imagine
that we also have a directory in Linear Hashing with elements 0 to N - 1. The first split is
at bucket 0, and so we add directory element N.
we may imagine that the entire directory has been doubled at this point; however,
because element 1 is the same as element N + 1, element 2 is the same as element
N+2, and so on, we can avoid the actual copying for the rest of the directory.
The second split occurs at bucket 1; now directory element N + 1 becomes significant
and is added. At the end of the round, all the original N buckets are split, and the
directory is doubled in size.
The directory analogy is useful for understanding the ideas behind Extendible and Linear
Hashing.
However, the directory structure can be avoided for Linear Hashing by allocating primary
bucket pages consecutively, which would allow us to locate the page for bucket i by a
simple offset calculation.
For uniform distributions, this implementation of Linear Hashing has a lower average
cost for equality selections.
A different implementation of Linear Hashing, in which a directory is actually maintained,
offers the flexibility of not allocating one page per bucket; null directory elements can be
used as in Extendible Hashing. However, this implementation introduces the overhead of
a directory level and could prove costly for large, uniformly distributed files.