0% found this document useful (0 votes)
11 views56 pages

L4 Indexing

The document discusses indexing in databases, highlighting its importance for efficient data retrieval without scanning every row. It covers various types of indices, including ordered and hash indices, and explains the structure and access mechanisms of disk storage. Additionally, it addresses the performance implications of different indexing strategies and the management of indices during data insertion and deletion.

Uploaded by

xihuatl074
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views56 pages

L4 Indexing

The document discusses indexing in databases, highlighting its importance for efficient data retrieval without scanning every row. It covers various types of indices, including ordered and hash indices, and explains the structure and access mechanisms of disk storage. Additionally, it addresses the performance implications of different indexing strategies and the management of indices during data insertion and deletion.

Uploaded by

xihuatl074
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Indexing

Dr. K. M. Azharul Hasan


Dept. of CSE, KUET
Indexing

Indexes are used to quickly locate data


without having to search every row in a
database every time a database table is
accessed.
 Indexes can be created using one or more
columns of a database table, providing the
basis for both rapid random lookups and
efficient access of ordered records.
Storage and Indexing?

DB design using logical models (ER/Relational).


 Appropriate level for designers to begin with
 Provide independence from implementation details

Performance: another major factor in user


satisfaction
 Depends on
 Efficient data structures for data representation
 Efficiency of system operation on those structures
 Disks contains data files and system files including
dictionary and index files
 Disk access: one of the most critical factor in
performance.
Storage Hierarchy

 DBMS stores information on some storage medium


 Primary storage: can be operated directly by CPU.
 Secondary storage:
 larger capacity, lower cost, slower access
 cannot be operated directly by CPU – must be copied
to primary storage
 Secondary storage has major implications for DBMS
design
 READ: transfer data to main memory
 WRITE: transfer data from main memory.
 Both transfers are high-cost operations, relative to in-
memory operations, so must be planned carefully
Why Not Store Everything in Main
Memory?

 Cost and size


 Main memory is volatile: What’s the problem?
You know!!!
 Typical storage hierarchy:
 Factors: access speed, cost per unit, reliability
 Cache and main memory (RAM) for currently used
data: fast but costly
 Flash memory: limited number of writes (and
slow), non-volatile, disk-substitute in embedded
systems
 Disk for the main database (secondary storage).
 Tapes for archiving older versions of the data
(tertiary storage).
Disks

Secondary storage device of choice.


Data is stored and retrieved in units
called disk blocks or pages.
Unlike RAM, time to retrieve a disk page
varies depending upon location on disk.
 Therefore, relative placement of pages on disk
has major impact on DBMS performance!
Components of a Disk
Spindle
Tracks
 The platters spin Disk head

 The arm assembly is moved in


or out to position a head on a Sector
desired track. Tracks under
heads make a cylinder
(imaginary!).
 Only one head
reads/writes at any one
time.
 Block size is a multiple Platters
Arm movement
of sector size (which is
fixed).

Arm assembly
Accessing a Disk Page

Time to access (read/write) a disk block:


 seek time (moving arms to position disk head on track)
 rotational delay (waiting for block to rotate under
head)
 transfer time (actually moving data to/from disk
surface)
Seek time and rotational delay dominate.
Key to lower I/O cost: reduce seek/rotation
delays
Basic Concepts
9

 Indexing mechanisms used to speed up access to desired


data.
 E.g., author catalog in library

 Search Key - attribute to set of attributes used to look up


records in a file.
 An index file consists of records (called index entries) of the
form search-key pointer

 Index files are typically much smaller than the original file
 Two basic kinds of indices:
 Ordered indices: search keys are stored in sorted order
 Hash indices: search keys are distributed uniformly
across “buckets” using a “hash function”.
06/05/2025
Index Evaluation Metrics
10

 Access types supported efficiently. e.g.,


 records with a specified value in the
attribute
 or records with an attribute value falling in

a specified range of values.


 Access time
 Insertion time
 Deletion time
 Space overhead

06/05/2025
Ordered Indices
11
 In an ordered index, index entries are stored sorted on
the search key value. E.g., author catalog in library.
 Primary index: in a sequentially ordered file, the index
whose search key specifies the sequential order of the file.
 Also called clustering index
 The search key of a primary index is usually but not

necessarily the primary key.


 Secondary index: an index whose search key specifies an
order different from the sequential order of the file. Also
called
non-clustering index.
 Index-sequential file: ordered sequential file with a
primary index.

06/05/2025
Dense Index Files
12

Dense index — Index record appears for every search-key value


in the file.

06/05/2025
Sparse Index Files
13

 Sparse Index: contains index records for only some search-


key values.
 Applicable when records are sequentially ordered on

search-key
 To locate a record with search-key value K we:
 Find index record with largest search-key value < K
 Search file sequentially starting at the record to which

the index record points

06/05/2025
Sparse Index Files (Cont.)
14

 Compared to dense indices:


 Less space and less maintenance overhead for
insertions and deletions.
 Generally slower than dense index for locating

records.
 Good tradeoff: sparse index with an index entry for
every block in file, corresponding to least search-key
value in the block.

06/05/2025
Multilevel Index
15

 If primary index does not fit in memory, access


becomes expensive.
 Solution: treat primary index kept on disk as a
sequential file and construct a sparse index on it.
 outer index – a sparse index of primary index
 inner index – the primary index file

 If even outer index is too large to fit in main


memory, yet another level of index can be
created, and so on.
 Indices at all levels must be updated on insertion
or deletion from the file.

06/05/2025
Multilevel Index (Cont.)

16

06/05/2025
Index Classification
17

Summery
 Primary vs. secondary: If search key contains same
order or not.
 Clustered vs. unclustered: If order of data records
is the same as order of data entries or not.
 Dense vs. sparse: If there is an entry in the index
for each key value or not .
 Single level vs. multi level:

06/05/2025
Hash-Based Indexes
18
Good for equality selections.
 Index is a collection of buckets. Bucket = primary
page plus zero or more overflow pages.
 Hashing function h: h(r) = bucket in which
record r belongs. h looks at the search key fields
of r.
Buckets may contain the data records or just
the rids.
Hash-based indexes are best for equality
selections. Cannot support range searches
So what is difference between hashing and
indexing?
06/05/2025
Index Update: Deletion
19

 If deleted record was the only record in the file with its
particular search-key value, the search-key is deleted from the
index also.
 Single-level index deletion:
 Dense indices – deletion of search-key: similar to file record

deletion.
 Sparse indices –

 if an entry for the search key exists in the index, it is


deleted by replacing the entry in the index with the next
search-key value in the file (in search-key order).
 If the next search-key value already has an index entry, the
entry is deleted instead of being replaced.

06/05/2025
Index Update: Insertion
20

 Single-level index insertion:


 Perform a lookup using the search-key value
appearing in the record to be inserted.
 Dense indices – if the search-key value does not

appear in the index, insert it.


 Sparse indices – if index stores an entry for each

block of the file, no change needs to be made to


the index unless a new block is created.
 If a new block is created, the first search-key value
appearing in the new block is inserted into the index.
 Multilevel insertion (as well as deletion) algorithms
are simple extensions of the single-level algorithms
06/05/2025
Secondary Indices
21

 Frequently, one wants to find all the records


whose values in a certain field (which is not the
search-key of the primary index) satisfy some
condition.
 Example 1: In the account relation stored

sequentially by account number, we may want


to find all accounts in a particular branch
 Example 2: as above, but where we want to

find all accounts with a specified balance or


range of balances
 We can have a secondary index with an index
record for each search-key value

06/05/2025
Secondary Indices Example

Secondary index on balance field of account

 Index record points to a bucket that contains pointers


to all the actual records with that particular search-key
value.
 Secondary indices have to be dense
Primary and Secondary Indices
23

 Indices offer substantial benefits when searching


for records.
 Updating indices imposes overhead on database
modification --when a file is modified, every index
on the file must be updated.
 Sequential scan using primary index is efficient,
but a sequential scan using a secondary index is
expensive
 Each record access may fetch a new block from

disk
 Block fetch requires about 5 to 10 micro

seconds, versus about 100 nanoseconds for


memory access 06/05/2025
B+-Tree Index Files
24

 Disadvantage of indexed-sequential files


 performance degrades as file grows, since many
overflow blocks get created.
 Periodic reorganization of entire file is required.
 Advantage of B+-tree index files:
 automatically reorganizes itself with small, local,
changes, in the face of insertions and deletions.
 Reorganization of entire file is not required to maintain
performance.
 (Minor) disadvantage of B+-trees:
 extra insertion and deletion overhead, space overhead.
 Advantages of B+-trees outweigh disadvantages
 B+-trees are used extensively

06/05/2025
B+-Tree Index Files
25

B+-tree indices are an alternative to indexed-sequential files.

06/05/2025
B+-Tree Index Files (Cont.)
26

B+-tree is a rooted tree satisfying the following properties

 All paths from root to leaf are of the same length


 Each node that is not a root or a leaf has between n/2 and
n children.
 A leaf node has between (n–1)/2 and n–1 values
 Special cases:
 If the root is not a leaf, it has at least 2 children.
 If the root is a leaf (that is, there are no other nodes in

the tree), it can have between 0 and (n–1) values.

06/05/2025
B+ Tree Example
27

To Records

06/05/2025
B+-Tree Node Structure
28

 Typical node
 Ki are the search-key values
 Pi are pointers to children (for non-leaf nodes) or pointers
to records or buckets of records (for leaf nodes).
 The search-keys in a node are ordered
K1 < K2 < K3 < . . . < Kn–1

06/05/2025
Leaf Nodes in B+-Trees
29
Properties of a leaf node:
 For i = 1, 2, . . ., n–1, pointer Pi either points to a file record with
search-key value Ki, or to a bucket of pointers to file records, each
record having search-key value Ki.
 If Li, Lj are leaf nodes and i < j, Li’s search-key values are less than
Lj’s search-key values
 Pn points to next leaf node in search-key order

06/05/2025
Non-Leaf Nodes in B+-Trees
30

 Non leaf nodes form a multi-level sparse index on the leaf


nodes. For a non-leaf node with m pointers:
 All the search-keys in the subtree to which P points are
1
less than K1
 For 2  i  n – 1, all the search-keys in the subtree to
which Pi points have values greater than or equal to Ki–1
and less than Ki
 All the search-keys in the subtree to which Pn points have
values greater than or equal to Kn–1

06/05/2025
Sample non-leaf

31

120

150

180
to keys to keys to keys
< 120 120 k<150 150k<180 180

06/05/2025
Sample leaf node
32

From non-leaf node

to next leaf
in sequence

120

130
with key 120

with key 130


To record

To record

06/05/2025
3
5
11

30
30
35

100
101
110
B+ Tree Example
33

100

To Records
120
130

150
156 120
179 150
180
180
200
06/05/2025
B+ Tree
34

Suppose a key value is 9 byte, page size is


512 bytes and a pointer (both page pointer
and record pointer) is 7 bytes. How many key
values you can enter in a leaf and non leaf
node of a B+ tree?

HT

06/05/2025
Insert into B+ tree
35

First lookup the proper leaf

(a) simple case


 leaf not full: just insert (key, pointer-to-record)
(b) leaf overflow
(c) non-leaf overflow
(d) new root

06/05/2025
(a) Insert key = 32
36

n=3

100
30
11

30
31
32
3
5

06/05/2025
(b) Insert key = 7

37

n=3

100
30
7
57
11

30
31
3
5

06/05/2025
100
160
150
(c) Insert key = 160

156 120
179 150
180
38

160
179
180

180
n=3

200
06/05/2025
(d) New root, insert 45 n=3
39

Height grows at root

30
new root => balance maintained

10
20
30

40
10
12

20
25

30
32
40

40
45
1
2
3

06/05/2025
Deletion from B+ tree
40

Again, first lookup the proper leaf;

(a): Simple case: no underflow;


(b): Borrow keys from an adjacent sibling

(if it doesn't become too empty);


(c): Underflow

06/05/2025
(b) Delete 50
=> min # of keys
41
in a leaf = 5/2 = 2

n=4

40 35
100
10

35
10
20
30
35

40
50

06/05/2025
(c) Leaf Underflow Delete 50

n=4
42

100
20
40
40
20
30

40
50

06/05/2025
(d) Non-leaf underflow Delete 37
=> min # of keys in a
non-leaf =
(n+1)/2 - 1=3-1= 2

n=4

25
new root

40
25
10
20

30
40
30

30
37
10
14

20
22

25
26

40
45
1
3

43 06/05/2025
Home task
• Construct a B+ tree having n= 4 or 5 up to
level 3 to insert random keys considering
the cases.
• How can you perform range key query in a
B+ tree ?

44 06/05/2025
Queries on B+-Trees (Cont.)
45

 If there are K search-key values in the file, the


height of the tree is no more than logn/2(K)
 A node is generally the same size as a disk block,
typically 4 kilobytes
 and n is typically around 100 (40 bytes per index entry).
 With 1 million search key values and n = 100
 at most log (1,000,000) = 4 nodes are accessed in a
50
lookup.
 Contrast this with a balanced binary tree with 1
million search key values — around 20 nodes are
accessed in a lookup
 above difference is significant since every node access
may need a disk I/O, costing around 20 milliseconds
06/05/2025
B-Tree Index Files
46
 Similar to B+-tree, but B-tree allows search-key values to appear only
once; eliminates redundant storage of search keys.
 Search keys in nonleaf nodes appear nowhere else in the B-tree; an
additional pointer field for each search key in a nonleaf node must be
included.
 Generalized B-tree leaf node vs B+ tree

 Nonleaf node – pointers Bi are the bucket or file


record pointers.
06/05/2025
B-Tree Index File Example
47

B-tree (above) and B+-tree (below) on


same data

06/05/2025
B-Tree Index Files (Cont.)
48

 Advantages of B-Tree indices:


 May use less tree nodes than a corresponding B +-Tree.

 Sometimes possible to find search-key value before

reaching leaf node.


 Disadvantages of B-Tree indices:
 Only small fraction of all search-key values are found early
 Non-leaf nodes are larger, so fan-out is reduced. Thus, B-
Trees typically have greater depth than corresponding B+-
Tree
 Insertion and deletion more complicated than in B+-Trees
 Implementation is harder than B+-Trees.
 Range key search is difficult.
 Typically, advantages of B-Trees do not out weigh
disadvantages. 06/05/2025
Index Definition in SQL
49

 Create an index
create index <index-name> on <relation-name>
(<attribute-list>)
E.g.: create index b-index on branch(branch_name)
 Use create unique index to indirectly specify and
enforce the condition that the search key is a
candidate key.
 Not really required if SQL unique integrity constraint is
supported
 To drop an index
drop index <index-name>

06/05/2025
Index Selection Guidelines
 Attributes in WHERE clause are candidates for
index keys.
 Exact match condition suggests cluster/sparse/hash
index.
 Range query suggests tree index.
Clustering is especially useful for range queries;
can also help on equality queries if there are
many duplicates.
 Multi-attribute search keys should be considered
when a WHERE clause contains several conditions.
 Try to choose indexes that benefit as many queries
as possible.
 If only one index can be clustered per relation,
choose it based on important queries that would
benefit the most from clustering.
Index Selection Guidelines(Cont..)
SELECT E.dno
FROM Emp E
WHERE E.age>40
 B+ tree index on E.age can be used to get
qualifying tuples.
 Things to consider
 How selective is the condition?
 If 99% are over 40, index is less useful
 If 10%, an index is useful
Index Selection Guidelines(Cont..)
SELECT E.dno, COUNT (*)
FROM Emp E
WHERE E.age>20
GROUP BY E.dno

Consider the GROUP BY query: using age as an


index ---- is it effective?
 If many tuples have E.age > 20, using E.age index and
sorting the retrieved tuples may be costly.
 Especially bad if this index is not clsutered
 Clustered E.dno index may be better!
Indexes with Composite Search
Keys

 Composite Search Keys: Examples of composite key


Search on a combination indexes using lexicographic order.
of fields.
11,80 11
 Equality query: Every field 12,10 12
value is equal to a constant 12,20 name age sal 12
value. E.g. wrt <sal,age> 13,75 bob 12 10 13
index: <age, sal> cal 11 80 <age>
 age=12 and sal =75 joe 12 20
 Range query: Some field 10,12 sue 13 75 10
value is not a constant. E.g.: 20,12 Data records 20
 age =12; or age=12 and sal 75,13 sorted by name 75
> 10 80,11 80
 Data entries in index <sal, age> <sal>
Data entries in index Data entries
sorted by search key to sorted by <sal,age> sorted by <sal>
support range queries.
Composite Search Keys

To retrieve Emp records with age=30 AND


sal=4000, an index on <age,sal> would be
better than an index on age or an index on sal.
If condition is: 20<age<30 AND
3000<sal<5000:
 Clustered index on <age,sal> or <sal,age> is best.
If condition is: age=30 AND 3000<sal<5000:
 Clustered <age,sal> index much better than <sal,age>
index!
Composite indexes are larger, updated more
often.
Exercise to solve

 Emp (eid: int, salary:int, age: real, did: int)


 eid is the key, and there’s a clustered
index on eid and an unclustered index on
age
1. Give an example of a query that can be
speeded up because of the available
indexes.
2. Give an example that is neither speeded up
nor slowed down by the indexes.
3. Can there be an update that can be slowed
down because of the indexes?
56

Thank You

06/05/2025

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy