0% found this document useful (0 votes)
65 views30 pages

1 Indexing Techniques

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views30 pages

1 Indexing Techniques

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Data Warehousing

Need for Speed:


Conventional Indexing Techniques

1
Need For Indexing: Speed
Consider searching your hard disk using the Windows SEARCH
command.
 Search goes into directory hierarchies.
 Takes about a minute, and there are only a few thousand files.

Assume a fast processor and (even more importantly) a fast-hard


disk.
 Assume file size to be 5 KB.
 Assume hard disk scan rate of a million files per second.
 Resulting in scan rate of 5 GB per second.

Largest search engine indexes more than 8 billion pages


 At above scan rate 1,600 seconds required to scan ALL pages.
 This is just for one user!
 No one is going to wait for 26 minutes, not even 26 seconds.

Hence, a sequential scan is simply not feasible.


2
Need For Indexing: Query Complexity
 How many customers do I have in Karachi?
 How many customers in Karachi made calls during
April?
 How many customers in Karachi made calls to Multan
during April?

 How many customers in Karachi made calls to Multan


during April using a particular calling package?

3
Need For Indexing: I/O Bottleneck
 Throwing hardware just speeds up the CPU intensive
tasks.
 The problem is of I/O, which does not scales up easily.
 Putting the entire table in RAM is very very expensive.
 Therefore, index!

4
Indexing Concept
• Purely physical concept, nothing to do with logical
model.

• Invisible to the end user (programmer), optimizer


chooses it, effects only the speed, not the answer.

• With the library analogy, the time complexity to find a


book? The average time taken

5
Indexing Concept
• Using a card catalog organized in many different ways
i.e. author, topic, title etc. and is sorted.

• A little bit of extra time to first check the catalog, but it


“gives” a pointer to the shelf and the row where book is
located.

• The catalog has no data about the book, just an efficient


way of searching.

6
Indexing Goal

Look at as few blocks as


possible to find the matching
record(s)

7
Conventional indexing Techniques
1. Dense
2. Sparse
3. Multi-level (or B-Tree)
4. Primary Index vs. Secondary Indexes

8
1. Dense Index: Concept
Dense Index Data File
Every key in the data 10 10
file is represented in 20 20
the index file 30
40
30
40
50
60 50
70 60
80
70
90 80
100 90
110 100
120

9
1. Dense Index: Adv & Dis Adv

• Advantage:
• A dense index, if fits in the memory, is very efficient in
locating a record given a key

• Disadvantage:
• A dense index, if too big and doesn’t fit into the
memory, will be expensive when used to find a record
given its key

10
2. Sparse Index: Concept
Sparse Index Data File

Normally keeps only 10 10


one key per data block 30 20
50
Some keys in the data 70
30
file will not have an
40
90
entry in the index file
110 50
130 60
150
70
170 80
190 90
210 100
230

11
2. Sparse Index: Adv & Dis Adv

• Advantage:
• A sparse index uses less space at the expense of
somewhat more time to find a record given its key

• Support multi-level indexing structure

• Disadvantage:
• Locating a record given a key has different performance
for different key values

12
2. Sparse Index: Multi level
Sparse 2nd level Data File

10 10 10
90 30 20
170 50
250 70
30
40
90
330 50
110
410 60
130
490
150
570 70
170 80
190 90
210 100
230

13
3. B-tree Indexing: Concept
• Can be seen as a general form of multi-level indexes.

• Generalize usual (binary) search trees (BST).

• Allow efficient and fast exploration at the expense of


using slightly more space.

• Popular variant: B+-tree

• Support more efficiently queries like:


• SELECT * FROM R WHERE a = 11

• SELECT * FROM R WHERE 0<= b and b<42

14
3. B-tree Indexing: Example

200
Looking for Empno 250

220
250
280
130

280
300
100

220
230
200
210
215
140
145

250
256
279
20
9

RID list

Each node stored in one disk block


15
3. B-tree Indexing: Limitations
 If a table is large and there are fewer unique values.

 Capitalization is not programmatically enforced


(meaning case-sensitivity does matter and
“FLASHMAN" is different from “Flashman").

 Outcome varies with inter-character spaces.

 A noun spelled differently will result in different


results.

 Insertion can be very expensive.


16
3. B-tree Indexing: Limitations Example
Given that MOHAMMED is the most common first name in Pakistan, a 5-million
row Customers table would produce many screens of matching rows for
MOHAMMED AHMAD, yet would skip potential matching values such as the
following:

VALUE MISSED REASON MISSED


Mohammed Ahmad Case sensitive
MOHAMMED AHMED AHMED versus AHMAD
MOHAMMED AHMAD Extra space between names
MOHAMMED AHMAD DR DR after AHMAD
MOHAMMAD AHMAD Alternative spelling of MOHAMMAD

17
Hash Based Indexing
• You may recall that in internal memory, hashing can
be used to quickly locate a specific key.

• The same technique can be used on external


memory.

• However, advantage over search trees is smaller in


external search than internal. WHY?

• Because part of search tree can be brought into the


main memory.

18
Hash Based Indexing: Concept
In contrast to B-tree indexing, hash based indexes do not
(typically) keep index values in sorted order.

• Index entry is found by hashing on index value


requiring exact match.

SELECT * FROM Customers WHERE AccttNo= 110240

19
Hash Based Indexing: Concept
• Index entries kept in hash organized tables rather than
B-tree structures.

• Index entry contains ROWID values for each row


corresponding to the index value.

• Remember few numbers in real-life to be useful for


hashing.

20
Hashing as Primary Index

.
.
records disk block
key ® h(key)
.
.
Note on terminology: .
The word "indexing" is often used
synonymously with "B-tree indexing".

21
Hashing as Secondary Index

key record
key ® h(key)

Index

Can always be transformed to a secondary index using


indirection, as above.

Indexing the Index


22
B-tree vs. Hash Indexes

 Indexing (using B-trees) good for range searches, e.g.:


SELECT * FROM R WHERE A > 5

 Hashing good for match based searches, e.g.:


SELECT * FROM R WHERE A = 5

23
Primary Key vs. Primary Index
Relation Students

Name ID dept
AHMAD 123 CS
Akram 567 EE
Numan 999 CS

Primary Key & Primary Index:


PK is ALWAYS unique.
PI can be unique, but does not have to be.
In DSS environment, very few queries are PK based.

24
4. Unique and Nonunique Primary Indexes
• Unique and Nonunique Primary Indexes

• You can define the primary index as unique (UPI) or


nonunique (NUPI)

• NUPIs depending on whether duplicate values are


allowed in the indexed column set.

• UPIs provide optimal data distribution and are typically


assigned to the primary key for a table.

25
4. Primary Indexing: Criterion
• Primary index selection criteria:

• Common join and retrieval key.

• Can be unique UPI or non-unique NUPI.

• Limits on NUPI.

• Only one primary index per table (for hash-based file


system).

26
4. Primary Indexing: Criterion
• Primary index selection criteria:

• Common join and retrieval key.

• Can be unique UPI or non-unique NUPI.

• Limits on NUPI.

• Only one primary index per table (for hash-based file


system).

27
4. Primary Indexing Criteria: Example
Call Table
call_id decimal (15,0) NOT NULL
caller_no decimal (10,0) NOT NULL
call_duration decimal (15,2) NOT NULL
call_dt date NOT NULL
called_no decimal (15,0) NOT NULL

What should be the primary index of the call table


for a large telecom company?

No simple answer!!

28
4. Primary Indexing
• Almost all joins and retrievals will occur through the
caller_no foreign key.
• Use caller_no as a NUPI.

• In case of non uniform distribution on caller_no or

• if phone number have very large number of outgoing


calls (e.g., an institutional number could easily have
several thousand calls).
• Use call_id as UPI for good data distribution.

29
4. Primary Indexing
For a hash-based file system, primary index is free!
• No storage cost.
• No index build required.

OLTP databases use a page-based file system and


therefore do not deliver this performance advantage.

30

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy