0% found this document useful (0 votes)
4 views449 pages

FULL

The document provides an overview of Database Management Systems (DBMS), detailing their capabilities such as persistent storage, programming interfaces, and transaction management. It discusses the history of database systems, components of a DBMS, and classifications based on data models and user access. Additionally, it covers disk storage, file structures, operations on files, and hashing techniques, emphasizing the importance of efficient data management.

Uploaded by

nguyen ngo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views449 pages

FULL

The document provides an overview of Database Management Systems (DBMS), detailing their capabilities such as persistent storage, programming interfaces, and transaction management. It discusses the history of database systems, components of a DBMS, and classifications based on data models and user access. Additionally, it covers disk storage, file structures, operations on files, and hashing techniques, emphasizing the importance of efficient data management.

Uploaded by

nguyen ngo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 449

Chapter 1

An Overview of a Database
Management System
What is a DBMS?

◼ The power of database comes from a body of


knowledge and technology that has developed over
several decades and is embodied in a specialized
software called a database management system,
or DBMS.
◼ A DBMS is a powerful tool for creating and
managing large amount of data efficiently and
allowing it to persist over long periods of time safely.

2
DBMS Capabilities
The capabilities that a DBMS provides the user are:
◼ Persistent Storage. A DBMS supports the storage of very large
amounts of data that exists independently of any processes that
are using the data.
◼ Programming Interface. A DBMS allows the user to access and
modify data through a powerful query language.
◼ Transaction management. A DBMS supports concurrent
access to data, i.e., simultaneously access by many distinct
processes (called transaction) at once. To avoid some of the
undesirable consequences of simultaneous access, the DBMS
supports:
❑ isolation

❑ atomicity

❑ resiliency

3
History of database systems and DBMS

1990s: 2000s: XML DB,


Object- bioinformation,
1980s: relational data stream,
Object- model) – sensor network,
Oriented, ORDBMS, NoSQL
1960s: Flat-
File, Distributed OLAP, data
1970s: mining,
Hierarchical, DBMS
Relational data
Network DBMS – warehouse,
Databases. RDBMS) multimedia
DB

4
Component
module of a
DBMS

5
The Database System Environment (1)

◼ DBMS component modules


❑ Buffer management
❑ Stored data manager
❑ DDL compiler
❑ Interactive query interface
• Query compiler
• Query optimizer
❑ Precompiler

6
The Database System Environment (2)

◼ DBMS component modules


❑ Runtime database processor
❑ System catalog
❑ Concurrency control system
❑ Backup and recovery system

7
Classification of DBMS
◼ DBMS classification based on:
❑ Data model:
◼ Hierarchical, network, relational, object, object-relational,
XML, document-based, graph-based, column-based, key-
value, …
❑ The number of users:
◼ Single-user systems vs. multiuser systems
❑ The number of sites
◼ Centralized vs. distributed
❑ Cost
❑ Purpose
◼ General purpose vs. special purpose
8
9
10
11
When should (not) we use the DBMS?
◼ Should
❑ Controlling Redundancy
❑ Restricting Unauthorized Access
❑ Providing Persistent Storage for Program Objects
❑ Providing Storage Structures and Search Techniques for Efficient Query
Processing
❑ Providing Backup and Recovery
❑ Providing Multiple User Interfaces
❑ Representing Complex Relationships among Data
❑ Enforcing Integrity Constraints
❑ Permitting Inferencing and Actions Using Rules and Triggers
❑ Additional Implications of Using the Database Approach
◼ Potential for Enforcing Standards
◼ Reduced Application Development Time
◼ Flexibility
◼ Availability of Up-to-Date Information
◼ Economies of Scale
12
When should (not) we use the DBMS?
◼ Should not
❑ Simple, well-defined database applications not expected to change at all
❑ Stringent, real-time requirements that may not be met because of DBMS
overhead
❑ Embedded systems with limited storage capacity
❑ No multiple-user access to data

13
14
Chapter 2

Disk Storage, Basic File Structures,


and Hashing.
Contents
1 Disk Storage Devices
2 Files of Records
3 Operations on Files
4 Unordered Files and Ordered Files
5 Hashed Files
6 RAID Technology

2
Contents
1 Disk Storage Devices
2 Files of Records
3 Operations on Files
4 Unordered Files and Ordered Files
5 Hashed Files
6 RAID Technology

3
Memory & Storage Hierarchy

primary storage
Secondary storage
4
Disk Storage Devices
◼ Preferred secondary storage device for high
storage capacity and low cost.
◼ Data stored as magnetized areas on
magnetic disk surfaces.
◼ A disk pack contains several magnetic disks
connected to a rotating spindle.
◼ Disks are divided into concentric circular
tracks on each disk surface .
❑ Track capacities vary typically from 10 to 150
Kbytes.
5
Disk Storage Devices (cont.)

6
Disk Storage Devices (cont.)

Track
Sector

Spindle

7
Disk Storage Devices (cont.)
◼ A track is divided into smaller blocks or
sectors.
❑ because a track usually contains a large amount
of information .
◼ A track is divided into blocks.
❑ The block size B is fixed for each system.
◼ Typical block sizes range from B=512 bytes to
B=8192 bytes.
❑ Whole blocks are transferred between disk and
main memory for processing.

8
Disk Storage Devices (cont.)
◼ A read-write head moves to the track that contains the
block to be transferred.
❑ Disk rotation moves the block under the read-write head for
reading or writing.
◼ A physical disk block (hardware) address consists of:
❑ a cylinder number (imaginary collection of tracks of same
radius from all recorded surfaces)
❑ the track number or surface number (within the cylinder)
❑ and block number (within track).
◼ Reading or writing a disk block is time consuming
because of the seek time s and rotational delay (latency)
rd.
◼ Double buffering can be used to speed up the transfer of
contiguous disk blocks.

9
10
Disk Storage Devices (cont.)

11
Double Buffering

12
Contents
1 Disk Storage Devices
2 Files of Records
3 Operations on Files
4 Unordered Files and Ordered Files
5 Hashed Files
6 RAID Technology

13
Records
◼ Fixed and variable length records.
◼ Records contain fields which have values of a
particular type.
❑ E.g., amount, date, time, age.
◼ Fields themselves may be fixed length or
variable length.
◼ Variable length fields can be mixed into one
record:
❑ Separator characters or length fields are needed
so that the record can be “parsed”.
14
Records (cont.)

(a)A fixed-length record with 6 fields and size of 71 bytes.


(b)A record with 2 variable-length fields and 3 fixed-length fields.
(c)A variable-field record with 3 types of separator characters.

15
Blocking
◼ Blocking: refers to storing a number of
records in one block on the disk.
◼ Blocking factor (bfr): refers to the number
of records per block.
◼ There may be empty space in a block if an
integral number of records do not fit in one
block.
◼ Spanned Records: refer to records that
exceed the size of one or more blocks and
hence span a number of blocks.

16
Blocking (cont.)

17
Files of Records
◼ A file is a sequence of records, where each record is
a collection of data values (or data items).
◼ A file descriptor (or file header) includes information
that describes the file, such as the field names and
their data types, and the addresses of the file blocks
on disk.
◼ Records are stored on disk blocks.
◼ The blocking factor bfr for a file is the (average)
number of file records stored in a disk block.
◼ A file can have fixed-length records or variable-
length records.

18
Files of Records (cont.)
◼ File records can be unspanned or spanned:
❑ Unspanned: no record can span two blocks
❑ Spanned: a record can be stored in more than one block
◼ The physical disk blocks that are allocated to hold the
records of a file can be contiguous, linked, or indexed.
◼ In a file of fixed-length records, all records have the
same format. Usually, unspanned blocking is used with
such files.
◼ Files of variable-length records require additional
information to be stored in each record, such as
separator characters and field types.
❑ Usually spanned blocking is used with such files.

19
Contents
1 Disk Storage Devices
2 Files of Records
3 Operations on Files
4 Unordered Files and Ordered Files
5 Hashed Files
6 RAID Technology

20
Operation on Files
Typical file operations include:
◼ OPEN: Reads the file for access, and associates a
pointer that will refer to a current file record at each point
in time.
◼ FIND: Searches for the first file record that satisfies
a certain condition, and makes it the current file record.
◼ FINDNEXT: Searches for the next file record (from
the current record) that satisfies a certain condition, and
makes it the current file record.
◼ READ: Reads the current file record into a program
variable.
◼ INSERT: Inserts a new record into the file, and makes
it the current file record.

21
Operation on Files (cont.)
◼ DELETE: Removes the current file record from the
file, usually by marking the record to indicate that it
is no longer valid.
◼ MODIFY: Changes the values of some fields of the
current file record.
◼ CLOSE: Terminates access to the file.
◼ REORGANIZE: Reorganizes the file records. For
example, the records marked deleted are physically
removed from the file or a new organization of the
file records is created.
◼ READ_ORDERED: Read the file blocks in order of
a specific field of the file.

22
Contents
1 Disk Storage Devices
2 Files of Records
3 Operations on Files
4 Unordered Files and Ordered Files
5 Hashed Files
6 RAID Technology

23
Unordered Files
◼ Also called a heap or a pile file.
◼ New records are inserted at the end of the file.
◼ A linear search through the file records is
necessary to search for a record.
❑ This requires reading and searching half the file
blocks on the average, and is hence quite expensive.
◼ Record insertion is quite efficient.
◼ Reading the records in order of a particular field
requires sorting the file records.

24
Ordered Files
◼ Also called a sequential file.
◼ File records are kept sorted by the values of an ordering
field.
◼ Insertion is expensive: records must be inserted in the
correct order.
❑ It is common to keep a separate unordered overflow (or
transaction) file for new records to improve insertion efficiency;
this is periodically merged with the main ordered file.
◼ A binary search can be used to search for a record on
its ordering field value.
❑ This requires reading and searching log2 of the file blocks on the
average, an improvement over linear search.
◼ Reading the records in order of the ordering field is quite
efficient.

25
Ordered
Files (cont.)

26
Binary search
2 5 6 9 12 18 22 27 33

low mid high

◼ Search 27:
❑ Step 1:
2 5 6 9 12 18 22 27 33

low mid high


❑ Step 2:
2 5 6 9 12 18 22 27 33

low, high
mid
Average Access Times

◼ The following table shows the average access time


to access a specific record for a given type of file:

28
Contents
1 Disk Storage Devices
2 Files of Records
3 Operations on Files
4 Unordered Files and Ordered Files
5 Hashed Files
6 RAID Technology

29
Hashed Files
◼ Hashing for disk files is called External Hashing.
◼ The file blocks are divided into M equal-sized buckets,
numbered bucket0, bucket1, ..., bucketM-1.
❑ Typically, a bucket corresponds to one (or a fixed number of) disk
block.
◼ One of the file fields is designated to be the hash key of
the file.
◼ The record with hash key value K is stored in bucket i,
where i=h(K), and h is the hashing function.
◼ Search is very efficient on the hash key.
◼ Collisions occur when a new record hashes to a bucket
that is already full.
❑ An overflow file is kept for storing such records.
❑ Overflow records that hash to each bucket can be linked together.
30
Hashed Files (cont.)

31
Hashed Files (cont.)
There are numerous methods for collision resolution:

◼ Open addressing: Proceeding from the occupied position


specified by the hash address, the program checks the
subsequent positions in order until an unused (empty)
position is found. 0 1 2 3 4 5 6
❑ h(K) = K mod 7
1 3 11 6

❑ Insert 8 1 8 3 11 6

❑ Insert 15 1 8 3 11 15 6

❑ Insert 13 13 1 8 3 11 15 6

32
Hashed Files (cont.)
◼ There are numerous methods for collision resolution,
including the following:
❑ Chaining:
◼ Various overflow locations are kept: extending the array with a
number of overflow positions.
◼ A pointer field is added to each record location.
◼ A collision is resolved by placing the new record in an unused
overflow location and setting the pointer of the occupied hash
address location to the address of that overflow location.

❑ Multiple hashing:
◼ The program applies a second hash function if the first results in
a collision.
◼ If another collision results, the program uses open addressing or
applies a third hash function and then uses open addressing if
necessary.

33
Hashed Files (cont.) - Overflow handling

34
Hashed Files (cont.)
◼ To reduce overflow records, a hash file is typically
kept 70-80% full.
◼ The hash function h should distribute the records
uniformly among the buckets;
❑ Otherwise, search time will be increased because many
overflow records will exist.
◼ Main disadvantages of static external hashing:
❑ Fixed number of buckets M is a problem if the number of
records in the file grows or shrinks.
❑ Ordered access on the hash key is quite inefficient
(requires sorting the records).

35
Dynamic And Extendible Hashed Files
◼ Dynamic and Extendible Hashing Techniques
❑ Hashing techniques are adapted to allow the dynamic
growth and shrinking of the number of file records.
❑ These techniques include the following: dynamic
hashing, extendible hashing, and linear hashing.
◼ Both dynamic and extendible hashing use the
binary representation of the hash value h(K) in
order to access a directory.
❑ In dynamic hashing the directory is a binary tree.
❑ In extendible hashing the directory is an array of size
2d where d is called the global depth.

36
Dynamic And Extendible Hashing (cont.)
◼ The directories can be stored on disk, and they
expand or shrink dynamically.
❑ Directory entries point to the disk blocks that contain
the stored records.
◼ An insertion in a disk block that is full causes the
block to split into two blocks and the records are
redistributed among the two blocks.
❑ The directory is updated appropriately.
◼ Dynamic and extendible hashing do not require
an overflow area.
◼ Linear hashing does require an overflow area
but does not use a directory.
❑ Blocks are split in linear order as the file expands.

37
Extendible
Hashing

38
Extendible Hashing – Example
Record K h(K) = K % 32 h(K)B
r1 2657 1 00001
r2 3760 16 10000 d’ = local depth
r3 4692 20 10100 d = global depth
r4 4871 7 00111
r5 5659 27 11011
r6 1821 29 11101
r1 (00001) d’ =0
r7 1074 18 10010
r8 2123 11 01011 r2 (10000)
r9 1620 20 10100 d=0
r10 2428 28 11100
r11 3943 7 00111
r12 4750 14 01110
r13 6975 31 11111 Each bucket has
r14 4981 21 10101 maximum 2 records
r15 9208 24 11000

39
Extendible Hashing – Example(cont.)
Directory
r1 (00001) d’ =1

r1 (00001) d’ =0
r2 (10000) 0 Insert r4 (00111)
1
d=0
d=1
Insert r3 (10100) =>
overflow=> splitting
r2 (10000) d’ =1
r3 (10100)

40
Extendible Hashing – Example(cont.)

r1 (00001) d’ =1
r4 (00111)

0
1

d=1
r2 (10000) d’ =1
r3 (10100)

Insert r5 (11011) =>


overflow=> splitting

41
Extendible Hashing – Example(cont.)

r1 (00001) d’ =1
r4 (00111)

00
01
10
11 r2 (10000) d’ =2
r3 (10100)
d=2
r5 (11011) d’ =2

Insert r6 (11101)

42
Extendible Hashing – Example(cont.)

r1 (00001) d’ =1
r4 (00111)

00
01
10
11 r2 (10000) d’ =2
r3 (10100)
d=2
Insert r7 (10010) =>
r5 (11011) d’ =2
overflow=> splitting
r6 (11101)

43
Extendible Hashing – Example(cont.)
Insert r8 (01011) =>
000 r1 (00001) d’ =1 overflow=> splitting
r4 (00111)
001
010
r2 (10000) d’ =3
011
r7 (10010)
100
101 r3 (10100) d’ =3

110
111 r5 (11011) d’ =2

d=3 r6 (11101)

44
Extendible Hashing – Example(cont.)
r1 (00001) d’ =2
r4 (00111)
000 r8 (01011) d’ =2
001
010
r2 (10000) d’ =3
011
r7 (10010)
100 Insert r9 (10100)
101 r3 (10100) d’ =3

110
111 r5 (11011) d’ =2

d=3 r6 (11101)

45
Extendible Hashing – Example(cont.)
r1 (00001) d’ =2
r4 (00111)
000 r8 (01011) d’ =2
001
010
r2 (10000) d’ =3
011
r7 (10010)
100
101 r3 (10100) d’ =3

110 r9 (10100)
111 Insert r10 (11100) =>
r5 (11011) d’ =2 overflow=> splitting
d=3 r6 (11101)

46
Extendible Hashing – Example(cont.)
r1 (00001) d’ =2
r4 (00111) Insert r11 (00111) =>
000 r8 (01011) d’ =2 overflow=> splitting
001
010
r2 (10000) d’ =3
011
r7 (10010)
100
101 r3 (10100) d’ =3

110 r9 (10100)
111 r5 (11011) d’ =3

d=3
r6 (11101) d’ =3
r10 (11100)
47
Extendible Hashing – Example(cont.)
r1 (00001) d’ =3
r4 (00111) d’ =3
000 r11 (00111)
001 r8 (01011) d’ =2
Insert r12 (01110)
010
011
r2 (10000) d’ =3
100
r3 (10100) d’ =3 r7 (10010)
101
110 r9 (10100)
111 r5 (11011) d’ =3

d=3
r6 (11101) d’ =3
r10 (11100)
48
Extendible Hashing – Example(cont.)
r1 (00001) d’ =3
r4 (00111) d’ =3
000 r11 (00111)
001 r8 (01011) d’ =2
010 r12 (01110)
011
r2 (10000) d’ =3
100
r3 (10100) d’ =3 r7 (10010)
101
110 r9 (10100)
111 r5 (11011) d’ =3

d=3 Insert r13 (11111) =>


r6 (11101) d’ =3 overflow=> splitting
r10 (11100)
49
Extendible Hashing – Example(cont.)
r1 (00001) d’ =3
0000
0001 r4 (00111) d’ =3
0010 r11 (00111)
0011
0100 r8 (01011) d’ =2
0101
0110 r12 (01110) r2 (10000) d’ =3
0111 r7 (10010)
1000
1001 Insert r14 (10101) =>
r3 (10100) d’ =3
1010 overflow=> splitting
1011 r9 (10100)
1100 r5 (11011) d’ =3
1101 r6 (11101) d’ =4
1110
r10 (11100)
1111
d=4 r1 (11111) d’ =4

50
Extendible Hashing – Example(cont.)
r1 (00001) d’ =3
0000
0001 r4 (00111) d’ =3
0010 r11 (00111)
0011
0100 r8 (01011) d’ =2
0101
0110 r12 (01110) r2 (10000) d’ =3
0111 r7 (10010)
1000 r3 (10100) d’ =4
1001
1010 r9 (10100)
1011
1100 r14 (10101) d’ =4 r5 (11011) d’ =3
1101
Insert r15 (11000)
1110
1111 r6 (11101) d’ =4
d=4 r10 (11100) r1 (11111) d’ =4

51
Extendible Hashing – Example(cont.)
r1 (00001) d’ =3
0000
0001 r4 (00111) d’ =3
0010 r11 (00111)
0011
0100 r8 (01011) d’ =2
0101
0110 r12 (01110) r2 (10000) d’ =3
0111 r7 (10010)
1000 r3 (10100) d’ =4
1001
1010 r9 (10100)
1011
1100 r14 (10101) d’ =4 r5 (11011) d’ =3
1101
r15 (11000)
1110
1111 r6 (11101) d’ =4
d=4 r10 (11100) r1 (11111) d’ =4

52
Linear Hashing – Example
◼ M=4, h0(K) = K mod M, each bucket has 3
records.
◼ Initialization:
Split pointer
0 1 2 3
4:8: 5 : 9 : 13 6: : 7 : 11 :

Insert 17 Bucket 1: overflow


(17 mod 4 = 1)
Split bucket 0
h1(K) = K mod 2*M

53
Linear Hashing – Example(cont.)

Split pointer
0 1 2 3 4
8: : 5 : 9 : 13 6: : 7 : 11 : 4: :

4: bucket (4 mod 2*4 =) 4


8: bucket (8 mod 2*4 = ) 0
17 : :
17: overflow records

54
Linear Hashing – Example(cont.)
insert 15
(15 mod 4 = 3)
Split pointer
0 1 2 3 4
8: : 5 : 9 : 13 6: : 7 : 11 : 4: :

17 : :

55
Linear Hashing – Example(cont.)
insert 3
(3 mod 4 = 3)
Split pointer
0 1 2 3 4
8: : 5 : 9 : 13 6: : 7 : 11 : 15 4: :

Bucket 3: overflow
Split bucket 1.
17 : :
=> Overflow records: Redistributed

56
Linear Hashing – Example(cont.)
5: bucket (5 mod 2*4 =) 5
9: bucket (9 mod 2*4 = ) 1
13: bucket (13 mod 2*4 = ) 5
17: bucket (17 mod 2*4 = ) 1
0 1 2 3 4 5
8: : 9 : 17 : 6: : 7 : 11 : 15 4: : 5 : 13 :

Split pointer

3: :

57
Linear Hashing – Example(cont.)
Bucket 3: overflow.
Insert 23
Split bucket 2. (23 mod 4 = 3)

0 1 2 3 4 5
8: : 9 : 17 : 6: : 7 : 11 : 15 4: : 5 : 13 :

Split pointer

3: :

58
Linear Hashing – Example(cont.)
Bucket 3: overflow
Split bucket 3
=> Overflow records: Redistributed
Split pointer

0 1 2 3 4 5 6
8: : 9 : 17 : : : 7 : 11 : 15 4: : 5 : 13 : 6: :

insert 31
(31 mod 4 = 3)
3 : 23 :

59
Linear Hashing – Example(cont.)
7: bucket (7 mod 2*4 =) 7
11: bucket (11 mod 2*4 = ) 3
15: bucket (15 mod 2*4 = ) 3 h1(K) = K mod 8
3: bucket (3 mod 2*4 = ) 3
23: bucket (23 mod 2*4 = ) 7
31: bucket (31 mod 2*4 = ) = 7

0 1 2 3 4 5 6 7
8: : 9 : 17 : : : 11 : 15 : 3 4: : 5 : 13 : 6: : 7 : 23 : 31

Split pointer

60
Contents
1 Disk Storage Devices
2 Files of Records
3 Operations on Files
4 Unordered Files and Ordered Files
5 Hashed Files
6 RAID Technology

61
Parallelizing Disk Access using RAID
Technology.
◼ Secondary storage technology must take steps to
keep up in performance and reliability with
processor technology.
◼ A major advance in secondary storage technology is
represented by the development of RAID, which
originally stood for Redundant Arrays of
Inexpensive Disks.
◼ The main goal of RAID is to even out the widely
different rates of performance improvement of disks
against those in memory and microprocessors.

62
RAID Technology (cont.)
◼ A natural solution is a large array of small independent
disks acting as a single higher-performance logical disk.
A concept called data striping is used, which utilizes
parallelism to improve disk performance.
◼ Data striping distributes data transparently over multiple
disks to make them appear as a single large, fast disk.

63
RAID Technology (cont.)
◼ Different raid organizations were defined based on
different combinations of the two factors of granularity of
data interleaving (striping) and pattern used to compute
redundant information.
❑ Raid level 0 has no redundant data and hence has the best write
performance.
❑ Raid level 1 uses mirrored disks.
❑ Raid level 2 uses memory-style redundancy by using Hamming
codes, which contain parity bits for distinct overlapping subsets of
components. Level 2 includes both error detection and correction.

64
RAID Technology (cont.)
❑ Raid level 3 uses a single parity disk relying on the disk controller to
figure out which disk has failed.
❑ Raid levels 4 and 5 use block-level data striping, with level 5 distributing
data and parity information across all disks.

65
RAID Technology (cont.)
❑ Raid level 6 applies the so-called P + Q redundancy scheme using
Reed-Soloman codes to protect against up to two disk failures by using
just two redundant disks.

66
Use of RAID Technology (cont.)
◼ Different raid organizations are being used under
different situations:
❑ Raid level 1 (mirrored disks) is the easiest for rebuild of a disk
from other disks
◼ It is used for critical applications like logs.
❑ Raid level 2 uses memory-style redundancy by using Hamming
codes, which contain parity bits for distinct overlapping subsets
of components. Level 2 includes both error detection and
correction.
❑ Raid level 3 (single parity disks relying on the disk controller to
figure out which disk has failed) and level 5 (block-level data
striping) are preferred for large volume storage, with level 3
giving higher transfer rates.
❑ Most popular uses of the RAID technology currently are: Level 0
(with striping), Level 1 (with mirroring) and Level 5 with an extra
drive for parity.
❑ Design decisions for RAID include – level of RAID, number of
disks, choice of parity schemes, and grouping of disks for block-
level striping. 67
Storage Area Networks
◼ The demand for higher storage has risen considerably
in recent times.
◼ Organizations have a need to move from a static fixed
data center oriented operation to a more flexible and
dynamic infrastructure for information processing.
◼ Thus they are moving to a concept of Storage Area
Networks (SANs).
❑ In a SAN, online storage peripherals are configured as
nodes on a high-speed network and can be attached and
detached from servers in a very flexible manner.
◼ This allows storage systems to be placed at longer
distances from the servers and provide different
performance and connectivity options.

68
Storage Area Networks (cont.)
◼ Advantages of SANs are:
❑ Flexible many-to-many connectivity among servers and
storage devices using fiber channel hubs and switches.
❑ Up to 10km separation between a server and a storage
system using appropriate fiber optic cables.
❑ Better isolation capabilities allowing nondisruptive addition
of new peripherals and servers.
◼ SANs face the problem of:
❑ combining storage options from multiple vendors
❑ dealing with evolving standards of storage management
software and hardware.

69
Review questions
1) What is the difference between a file organization and an
access method?
2) What is the difference between static and dynamic files?
3) What are the typical record-at-a-time operations for accessing
a file? Which of these depend on the current record of a file?
4) Discuss the advantages and disadvantages of (a) unordered
file, (b) ordered file, and (c) static hash file with buckets and
chaining. Which operations can be performed efficiently on
each of these organizations, and which operations are
expensive?
5) Discuss the techniques for allowing a hash file to expand and
shrink dynamically. What are the advantages and
disadvantages of each?

70
71
Chapter 3

Indexing Structures for Files


Contents

1 Single-level Ordered Indexes


2 Multilevel Indexes
Dynamic Multilevel Indexes Using B-Trees and
3
B+-Trees
4 Indexes on Multiple Keys
5 Other File Indexes
6 Indexes in Today‘s DBMSs

2
Contents

1 Single-level Ordered Indexes


2 Multilevel Indexes
Dynamic Multilevel Indexes Using B-Trees and
3
B+-Trees
4 Indexes on Multiple Keys
5 Other File Indexes
6 Indexes in Today‘s DBMSs

3
Single-level index introduction
◼ A single-level index is an auxiliary file that
makes it more efficient to search for a record in
the data file.
◼ The index is usually specified on one field of the
file (although it could be specified on several
fields).
◼ One form of an index is a file of entries <field
value, pointer to record>, which is ordered by
field value.
◼ The index is called an access path on the field.

4
Single-level index introduction (cont.)
◼ The index file usually occupies considerably less
disk blocks than the data file because its entries
are much smaller.
◼ A binary search on the index yields a pointer to
the file record.
◼ Indexes can also be characterized as dense or
sparse:
❑ A dense index has an index entry for every search key
value (and hence every record) in the data file.
❑ A sparse (or nondense) index, on the other hand, has
index entries for only some of the search values

5
Example 1
Given the following data file:
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
◼ Record size R = 150 bytes, block size B = 512 bytes, r = 30.000 records
◼ SSN Field size VSSN = 9 bytes, record pointer size PR = 7 bytes
Then, we get:
◼ Blocking factor: bfr = B/R = 512/150 = 3 records/block
◼ Number of blocks needed for the file: b= r/bfr = 30.000/3 = 10.000 blocks

For an dense index on the SSN field:


◼ Index entry size: Ri = (VSSN+ PR) = (9+7) = 16 bytes
◼ Index blocking factor bfri = B/RI = 512/16 = 32 entries/block
◼ Number of blocks for index file: b i = r/bfri = (30000/32)= 938 blocks
◼ Search for and retrieve a record needs: log2bi  + 1 = log2938  + 1 = 11
block accesses

◼ This is compared to an average linear search cost of:


(b/2)= 10000/2 = 5000 block accesses
◼ If the file records are ordered, the binary search cost would be:
 log2b  =  log210000  = 14 block accesses
6
Types of Single-level Ordered Indexes

◼ Primary Indexes

◼ Clustering Indexes

◼ Secondary Indexes

7
Primary Index

◼ Defined on an ordered data file.


❑ The data file is ordered on a key field.

◼ One index entry for each block in the data file


❑ First record in the block, which is called the block anchor

◼ A similar scheme can use the last record in a block.

8
Primary key field Data file

ID Name DoB Salary Sex


1
2
Index file 3
(<K(i), P(i)> entries)
4
Primary Block
key value pointer 6

1 7

4 8
8 9
12 10

12
13
15

9
Primary Index

◼ Number of index entries?


❑ Number of blocks in data file.

◼ Dense or Nondense?
❑ Nondense

◼ Search/ Insert/ Update/ Delete?

10
Clustering Index

◼ Defined on an ordered data file.


❑ The data file is ordered on a non-key field.

◼ One index entry each distinct value of the field.


❑ The index entry points to the first data block that
contains records with that field value

11
Clustering field Data file

Dept_No Name DoB Salary Sex


1
1
Index file 2
(<K(i), P(i)> entries)
2
Clustering Block
field value pointer 2
1 2
2
2
3
3
4
3
5
4
4
5

12
Dept_No Name DoB Salary Sex
Clustering field
1
1

2
2
Index file
2
(<K(i), P(i)> entries)
2
Clustering Block 2
field value pointer
1
3
2
3
3
4
4
5
4

Data file 13
Clustering Index

◼ Number of index entries?


❑ Number of distinct indexing field values in data file .

◼ Dense or Nondense?
❑ Nondense

◼ Search/ Insert/ Update/ Delete?


◼ At most one primary index or one clustering
index but not both.

14
Secondary index
◼ A secondary index provides a secondary means of
accessing a file.
❑ The data file is unordered on indexing field.
◼ Indexing field:
❑ secondary key (unique value)
❑ nonkey (duplicate values)

◼ The index is an ordered file with two fields:


❑ The first field: indexing field.
❑ The second field: block pointer or record pointer.

◼ There can be many secondary indexes for the same file.

15
Index file Secondary
(<K(i), P(i)> entries) key field Data file

Index field Block 5


value pointer
13
3
8
4
5 6
6 15
8 3
9
9
11
21
13 … 11
15
18 4
21 23
23 18

Secondary index on key field


16
Secondary index on key field

◼ Number of index entries?


❑ Number of record in data file

◼ Dense or Nondense?
❑ Dense

◼ Search/ Insert/ Update/ Delete?

17
Secondary index on non-key field
◼ Discussion: Structure of Secondary index on non-
key field?
◼ Option 1: include duplicate index entries with the
same K(i) value - one for each record.
◼ Option 2: keep a list of pointers <P(i, 1), ..., P(i, k)>
in the index entry for K(i).
◼ Option 3:
❑ more commonly used.
❑ one entry for each distinct index field value + an extra
level of indirection to handle the multiple pointers.

18
Blocks of record pointers Indexing field Data file

Dept Name DoB Job Sex


_No


3
Index file 5
(<K(i), P(i)> entries) 1


Field Block
2
value pointer
3
4

1
2 3
3 3

4 1

5 5
1

Secondary Index on non-key field: option 3


Secondary index on nonkey field

◼ Number of index entries?


❑ Number of records in data file
❑ Number of distinct index field values

◼ Dense or Nondense?
❑ Dense/ nondense

◼ Search/ Insert/ Update/ Delete?

20
Summary of Single-level indexes

◼ Ordered file on indexing field?


❑ Primary index
❑ Clustering index
◼ Indexing field is Key?
❑ Primary index
❑ Secondary index
◼ Indexing field is not Key?
❑ Clustering index
❑ Secondary index
21
Summary of Single-level indexes

◼ Dense index?
❑ Secondary index

◼ Nondense index?
❑ Primary index
❑ Clustering index
❑ Secondary index

22
Summary of Single-level indexes

23
Example 2
Given the following data file:
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
◼ Record size R = 150 bytes, block size B = 512 bytes, r = 30.000 records
◼ SSN Field size VSSN = 9 bytes, block pointer size P = 6 bytes
Then, we get:
◼ Blocking factor: bfr = B/R = 512/150 = 3 records/block
◼ Number of blocks needed for the file: b = r/bfr = 30.000/3 = 10.000 blocks

For a primary index on the ordering key field SSN:


◼ Index entry size: Ri = (VSSN+ P) = (9+6) = 15 bytes
◼ Index blocking factor bfri= B/Ri = 512/15 = 34 entries/block
◼ Number of blocks for index file: b i= b/bfri = 10000/34 = 295 blocks
◼ Search for and retrieve a record needs: log2bi  + 1 = log2 295  + 1 = 10
block accesses

◼ This is compared to a dense index cost of: 11 block accesses

24
Contents

1 Single-level Ordered Indexes


2 Multilevel Indexes
Dynamic Multilevel Indexes Using B-Trees and
3
B+-Trees
4 Indexes on Multiple Keys
5 Other File Indexes
6 Indexes in Today‘s DBMSs

25
Multi-Level Indexes
◼ Because a single-level index is an ordered file, we
can create a primary index to the index itself.
❑ The original index file is called the first-level index and the
index to the index is called the second-level index.
◼ We can repeat the process, creating a third, fourth,
..., top level until all entries of the top level fit in
one disk block.
◼ A multi-level index can be created for any type of
first-level index (primary, secondary, clustering) as
long as the first-level index consists of more than
one disk block.

26
A two-level primary
index resembling
ISAM (Indexed
Sequential Access
Method)
organization.

27
Example 3
Given the following data file:
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
◼ Record size R=150 bytes, block size B=512 bytes, r=30000 records
◼ SSN Field size VSSN=9 bytes, block pointer size P=6 bytes
Then, we get:
◼ Blocking factor: bfr= B/R = 512/150 = 3 records/block
◼ Number of blocks needed for the file: b= r/bfr= 30000/3  = 10000 blocks
For a primary index on the ordering key field SSN (Example 2):
◼ Index entry size: Ri=(VSSN+ P)=(9+6)=15 bytes
◼ Index blocking factor bfri= B/Ri = 512/15 = 34 entries/block
◼ Number of blocks for index file: b i= b/bfri = 10000/34  = 295 blocks
◼ Search for and retrieve a record needs: log2bi  + 1 = log2295  + 1 = 10 block
accesses
For a multilevel index on the ordering key field SSN:
◼ Index blocking factor bfri= B/Ri = 512/15 = 34 entries/block
o This is the fan-out fo of the multilevel index.
◼ Number of 1st level index blocks: b1 = 295 blocks
◼ Number of 2nd level index blocks: b2 =  b1 / fo =  295 / 34 = 9 blocks
◼ Number of 3th level index blocks: b3 =  b2 / fo =  9 / 34 = 1 block → top level
◼ Number of level of this multilevel index: x = 3 levels
◼ Search for and retrieve a record needs: x + 1 = 4 blocks
28
31
32
33
34
35
36
37
Multi-Level Indexes

◼ Such a multi-level index is a form of search


tree.
◼ However, insertion and deletion of new index
entries is a severe problem because every
level of the index is an ordered file.

38
Contents

1 Single-level Ordered Indexes


2 Multilevel Indexes
Dynamic Multilevel Indexes Using B-Trees
3
and B+-Trees
4 Indexes on Multiple Keys
5 Other File Indexes
6 Indexes in Today‘s DBMSs

39
Dynamic Multilevel Indexes Using B-
Trees and B+-Trees
◼ Most multi-level indexes use B-tree or B+-tree data
structures because of the insertion and deletion
problem.
❑ This leaves space in each tree node (disk block) to allow
for new index entries
◼ These data structures are variations of search trees
that allow efficient insertion and deletion of new
search values.
◼ In B-Tree and B+-Tree data structures, each node
corresponds to a disk block.
◼ Each node is kept between half-full and completely
full.
40
Dynamic Multilevel Indexes Using B-
Trees and B+-Trees (cont.)
◼ An insertion into a node that is not full is quite
efficient.
❑ If a node is full, the insertion causes a split into
two nodes.
◼ Splitting may propagate to other tree levels.
◼ A deletion is quite efficient if a node does not
become less than half full.
◼ If a deletion causes a node to become less than
half full, it must be merged with neighboring
nodes.
41
Difference between B-tree and B+-tree

◼ In a B-Tree, pointers to data records exist at


all levels of the tree.
◼ In a B+-Tree, all pointers to data records exist
at the leaf-level nodes.
◼ A B+-Tree can have less levels (or higher
capacity of search values) than the
corresponding B-tree.

42
B-tree Structures

43
The Nodes of a B+-Tree

44
The Nodes of a B+-Tree (cont.)

45
Example 4: Calculate the order of a B-tree
◼ Suppose that:
❑ Search field V = 9 bytes, disk block size B = 512 bytes
❑ Record (data) pointer Pt = 7 bytes, block pointer is P = 6 bytes.
◼ Each B-tree node can have at most p tree pointers, p – 1
data pointers, and p – 1 search key field values.
◼ These must fit into a single disk block if each B-tree node is to
correspond to a disk block:
(p*P) + ((p-1)*(Pt+V))  B
 (p*6) + ((p-1)*(7+9))  512
 (22*p)  528
◼ We can choose to be a large value that satisfies the above
inequality, which gives p = 23 (p = 24 is not chosen because
of additional information).

46
Example 5: Calculate approximate number
of entries of a B-tree
◼ Suppose that:
❑ Search field of Example 3 is a non-ordering key field, and we construct a B-Tree on
this field.
❑ Each node of the B-tree is 69 percent full.
◼ Each node, on the average, will have: p * 0.69 = 23 * 0.69 = 15.87 ≈ 16
pointers → 15 search key field values.
◼ The average fan-out fo = 16. We can start at the root and see how many
values and pointers can exist, on the average, at each subsequent level:
Level Nodes Index entries Pointers
Root: 1 node 15 entries 16 pointers
Level 1: 16 nodes 240 entries 256 pointers
Level 2: 256 nodes 3840 entries 4096 pointers
Level 3: 4096 nodes 61,440 entries
◼ At each level, number of entries = the total number of pointers at the
previous level * the average number of entries in each node.
◼ A two-level B-tree holds 3840+240+15 = 4095 entries on the average; a
three-level B-tree holds 65,535 entries on the average. 47
Example 6: Calculate the order of a B+-tree
◼ Suppose that:
❑ Search key field V=9 bytes, block size B=512bytes
❑ Record pointer is Pr = 7bytes, block pointer is P = 6bytes.
◼ An internal node of the B+-tree can have up to p tree pointers and p-
1 search field values; these must fit into a single block. Hence, we
have:
(p*P) + ((p-1)*V)  B
 (p*6) + ((p-1)*9)  512

 15*p  512

◼ We can choose p to be the largest value satisfying the above


inequality, which give p = 34.
◼ This is larger than the value of 23 for the B-Tree, resulting in a larger
fan-out and more entries in each internal node of a B+-Tree than in
the corresponding B-Tree.

48
Example 6: Calculate the order of a B+-tree
(cont.)
◼ The leaf nodes of B+-tree will have the same number of
values and pointers, except that the pointers are data
pointers and a next pointer. Hence, the order pleaf for the
leaf nodes can be calculated as follows:
(pleaf * (Pt+V))+P  B
 (pleaf * (7+9))+6  512
 (16 * pleaf)  506
◼ If follows that each leaf node hold up to pleaf = 31 key
value/data pointer combinations, assuming that the data
pointers are record pointers.

49
Example 7: Calculate approximate number
of entries of a B+-tree
◼ Suppose that we construct a B+-Tree on the field of Example 6:
❑ Search key field V = 9 bytes, block size B = 512bytes
❑ Record pointer is Pr = 7bytes, block pointer is P = 6bytes.
❑ Each node is 69 percent full.
◼ On the average, each internal node will be have 34*0.69 ≈ 23.46 or
approximately 23 pointers, and hence 22 values.
◼ Each leaf node, on the average, will hold 0.69*pleaf = 0.69*31 ≈ 21.39 or
approximately 21 data record pointers.
◼ A B+-tree will have the following average number of entries at each level:
Level Nodes Index entries Pointers
Root 1 nodes 22 entries 23 pointers
Level 1 23 23*22 = 506 232=529 pointers
Level 2 529 529*22 = 11,638 233=12,167 pointers
Leaf level 12,167 12,167 *21 = 255,507
◼ A 3-level B+-tree holds up to 255,507 record pointers, on the average.
◼ Compare this to the 65,535 entries for corresponding B-tree in Example 4.
50
B+-Tree: Insert entry

51
B+-Tree: Insert entry (cont.)

52
Example of insertion in B+-tree

p = 3 and pleaf = 2

Insertion Sequence: 8, 5, 1, 7, 3, 12, 9, 6

53
Example of insertion in B+-tree (cont.)

p = 3 and pleaf = 2

Insertion Sequence: 8, 5, 1, 7, 3, 12, 9, 6

54
Example of insertion in B+-tree (cont.)

p = 3 and pleaf = 2

Insertion Sequence: 8, 5, 1, 7, 3, 12, 9, 6

55
Example of insertion in B+-tree (cont.)

p = 3 and pleaf = 2

Insertion Sequence: 8, 5, 1, 7, 3, 12, 9, 6

56
Example of insertion in B+-tree (cont.)

p = 3 and pleaf = 2 Insertion Sequence: 8, 5, 1, 7, 3, 12, 9, 6

57
Example of insertion in B+-tree (cont.)

p = 3 and pleaf = 2 Insertion Sequence: 8, 5, 1, 7, 3, 12, 9, 6

58
Example of insertion in B+-tree (cont.)

p = 3 and pleaf = 2 Insertion Sequence: 8, 5, 1, 7, 3, 12, 9, 6

59
B+-Tree: Delete entry
◼ Remove the entry from the leaf node.
◼ If it happens to occur in an internal node:
❑ Remove.
❑ The value to its left in the leaf node must replace it in the internal
node.
◼ Deletion may cause underflow in leaf node:
❑ Try to find a sibling leaf node – a leaf node directly to the left or to
the right of the node with underflow.
❑ Redistribute the entries among the node and its siblings.
(Common method: The left sibling first and the right sibling later)
❑ If redistribution fails, the node is merged with its sibling.
❑ If merge occurred, must delete entry (pointing to node and
sibling) from parent node.

60
B+-Tree: Delete entry (cont.)

◼ If an internal node is underflow:


❑ Redistribute the entries among the node, its siblings and
entry pointing to node and sibling of parent node .
❑ If redistribution fails, the node is merged with its sibling and
the entry pointing to node and sibling of parent node .
❑ If merge occurred, must delete entry pointing to node and
sibling from parent node.
❑ If the root node is empty → the merged node becomes the
new root node.
◼ Merge could propagate to root, reduce the tree
levels.

61
Example of deletion from B+-tree

p = 3 and pleaf = 2.

Deletion sequence: 5, 12, 9

Delete 5

62
Example of deletion from B+-tree (cont.)
P = 3 and pleaf = 2.

Deletion sequence: 5, 12, 9

Delete 12: underflow


(redistribute)

63
Example of deletion from B+-tree (cont.)
p = 3 and pleaf = 2.

Deletion sequence: 5, 12, 9

Delete 9:
Underflow (merge with left, redistribute)

64
Example of deletion from B+-tree (cont.)
p = 3 and pleaf = 2.

Deletion sequence: 5, 12, 9

65
Search using B-trees and B+-trees
K=8
5<8

7< 8 <= 8

found

66
Search using B-trees and B+-trees
◼ Search conditions on indexing attributes
❑ =, <, >, ≤, ≥, between, MINIMUM value, MAXIMUM
value
◼ Search results
❑ Zero, one, or many data records
◼ Search cost
❑ B-trees
◼ From 1 to (1 + the number of tree levels) + data accesses
❑ B+-trees
◼ 1 (root level) + the number of tree levels + data accesses

◼ Logically ordering for a data file

67
Contents

1 Single-level Ordered Indexes


2 Multilevel Indexes
Dynamic Multilevel Indexes Using B-Trees and
3
B+-Trees
4 Indexes on Multiple Keys
5 Other File Indexes
6 Indexes in Today‘s DBMSs

69
Indexes on Multiple Keys
◼ In many retrieval and update requests, multiple
attributes are involved.
◼ If a certain combination of attributes is used
frequently, it is advantageous to set up an access
structure to provide efficient access by a key value
that is a combination of those attributes.
◼ If an index is created on attributes <A1, A2, … , An>,
the search key values are tuples with n values: <v1,
v2, … , vn>.
◼ A lexicographic ordering of these tuple values
establishes an order on this composite search key.
◼ An index on a composite key of n attributes works
similarly to any index discussed so far.

70
Contents

1 Single-level Ordered Indexes


2 Multilevel Indexes
Dynamic Multilevel Indexes Using B-Trees and
3
B+-Trees
4 Indexes on Multiple Keys
5 Other File Indexes
6 Indexes in Today‘s DBMSs

71
Other File Indexes
◼ Hash indexes
❑ The hash index is a secondary structure to access the
file by using hashing on a search key other than the one
used for the primary data file organization.
◼ Bitmap indexes
❑ A bitmap index is built on one particular value of a
field (the column in a table) with respect to all the rows
(records) and is an array of bits.
◼ Function-based indexes
❑ In Oracle, an index such that the value that results from
applying a function (expression) on a field or some fields
becomes the key to the index

72
Other File Indexes

◼ Hash indexes
❑ The hash index is a secondary structure to
access the file by using hashing on a search
key other than the one used for the primary
data file organization.
◼ access structures similar to indexes, based on
hashing
❑ Support for equality searches on the hash
field

73
Hash indexes

◼ The hash index is a secondary


structure to access the file by using
hashing on a search key other than the
one used for the primary data file
organization.
❑ access structures similar to indexes, based
on hashing
◼ Support for equality searches on the
hash field

74
hashing
function:
the sum of
the digits
of Emp_id
modulo 10

75
Bitmap indexes
◼ A bitmap index is built on one particular value
of a field (the column in a table) with respect to
all the rows (records) and is an array of bits.
❑ Each bit in the bitmap corresponds to a row. If the bit is
set, then the row contains the key value.
◼ In a bitmap index, each indexing field value is
associated with pointers to multiple rows.
◼ Bitmap indexes are primarily designed for data
warehousing or environments in which queries
reference many columns in an ad hoc fashion.
❑ The number of distinct values of the indexed field is
small compared to the number of rows.
❑ The indexed table is either read-only or not subject to
significant modification by DML statements.
76
Bitmap indexes

77
Bitmap indexes

78
Function-based indexes
◼ The use of any function on a column prevents the
index defined on that column from being used.
❑ Indexes are only used with some specific search
conditions on indexed columns.

◼ In Oracle, a function-based index is an index


such that the value that results from applying
some function (expression) on a field or a
collection of fields becomes the key to the index.
❑ A function-based index can be either a B-tree or a
bitmap index.

79
Function-based indexes

80
Contents

1 Single-level Ordered Indexes


2 Multilevel Indexes
Dynamic Multilevel Indexes Using B-Trees and
3
B+-Trees
4 Indexes on Multiple Keys
5 Other File Indexes
6 Indexes in Today‘s DBMSs

81
Index Creation
CREATE [ UNIQUE ] INDEX <index name>
ON <table name> ( <column name> [ <order> ] { , <column name> [ <order> ] } )
[ CLUSTER ] ;

◼ UNIQUE is used to guarantee that no two rows of a table


have duplicate values in the key column or column.
◼ CLUSTER is used when the index to be created should also
sort the data file records on the indexing attribute.

CREATE INDEX DnoIndex ON EMPLOYEE (Dno)


CLUSTER ;

82
B-tree index in Oracle 19c

83
B-tree for a clustered index in MS
SQL Server

84
Review questions
1) Define the following terms: indexing field, primary key field, clustering
field, secondary key field, block anchor, dense index, and nondense
(sparse) index.
2) What are the differences among primary, secondary, and clustering
indexes? How do these differences affect the ways in which these
indexes are implemented? Which of the indexes are dense, and which
are not?
3) Why can we have at most one primary or clustering index on a file, but
several secondary indexes?
4) How does multilevel indexing improve the efficiency of searching an
index file?
5) What is the order p of a B-tree? Describe the structure of B-tree nodes.
6) What is the order p of a B+-tree? Describe the structure of both internal
and leaf nodes of a B+-tree.
7) How does a B-tree differ from a B+-tree? Why is a B+-tree usually
preferred as an access structure to a data file?

85
86
Chapter 4

Algorithms for Query Processing and


Optimization
Chapter Outline
1. Introduction to Query Processing
2. Translating SQL Queries into Relational Algebra
3. Algorithms for External Sorting
4. Algorithms for SELECT and JOIN Operations
5. Algorithms for PROJECT and SET Operations
6. Implementing Aggregate Operations and Outer Joins
7. Combining Operations using Pipelining
8. Using Heuristics in Query Optimization
9. Using Selectivity and Cost Estimates in Query Optimization
10. Overview of Query Optimization in Oracle
11. Semantic Query Optimization

2
1. Introduction to Query Processing

◼ Query optimization: the process of choosing a suitable


execution strategy for processing a query.
◼ Two internal representations of a query
❑ – Query Tree
❑ – Query Graph

3
Typical steps when processing a high-level query

4
COMPANY Database Schema

5
2. Translating SQL Queries into Relational
Algebra (1)
◼ Query block: the basic unit that can be translated into the algebraic
operators and optimized.
◼ A query block contains a single SELECT-FROM-WHERE expression,
as well as GROUP BY and HAVING clause if these are part of the
block.
◼ Nested queries within a query are identified as separate query blocks.
◼ Aggregate operators in SQL must be included in the extended algebra.

6
Translating SQL Queries into Relational Algebra (2)

SELECT LNAME, FNAME


FROM EMPLOYEE
WHERE SALARY > ( SELECT MAX (SALARY)
FROM EMPLOYEE
WHERE DNO = 5);

SELECT LNAME, FNAME SELECT MAX (SALARY)


FROM EMPLOYEE FROM EMPLOYEE
WHERE SALARY > C WHERE DNO = 5

LNAME, FNAME ℱMAX SALARY


σSALARY>C(EMPLOYEE)) (σDNO=5 (EMPLOYEE))
7
3. Algorithms for External Sorting (1)

◼ External sorting : refers to sorting algorithms that are suitable for large
files of records stored on disk that do not fit entirely in main memory, such
as most database files.
◼ Sort-Merge strategy : starts by sorting small subfiles (runs ) of the main
file and then merges the sorted runs, creating larger sorted subfiles that
are merged in turn.
– Sorting phase: nR = (b/nB)
– Merging phase: dM = Min(nB-1, nR);
nP= (logdM(nR))
nR: number of initial runs; b: number of file blocks;
nB: available buffer space; dM: degree of merging;
nP: number of passes.

8
Algorithms for External Sorting (2)

set i  1, j  b; /* size of the file in blocks */


k  nB; /* size of buffer in blocks */
m  (j/k); /*number of runs */
{Sort phase}
while (i<= m) do
{
read next k blocks of the file into the buffer or if there are less than k
blocks remaining, then read in the remaining blocks;
sort the records in the buffer and write as a temporary subfile;
i  i+1;
}
The number of block accesses for the sort phase = 2*b

9
Algorithms for External Sorting (3)
/*Merge phase: merge subfiles until only 1 remains */
set i  1;
p  logk-1m; /* p is the number of passes for the merging phase */
j  m; /* the number of runs */
while (i<= p) do
{
n  1;
q  (j/(k-1); /* the number of runs to write in this pass */
while ( n <= q) do
{
read next k-1 subfiles or remaining subfiles (from previous pass) one block at a time
merge and write as new subfile one block at a time;
n  n+1;
}
j  q;
The number of block accesses for the merge phase = 2*(b* logdMnR)
i  i+1;
}
10
Example of External Sorting (1)
1 block = 2 records

15 22 2 27 14 6 51 18 35 16 50 36 9 8 32 12 11 33 30 30 23 21 24 29

buffer = 3 blocks = 6 records

Sort phase:
Read 3 blocks of the file → sort.
→ run: 3 blocks

11
Example of External Sorting (2)
Sort phase:

15 22 2 27 14 6 51 18 35 16 50 36 9 8 32 12 11 33 30 30 23 21 24 29

15 22 2 27 14 6 2 6 14 15 22 27

2 6 14 15 22 27

1 run

12
Example of External Sorting (3)
Sort phase

15 22 2 27 14 6 51 18 35 16 50 36 9 8 32 12 11 33 30 30 23 21 24 29

2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30

1 run 1 run 1 run 1 run

13
Example of External Sorting (4)

2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30

Merge phase:
Each step:
- Read 1 block from (nB - 1) runs to buffer
- Merge → temp block
- If temp block full: write to file
- If any empty block: Read next block from
corresponding run

14
Example of External Sorting (5)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30

2 6 16 18 6 16 18 2 16 18 2 6

Temp block Empty → read next Full


block from → write
corresponding run to file

15
Example of External Sorting (6)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30

14 15 16 18 15 16 18 14 16 18 14 15

Temp block Empty → read next Full


block from → write to file
corresponding run

2 6

16
Example of External Sorting (7)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30

Empty → read next


block from
corresponding run

22 27 16 18 22 27 18 16 22 27 16 18

Temp block Full


→ write to file

2 6 14 15

17
Example of External Sorting (8)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30

Empty → read next


block from
corresponding run

22 27 35 36 27 35 36 22 35 36 22 27

Temp block Full


→ write to file

2 6 14 15 16 18

18
Example of External Sorting (9)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30

Not have any block=> remove this run


from this merge step => write remain
blocks of the others run to file.

35 36

Temp block

2 6 14 15 16 18 22 27

19
Example of External Sorting (10)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30

Temp block
1 new run

2 6 14 15 16 18 22 27 35 36 50 51

20
Example of External Sorting (11)
Merge phase: Pass 2
2 6 14 15 16 18 22 27 35 36 50 51 8 9 11 12 20 21 23 24 29 30 32 33

2 6 8 9 6 8 9 2 8 9 2 6

Temp block Empty → read next Full


block from → write to file
corresponding run

21
Example of External Sorting (12)
Merge phase: Pass 2
2 6 14 15 16 18 22 27 35 36 50 51 8 9 11 12 20 21 23 24 29 30 32 33

14 15 8 9

Temp block

2 6

22
Example of External Sorting (13)

Result:
2 6 8 9 11 12 14 15 16 18 20 21 22 23 24 27 29 30 32 33 35 36 50 51

23
4. Algorithms for SELECT and JOIN
Operations (1)
Implementing the SELECT Operation:
Examples:
◼ (OP1): σSSN='123456789'(EMPLOYEE)
◼ (OP2): σDNUMBER>5(DEPARTMENT)
◼ (OP3): σDNO=5(EMPLOYEE)
◼ (OP4): σDNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
◼ (OP5): σESSN='123456789' AND PNO=10(WORKS_ON)
◼ (OP6): σDNO IN (3, 27, 49) (EMPLOYEE)

◼ (OP7): σ((Salary*Commission_pct) + Salary ) > 5000(EMPLOYEE) 24


Algorithms for SELECT and JOIN (2)

Implementing the SELECT Operation (cont.):


Search Methods for Simple Selection:
◼ S1. Linear search (brute force): Retrieve every record in the file, and
test whether its attribute values satisfy the selection condition.
◼ S2. Binary search : If the selection condition involves an equality
comparison on a key attribute on which the file is ordered, binary search
(which is more efficient than linear search) can be used. (See OP1).
◼ S3. Using a primary index or hash key to retrieve a single record: If the
selection condition involves an equality comparison on a key attribute with
a primary index (or a hash key), use the primary index (or the hash key) to
retrieve the record.

25
Algorithms for SELECT and JOIN Operations (3)

Implementing the SELECT Operation (cont.):


Search Methods for Simple Selection:
◼ S4. Using a primary index to retrieve multiple records: If the
comparison condition is >, ≥ , <, or ≤ on a key field with a primary
index, use the index to find the record satisfying the corresponding
equality condition, then retrieve all subsequent records in the (ordered)
file.
◼ S5. Using a clustering index to retrieve multiple records: If the
selection condition involves an equality comparison on a non-key
attribute with a clustering index, use the clustering index to retrieve all
the records satisfying the selection condition.

26
Algorithms for SELECT and JOIN Operations (4)

Implementing the SELECT Operation (cont.):


Search Methods for Simple Selection:
◼ S6. Using a secondary (B+-tree) index : On an equality comparison, this
search method can be used to retrieve:
❑ a single record if the indexing field has unique values (is a key) OR
❑ to retrieve multiple records if the indexing field is not a key.
❑ In addition, it can be used to retrieve records on conditions involving >,>=, <, or <=
(FOR RANGE QUERIES )

27
Algorithms for SELECT and JOIN Operations (4)

Implementing the SELECT Operation (cont.):


Search Methods for Simple Selection:
◼ S7.a. Using a bitmap index: If the selection condition involves a set of
values for an attribute, the corresponding bitmaps for each value can be
OR-ed to give the set of record identifiers that qualify.
◼ S7.b. Using a functional index: If there is a functional index defined, this
index can be used to retrieve all the records that qualify.

28
Algorithms for SELECT and JOIN (5)

Implementing the SELECT Operation (cont.):


Search Methods for Complex Selection:
◼ S8. Conjunctive (AND) selection : If an attribute involved in any single
simple condition in the conjunctive condition has an access path that
permits the use of one of the methods S2 to S6, use that condition to
retrieve the records and then check whether each retrieved record
satisfies the remaining simple conditions in the conjunctive condition.
◼ S9. Conjunctive (AND) selection using a composite index: If two or
more attributes are involved in equality conditions in the conjunctive
condition and a composite index (or hash structure) exists on the
combined field, we can use the index directly.

29
Algorithms for SELECT and JOIN (6)

Implementing the SELECT Operation (cont.):


Search Methods for Complex Selection:
◼ S10. Conjunctive (AND) selection by intersection of record pointers :
❑ This method is possible if:
◼ secondary indexes are available on all (or some of) the fields involved in equality
comparison conditions in the conjunctive condition and
◼ the indexes include record pointers (rather than block pointers).
❑ Each index can be used to retrieve the set of record pointers that satisfy
the individual condition.
❑ The intersection of these sets of record pointers: record pointers that
satisfy the conjunctive condition, which are then used to retrieve those
records directly.
❑ If only some of the conditions have secondary indexes, each retrieved
record is further tested to determine whether it satisfies the remaining
conditions.
30
Algorithms for SELECT and JOIN Operations (7)

Implementing the SELECT Operation (cont.):


◼ S11. Disjunctive (OR) selection conditions:

❑ Records satisfying the disjunctive condition are the union of the records
satisfying the individual conditions.
❑ If any one of the conditions does not have an access path, we are
compelled to use the brute force, linear search approach (S1).
❑ Only if an access path exists on every simple condition in the disjunction
can we optimize the selection by retrieving the records satisfying each
condition - or their record ids - and then applying the union operation to
eliminate duplicates.

31
Algorithms for SELECT and JOIN Operations (7)
Implementing the SELECT
Operation (cont.):
◼ S11. Disjunctive (OR)
selection conditions:

32
Algorithms for SELECT and JOIN Operations (7)

Implementing the SELECT Operation (cont.):


◼ Whenever a single condition specifies the selection, we can only check
whether an access path exists on the attribute involved in that condition. If an
access path exists, the method corresponding to that access path is used;
otherwise, the “brute force” linear search approach of method S1 is used.
(See OP1, OP2 and OP3)
◼ For conjunctive selection conditions, whenever more than one of the
attributes involved in the conditions have an access path, query optimization
should be done to choose the access path that retrieves the fewest records
in the most efficient way.

33
Which search method should be used? (1)

◼ (OP1): σSSN='123456789'(EMPLOYEE)
◼ (OP2): σDNUMBER>5(DEPARTMENT)
◼ (OP3): σDNO=5(EMPLOYEE)
◼ (OP4): σDNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
◼ (OP5): σESSN='123456789' AND PNO=10(WORKS_ON)
◼ (OP6): σDNO IN (3, 27, 49) (EMPLOYEE)

◼ (OP7): σ((Salary*Commission_pct) + Salary ) > 5000(EMPLOYEE)

34
Which search method should be used? (2)

◼ EMPLOYEE
❑ A clustering index on SALARY
❑ A secondary index on the key attribute SSN
❑ A secondary index on the nonkey attribute DNO
❑ A secondary index on SEX

(OP1): σSSN='123456789'(EMPLOYEE)

❑ S1. Linear search


❑ S6. Secondary index search
35
Which search method should be used? (3)

◼ DEPARTMENT
❑ A primary index on DNUMBER
❑ A secondary index on MGRSSN

(OP2): σDNUMBER>5 (DEPARTMENT)

❑ S1. Linear search


❑ S2. Binary search
❑ S4. Primary index search

36
Which search method should be used? (4)

◼ EMPLOYEE
❑ A clustering index on SALARY
❑ A secondary index on the key attribute SSN
❑ A secondary index on the nonkey attribute DNO
❑ A secondary index on SEX

(OP3): σDNO=5 (EMPLOYEE)


❑ S1. Linear search
❑ S6. Secondary index search
37
Which search method should be used? (5)
◼ EMPLOYEE
❑ A clustering index on SALARY
❑ A secondary index on the key attribute SSN
❑ A secondary index on the nonkey attribute DNO
❑ A secondary index on SEX
(OP4): σDNO=5 AND SALARY>30000 AND SEX=‘F’ (EMPLOYEE)
❑ S1. Linear search

❑ S8. Conjunctive selection with a simple condition

❑ S10. Conjunctive selection by intersection of the record pointers

◼ S6. Secondary index search on Dno


◼ S5. Clustering index search on Salary
◼ S6. Secondary index search on Sex
38
Which search method should be used? (6)
◼ EMPLOYEE
❑ A clustering index on SALARY
❑ A secondary index on the key attribute SSN
❑ A secondary index on the nonkey attribute DNO
❑ A secondary index on SEX
(OP4’): σDNO=5 OR SALARY>30000 OR SEX=‘F’ (EMPLOYEE)
❑ S1. Linear search

❑ S11. Disjunctive selection by union of the record pointers

◼ S6. Secondary index search on Dno


◼ S5. Clustering index search on Salary
◼ S6. Secondary index search on Sex

39
Which search method should be used? (7)

◼ WORKS_ON
❑ A composite primary index on (ESSN, PNO)

(OP5): σESSN='123456789' AND PNO=10 (WORKS_ON)

❑ S1. Linear search


❑ S9. Conjunctive selection using a composite primary index

40
Which search method should be used? (8)
◼ EMPLOYEE
❑ A clustering index on SALARY
❑ A secondary index on the key attribute SSN
❑ A secondary index on the nonkey attribute DNO
❑ A secondary index on SEX

(OP6): σDNO IN (3, 27, 49) (EMPLOYEE)

❑ S1. Linear search


❑ S7.a. Bitmap index search
❑ S6. Secondary index search + S11. Disjunctive selection by union of the
record pointers
41
Which search method should be used? (9)

◼ EMPLOYEE
❑ A clustering index on SALARY
❑ A secondary index on the key attribute SSN
❑ A secondary index on the nonkey attribute DNO
❑ A secondary index on SEX

(OP7): σ((Salary*Commission_pct) + Salary ) > 5000 (EMPLOYEE)


❑ S1. Linear search

❑ S7.b. Functional index search if a functional index is defined on the


expression: (Salary*Commission_pct + Salary)
42
Algorithms for SELECT and JOIN Operations (8)
Implementing the JOIN Operation:
◼ Join (EQUIJOIN, NATURAL JOIN)

– two–way join: a join on two files


e.g. R  A=B S
– multi-way joins: joins involving more than two files.
e.g. R  A=B S  C=DT

◼ Examples
(OP8): EMPLOYEE  DNO=DNUMBERDEPARTMENT
(OP9): DEPARTMENT  MGRSSN=SSNEMPLOYEE

43
Algorithms for SELECT and JOIN Operations (9)

Implementing the JOIN Operation (cont.):


Methods for implementing joins:
◼ J1. Nested-loop join (brute force): For each record t in R (outer loop),
retrieve every record s from S (inner loop) and test whether the two
records satisfy the join condition t[A] = s[B].
◼ J2. Single-loop join (Using an access structure to retrieve the
matching records): If an index (or hash key) exists for one of the two join
attributes — say, B of S — retrieve each record t in R, one at a time, and
then use the access structure to retrieve directly all matching records s
from S that satisfy s[B] = t[A].

44
Algorithms for SELECT and JOIN Operations (10)

Implementing the JOIN Operation (cont.):


Methods for implementing joins:
◼ J3. Sort-merge join:
❑ Records of R and S are physically sorted (ordered) by value of the join
attributes A and B, respectively.
❑ Both files are scanned in order of the join attributes, matching the
records that have the same values for A and B.
❑ In this method, the records of each file are scanned only once each for
matching with the other file—unless both A and B are non-key attributes, in
which case the method needs to be modified slightly.

45
Algorithms for SELECT and JOIN Operations (12)
sort the tuples in R on attribute A; /* assume R has n tuples */
sort the tuples in S on attribute B; /* assume S has m tuples */
set i  1, j  1;
while (i ≤ n) and (j ≤ m)
do {
if R(i)[A] > S(j)[B]
then set j  j + 1
elseif R(i)[A] < S(j)[B]
then set i  i + 1
else { /* output a matched tuple */
output the combined tupe <R(i), S(j)> to T;
/* output other tuples that match R(i), if any */
set l  j + 1 ;
while ( l ≤ m) and (R(i)[A] = S(l)[B])
do { output the combined tuple <R(i), S(l)> to T;
set l  l + 1
}
/* output other tuples that match S(j), if any */
set k  i+1
while ( k ≤ n) and (R(k)[A] = S(j)[B])
do { output the combined tuple <R(k), S(j)> to T;
set k  k + 1
} Implementing Sort-Merge Join (J3): T  R A=B S
set i  i+1, j  j+1;
}
46
}
R S

C A B D

5 4
6 6
9 6
10 10
17 17
20 18

Assume that A is a key of R. Initially, two pointers are used


to point to the two tuples of the two relations that have the
smallest values of the two joining attributes.

47
R S
R(i)[A] > S(j)[B]
C A B D

5 4
6 6
9 6
10 10
17 17
20 18

48
R S
R(i)[A] < S(j)[B]
C A B D

5 4
6 6
9 6
10 10
17 17
20 18

49
R S
R(i)[A] = S(j)[B]
C A B D

5 4 → R(2), S(2)
6 6
9 6
10 10
17 17
20 18

50
R S
R(i)[A] = S(j)[B]
C A B D

5 4 → R(2), S(3)
6 6
9 6
10 10
17 17
20 18

51
R S
R(i)[A] < S(j)[B]
C A B D

5 4
6 6
9 6
10 10
17 17
20 18

52
R S
R(i)[A] < S(j)[B]
C A B D

5 4
6 6
9 6
10 10
17 17
20 18

53
R S
R(i)[A] = S(j)[B]
C A B D

5 4 → R(4), S(4)
6 6
9 6
10 10
17 17
20 18

54
R S
R(i)[A] < S(j)[B]
C A B D

5 4
6 6
9 6
10 10
17 17
20 18

55
R S
R(i)[A] = S(j)[B]
C A B D

5 4 → R(5), S(5)
6 6
9 6
10 10
17 17
20 18

56
R S
R(i)[A] < S(j)[B]
C A B D

5 4
6 6
9 6
10 10
17 17
20 18

57
R S
R(i)[A] > S(j)[B]
C A B D

5 4
6 6
9 6
10 10
17 17
20 18

58
R S
j > m → end.
C A B D

5 4
6 6
9 6
10 10
17 17
20 18

Implementing Sort-Merge Join (J3): T  R A=B S

Result: C A B D
R(2), S(2) 6 6
R(2), S(3) 6 6
R(4), S(4) 10 10
R(5), S(5) 17 17

59
Algorithms for SELECT and JOIN Operations (11)

Implementing the JOIN Operation (cont.):


Methods for implementing joins:
◼ J4. Hash-join:
❑ The records of files R and S are both hashed to the same hash file, using the same
hashing function on the join attributes A of R and B of S as hash keys.
❑ A single pass through the file with fewer records (say, R) hashes its records to the
hash file buckets.
❑ A single pass through the other file (S) then hashes each of its records to the
appropriate bucket, where the record is combined with all matching records from R.

60
Algorithms for SELECT and JOIN Operations (11)

Implementing the JOIN Operation (cont.):


Methods for implementing joins:
◼ J4. Hash-join:
❑ The records of files R and S are both hashed to the same hash file, using the same
hashing function on the join attributes A of R and B of S as hash keys.
❑ A single pass through the file with fewer records (say, R) hashes its records to the
hash file buckets.
❑ A single pass through the other file (S) then hashes each of its records to the
appropriate bucket, where the record is combined with all matching records from R.

61
Algorithms for SELECT and JOIN Operations (12)

Implementing the JOIN Operation (cont.):


Methods for implementing joins:
◼ J4. Hash-join:

62
Algorithms for SELECT and JOIN Operations (12)

Implementing the JOIN Operation (cont.):


Methods for implementing joins:
◼ J4. Hash-join:

63
Algorithms for SELECT and JOIN Operations (12)

Implementing the JOIN Operation (cont.):


Methods for implementing joins:
◼ J4. Hash-join:

64
Algorithms for SELECT and JOIN Operations (12)

Implementing the JOIN Operation (cont.):


Methods for implementing joins:
◼ J4. Hash-join:

65
5. Algorithms for PROJECT and SET
Operations (1)
◼ Algorithm for PROJECT operations π<attribute list>(R)
(Figure 19.3b)

◼ If <attribute list> has a key of relation R, extract all tuples from R with only the values for
the attributes in <attribute list>.
◼ If <attribute list> does NOT include a key of relation R, duplicated tuples must be
removed from the results.
◼ Methods to remove duplicate tuples:
◼ Sorting
◼ Hashing

66
Implementing T  ∏<attribute list>(R)

create a tuple t[<attribute list>] in T’ for each tuple t in R;


/* T’ contains the projection result before duplicate elimination */
if <attribute list> includes a key of R
then T  T’
else { sort the tuples in T’;
set i  1, j  2;
while i ≤ n
do { output the tuple T’[i] to T;
while T’[i] = T’[j] and j ≤ n do j  j+1;
set i  j, j  i+1;
}
}
/* T contains the projection result after duplicate elimination */
Implementing T  ∏<attribute list>(R)

67
Algorithms for PROJECT and SET Operations (2)

Algorithm for SET operations


◼ Set operations : UNION, INTERSECTION, SET DIFFERENCE and CARTESIAN
PRODUCT.
◼ CARTESIAN PRODUCT of relations R and S includes all possible combinations of
records from R and S. The attributes of the result include all attributes of R and S.
◼ Cost analysis of CARTESIAN PRODUCT If R has n records and j attributes and S has
m records and k attributes, the result relation will have n*m records and j+k attributes.
◼ CARTESIAN PRODUCT operation is very expensive and should be avoided if possible.

68
Algorithms for PROJECT and SET Operations (3)

◼ Algorithm for SET operations (Cont.)

◼ UNION (See Figure 19.3c)


❑ 1. Sort the two relations on the same attributes.
❑ 2. Scan and merge both sorted files concurrently, whenever the same tuple exists in both
relations, only one is kept in the merged results.
◼ INTERSECTION (See Figure 19.3d)
❑ 1. Sort the two relations on the same attributes.

❑ 2. Scan and merge both sorted files concurrently, keep in the merged results only those
tuples that appear in both relations.
◼ SET DIFFERENCE R-S (See Figure 19.3e)(keep in the merged results only those
tuples that appear in relation R but not in relation S.)

69
Union: T  R  S

sort the tuples in R and S using the same unique sort attributes;
set i  1, j  1;
while (i ≤ n) and (j ≤ m) do
{
if R(i) > S(j)
then
{ output S(j) to T;
set j  j+1
}
else if R(i) < S(j)
then
{ output R(i) to T;
set i  i+1
}
else set j j+1 /* R(i) = S(j), so we skip one of the duplicate tuples */
}
if (i ≤ n) then add tuples R(i) to R(n) to T;
if (j ≤ m) then add tuples S(j) to S(m) to T;

70
Intersection T  R  S

sort the tuples in R and S using the same unique sort attributes;
set i  1, j  1;
while (i ≤ n) and (j ≤ m) do
{
if R(i) > S(j)
then
set j  j+1
else if R(i) < S(j)
then
set i  i+1
else
{ output R(i) to T; /* R(i) = S(j), so we output the tuple
*/
set i  i+1, j j+1
}
} 71
Difference T  R − S
sort the tuples in R and S using the same unique sort attributes;
set i  1, j  1;
while (i ≤ n) and (j ≤ m) do
{
if R(i) > S(j)
then
set j  j+1
else if R(i) < S(j)
then
{ output R(i) to T; /* R(i) has no matching S(j), so output R(i) */
set i  i+1
}
else
set i  i+1, j j+1
}
if (i ≤ n) then add tuples R(i) to R(n) to T;

72
6. Implementing Aggregate Operations
and Outer Joins (1)
Implementing Aggregate Operations:
◼ Aggregate operators : MIN, MAX, SUM, COUNT and AVG

◼ Options to implement aggregate operators:

❑ Table Scan
❑ Index
◼ Example

SELECT MAX(SALARY) FROM EMPLOYEE;

◼ If an (ascending) index on SALARY exists for the employee relation, then the optimizer could
decide on traversing the index for the largest value, which would entail following the right
most pointer in each index node from the root to a leaf.

73
Implementing Aggregate Operations and
Outer Joins (2)
◼ Implementing Aggregate Operations (cont.):
◼ SUM, COUNT and AVG
❑ For a dense index (each record has one index entry):
apply the associated computation to the values in the index.
❑ For a non-dense index: actual number of records associated with each index entry must
be accounted for
◼ With GROUP BY: the aggregate operator must be applied separately to each group of
tuples.
❑ Use sorting or hashing on the group attributes to partition the file into the appropriate
groups;
❑ Compute the aggregate function for the tuples in each group.

◼ What if we have Clustering index on the grouping attributes?

74
Implementing Aggregate Operations and
Outer Joins (3)
◼ Implementing Outer Join:
◼ Outer Join Operators : LEFT OUTER JOIN, RIGHT OUTER JOIN and FULL OUTER
JOIN.
◼ The full outer join produces a result which is equivalent to the union of the results of the
left and right outer joins.
◼ Example:
SELECT FNAME, DNAME
FROM ( EMPLOYEE LEFT OUTER JOIN DEPARTMENT ON DNO = DNUMBER);
◼ Note: The result of this query is a table of employee names and their associated
departments. It is similar to a regular join result, with the exception that if an employee
does not have an associated department, the employee's name will still appear in the
resulting table, although the department name would be indicated as null.

75
Implementing Aggregate Operations and
Outer Joins (4)
◼ Implementing Outer Join (cont.):
◼ Modifying Join Algorithms:

Nested Loop or Sort-Merge joins can be modified to implement outer join.


❑ For left outer join, use the left relation as outer loop and construct result from
every tuple in the left relation.
❑ If there is a match, the concatenated tuple is saved in the result.
❑ However, if an outer tuple does not match, then the tuple is still included in the
result but is padded with a null value(s).

76
Implementing Aggregate Operations and
Outer Joins (5)
◼ Implementing Outer Join (cont.):

Executing a combination of relational algebra operators.


Implement the previous left outer join example
1. {Compute the JOIN of the EMPLOYEE and DEPARTMENT tables}
TEMP1   FNAME,DNAME(EMPLOYEE  DNO=DNUMBER DEPARTMENT)
2. {Find the EMPLOYEEs that do not appear in the JOIN}
TEMP2   FNAME(EMPLOYEE) -  FNAME(Temp1)
3. {Pad each tuple in TEMP2 with a null DNAME field}
TEMP2  TEMP2 x 'null'
4. {UNION the temporary tables to produce the LEFT OUTER JOIN result}
RESULT  TEMP1  TEMP2
The cost of the outer join, as computed above, would include the cost of the associated steps
(i.e., join, projections, set difference and union).

77
7. Combining Operations using Pipelining (1)

◼ Motivation
❑ A query is mapped into a sequence of operations.

❑ Each execution of an operation produces a temporary result.

❑ Generating and saving temporary files on disk is time consuming and


expensive.
◼ Alternative:
❑ Avoid constructing temporary results as much as possible.

❑ Pipeline the data through multiple operations - pass the result of a


previous operator to the next without waiting to complete the previous
operation.

78
Combining Operations using Pipelining (2)
◼ Example: 2-way join, 2 selections on the input files and one final
projection on the resulting file.
◼ Dynamic generation of code to allow for multiple operations to be
pipelined.
◼ Results of a select operation are fed in a "Pipeline " to the join
algorithm.
◼ Also known as stream-based processing.

79
8. Using Heuristics in Query Optimization(1)

Process for heuristics optimization


◼ 1. The parser of a high-level query generates an initial internal
representation;
◼ 2. Apply heuristics rules to optimize the internal representation.
◼ 3. A query execution plan is generated to execute groups of operations
based on the access paths available on the files involved in the query.
◼ The main heuristic is to apply first the operations that reduce the size
of intermediate results.
E.g., Apply SELECT and PROJECT operations before applying the
JOIN or other binary operations.

80
Using Heuristics in Query Optimization (2)

◼ Query tree : a tree data structure that corresponds to a relational algebra


expression. It represents the input relations of the query as leaf nodes of the
tree, and represents the relational algebra operations as internal nodes.
◼ An execution of the query tree consists of executing an internal node
operation whenever its operands are available and then replacing that
internal node by the relation that results from executing the operation.
◼ Query graph : a graph data structure that corresponds to a relational
calculus expression. It does not indicate an order on which operations to
perform first. There is only a single graph corresponding to each query.

81
Using Heuristics in Query Optimization (3)
◼ Example:
For every project located in ‘Stafford’, retrieve the project number, the
controlling department number and the department manager’s last name,
address and birthdate.
◼ Relation algebra :
 PNUMBER, DNUM, LNAME, ADDRESS, BDATE(((σPLOCATION=‘STAFFORD’(PROJECT))
 DNUM=DNUMBER (DEPARTMENT))  MGRSSN=SSN (EMPLOYEE))
◼ SQL query :
Q2: SELECT P.NUMBER, P.DNUM,E.LNAME, E.ADDRESS, E.BDATE
FROM PROJECT AS P, DEPARTMENT AS D, EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND D.MGRSSN=E.SSN AND
P.PLOCATION=‘STAFFORD’; 82
Two query trees for the query Q2

83
Query graph for Q2

84
Using Heuristics in Query Optimization (6)

Heuristic Optimization of Query Trees:

◼ The same query could correspond to many different relational algebra expressions —
and hence many different query trees.

◼ The task of heuristic optimization of query trees is to find a final query tree that is
efficient to execute.

◼ Example :

Q: SELECT LNAME

FROM EMPLOYEE, WORKS_ON,PROJECT

WHERE PNAME = ‘AQUARIUS’ AND PNMUBER=PNO

AND ESSN=SSN AND BDATE > ‘1957-12-31’;


85
Step in converting a query during heuristic optimization.

Step 1: Initial (canonical) query tree for SQL query Q.

86
Step 2: Moving SELECT operations down the query tree.

87
Step 3: Apply more restrictive SELECT operation first

88
Step 4: Replacing Cartesian Product and Select with Join operation.

89
Step 5: Moving Project operations down the query tree

90
Using Heuristics in Query Optimization (10)
General Transformation Rules for Relational Algebra Operations:

◼ 1. Cascade of σ : A conjunctive selection condition can be broken up into a cascade


(sequence) of individual selection operations:
σc1 AND c2 AND ... AND cn(R) = σc1(σc2(...( σcn(R))...) )

◼ 2.Commutativity of σ : The σ operation is commutative:


σc1(σc2(R)) = σc2(σc1(R))

◼ 3. Cascade of π : In a cascade (sequence) of π operations, all but the last one can be
ignored:
πList1(π List2(...( πListn (R))...) ) = π List1(R)

◼ 4. Commuting σ with π : If the selection condition c involves only the attributes A1, ..., An in
the projection list, the two operations can be commuted:
πA1, A2,., An(σc(R)) = σc(πA1, A2,., An (R))

91
Using Heuristics in Query Optimization (11)

General Transformation Rules for Relational Algebra Operations:


(cont.):

◼ 5.Commutativity of  ( and  ): The operation is commutative as the  operation: R 


S = S  R; R  S = S  R

◼ 6. Commuting σ with  (or  ): If all the attributes in the


selection condition c involve only the attributes of one of the relations being joined - say, R-
the two operations can be commuted as follows :
σc( R  S ) = (σc(R))  S

◼ Alternatively, if the selection condition c can be written as (c1 and c2), where condition c1
involves only the attributes of R and condition c2 involves only the attributes of S, the
operations commute as follows:

σc( R  S ) = (σc1(R))  (σc2(S))


92
Using Heuristics in Query Optimization (12)

General Transformation Rules for Relational Algebra Operations (cont.):

◼ 7. Commuting π with  (or  ): Suppose that the projection list is L = {A1,


..., An, B1, ..., Bm}, where A1, ..., An are attributes of R and B1, ..., Bm
are attributes of S. If the join condition c involves only attributes in L, the
two operations can be commuted as follows:
πL( R CS ) = (πA1, ..., An(R)) C(πB1, ..., Bm(S))

◼ If the join condition c contains additional attributes not in L, these must be


added to the projection list, and a final operation is needed.

93
Using Heuristics in Query Optimization (13)

General Transformation Rules for Relational Algebra Operations (cont.):

◼ 8. Commutativity of set operations: The set operations  and ∩ are


commutative but – is not.

◼ 9. Associativity of  , X ,  , and ∩ : These four operations are individually


associative; that is, if q stands for any one of these four operations
(throughout the expression), we have ( R q S ) q T = R q ( S q T )

◼ 10. Commuting s with set operations: The s operation commutes with  , ∩ ,


and –. If q stands for any one of these three operations, we have
sc ( R q S ) = (sc (R)) q (sc (S))

94
Using Heuristics in Query Optimization (14)

General Transformation Rules for Relational Algebra Operations (cont.):

◼ 11. The π operation commutes with .


πL( R  S ) = (πL(R))  (πL(S))

◼ 12. Converting a (σ,  ) sequence into  : If the condition c of a


σ that follows a X corresponds to a join condition, convert the (σ,
X ) sequence into a  as follows:
(σC(R  S)) = (R  C S)

◼ 13. Other transformations

95
Using Heuristics in Query Optimization (15)

Outline of a Heuristic Algebraic Optimization Algorithm


◼ 1. Using rule 1, break up any select operations with conjunctive conditions
into a cascade of select operations.
◼ 2. Using rules 2, 4, 6, and 10 concerning the commutativity of select with
other operations, move each select operation as far down the query tree as is
permitted by the attributes involved in the select condition.
◼ 3. Using rule 9 concerning associativity of binary operations, rearrange the
leaf nodes of the tree so that the leaf node relations with the most restrictive
select operations are executed first in the query tree representation.
◼ 4. Using Rule 12, combine a cartesian product operation with a subsequent
select operation in the tree into a join operation.

96
Using Heuristics in Query Optimization (16)

Outline of a Heuristic Algebraic Optimization Algorithm (cont.)

◼ 5. Using rules 3, 4, 7, and 11 concerning the cascading of project and


the commuting of project with other operations, break down and move
lists of projection attributes down the tree as far as possible by creating
new project operations as needed.

◼ 6. Identify subtrees that represent groups of operations that can be


executed by a single algorithm.

97
Using Heuristics in Query Optimization (17)

Summary of Heuristics for Algebraic Optimization:


◼ 1.The main heuristic is to apply first the operations that reduce the size of
intermediate results.
◼ 2. Perform select operations as early as possible to reduce the number of
tuples and perform project operations as early as possible to reduce the
number of attributes. (This is done by moving select and project operations
as far down the tree as possible.)
◼ 3. The select and join operations that are most restrictive should be executed
before other similar operations. (This is done by reordering the leaf nodes of
the tree among themselves and adjusting the rest of the tree appropriately.)

98
Using Heuristics in Query Optimization (18)

Query Execution Plans


◼ An execution plan for a relational algebra query consists of a combination of
the relational algebra query tree and information about the access methods
to be used for each relation as well as the methods to be used in computing
the relational operators stored in the tree.

◼ Materialized evaluation: The result of an operation is stored as a temporary


relation.

◼ Pipelined evaluation: as the result of an operator is produced, it is


forwarded to the next operator in sequence.

99
9. Using Selectivity and Cost Estimates in
Query Optimization (1)
◼ Cost-based query optimization: Estimate and compare the costs of
executing a query using different execution strategies and choose the
strategy with the lowest cost estimate.

◼ Issues
❑ Cost function
❑ Number of execution strategies to be considered

100
Using Selectivity and Cost Estimates in Query Optimization (2)

Cost Components for Query Execution


◼ 1. Access cost to secondary storage
◼ 2. Storage cost

◼ 3. Computation cost

◼ 4. Memory usage cost


◼ 5. Communication cost

Note: Different database systems may focus on different cost


components.

101
Using Selectivity and Cost Estimates in Query
Optimization (3)
Catalog Information Used in Cost Functions
◼ Information about the size of a file

❑ number of records (tuples) (r),

❑ record size (R),

❑ number of blocks (b)

❑ blocking factor (bfr)

◼ Information about indexes and indexing attributes of a file


❑ Number of levels (x) of each multilevel index
❑ Number of first-level index blocks (bI1)

❑ Number of distinct values (d) of an attribute


❑ Selectivity (sl) of an attribute

❑ Selection cardinality (s) of an attribute. (s = sl * r)


102
Using Selectivity and Cost Estimates in Query Optimization (4)

Examples of Cost Functions for SELECT


◼ S1. Linear search (brute force) approach
CS1a= b;
For an equality condition on a key, C S1b = (b/2) if the record is found;
otherwise CS1a= b.
◼ S2. Binary search :
CS2= log2b + ┌ (s/bfr) ┐- 1
For an equality condition on a unique (key) attribute,
CS2 =log2b
◼ S3. Using a primary index (S3a) or hash key (S3b) to retrieve a single
record
CS3a= x + 1; CS3b = 1 for static or linear hashing;
CS3b = 2 for extendible hashing;
103
Using Selectivity and Cost Estimates in Query Optimization (5)
Examples of Cost Functions for SELECT (cont.)

◼ S4. Using an ordering index to retrieve multiple records:


For the comparison condition on a key field with an ordering index, CS4= x +
(b/2)
◼ S5. Using a clustering index to retrieve multiple records for an equality
condition:
CS5= x + ┌ (s/bfr) ┐
◼ S6. Using a secondary (B+-tree) index:
For an equality comparison, CS6a= x + s (option 1 & 2);
CS6a= x + s + 1 (option 3);
For a comparison condition such as >, <, >=, or <=,
CS6b= x + (bI1/2) + (r/2) 104
Using Selectivity and Cost Estimates in Query
Optimization (6)
Examples of Cost Functions for SELECT (cont.)

◼ S7. Conjunctive selection:


Use either S1 or one of the methods S2 to S6 to solve.
For the latter case, use one condition to retrieve the records and then check
in the memory buffer whether each retrieved record satisfies the remaining
conditions in the conjunction.

◼ S8. Conjunctive selection using a composite index:


Same as S3a, S5 or S6a, depending on the type of index.

105
Example
◼ rE = 10,000 , bE = 2000 , bfrE = 5
◼ Access paths:
❑ 1. A clustering index on SALARY, with levels xSALARY = 3 and average
selection cardinality SSALARY = 20.
❑ 2. A secondary index on the key attribute SSN, with xSSN = 4 (SSSN = 1).
❑ 3. A secondary index on the nonkey attribute DNO, with xDNO= 2 and first-
level index blocks bI1DNO= 4. There are dDNO = 125 distinct values for DNO,
so the selection cardinality of DNO is SDNO = (r/dDNO) = 80.
❑ 4. A secondary index on SEX, with xSEX = 1. There are dSEX = 2 values for
the sex attribute, so the average selection cardinality is SSEX = (r/dSEX) =
5000.
106
Example

◼ (op1): σSSN='123456789' (EMPLOYEE)

❑ CS1b = 1000
❑ CS6a = xSSN + 1 = 4+1 = 5

◼ (op2): σDNO>5 (EMPLOYEE)

❑ CS1a = 2000
❑ CS6b = xDNO + (bl1DNO/2) + (r/2) = 2 + 4/2 + 10000/2 = 5004

107
Example
◼ (op3): σDNO=5 (EMPLOYEE)

❑ CS1a = 2000
❑ CS6a = xDNO + sDNO = 2 + 80 = 82 (option 1 & 2)
❑ CS6a = xDNO + sDNO + 1 = 2 + 80 + 1= 83 (option 3)

◼ (op4): σDNO=5 AND SALARY>30000 AND SEX='F' (EMPLOYEE)


❑ CS6a-DNO = 82 (or 83)
❑ CS4-SALARY = xSALARY + (b/2) = 3 + 2000/2 = 1003
❑ CS6a-SEX = xSEX + sSEX = 1 + 5000 = 5001
❑ (CS6a-SEX = xSEX + sSEX + 1 = 5002 )

❑ => chose DNO = 5 first and check the other conditions.

108
Using Selectivity and Cost Estimates in Query
Optimization (7)
Examples of Cost Functions for JOIN

◼ Join selectivity (js)


js = | (R  C S) | / | R x S | = | (R  C S) | / (|R| * |S |)

If condition C does not exist, js = 1;


If no tuples from the relations satisfy condition C, js = 0;
Usually, 0 <= js <= 1 ;

Size of the result file after join operation


| (R C S) | = js * |R| * |S |

109
Using Selectivity and Cost Estimates in Query
Optimization (8)
Examples of Cost Functions for JOIN (cont.)

◼ J1. Nested-loop join:


CJ1 = bR+ (bR*bS) + ((js* |R|* |S|)/bfrRS)
(Use R for outer loop)

◼ J2. Single-loop join(using an access structure to retrieve the matching record(s))


If an index exists for the join attribute B of S with index levels
xB, we can retrieve each record s in R and then use the index to retrieve all the matching
records t from S that satisfy t[B] = s[A].
The cost depends on the type of index.

110
Using Selectivity and Cost Estimates in Query
Optimization (9)
Examples of Cost Functions for JOIN (cont.)
◼ J2. Single-loop join (cont.)

For a secondary index,


CJ2a = bR + (|R| * (xB+ sB)) + ((js* |R|* |S|)/bfrRS) (option 1 &2);
CJ2a = bR + (|R| * (xB+ sB + 1 )) + ((js* |R|* |S|)/bfrRS) (option 3);
For a clustering index,
CJ2b = bR + (|R| * (xB+ (sB/bfrB))) + ((js* |R|* |S|)/bfrRS);
For a primary index,
CJ2c = bR + (|R| * (xB+ 1)) + ((js* |R|* |S|)/bfrRS);
If a hash key exists for one of the two join attributes — B of S
CJ2d = bR + (|R| * h) + ((js* |R|* |S|)/bfrRS);
h: the average number of block accesses to retrieve a record, given its hash key value, h>=1
◼ J3. Sort-merge join:
CJ3a = CS + bR+ bS + ((js* |R|* |S|)/bfrRS); (CS: Cost for sorting files)
111
Example

◼ Suppose that we have the EMPLOYEE file described in the previous


example
◼ Assume that the DEPARTMENT file of rD = 125 and bD = 13 , xDNUMBER = 1,
secondary index on MGRSSN of DEPARTMENT, sMGRSSN = 1, xMGRSSN = 2,
jsOP6 = (1/IDEPARTMENTI ) = 1/125 , bfrED = 4

◼ (op8): EMPLOYEE  DNO=DNUMBER DEPARTMENT


◼ (op9): DEPARTMENT  MGRSSN=SSNEMPLOYEE

112
DEPARTMENT: rD = 125 and bD = 13 , xDNUMBER = 1, primary index on DNUMBER of DEPARTMENT, xDNUMBER = 1,
Example
jsOP6 = (1/IDEPARTMENTI ) = 1/rD = 1/125 , bfrED = 4
EMPLOYEE : rE = 10000, bE = 2000, secondary index on the nonkey attribute DNO, xDNO = 2, SDNO = 80).
◼ (op8): EMPLOYEE  DNO=DNUMBER DEPARTMENT
❑ Method J1 with Employee as outer:
◼ CJ1 = bE + (bE * bD) + ((jsOP6 * rE * rD)/bfrED)
◼ = 2000 + (2000 * 13) + (((1/125) * 10,000 * 125)/4) =30,500
❑ Method J1 with Department as outer:
◼ CJ1 = bD + (bE * bD) + (((jsOP6 * rE * rD)/bfrED)
◼ = 13 + (13 * 2000) + (((1/125) * 10,000 * 125/4) = 28,513
❑ Method J2 with EMPLOYEE as outer loop:
◼ CJ2c = bE + (rE * (xDNUMBER + 1)) + ((jsOP6 * rE * rD)/bfrED
◼ = 2000 + (10,000 * 2) + (((1/125) * 10,000 * 125/4) = 24,500
❑ Method J2 with DEPARTMENT as outer loop:
◼ CJ2a = bD + (rD * (xDNO+ sDNO)) + ((jsOP6 * rE * rD)/bfrED) (option 1 & 2)
◼ = 13 + (125 * (2 + 80)) + (((1/125) * 10,000 * 125/4) = 12,763
◼ CJ2a = bD + (rD * (xDNO+ sDNO + 1)) + ((jsOP6 * rE * rD)/bfrED) (option 3)
◼ = 13 + (125 * (2 + 80 + 1)) + (((1/125) * 10,000 * 125/4) = 12,888
113
DEPARTMENT: rD = 125 and bD = 13 , xDNUMBER = 1, secondary index on MGRSSN of DEPARTMENT, sMGRSSN =
Example
1, xMGRSSN = 2, jsOP7 = (1/IEMPLOYEEI ) = 1/rE = 1/10,000 , bfrED = 4
EMPLOYEE : : rE = 10000, bE = 2000, secondary index on the key attribute SSN, with xSSN = 4 (SSSN = 1).

◼ (op9): DEPARTMENT  MGRSSN=SSNEMPLOYEE


❑ Method J1 with Employee as outer:
◼ CJ1 = bE + (bE * bD) + ((jsOP7 * rE * rD)/bfrED)
◼ = 2000 + (2000 * 13) + (((1/10,000) * 10,000 * 125)/4) = ┌28,031.25┐ = 28,032
❑ Method J1 with Department as outer:
◼ CJ1 = bD + (bE * bD) + (((jsOP7 * rE * rD)/bfrED)
◼ = 13 + (13 * 2000) + (((1/10,000) * 10,000 * 125/4) = ┌ 26,044.25 ┐ = 26,045
❑ Method J2 with EMPLOYEE as outer loop:
◼ CJ2c = bE + (rE * (xMGRSSN + sMGRSSN)) + ((jsOP7 * rE * rD)/bfrED (option 1 & 2)
◼ = 2000 + (10,000 * (2+1)) + (((1/10,000) * 10,000 * 125/4) = ┌ 32,031.25 ┐ = 32,032
◼ CJ2c = bE + (rE * (xMGRSSN + sMGRSSN +1)) + ((jsOP7 * rE * rD)/bfrED = 42,032 (option 3)
❑ Method J2 with DEPARTMENT as outer loop:
◼ CJ2a = bD + (rD * (xSSN+ sSSN)) + ((jsOP7 * rE * rD)/bfrED) (option 1 & 2)
◼ = 13 + (125 * (4 + 1)) + (((1/10,000) * 10,000 * 125/4) = ┌ 669.25 ┐ = 670
◼ CJ2a = bD + (rD * (xSSN+ sSSN +1)) + ((jsOP7 * rE * rD)/bfrED) = 795 (option 3) 114
Using Selectivity and Cost Estimates in Query
Optimization (10)
Multiple Relation Queries and Join Ordering
◼ A query joining n relations will have n-1 join operations, and hence can
have a large number of different join orders when we apply the
algebraic transformation rules.
◼ Current query optimizers typically limit the structure of a (join) query
tree to that of left-deep (or right-deep) trees.
◼ Left-deep tree : a binary tree where the right child of each non-leaf
node is always a base relation.
❑ Amenable to pipelining
❑ Could utilize any access paths on the base relation (the right child) when
executing the join.

115
Using Selectivity and Cost Estimates in Query
Optimization (11)
◼ Example: 2 left-deep trees

116
10. Overview of Query Optimization in Oracle

◼ Oracle DBMS V8
❑ Rule-based query optimization: the optimizer chooses execution plans based
on heuristically ranked operations.
◼ (Currently it is being phased out)
❑ Cost-based query optimization: the optimizer examines alternative access
paths and operator algorithms and chooses the execution plan with lowest
estimate cost.
◼ The query cost is calculated based on the estimated usage of resources such as I/O,
CPU and memory needed.
❑ Application developers could specify hints to the ORACLE query optimizer.
❑ The idea is that an application developer might know more information about the
data.

117
11. Semantic Query Optimization
◼ Semantic Query Optimization:
❑ Uses constraints specified on the database schema in order to modify one query into
another query that is more efficient to execute.

◼ Consider the following SQL query,


SELECT E.LNAME, M.LNAME
FROM EMPLOYEE E M
WHERE E.SUPERSSN=M.SSN AND E.SALARY>M.SALARY

◼ Explanation:
❑ Suppose that we had a constraint on the database schema that stated that no employee
can earn more than his or her direct supervisor. If the semantic query optimizer checks for
the existence of this constraint, it need not execute the query at all because it knows that
the result of the query will be empty. Techniques known as theorem proving can be used
for this purpose.

118
120
Chapter 5

Introduction to Transaction
Processing Concepts and Theory

1
Chapter Outline
◼ Introduction to Transaction Processing

◼ Transaction and System Concepts


◼ Desirable Properties of Transactions

◼ Characterizing Schedules based on Recoverability

◼ Characterizing Schedules based on Serializability

◼ Transaction Support in SQL

2
1. Introduction to Transaction Processing (1)

◼ Single-User System: At most one user at a time can use


the system.
◼ Multiuser System: Many users can access the system
concurrently.
◼ Concurrency
❑ Interleaved processing: concurrent execution of
processes is interleaved in a single CPU
❑ Parallel processing: processes are concurrently executed
in multiple CPUs.

3
Introduction to Transaction Processing (2)

◼ A Transaction: logical unit of database processing


that includes one or more access operations (read -retrieval,
write - insert or update, delete).
◼ A transaction (set of operations) may be stand-
alone specified in a high level language like SQL submitted
interactively, or may be embedded within a program.
◼ Transaction boundaries: Begin and End transaction.
◼ An application program may contain several
transactions separated by the Begin and End
transaction boundaries.

4
Introduction to Transaction Processing (3)

SIMPLE MODEL OF A DATABASE (for purposes


of discussing transactions):

◼ A database - collection of named data items


◼ Granularity of data - a field, a record , or a whole disk
block (Concepts are independent of granularity)
◼ Basic operations are read and write
❑ read_item(X): Reads a database item named X into a
program variable. To simplify our notation, we assume
that the program variable is also named X.
❑ write_item(X): Writes the value of program variable X
into the database item named X.

5
Introduction to Transaction Processing (4)
READ AND WRITE OPERATIONS:
◼ Basic unit of data transfer from the disk to the computer
main memory is one block.
◼ Data item (what is read or written):
❑ the field of some record in the database,
❑ a larger unit such as a record or even a whole block.
◼ read_item(X) command includes the following
steps:
1. Find the address of the disk block that contains item X.
2. Copy that disk block into a buffer in main memory (if that
disk block is not already in some main memory buffer).
3. Copy item X from the buffer to the program variable
named X.

6
Introduction to Transaction Processing (5)

READ AND WRITE OPERATIONS (cont.):


◼ write_item(X) command includes the following
steps:
1. Find the address of the disk block that contains item
X.
2. Copy that disk block into a buffer in main memory (if
that disk block is not already in some main memory
buffer).
3. Copy item X from the program variable named X
into its correct location in the buffer.
4. Store the updated block from the buffer back to disk
(either immediately or at some later point in time).

7
Two sample transactions. (a) Transaction T1.
(b) Transaction T2.

8
Introduction to Transaction Processing (7)

Why Concurrency Control is needed:


◼ The Lost Update Problem.
This occurs when two transactions that access the
same database items have their operations
interleaved in a way that makes the value of some
database item incorrect.
◼ The Temporary Update (or Dirty Read) Problem.
This occurs when one transaction updates a database
item and then the transaction fails for some reason.
The updated item is accessed by another transaction
before it is changed back to its original value.

9
Some problems that occur when concurrent execution
is uncontrolled. (a) The lost update problem.

10
Some problems that occur when concurrent execution
is uncontrolled. (b) The temporary update problem.

11
Introduction to Transaction Processing (8)
Why Concurrency Control is needed (cont.):
◼ The Incorrect Summary Problem .
If one transaction is calculating an aggregate summary
function on a number of records while other transactions
are updating some of these records, the aggregate
function may calculate some values before they are
updated and others after they are updated.

◼ The unrepeatable Read Problem:


Transaction T reads the same item twice and the item is
changed by another transaction T’ between the two
reads.
12
Some problems that occur when concurrent execution is
uncontrolled. (c) The incorrect summary problem.

13
Introduction to Transaction Processing (11)
Why recovery is needed:
(What causes a Transaction to fail)
1. A computer failure (system crash): A hardware or
software error occurs in the computer system during
transaction execution. If the hardware crashes, the
contents of the computer’s internal memory may be
lost.
2. A transaction or system error : Some operation in the
transaction may cause it to fail, such as integer overflow
or division by zero. Transaction failure may also occur
because of erroneous parameter values or because of
a logical programming error. In addition, the user may
interrupt the transaction during its execution.

14
Introduction to Transaction Processing (12)
Why recovery is needed (cont.):
3. Local errors or exception conditions detected by the
transaction:
- certain conditions necessitate cancellation of the
transaction. For example, data for the transaction may not
be found. A condition, such as insufficient account balance
in a banking database, may cause a transaction, such as a
fund withdrawal from that account, to be canceled.
- a programmed abort in the transaction causes it to fail.
4. Concurrency control enforcement: The concurrency
control method may decide to abort the transaction, to be
restarted later, because it violates serializability or because
several transactions are in a state of deadlock (see
Chapter 5).

15
Introduction to Transaction Processing (13)

Why recovery is needed (cont.):


5. Disk failure: Some disk blocks may lose their data
because of a read or write malfunction or because of
a disk read/write head crash. This may happen during
a read or a write operation of the transaction.
6. Physical problems and catastrophes: This refers
to an endless list of problems that includes power or
air-conditioning failure, fire, theft, sabotage,
overwriting disks or tapes by mistake, and mounting
of a wrong tape by the operator.

16
2 . Transaction and System Concepts (1)

◼ A transaction is an atomic unit of work that is either


completed in its entirety or not done at all. For recovery
purposes, the system needs to keep track of when the
transaction starts, terminates, and commits or aborts.

◼ Transaction states:
❑ Active state
❑ Partially committed state
❑ Committed state
❑ Failed state
❑ Terminated State

17
State transition diagram illustrating the states for
transaction execution.

18
Transaction and System Concepts (2)
Recovery manager keeps track of the following
operations:
◼ begin_transaction: This marks the beginning of
transaction execution.
◼ read or write: These specify read or write operations on
the database items
◼ end_transaction:
❑ This specifies that read and write transaction
operations have ended and marks the end limit of
transaction execution.
❑ may be necessary to check whether the changes
introduced by the transaction can be permanently
applied to the database or whether the transaction has
to be aborted because it violates concurrency control or
for some other reason.
19
Transaction and System Concepts (3)
Recovery manager keeps track of the following
operations (cont):
◼ commit_transaction: This signals a successful end of
the transaction so that any changes (updates) executed
by the transaction can be safely committed to the
database and will not be undone.
◼ rollback (or abort): This signals that the transaction
has ended unsuccessfully, so that any changes or
effects that the transaction may have applied to the
database must be undone.

20
Transaction and System Concepts (4)

Recovery techniques use the following operators:


◼ undo: Similar to rollback except that it applies to a
single operation rather than to a whole transaction.
◼ redo: This specifies that certain transaction
operations must be redone to ensure that all the
operations of a committed transaction have been
applied successfully to the database.

21
Transaction and System Concepts (5)
The System Log
◼ Log or Journal :
❑ The log keeps track of all transaction operations that
affect the values of database items.
❑ This information may be needed to permit recovery
from transaction failures.
❑ The log is kept on disk → not affected by any type of
failure except for disk or catastrophic failure.
❑ In addition, the log is periodically backed up to
archival storage (tape) to guard against such
catastrophic failures.

22
Transaction and System Concepts (6)

The System Log (cont):


Types of log record:
1. [start_transaction,T]: Records that transaction T has started
execution.
2. [write_item,T,X,old_value,new_value]: Records that
transaction T has changed the value of database item X from
old_value to new_value.
3. [read_item,T,X]: Records that transaction T has read the
value of database item X.
4. [commit,T]: Records that transaction T has completed
successfully, and affirms that its effect can be committed
(recorded permanently) to the database.
5. [abort,T]: Records that transaction T has been aborted.

23
Transaction and System Concepts (7)

The System Log (cont):


◼ Protocols for recovery that avoid cascading
rollbacks do not require that READ operations be
written to the system log, whereas other protocols
require these entries for recovery.

◼ Strict protocols require simpler WRITE entries that


do not include new_value.

24
Transaction and System Concepts (8)

Recovery using log records:


◼ If the system crashes, we can recover to a consistent database
state by examining the log and using one of the techniques
described in Chapter 7.
◼ Because the log contains a record of every write operation that
changes the value of some database item, it is possible to undo
the effect of these write operations of a transaction T by tracing
backward through the log and resetting all items changed by a
write operation of T to their old_values.
◼ We can also redo the effect of the write operations of a
transaction T by tracing forward through the log and setting all
items changed by a write operation of T (that did not get done
permanently) to their new_values.

25
Transaction and System Concepts (9)

Commit Point of a Transaction:


◼ Definition: A transaction T reaches its commit point when
❑ all its operations that access the database have been
executed successfully and
❑ the effect of all the transaction operations on the
database has been recorded in the log.
◼ Beyond the commit point, the transaction is said to be
committed, and its effect is assumed to be permanently
recorded in the database. The transaction then writes an entry
[commit,T] into the log.
◼ Roll Back of transactions: Needed for transactions that have
a [start_transaction,T] entry into the log but no commit entry
[commit,T] into the log.

26
3. Desirable Properties of Transactions (1)

ACID properties:
◼ Atomicity: A transaction is an atomic unit of
processing; it is either performed in its entirety
or not performed at all.

◼ Consistency preservation: A correct execution


of the transaction must take the database from
one consistent state to another.

27
Desirable Properties of Transactions (2)

ACID properties (cont.):


◼ Isolation: A transaction should appear as though it is
being executed in isolation from other transaction. That
is, the execution of a transaction should not be
interfered with by any other transaction executing
concurrently.

◼ Durability or permanency: Once a transaction


changes the database and the changes are committed,
these changes must never be lost because of
subsequent failure.

28
4. Characterizing Schedules based on
Recoverability (1)
◼ Transaction schedule or history:
❑ When transactions are executing concurrently in an
interleaved fashion
❑ The order of execution of operations from the various
transactions forms → a transaction schedule (or history).

◼ A schedule (or history) S of n transactions T1, T2, ..., Tn :


❑ Constraint : for each transaction Ti that participates in S,
the operations of T1 in S must appear in the same order
in which they occur in T1.
❑ However, that operations from other transactions Tj can be
interleaved with the operations of Ti in S.

29
Characterizing Schedules based on
Recoverability (2)
◼ Notation:

Notation Description
ri(X) read_item(X) - transaction Ti
wi(X) write_item(X) - transaction Ti
ci commit - transaction Ti
ai abort - transaction Ti
Characterizing Schedules based on
Recoverability (3)
◼ Example (1):

❑ Sa: r1(X); r2(X); w1(X); r1(Y); w2(X); w1(Y);


Characterizing Schedules based on
Recoverability (4)
◼ Example (2):

abort;

❑ Sb: r1(X); w1(X); r2(X); w2(X); r1(Y); a1;


Characterizing Schedules based on
Recoverability (5)
◼ Two operations in a schedule are said to
conflict if they satisfy all:
❑ (1) they belong to different transactions.
❑ (2) they access the same item X.
❑ (3) at least one of the operation is a write_item(X)
Characterizing Schedules based on
Recoverability (6)
◼ Example (1):
❑ conflict:
◼ r1(X) and w2(X)
◼ r2(X) and w1(X)
◼ w1(X) and w2(X)

❑ Not conflict:
◼ r1(X) and r2(X)
◼ r1(Y) and w2(X)
◼ r1(X) and w1(X)
◼ …
Characterizing Schedules based on
Recoverability (7)
◼ Example (2):
❑ Sb: r1(X); w1(X); r2(X); w2(X); r1(Y); a1;

❑ Conflict:
◼ r1(X) and w2(X)
◼ w1(X) and r2(X)
◼ w1(X) and w2(X)
Characterizing Schedules based on
Recoverability (8)
Schedules classified on recoverability:
◼ Recoverable schedule: A schedule S is recoverable if
no transaction T in S commits until all transactions T’ that have
written an item that T reads have committed.

◼ Cascadeless schedule: One where every transaction


reads only the items that are written by committed transactions.
Schedules requiring cascaded rollback: A schedule in which
uncommitted transactions that read an item from a failed
transaction must be rolled back.

36
Characterizing Schedules based on
Recoverability (9)
Schedules classified on recoverability (cont.):
◼ Strict Schedules: A schedule in which a
transaction can neither read or write an item X until
the last transaction that wrote X has committed.

37
Characterizing Schedules based on
Recoverability (10)
◼ Example of Recoverable schedule :
Sa': r1(X); r2(X); w1(X); r1(Y); w2(X); c2; w1(Y); c1;
❑ Lost update

◼ Example of Nonrecoverable schedule :


Sc: r1(X); w1(X); r2(X); r1(Y); w2(X); c2; a1;
Characterizing Schedules based on
Recoverability (11)
◼ Example of Cascading Rollback:
Sd: r1(X); w1(X); r2(X); r1(Y); w2(X); w1(Y); c1; c2;
❑ Sd : Recoverable schedule

◼ Example of Cascadeless schedule:


S’d: r1(X); w1(X); r1(Y); w1(Y); c1; r2(X); w2(X); c2;

◼ Example of Strict schedule


Sf: r1(X); r2(Z); r1(Z); r3(X); r3(Y); w1(X); c1; w3(Y); c3
r2(Y); w2(Z); r2(Y); c2
5. Characterizing Schedules based on
Serializability (1)
◼ Serial schedule: A schedule S is serial if, for every
transaction T participating in the schedule, all the
operations of T are executed consecutively in the
schedule. Otherwise, the schedule is called
nonserial schedule.
◼ Serializable schedule: A schedule S is
serializable if it is equivalent to some serial
schedule of the same n transactions.

40
Characterizing Schedules based on
Serializability (2)

Serial Schedules:
(A) T1 followed by T2 (B) T2 followed by T1
Characterizing Schedules based on
Serializability (3)

Two nonserial Schedules


Characterizing Schedules based on
Serializability (4)
◼ Result equivalent: Two schedules are called result
equivalent if they produce the same final state of
the database.
◼ Conflict equivalent: Two schedules are said to be
conflict equivalent if the order of any two conflicting
operations is the same in both schedules.
❑ Two operations in a schedule are said to conflict if they
belong to different transactions, access the same data
item, and at least one of the two operations is a write_item
operation.

43
Characterizing Schedules based on
Serializability (5)
◼ Conflict serializable: A schedule S is said to be
conflict serializable if it is conflict equivalent to some
serial schedule S’.
❑ In such a case, we can reorder the nonconflicting
operations in S until we form the equivalent serial schedule
S’.

44
Characterizing Schedules based on
Serializability (6)
◼ Being serializable is not the same as being
serial

◼ Being serializable implies that the schedule is a


correct schedule.
❑ It will leave the database in a consistent state.
❑ The interleaving is appropriate and will result in a
state as if the transactions were serially executed, yet
will achieve efficiency due to concurrent execution.

45
Characterizing Schedules based on
Serializability (7)
◼ Serializability is hard to check.
❑ Interleaving of operations occurs in an operating
system through some scheduler
❑ Difficult to determine beforehand how the
operations in a schedule will be interleaved.

46
Characterizing Schedules based on
Serializability (8)
Practical approach:
◼ Come up with methods (protocols) to ensure
serializability.
◼ It’s not possible to determine when a schedule
begins and when it ends. Hence, we reduce the
problem of checking the whole schedule to checking
only a committed project of the schedule (i.e.
operations from only the committed transactions.)
◼ Current approach used in most DBMSs:
❑ Use of locks with two phase locking

47
Characterizing Schedules based on
Serializability (9)
Testing for conflict serializability
◼ Precedence graph (serialization graph) G = (N,
E)
❑ Directed graph

❑ Set of Nodes: N = {T1, T2, ... , Tn}

❑ Directed edge: E ={e 1, e2, …, em}

◼ ei: Tj → Tk if one of the operations in Tj appears in


the schedule before some conflicting operation in Tk
Characterizing Schedules based on
Serializability (10)
Testing for conflict serializability (cont.)
◼ Algorithm
1. For each transaction Ti participating in schedule S,
create a node labeled Ti in the precedence graph.
2. For each case in S where Tj executes a read_item(X)
after Ti executes a write_item(X), create an edge
(Ti→ Tj) in the precedence graph.
3. For each case in S where Tj executes a write_item(X)
after Ti executes a read_item(X), create an edge
(Ti→Tj) in the precedence graph.
Characterizing Schedules based on
Serializability (11)
Testing for conflict serializability (cont.)
◼ Algorithm (cont.)

4. For each case in S where Tj executes a write_item(X)


after Ti executes a write_item(X), create an edge
(Ti→ Tj) in the precedence graph.

5. The schedule S is serializable if and only if the


precedence graph has no cycles.
Example of Serializability Testing (1)

Serializable
schedule
Example of Serializability Testing (2)

Serializable
schedule
Example of Serializability Testing (3)

Not Serializable
schedule
Example of Serializability Testing (4)

Serializable
schedule
Another example of serializability testing. (a) The
READ and WRITE operations of three transactions T1,
T2, and T3.

55
Another example of serializability testing. (b) Schedule
E.

56
Another example of serializability testing. (c) Schedule
F.

57
58
6. Transaction Support in SQL2 (1)

◼ A single SQL statement is always considered to be


atomic. Either the statement completes execution
without error or it fails and leaves the database
unchanged.
◼ With SQL, there is no explicit Begin Transaction
statement. Transaction initiation is done implicitly
when particular SQL statements are encountered.
◼ Every transaction must have an explicit end
statement, which is either a COMMIT or
ROLLBACK.

59
Transaction Support in SQL2 (2)

Characteristics specified by a SET


TRANSACTION statement in SQL2:
◼ Access mode: READ ONLY or READ WRITE. The default
is READ WRITE unless the isolation level of READ
UNCOMITTED is specified, in which case READ ONLY is
assumed.
◼ Diagnostic size n, specifies an integer value n, indicating
the number of conditions that can be held simultaneously in
the diagnostic area. (Supply user feedback information)

60
Transaction Support in SQL2 (3)

Characteristics specified by a SET


TRANSACTION statement in SQL2 (cont.):
◼ Isolation level <isolation>, where <isolation> can be READ
UNCOMMITTED, READ COMMITTED, REPEATABLE
READ or SERIALIZABLE. The default is SERIALIZABLE.
With SERIALIZABLE: the interleaved execution of
transactions will adhere to our notion of serializability.
However, if any transaction executes at a lower level, then
serializability may be violated.

61
Transaction Support in SQL2 (4)

Potential problem with lower isolation levels:


◼ Dirty Read: Reading a value that was written by a
transaction which failed.
◼ Nonrepeatable Read: Allowing another transaction to write
a new value between multiple reads of one transaction.
A transaction T1 may read a given value from a table.
If another transaction T2 later updates that value and T1
reads that value again, T1 will see a different value.
Consider that T1 reads the employee salary for Smith.
Next, T2 updates the salary for Smith. If T 1 reads Smith's
salary again, then it will see a different value for Smith's
salary.

62
Transaction Support in SQL2(5)

Potential problem with lower isolation levels


(cont.):
◼ Phantoms: New rows being read using the same
read with a condition.
❑ A transaction T1 may read a set of rows from a table,
perhaps based on some condition specified in the SQL
WHERE clause.
❑ A transaction T2 inserts a new row that also satisfies the
WHERE clause condition of T1, into the table used by T1.
❑ If T1 is repeated, then T1 will see a row that previously did
not exist, called a phantom.

63
Transaction Support in SQL2 (6)
Sample SQL transaction:
EXEC SQL whenever sqlerror go to UNDO;
EXEC SQL SET TRANSACTION
READ WRITE
DIAGNOSTICS SIZE 5
ISOLATION LEVEL SERIALIZABLE;
EXEC SQL INSERT
INTO EMPLOYEE (FNAME, LNAME, SSN, DNO, SALARY)
VALUES ('Robert','Smith','991004321',2,35000);
EXEC SQL UPDATE EMPLOYEE
SET SALARY = SALARY * 1.1
WHERE DNO = 2;
EXEC SQL COMMIT;
GOTO THE_END;
UNDO: EXEC SQL ROLLBACK;
THE_END: ...

64
Transaction Support in SQL2 (7)

Possible violation of serializabilty:

65
Chapter 6

Concurrency Control Techniques


Chapter Outline
◼ Purpose of Concurrency Control
◼ Two-Phase Locking Techniques
◼ Concurrency Control Based on Timestamp
Ordering
◼ Multi-version Concurrency Control Techniques
◼ Validation (Optimistic) Concurrency Control
Techniques
◼ Granularity of Data Items And Multiple Granularity
Locking

2
1. Purpose of Concurrency Control
◼ To enforce Isolation (through mutual exclusion)
among conflicting transactions.
◼ To preserve database consistency through
consistency preserving execution of
transactions.
◼ To resolve read-write and write-write conflicts.

◼ Example:
❑ In concurrent execution environment: if T1 conflicts with T2
over a data item A
❑ Then the concurrency control decides if T1 or T2 should
get the A and if the other transaction is rolled-back or waits.

3
2. Two-Phase Locking Techniques (1)
◼ Locking is an operation which secures
❑ (a) permission to Read
❑ (b) permission to Write a data item for a transaction.
◼ Example:
❑ Lock (X). Data item X is locked in behalf of the requesting
transaction.
◼ Unlocking is an operation which removes these
permissions from the data item.
◼ Example:
❑ Unlock (X): Data item X is made available to all other
transactions.
◼ Lock and Unlock are Atomic operations.

4
Two-Phase Locking Techniques (2)

◼ Database requires that all transactions


should be well-formed. A transaction is well-
formed if:
❑ It must lock the data item before it reads or
writes to it.
❑ It must not lock an already locked data items
and it must not try to unlock a free data item.

5
Two-Phase Locking Techniques (3)

◼ Type of Locks:
❑ Binary Locks
❑ Shared/ Exclusive (or Read/ Write) Locks

6
Two-Phase Locking Techniques (4)

◼ Binary Locks
❑ 2 values: locked and unlocked (1 and 0)
❑ The following code performs the lock operation:
B: if LOCK (X) = 0 (*item is unlocked*)
then LOCK (X)  1 (*lock the item*)
else begin
wait (until lock (X) = 0) and
the lock manager wakes up the transaction);
goto B
end;

7
Two-Phase Locking Techniques (5)

◼ Binary Locks
❑ The following code performs the unlock operation:

LOCK (X)  0 (*unlock the item*)


if any transactions are waiting then
wake up one of the waiting the transactions;

8
Two-Phase Locking Techniques (6)

◼ Binary Locks
❑ Rules:
1. A transaction T must issue the operation lock_item(X)
before any read_item(X) or write_item(X) operations in T.
2. A transaction T must issue the operation unlock_item(X)
after all read_item(X) and write_item(X) operations are
completed in T.
3. A transaction T will not issue a lock_item(X) operation if it
already holds the lock on item X.
4. A transaction T will not issue an unlock_item(X) operation
unless it already holds the lock on item X.

9
Two-Phase Locking Techniques (7)

◼ Shared/ Exclusive (or Read/ Write) Locks


❑ Two locks modes:
◼ (a) shared (read) (b) exclusive (write).
❑ Shared mode: read lock (X)
◼ More than one transaction can apply share lock on
X for reading its value but no write lock can be
applied on X by any other transaction.
❑ Exclusive mode: write lock (X)
◼ Only one write lock on X can exist at any time and
no shared lock can be applied by any other
transaction on X.

10
Two-Phase Locking Techniques (8)

◼ Shared/ Exclusive (or Read/ Write) Locks


❑ Lock Manager:
◼ Managing locks on data items.
❑ Lock table:
◼ Lock manager uses it to store the identify of transaction
locking a data item, the data item, lock mode and pointer
to the next data item locked. One simple way to
implement a lock table is through linked list.

Transaction ID Data item id lock mode Ptr to next data item


T1 X1 Read Next

11
Two-Phase Locking Techniques (9)
◼ Shared/ Exclusive (or Read/ Write) Locks
❑ The following code performs the read lock operation:
B: if LOCK (X) = “unlocked” then
begin LOCK (X)  “read-locked”;
no_of_reads (X)  1;
end
else if LOCK (X)  “read-locked” then
no_of_reads (X)  no_of_reads (X) +1;
else begin wait (until LOCK (X) = “unlocked” and
the lock manager wakes up the transaction);
go to B;
end;
12
Two-Phase Locking Techniques (10)
◼ Shared/ Exclusive (or Read/ Write) Locks
❑ The following code performs the write lock operation:

B: if LOCK(X) = “unlocked”
then LOCK(X) ← “write-locked”
else begin
wait (until LOCK(X) = “unlocked”
and the lock manager wakes up the transaction);
go to B
end;

13
Two-Phase Locking Techniques (11)
◼ Shared/ Exclusive (or Read/ Write) Locks
❑ The following code performs the unlock operation:
if LOCK (X) = “write-locked” then
begin LOCK (X)  “unlocked”;
wakes up one of the transactions, if any
end
else if LOCK (X)  “read-locked” then
begin
no_of_reads (X)  no_of_reads (X) -1
if no_of_reads (X) = 0 then
begin
LOCK (X) = “unlocked”;
wake up one of the transactions, if any
end
end;
14
Two-Phase Locking Techniques (12)
◼ Shared/ Exclusive (or Read/ Write) Locks
❑ Rules:
1. A transaction T must issue the operation read_lock(X)
or write_lock(X) before any read_item(X) operation is
performed in T.

2. A transaction T must issue the operation write_lock(X)


before any write_item(X) operation is performed in T.

3. A transaction T must issue the operation unlock(X)


after all read_item(X) and write_item(X) operations are
completed in T.

15
Two-Phase Locking Techniques (13)
◼ Shared/ Exclusive (or Read/ Write) Locks
❑ Rules (cont.):
4. A transaction T must not issue a read_lock(X)
operation if it already holds a read(shared) lock or a
write(exclusive) lock on item X.

5. A transaction T must not issue a write_lock(X)


operation if it already holds a read(shared) lock or a
write(exclusive) lock on item X.

6. A transaction T must not issue the operation unlock(X)


unless it already holds a read (shared) lock or a
write(exclusive) lock on item X.

16
Two-Phase Locking Techniques (14)
◼ Shared/ Exclusive (or Read/ Write) Locks
❑ Lock conversion
◼ Lock upgrade: existing read lock to write lock

if Ti has a read-lock (X) and Tj has no read-lock (X) (i  j) then


convert read-lock (X) to write-lock (X)
else
force Ti to wait until Tj unlocks X

◼ Lock downgrade: existing write lock to read lock


Ti has a write-lock (X) (*no transaction can have any lock on X*)
convert write-lock (X) to read-lock (X)

17
Two-Phase Locking Techniques (15)

◼ Two-Phase Locking
❑ Two Phases:
◼ (a) Locking (Growing)
◼ (b) Unlocking (Shrinking).
❑ Locking (Growing) Phase:
◼ A transaction applies locks (read or write) on desired data
items one at a time.
❑ Unlocking (Shrinking) Phase:
◼ A transaction unlocks its locked data items one at a time.
❑ Requirement:
◼ For a transaction these two phases must be mutually
exclusively, that is, during locking phase unlocking phase
must not start and during unlocking phase locking phase must
not begin.

18
Two-Phase Locking Techniques (16)

◼ Do not obey Two-Phase Locking

19
Two-Phase Locking Techniques (17)
◼ Do not obey Two-Phase Locking

20
Two-Phase Locking Techniques (18)

◼ Two-Phase Locking

T1’ and T2’ follow


two-phase policy
but they are subject
to deadlock, which
must be dealt
with.

21
T’1 T’2
read_lock (Y);
read_item (Y);
write_lock (X);
read_lock (X);
unlock (Y);
read_item (X); wait
X:=X+Y;
write_item (X);
unlock (X);
read_lock (X);
read_item (X);
write_lock (Y);
unlock (X);
read_item (Y);
Y:=X+Y; Guaranteed to be
write_item (Y); serializable
unlock (Y);
22
T’1 T’2
read_lock (Y);
read_item (Y);
read_lock (X);
Deadlock
read_item (X);
write_lock (X);
write_lock (Y);
unlock (X);
read_item (Y);
Y:=X+Y;
write_item (Y);
unlock (Y);
unlock (Y);
read_item (X);
X:=X+Y;
write_item (X);
unlock (X); Can produce a deadlock

23
Two-Phase Locking Techniques (19)
◼ Two-Phase Locking
❑ Variations:
◼ (a) Basic
◼ (b) Conservative
◼ (c) Strict
◼ (d) Rigorous

❑ Conservative:
◼ Prevents deadlock by locking all desired data items
before transaction begins execution.
❑ Basic:
◼ Transaction locks data items incrementally. This
may cause deadlock which is dealt with.

24
Two-Phase Locking Techniques (20)
◼ Two-Phase Locking
❑ Strict:
◼ A transaction T does not release any of its
exclusive (write) locks until after it commits or
aborts.
◼ The most commonly used two-phase locking
algorithm.

❑ Rigorous:
◼ A Transaction T does not release any of its locks
(Exclusive or shared) until after it commits or
aborts.

25
Two-Phase Locking Techniques (21)
◼ Dealing with Deadlock and Starvation
❑ Deadlock
T’1 T’2
read_lock (Y); T’1 and T’2 did follow two-phase
read_item (Y); policy but they are deadlock
read_lock (X);
read_item (Y);
write_lock (X);
(waits for X) write_lock (Y);
(waits for Y)

❑ Deadlock (T’1 and T’2)

26
Two-Phase Locking Techniques (22)

◼ Dealing with Deadlock and Starvation


❑ Deadlock prevention
◼ A transaction locks all data items it refers to before it
begins execution.

◼ This way of locking prevents deadlock since a


transaction never waits for a data item.

◼ The conservative two-phase locking uses this approach.

27
Two-Phase Locking Techniques (23)

◼ Dealing with Deadlock and Starvation


❑ Deadlock detection and resolution
◼ In this approach, deadlocks are allowed to happen.

◼ The scheduler maintains a wait-for-graph for detecting cycle.

◼ If a cycle exists, then one transaction involved in the cycle is


selected (victim) and rolled-back.

28
Two-Phase Locking Techniques (24)

◼ Dealing with Deadlock and Starvation


❑ Deadlock detection and resolution
◼ A wait-for-graph:
❑ One node is for each transaction that is currently executing.
❑ Whenever a transaction Ti is waiting to lock an item X that
is currently locked by a transaction Tj, a directed edge (Ti →
Tj) is created.
❑ When Tj releases the lock(s) on the items that Ti was
waiting for, the directed edge is dropped.
❑ We have a state of deadlock if and only if the wait-for graph
has a cycle.

◼ When the system should check for a deadlock?

29
Two-Phase Locking Techniques (25)

T’1 T’2

read_lock (Y);
read_item (Y);
T’1 T’2
read_lock (X);
read_item (X);

write_lock (X); b) wait-for graph


(waits for X) write_lock (Y);
(waits for Y)

a) Partial schedule of T’1 and T’2

30
Two-Phase Locking Techniques (26)

◼ Dealing with Deadlock and Starvation


❑ Deadlock avoidance
◼ There are many variations of two-phase locking algorithm.

◼ Some avoid deadlock by not letting the cycle to complete.

◼ That is as soon as the algorithm discovers that blocking a


transaction is likely to create a cycle, it rolls back the
transaction.

◼ Wound-Wait and Wait-Die algorithms use timestamps to avoid


deadlocks by rolling-back victim.

31
Two-Phase Locking Techniques (27)

◼ Dealing with Deadlock and Starvation


❑ Deadlock avoidance
◼ Timestamp:
❑ TS(T)
❑ A unique identifier assigned to each transaction.
❑ Typically based on the order in which transactions are
started
❑ If transaction T1 starts before transaction T2, then TS(T1) <
TS(T2). Notice that the older transaction (which starts first)
has the smaller timestamp value.

32
Two-Phase Locking Techniques (28)

◼ Dealing with Deadlock and Starvation


❑ Deadlock avoidance
◼ Wait-die:
❑ If TS(Ti) < TS(Tj), then (Ti older than Tj) Ti is allowed to wait.
❑ Otherwise (Ti younger than Tj) abort Ti (Ti dies) and restart
it later with the same timestamp.

◼ Wound-wait:
❑ If TS(Ti) < TS(Tj), then (Ti older than Tj) abort Tj (Ti wounds
Tj) and restart it later with the same timestamp.
❑ Otherwise (Ti younger than Tj) Ti is allowed to wait.

33
Two-Phase Locking Techniques (29)
◼ Dealing with Deadlock and Starvation
◼ Starvation
❑ Starvation occurs when a particular transaction consistently
waits or restarted and never gets a chance to proceed
further.
❑ In a deadlock resolution it is possible that the same
transaction may consistently be selected as victim and
rolled-back.
❑ This limitation is inherent in all priority based scheduling
mechanisms.
❑ Wound-Wait and wait-die scheme can avoid starvation.

34
3. Concurrency Control Based on
Timestamp Ordering (1)
◼ Timestamp
❑ A monotonically increasing variable (integer)
indicating the age of an operation or a transaction.
A larger timestamp value indicates a more recent
event or operation.

❑ Timestamp based algorithm uses timestamp to


serialize the execution of concurrent transactions.

35
Concurrency Control Based on
Timestamp Ordering (2)
◼ Timestamp
❑ The algorithm associates with each database
item X with two timestamp (TS) values:
◼ Read_TS(X): The read timestamp of item X; this is
the largest timestamp among all the timestamps of
transactions that have successfully read item X.
◼ Write_TS(X):The write timestamp of item X; this is
the largest timestamp among all the timestamps of
transactions that have successfully written item X.

36
Concurrency Control Based on
Timestamp Ordering (3)
◼ Basic Timestamp Ordering
❑ 1. Transaction T issues a write_item(X) operation:
◼ (a) If read_TS(X) > TS(T) or if write_TS(X) > TS(T)
❑ an younger transaction has already read the data item
❑ abort and roll-back T with a new timestamp and reject the
operation.
◼ (b) If the condition in part (a) does not exist, then execute
write_item(X) of T and set write_TS(X) to TS(T).
❑ 2. Transaction T issues a read_item(X) operation:
◼ (a) If write_TS(X) > TS(T)
❑ an younger transaction has already written to the data item
❑ abort and roll-back T with a new timestamp and reject the
operation.
◼ (b) If write_TS(X)  TS(T), then execute read_item(X) of T and
set read_TS(X) to the larger of (TS(T) and the current
read_TS(X) )
37
Example:Three transactions executing under a
timestamp-based scheduler
T1 T2 T3 A B C

200 150 175 RT =0 RT = 0 RT=0


WT=0 WT=0 WT=0

r1(B) RT = 200
r2(A) RT = 150
r3(C) RT=175
w1(B) WT=200
w1(A) WT=200
w2(C)
Abort
w3(A)
Why T2 must be aborted (rolled-back)?
38
Concurrency Control Based on
Timestamp Ordering (4)
◼ Strict Timestamp Ordering
❑ 1. Transaction T issues a write_item(X)
operation:
◼ If TS(T) > write_TS(X), then delay T until the transaction
T’ that wrote X has terminated (committed or aborted).
❑ 2. Transaction T issues a read_item(X) operation:
◼ If TS(T) > write_TS(X), then delay T until the transaction
T’ that wrote X has terminated (committed or aborted).

❑ Ensures the schedules are both strict and conflict


serializable
39
Concurrency Control Based on
Timestamp Ordering (5)
◼ Thomas’s Write Rule
Modify the checks for the write_item(X) operation:
❑ 1. If read_TS(X) > TS(T) then abort and roll-back
T and reject the operation.
❑ 2. If write_TS(X) > TS(T), then just ignore the
write operation and continue execution because it
is already outdated and obsolete.
❑ If the conditions given in 1 and 2 above do not
occur, then execute write_item(X) of T and set
write_TS(X) to TS(T).

40
4. Multiversion Concurrency Control
Techniques (1)
◼ This approach maintains a number of
versions of a data item and allocates the right
version to a read operation of a transaction.
Thus unlike other mechanisms a read
operation in this mechanism is never
rejected.
◼ Side effect:
❑ Significantly more storage (RAM and disk) is
required to maintain multiple versions.
❑ To check unlimited growth of versions, a garbage
collection is run when some criteria is satisfied.
41
Multiversion Concurrency Control
Techniques (2)
◼ Multiversion technique based on timestamp ordering
❑ Assume X1, X2, …, Xn are the version of a data item X
created by a write operation of transactions. With each Xi a
read_TS (read timestamp) and a write_TS (write
timestamp) are associated.
❑ read_TS(Xi): The read timestamp of Xi is the largest of all
the timestamps of transactions that have successfully read
version Xi.
❑ write_TS(Xi): The write timestamp of Xi is the timestamps
of the transaction hat wrote the value of version Xi.
❑ A new version of Xi is created only by a write operation.

42
Multiversion Concurrency Control
Techniques (3)
◼ Multiversion technique based on timestamp ordering
To ensure serializability, the following two rules are used:
1. If transaction T issues write_item (X) and version i of X has
the highest write_TS(Xi) of all versions of X that is also less
than or equal to TS(T), and read _TS(Xi) > TS(T), then abort
and roll-back T; otherwise create a new version Xj and
read_TS(Xj) = write_TS(Xj) = TS(T).
2. If transaction T issues read_item (X), find the version i of X
that has the highest write_TS(Xi) of all versions of X that is
also less than or equal to TS(T), then return the value of Xi to
T, and set the value of read _TS(Xi) to the largest of TS(T)
and the current read_TS(Xi).
❑ Rule 2 guarantees that a read will never be rejected.
43
Example: Execution of transactions using
multiversion concurrency control
T1 T2 T3 T4 A0 A150 A200

150 200 175 225

r1(A) read
w1(A) Create
r2(A) Read
w2(A) Create
r3(A) read
r4(A) read

Note: T3 does not have to abort, because it can read an earlier


version of A.

44
Multiversion Concurrency Control
Techniques (4)
Multiversion Two-Phase Locking Using Certify Locks
◼ Concept:

❑ Allow a transaction T’ to read a data item X while


it is write locked by a conflicting transaction T.
❑ This is accomplished by maintaining two
versions of each data item X
◼ One version must always have been written by some
committed transaction. This means a write operation
always creates a new version of X.
◼ The second version created when a transaction acquires
a write lock an the item.

45
Multiversion Concurrency Control
Techniques (5)
Multiversion Two-Phase Locking Using Certify Locks
◼ Steps:
1. X is the committed version of a data item.
2. T creates a second version X’ after obtaining a write lock on X.
3. Other transactions continue to read X.
4. T is ready to commit so it obtains a certify lock on X’.
5. The committed version X becomes X’.
6. T releases its certify lock on X’, which is X now.
Compatibility tables for
Read Write Read Write Certify
Read yes no Read yes no no
Write no no Write no no no
Certify no no no
read/write locking scheme read/write/certify locking scheme
46
Multiversion Concurrency Control
Techniques (6)
Multiversion Two-Phase Locking Using Certify Locks
◼ Note:
❑ In multiversion 2PL read and write operations
from conflicting transactions can be processed
concurrently.
❑ This improves concurrency but it may delay
transaction commit because of obtaining certify
locks on all its writes. It avoids cascading abort
but like strict two phase locking scheme conflicting
transactions may get deadlocked.

47
5. Validation (Optimistic)
Concurrency Control Techniques (1)
◼ In this technique only at the time of commit
serializability is checked and transactions are aborted in
case of non-serializable schedules.
◼ Three phases:
1. Read phase
2. Validation phase
3. Write phase

1. Read phase:
❑ A transaction can read values of committed data items.
However, updates are applied only to local copies
(versions) of the data items (in database cache).

48
Validation (Optimistic) Concurrency
Control Techniques (2)
2. Validation phase: Serializability is checked before
transactions write their updates to the database.
❑ This phase for Ti checks that, for each transaction Tj that is either
committed or is in its validation phase, one of the following
conditions holds:
1. Tj completes its write phase before Ti starts its read phase.
2. Ti starts its write phase after Tj completes its write phase, and
the read_set of Ti has no items in common with the write_set of
Tj
3. Both the read_set and write_set of Ti have no items in common
with the write_set of Tj, and Tj completes its read phase before
Ti completes its read phase.
◼ The first condition is checked first for each transaction Tj. If (1) is
false then (2) is checked and if (2) is false then (3 ) is checked. If
none of these conditions holds, → fails and Ti is aborted.

49
Validation (Optimistic) Concurrency
Control Techniques (3)
3. Write phase: On a successful validation
transactions’ updates are applied to the
database; otherwise, transactions are
restarted.

50
6. Granularity of Data Items And
Multiple Granularity Locking (1)
◼ A lockable unit of data defines its granularity.
Granularity can be coarse (entire database) or it can be
fine (a tuple or an attribute of a relation).
◼ Data item granularity significantly affects concurrency
control performance. Thus, the degree of concurrency
is low for coarse granularity and high for fine granularity.
◼ Example of data item granularity:
1. A field of a database record (an attribute of a tuple)
2. A database record (a tuple or a relation)
3. A disk block
4. An entire file
5. The entire database

51
Granularity of data items and
Multiple Granularity Locking (2)
◼ The following diagram illustrates a hierarchy
of granularity from coarse (database) to fine
(record).

52
Granularity of data items and Multiple
Granularity Locking (3)
◼ To manage such hierarchy, in addition to read and write,
three additional locking modes, called intention lock
modes are defined:
❑ Intention-shared (IS): indicates that a shared lock(s) will
be requested on some descendent nodes(s).
❑ Intention-exclusive (IX): indicates that an exclusive
lock(s) will be requested on some descendent node(s).
❑ Shared-intention-exclusive (SIX): indicates that the
current node is locked in shared mode but an exclusive
lock(s) will be requested on some descendent nodes(s).

53
Granularity of data items and Multiple
Granularity Locking (4)
◼ These locks are applied using the following
compatibility matrix:

Intention-shared (IS
Intention-exclusive (IX)
Shared-intention-
exclusive (SIX)

54
Granularity of data items and
Multiple Granularity Locking (5)
◼ The set of rules which must be followed for
producing serializable schedule:
1. The lock compatibility must adhered to.
2. The root of the tree must be locked first, in any mode.
3. A node N can be locked by a transaction T in S or IX mode
only if the parent node is already locked by T in either IS or
IX mode.
4. A node N can be locked by T in X, IX, or SIX mode only if the
parent of N is already locked by T in either IX or SIX mode.
5. T can lock a node only if it has not unlocked any node (to
enforce 2PL policy).
6. T can unlock a node, N, only if none of the children of N are
currently locked by T.

55
Granularity of data items and Multiple
Granularity Locking (6)
◼ An example of a serializable execution:

T1 wants to update r111, r211


T2 wants to update all records
on page p12
T3 wants to read r11j and the
entire file f2

56
Granularity of data items and
Multiple Granularity Locking (7)
◼ An example of a serializable execution (continued):

57
Chapter 7

Database Recovery Techniques

1
Outline

◼ Purpose of Database Recovery


◼ Recovery Concepts
◼ Recovery Based on Deferred Update
◼ Recovery Based on Immediate Update
◼ Shadow paging
◼ ARIES Recovery Algorithm
◼ Recovery in Multidatabase System

2
1. Purpose of Database Recovery
◼ To bring the database into the last consistent state,
which existed prior to the failure.
◼ To preserve transaction properties (Atomicity,
Consistency, Isolation and Durability).

◼ Example:
❑ If the system crashes before a fund transfer
transaction completes its execution, then either one or
both accounts may have incorrect value.
❑ Thus, the database must be restored to the state
before the transaction modified any of the accounts.
2. Recovery Concepts (1)

Types of Failure
◼ The database may become unavailable for use
due to
❑ Transaction failure: Transactions may fail because
of incorrect input, deadlock, incorrect synchronization.
❑ System failure: System may fail because of
addressing error, application error, operating system
fault, RAM failure, etc.
❑ Media failure: Disk head crash, power disruption, etc.
Recovery Concepts (2)
Transaction Log
◼ For recovery from any type of failure data values prior to
modification (BFIM - BeFore Image) and the new value after
modification (AFIM – AFter Image) are required.
◼ These values and other information is stored in a sequential
file called Transaction log. A sample log is given below. Back
P and Next P point to the previous and next log records of the
same transaction.
T ID Back P Next P Operation Data item BFIM AFIM
T1 0 1 Begin
T1 1 4 Write X X = 100 X = 200
T2 0 8 Begin
T1 2 5 W Y Y = 50 Y = 100
T1 4 7 R M M = 200 M = 200
T3 0 9 R N N = 400 N = 400
T1 5 nil End
Recovery Concepts (3)

Data Caching
◼ Data items to be modified are first stored into
database cache by the Cache Manager (CM)
and after modification they are flushed (written)
to the disk.
◼ The flushing is controlled by Modified and Pin-
Unpin bits.
❑ Pin-Unpin: Instructs the operating system not to flush
the data item.
❑ Modified: Indicates the AFIM of the data item.
Recovery Concepts (4)
Data Update:
◼ Immediate Update: As soon as a data item is modified in
cache, the disk copy is updated.
◼ Deferred Update: All modified data items in the cache is
written either after a transaction ends its execution or after a
fixed number of transactions have completed their execution.
◼ Shadow update: The modified version of a data item does
not overwrite its disk copy but is written at a separate disk
location.
◼ In-place update: The disk version of the data item is
overwritten by the cache version.
Recovery Concepts (5)

Transaction Roll-back (Undo) and Roll-Forward


(Redo)
◼ To maintain atomicity, a transaction’s operations are
redone or undone.
❑ Undo: Restore all BFIMs on to disk (Remove all AFIMs).
❑ Redo: Restore all AFIMs on to disk.

◼ Database recovery is achieved either by performing only


Undo or only Redo or by a combination of the two.
These operations are recorded in the log as they
happen.
Recovery Concepts (6)
Write-Ahead Logging
◼ When in-place update (immediate or deferred) is used
then log is necessary for recovery and it must be
available to recovery manager. This is achieved by
Write-Ahead Logging (WAL) protocol. WAL states that:
❑ For Undo: Before a data item’s AFIM is flushed to the
database disk (overwriting the BFIM) its BFIM must be
written to the log and the log must be saved on a stable
store (log disk).
❑ For Redo: Before a transaction executes its commit
operation, all its AFIMs must be written to the log and the
log must be saved on a stable store.
Recovery Concepts (7)
Steal/No-Steal and Force/No-Force
◼ Possible ways for flushing database cache to
database disk:
❑ Steal/No-Steal:
1. Steal: Cache can be flushed before transaction
commits.
2. No-Steal: Cache cannot be flushed before
transaction commit.
❑ Force/No-Force:
1. Force: Cache is immediately flushed (forced) to
disk before the transaction commit.
2. No-Force: Otherwise.
Recovery Concepts (8)

Steal/No-Steal and Force/No-Force

◼ These give rise to four different ways for


handling recovery:
❑ Steal/No-Force (Undo/Redo)
❑ Steal/Force (Undo/No-redo)
❑ No-Steal/No-Force (Redo/No-undo)
❑ No-Steal/Force (No-undo/No-redo)

11
Recovery Concepts (9)

Checkpointing
◼ Time to time (randomly or under some criteria) the
database flushes its buffer to database disk to
minimize the task of recovery. The following steps
defines a checkpoint operation:
1. Suspend execution of transactions temporarily.
2. Force write modified buffer data to disk.
3. Write a [checkpoint] record to the log, save the log to disk.
4. Resume normal transaction execution.
◼ During recovery redo or undo is required to
transactions appearing after [checkpoint] record.
Recovery Concepts (10)

Fuzzy Checkpointing
◼ The time need to force-write all modified memory
buffers may delay transaction processing
→ Fuzzy checkpointing.
◼ System can resume transaction processing after a
[begin_checkpoint] record is written to the log without
having to wait for step 2 to finish.
◼ When step 2 is completed → [end_checkpoint] record is
written to the log.
◼ Until step 2 is commpleted, the previous checkpoint
record should maintain valid.
3. Recovery Based on Deferred
Update (1)
◼ Deferred Update (No Undo/Redo)
◼ The data update goes as follows:
❑ A set of transactions records their updates in the
log.
❑ At commit point under WAL scheme these
updates are saved on database disk.
❑ After reboot from a failure the log is used to redo
all the transactions affected by this failure. No
undo is required because no AFIM is flushed to
the disk before a transaction commits.
Recovery Based on Deferred Update (2)
◼ Deferred Update in a single-user system
There is no concurrent data sharing in a single user
system. The data update goes as follows:
❑ A set of transactions records their updates in the log.
❑ At commit point under WAL scheme these updates are
saved on database disk.
◼ After reboot from a failure the log is used to redo all the
transactions affected by this failure. No undo is required
because no AFIM is flushed to the disk before a
transaction commits.
T1 T2
read_item (A) read_item (B)
read_item (D) write_item (B)
write_item (D) read_item (D)
write_item (D)

--- log ---


[start_transaction, T1]
[write_item, T1, D, 20]
[commit T1]
[start_transaction, T2]
[write_item, T2, B, 10]
[write_item, T2, D, 25]  system crash

Redo [write_item, T1, D, 20] of T1


Ignore T2
Recovery Based on Deferred Update (3)

◼ Deferred Update with concurrent users


❑ This environment requires some concurrency control mechanism
to guarantee isolation property of transactions. In a system
recovery transactions which were recorded in the log after the
last checkpoint were redone. The recovery manager may scan
some of the transactions recorded before the checkpoint to get
the AFIMs.
Recovery Based on Deferred Update (4)

Deferred Update with concurrent users


◼ Two tables are required for implementing this protocol:
❑ Active table: All active transactions are entered in this
table.
❑ Commit table: Transactions to be committed are entered
in this table.

◼ During recovery, all transactions of the commit table are


redone and all transactions of active tables are ignored
since none of their AFIMs reached the database. It is
possible that a commit table transaction may be redone
twice but this does not create any inconsistency because
of a redone is “idempotent”, that is, one redone for an
AFIM is equivalent to multiple redone for the same AFIM.
--- log ---
[start_transaction, T1] • Ignore: T2 & T3
[write_item, T1, D, 20] • Redo: T1 & T4
[checkpoint]
[start_transaction, T4]
[write_item, T4, B, 15]
[start_transaction T2]
[commit, T1]
[write_item, T4, A, 20]
[commit, T4]
[write_item, T2, B, 12]
[start_transaction, T3]
[write_item, T3, A, 30]
[write_item, T2, D, 25]  system crash

D  20
B  15
A  20
4. Recovery Based on Immediate
Update (1)
◼ Undo/No-redo Algorithm
❑ In this algorithm AFIMs of a transaction are
flushed to the database disk under WAL before it
commits.
❑ For this reason the recovery manager undoes all
transactions during recovery.
❑ No transaction is redone.
❑ It is possible that a transaction might have
completed execution and ready to commit but this
transaction is also undone.
Recovery Based on Immediate
Update (2)
◼ Undo/Redo Algorithm (Single-user
environment)
❑ Recovery schemes of this category apply undo and
also redo for recovery.
❑ In a single-user environment no concurrency control is
required but a log is maintained under WAL.
❑ Note that at any time there will be one transaction in
the system and it will be either in the commit table or
in the active table.
❑ The recovery manager performs:
◼ Undo of a transaction if it is in the active table.
◼ Redo of a transaction if it is in the commit table.
Recovery Based on Immediate
Update (3)
◼ Undo/Redo Algorithm (Concurrent execution)
◼ Recovery schemes of this category applies undo and
also redo to recover the database from failure.
◼ In concurrent execution environment a concurrency
control is required and log is maintained under WAL.
◼ Commit table records transactions to be committed and
active table records active transactions. To minimize the
work of the recovery manager checkpointing is used.
◼ The recovery performs:
❑ Undo of a transaction if it is in the active table.
❑ Redo of a transaction if it is in the commit table.
--- log ---
[start_transaction, T1]
[write_item, T1, D, 12, 20]
[checkpoint]
[start_transaction, T4]
[write_item, T4, B, 23, 15] Undo: T2 & T3
[start_transaction T2]
[commit, T1] B  12
[write_item, T2, B, 15, 12] D  20
[start_transaction, T3] A  20
[write_item, T4, A, 30, 20] B  15
[commit, T4]
[write_item, T3, A, 20, 30]
[write_item, T2, D, 20, 25]
[write_item, T2, B, 12, 17] Redo: T1 & T4
 system crash D  20
B  15
A  20
5. Shadow paging (1)
◼ The AFIM does not overwrite its BFIM but recorded at
another place on the disk. Thus, at any time a data item
has AFIM and BFIM (Shadow copy of the data item) at
two different places on the disk.

X Y
X' Y'

Database
X and Y: Shadow copies of data items
X' and Y': Current copies of data items
Shadow paging (2)
◼ To manage access of data items by concurrent transactions
two directories (current and shadow) are used.
❑ The directory arrangement is illustrated below. Here a page
is a data item.
Shadow paging (3)

◼ Recovery:
❑ Free the modified database pages and to discard
the current directory (reinstating the shadow
directory)
◼ Committing a transaction corresponding to
discarding the previous shadow directory.
◼ NO-UNDO/ NO-REDO
◼ In a multiuser environment→ use logs and
checkpoints

27
6. ARIES Recovery Algorithm (1)

◼ Steal/no-force (UNDO/REDO)
◼ The ARIES Recovery Algorithm is based on:
❑ WAL (Write Ahead Logging)
❑ Repeating history during redo:
◼ ARIES will retrace all actions of the database system
prior to the crash to reconstruct the database state when
the crash occurred.
❑ Logging changes during undo:
◼ It will prevent ARIES from repeating the completed undo
operations if a failure occurs during recovery, which
causes a restart of the recovery process.
ARIES Recovery Algorithm (2)

◼ The ARIES recovery algorithm consists of three


steps:
1. Analysis: step identifies the dirty (updated) pages
in the buffer and the set of transactions active at
the time of crash. The appropriate point in the log
where redo is to start is also determined.
2. Redo: necessary redo operations are applied.
3. Undo: log is scanned backwards and the operations
of transactions active at the time of crash are
undone in reverse order.
ARIES Recovery Algorithm (3)

◼ The Log and Log Sequence Number (LSN)


❑ A log record is written for:
◼ (a) data update
◼ (b) transaction commit
◼ (c) transaction abort
◼ (d) undo
❑ In the case of undo a compensating log record is written.
◼ (e) transaction end
ARIES Recovery Algorithm (4)
◼ The Log and Log Sequence Number (LSN) (cont.)
❑ A unique LSN is associated with every log record.
◼ LSN increases monotonically and indicates the disk address
of the log record it is associated with.
◼ In addition, each data page stores the LSN of the latest log
record corresponding to a change for that page.
❑ A log record stores
◼ (a) the previous LSN of that transaction. It links the log record
of each transaction. It is like a back pointer points to the
previous record of the same transaction
◼ (b) the transaction ID
◼ (c) the type of log record.
ARIES Recovery Algorithm (5)
◼ The Log and Log Sequence Number (LSN) (cont.)
❑ For a write operation the following additional
information is logged:
1. Page ID for the page that includes the item
2. Length of the updated item
3. Its offset from the beginning of the page
4. BFIM of the item
5. AFIM of the item
ARIES Recovery Algorithm (6)
◼ The Transaction table and the Dirty Page
table
❑ For efficient recovery following tables are also
stored in the log during checkpointing:
◼ Transaction table: Contains an entry for each active
transaction, with information such as transaction ID,
transaction status and the LSN of the most recent log
record for the transaction.
◼ Dirty Page table: Contains an entry for each dirty page
in the buffer, which includes the page ID and the LSN
corresponding to the earliest update to that page.
ARIES Recovery Algorithm (7)
◼ Checkpointing
❑ A checkpointing does the following:
◼ Writes a begin_checkpoint record in the log
◼ Writes an end_checkpoint record in the log. With this record
the contents of transaction table and dirty page table are
appended to the end of the log.
◼ Writes the LSN of the begin_checkpoint record to a special
file. This special file is accessed during recovery to locate the
last checkpoint information.
❑ To reduce the cost of checkpointing and allow the system
to continue to execute transactions, ARIES uses “fuzzy
checkpointing”.
ARIES Recovery Algorithm (8)
◼ The following steps are performed for recovery:
❑ Analysis phase:
◼ Start at the begin_checkpoint record and proceed to the
end_checkpoint record.
◼ Access transaction table and dirty page table are
appended to the end of the log.
◼ Modify transaction table and dirty page table:
❑ An end log record was encountered for T → delete entry T
from transaction table
❑ Some other type of log record is encountered for T’ →
insert an entry T’ into transaction table if not already
present, or the last LSN is modified.
❑ The log record corresponds to a change for page P →
insert an entry P (if not present) with the associated LSN in
dirty page table
◼ The analysis phase compiles the set of redo and undo to
be performed and ends.
ARIES Recovery Algorithm (9)
◼ The following steps are performed for recovery:
❑ Redo phase: Starts redoing at a point in the log where it knows
that previous changes to dirty pages have already been applied
to disk.
❑ Where?
◼ Finding the smallest LSN, M of all the dirty pages in the Dirty
Page Table.
❑ Redo starts at the log record with LSN = M and scan forward to
the end of the log.
❑ Verify whether or not the change has to be reapplied.
◼ A change recorded in the log pertains to the page P that is not in
the Dirty Page Table → no redo
◼ A change recorded in the log (LSN = N) pertain to Page P and
the Dirty Page Table contains an entry for P with LSN > N → no
redo.
◼ Page P is read from disk and the LSN stored on that page > N →
no redo.
ARIES Recovery Algorithm (10)
◼ The following steps are performed for recovery:
❑ Undo phase: Starts from the end of the log and proceeds
backward while performing appropriate undo. For each undo it
writes a compensating record in the log.
redo

undo redo
7. Recovery in multidatabase system
◼ A multidatabase system is a special distributed database
system where one node may be running relational database
system under UNIX, another may be running object-oriented
system under Windows and so on.
◼ A transaction may run in a distributed fashion at multiple
nodes.
◼ In this execution scenario the transaction commits only when
all these multiple nodes agree to commit individually the part
of the transaction they were executing.
◼ This commit scheme is referred to as “two-phase commit”
(2PC).
❑ If any one of these nodes fails or cannot commit the part of
the transaction, then the transaction is aborted.
◼ Each node recovers the transaction under its own recovery
protocol.
Final exam

◼ 90 minutes
◼ Multiple choice + essay questions
◼ Open test (only paper document)

40

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy