FULL
FULL
An Overview of a Database
Management System
What is a DBMS?
2
DBMS Capabilities
The capabilities that a DBMS provides the user are:
◼ Persistent Storage. A DBMS supports the storage of very large
amounts of data that exists independently of any processes that
are using the data.
◼ Programming Interface. A DBMS allows the user to access and
modify data through a powerful query language.
◼ Transaction management. A DBMS supports concurrent
access to data, i.e., simultaneously access by many distinct
processes (called transaction) at once. To avoid some of the
undesirable consequences of simultaneous access, the DBMS
supports:
❑ isolation
❑ atomicity
❑ resiliency
3
History of database systems and DBMS
4
Component
module of a
DBMS
5
The Database System Environment (1)
6
The Database System Environment (2)
7
Classification of DBMS
◼ DBMS classification based on:
❑ Data model:
◼ Hierarchical, network, relational, object, object-relational,
XML, document-based, graph-based, column-based, key-
value, …
❑ The number of users:
◼ Single-user systems vs. multiuser systems
❑ The number of sites
◼ Centralized vs. distributed
❑ Cost
❑ Purpose
◼ General purpose vs. special purpose
8
9
10
11
When should (not) we use the DBMS?
◼ Should
❑ Controlling Redundancy
❑ Restricting Unauthorized Access
❑ Providing Persistent Storage for Program Objects
❑ Providing Storage Structures and Search Techniques for Efficient Query
Processing
❑ Providing Backup and Recovery
❑ Providing Multiple User Interfaces
❑ Representing Complex Relationships among Data
❑ Enforcing Integrity Constraints
❑ Permitting Inferencing and Actions Using Rules and Triggers
❑ Additional Implications of Using the Database Approach
◼ Potential for Enforcing Standards
◼ Reduced Application Development Time
◼ Flexibility
◼ Availability of Up-to-Date Information
◼ Economies of Scale
12
When should (not) we use the DBMS?
◼ Should not
❑ Simple, well-defined database applications not expected to change at all
❑ Stringent, real-time requirements that may not be met because of DBMS
overhead
❑ Embedded systems with limited storage capacity
❑ No multiple-user access to data
13
14
Chapter 2
2
Contents
1 Disk Storage Devices
2 Files of Records
3 Operations on Files
4 Unordered Files and Ordered Files
5 Hashed Files
6 RAID Technology
3
Memory & Storage Hierarchy
primary storage
Secondary storage
4
Disk Storage Devices
◼ Preferred secondary storage device for high
storage capacity and low cost.
◼ Data stored as magnetized areas on
magnetic disk surfaces.
◼ A disk pack contains several magnetic disks
connected to a rotating spindle.
◼ Disks are divided into concentric circular
tracks on each disk surface .
❑ Track capacities vary typically from 10 to 150
Kbytes.
5
Disk Storage Devices (cont.)
6
Disk Storage Devices (cont.)
Track
Sector
Spindle
7
Disk Storage Devices (cont.)
◼ A track is divided into smaller blocks or
sectors.
❑ because a track usually contains a large amount
of information .
◼ A track is divided into blocks.
❑ The block size B is fixed for each system.
◼ Typical block sizes range from B=512 bytes to
B=8192 bytes.
❑ Whole blocks are transferred between disk and
main memory for processing.
8
Disk Storage Devices (cont.)
◼ A read-write head moves to the track that contains the
block to be transferred.
❑ Disk rotation moves the block under the read-write head for
reading or writing.
◼ A physical disk block (hardware) address consists of:
❑ a cylinder number (imaginary collection of tracks of same
radius from all recorded surfaces)
❑ the track number or surface number (within the cylinder)
❑ and block number (within track).
◼ Reading or writing a disk block is time consuming
because of the seek time s and rotational delay (latency)
rd.
◼ Double buffering can be used to speed up the transfer of
contiguous disk blocks.
9
10
Disk Storage Devices (cont.)
11
Double Buffering
12
Contents
1 Disk Storage Devices
2 Files of Records
3 Operations on Files
4 Unordered Files and Ordered Files
5 Hashed Files
6 RAID Technology
13
Records
◼ Fixed and variable length records.
◼ Records contain fields which have values of a
particular type.
❑ E.g., amount, date, time, age.
◼ Fields themselves may be fixed length or
variable length.
◼ Variable length fields can be mixed into one
record:
❑ Separator characters or length fields are needed
so that the record can be “parsed”.
14
Records (cont.)
15
Blocking
◼ Blocking: refers to storing a number of
records in one block on the disk.
◼ Blocking factor (bfr): refers to the number
of records per block.
◼ There may be empty space in a block if an
integral number of records do not fit in one
block.
◼ Spanned Records: refer to records that
exceed the size of one or more blocks and
hence span a number of blocks.
16
Blocking (cont.)
17
Files of Records
◼ A file is a sequence of records, where each record is
a collection of data values (or data items).
◼ A file descriptor (or file header) includes information
that describes the file, such as the field names and
their data types, and the addresses of the file blocks
on disk.
◼ Records are stored on disk blocks.
◼ The blocking factor bfr for a file is the (average)
number of file records stored in a disk block.
◼ A file can have fixed-length records or variable-
length records.
18
Files of Records (cont.)
◼ File records can be unspanned or spanned:
❑ Unspanned: no record can span two blocks
❑ Spanned: a record can be stored in more than one block
◼ The physical disk blocks that are allocated to hold the
records of a file can be contiguous, linked, or indexed.
◼ In a file of fixed-length records, all records have the
same format. Usually, unspanned blocking is used with
such files.
◼ Files of variable-length records require additional
information to be stored in each record, such as
separator characters and field types.
❑ Usually spanned blocking is used with such files.
19
Contents
1 Disk Storage Devices
2 Files of Records
3 Operations on Files
4 Unordered Files and Ordered Files
5 Hashed Files
6 RAID Technology
20
Operation on Files
Typical file operations include:
◼ OPEN: Reads the file for access, and associates a
pointer that will refer to a current file record at each point
in time.
◼ FIND: Searches for the first file record that satisfies
a certain condition, and makes it the current file record.
◼ FINDNEXT: Searches for the next file record (from
the current record) that satisfies a certain condition, and
makes it the current file record.
◼ READ: Reads the current file record into a program
variable.
◼ INSERT: Inserts a new record into the file, and makes
it the current file record.
21
Operation on Files (cont.)
◼ DELETE: Removes the current file record from the
file, usually by marking the record to indicate that it
is no longer valid.
◼ MODIFY: Changes the values of some fields of the
current file record.
◼ CLOSE: Terminates access to the file.
◼ REORGANIZE: Reorganizes the file records. For
example, the records marked deleted are physically
removed from the file or a new organization of the
file records is created.
◼ READ_ORDERED: Read the file blocks in order of
a specific field of the file.
22
Contents
1 Disk Storage Devices
2 Files of Records
3 Operations on Files
4 Unordered Files and Ordered Files
5 Hashed Files
6 RAID Technology
23
Unordered Files
◼ Also called a heap or a pile file.
◼ New records are inserted at the end of the file.
◼ A linear search through the file records is
necessary to search for a record.
❑ This requires reading and searching half the file
blocks on the average, and is hence quite expensive.
◼ Record insertion is quite efficient.
◼ Reading the records in order of a particular field
requires sorting the file records.
24
Ordered Files
◼ Also called a sequential file.
◼ File records are kept sorted by the values of an ordering
field.
◼ Insertion is expensive: records must be inserted in the
correct order.
❑ It is common to keep a separate unordered overflow (or
transaction) file for new records to improve insertion efficiency;
this is periodically merged with the main ordered file.
◼ A binary search can be used to search for a record on
its ordering field value.
❑ This requires reading and searching log2 of the file blocks on the
average, an improvement over linear search.
◼ Reading the records in order of the ordering field is quite
efficient.
25
Ordered
Files (cont.)
26
Binary search
2 5 6 9 12 18 22 27 33
◼ Search 27:
❑ Step 1:
2 5 6 9 12 18 22 27 33
low, high
mid
Average Access Times
28
Contents
1 Disk Storage Devices
2 Files of Records
3 Operations on Files
4 Unordered Files and Ordered Files
5 Hashed Files
6 RAID Technology
29
Hashed Files
◼ Hashing for disk files is called External Hashing.
◼ The file blocks are divided into M equal-sized buckets,
numbered bucket0, bucket1, ..., bucketM-1.
❑ Typically, a bucket corresponds to one (or a fixed number of) disk
block.
◼ One of the file fields is designated to be the hash key of
the file.
◼ The record with hash key value K is stored in bucket i,
where i=h(K), and h is the hashing function.
◼ Search is very efficient on the hash key.
◼ Collisions occur when a new record hashes to a bucket
that is already full.
❑ An overflow file is kept for storing such records.
❑ Overflow records that hash to each bucket can be linked together.
30
Hashed Files (cont.)
31
Hashed Files (cont.)
There are numerous methods for collision resolution:
❑ Insert 8 1 8 3 11 6
❑ Insert 15 1 8 3 11 15 6
❑ Insert 13 13 1 8 3 11 15 6
32
Hashed Files (cont.)
◼ There are numerous methods for collision resolution,
including the following:
❑ Chaining:
◼ Various overflow locations are kept: extending the array with a
number of overflow positions.
◼ A pointer field is added to each record location.
◼ A collision is resolved by placing the new record in an unused
overflow location and setting the pointer of the occupied hash
address location to the address of that overflow location.
❑ Multiple hashing:
◼ The program applies a second hash function if the first results in
a collision.
◼ If another collision results, the program uses open addressing or
applies a third hash function and then uses open addressing if
necessary.
33
Hashed Files (cont.) - Overflow handling
34
Hashed Files (cont.)
◼ To reduce overflow records, a hash file is typically
kept 70-80% full.
◼ The hash function h should distribute the records
uniformly among the buckets;
❑ Otherwise, search time will be increased because many
overflow records will exist.
◼ Main disadvantages of static external hashing:
❑ Fixed number of buckets M is a problem if the number of
records in the file grows or shrinks.
❑ Ordered access on the hash key is quite inefficient
(requires sorting the records).
35
Dynamic And Extendible Hashed Files
◼ Dynamic and Extendible Hashing Techniques
❑ Hashing techniques are adapted to allow the dynamic
growth and shrinking of the number of file records.
❑ These techniques include the following: dynamic
hashing, extendible hashing, and linear hashing.
◼ Both dynamic and extendible hashing use the
binary representation of the hash value h(K) in
order to access a directory.
❑ In dynamic hashing the directory is a binary tree.
❑ In extendible hashing the directory is an array of size
2d where d is called the global depth.
36
Dynamic And Extendible Hashing (cont.)
◼ The directories can be stored on disk, and they
expand or shrink dynamically.
❑ Directory entries point to the disk blocks that contain
the stored records.
◼ An insertion in a disk block that is full causes the
block to split into two blocks and the records are
redistributed among the two blocks.
❑ The directory is updated appropriately.
◼ Dynamic and extendible hashing do not require
an overflow area.
◼ Linear hashing does require an overflow area
but does not use a directory.
❑ Blocks are split in linear order as the file expands.
37
Extendible
Hashing
38
Extendible Hashing – Example
Record K h(K) = K % 32 h(K)B
r1 2657 1 00001
r2 3760 16 10000 d’ = local depth
r3 4692 20 10100 d = global depth
r4 4871 7 00111
r5 5659 27 11011
r6 1821 29 11101
r1 (00001) d’ =0
r7 1074 18 10010
r8 2123 11 01011 r2 (10000)
r9 1620 20 10100 d=0
r10 2428 28 11100
r11 3943 7 00111
r12 4750 14 01110
r13 6975 31 11111 Each bucket has
r14 4981 21 10101 maximum 2 records
r15 9208 24 11000
39
Extendible Hashing – Example(cont.)
Directory
r1 (00001) d’ =1
r1 (00001) d’ =0
r2 (10000) 0 Insert r4 (00111)
1
d=0
d=1
Insert r3 (10100) =>
overflow=> splitting
r2 (10000) d’ =1
r3 (10100)
40
Extendible Hashing – Example(cont.)
r1 (00001) d’ =1
r4 (00111)
0
1
d=1
r2 (10000) d’ =1
r3 (10100)
41
Extendible Hashing – Example(cont.)
r1 (00001) d’ =1
r4 (00111)
00
01
10
11 r2 (10000) d’ =2
r3 (10100)
d=2
r5 (11011) d’ =2
Insert r6 (11101)
42
Extendible Hashing – Example(cont.)
r1 (00001) d’ =1
r4 (00111)
00
01
10
11 r2 (10000) d’ =2
r3 (10100)
d=2
Insert r7 (10010) =>
r5 (11011) d’ =2
overflow=> splitting
r6 (11101)
43
Extendible Hashing – Example(cont.)
Insert r8 (01011) =>
000 r1 (00001) d’ =1 overflow=> splitting
r4 (00111)
001
010
r2 (10000) d’ =3
011
r7 (10010)
100
101 r3 (10100) d’ =3
110
111 r5 (11011) d’ =2
d=3 r6 (11101)
44
Extendible Hashing – Example(cont.)
r1 (00001) d’ =2
r4 (00111)
000 r8 (01011) d’ =2
001
010
r2 (10000) d’ =3
011
r7 (10010)
100 Insert r9 (10100)
101 r3 (10100) d’ =3
110
111 r5 (11011) d’ =2
d=3 r6 (11101)
45
Extendible Hashing – Example(cont.)
r1 (00001) d’ =2
r4 (00111)
000 r8 (01011) d’ =2
001
010
r2 (10000) d’ =3
011
r7 (10010)
100
101 r3 (10100) d’ =3
110 r9 (10100)
111 Insert r10 (11100) =>
r5 (11011) d’ =2 overflow=> splitting
d=3 r6 (11101)
46
Extendible Hashing – Example(cont.)
r1 (00001) d’ =2
r4 (00111) Insert r11 (00111) =>
000 r8 (01011) d’ =2 overflow=> splitting
001
010
r2 (10000) d’ =3
011
r7 (10010)
100
101 r3 (10100) d’ =3
110 r9 (10100)
111 r5 (11011) d’ =3
d=3
r6 (11101) d’ =3
r10 (11100)
47
Extendible Hashing – Example(cont.)
r1 (00001) d’ =3
r4 (00111) d’ =3
000 r11 (00111)
001 r8 (01011) d’ =2
Insert r12 (01110)
010
011
r2 (10000) d’ =3
100
r3 (10100) d’ =3 r7 (10010)
101
110 r9 (10100)
111 r5 (11011) d’ =3
d=3
r6 (11101) d’ =3
r10 (11100)
48
Extendible Hashing – Example(cont.)
r1 (00001) d’ =3
r4 (00111) d’ =3
000 r11 (00111)
001 r8 (01011) d’ =2
010 r12 (01110)
011
r2 (10000) d’ =3
100
r3 (10100) d’ =3 r7 (10010)
101
110 r9 (10100)
111 r5 (11011) d’ =3
50
Extendible Hashing – Example(cont.)
r1 (00001) d’ =3
0000
0001 r4 (00111) d’ =3
0010 r11 (00111)
0011
0100 r8 (01011) d’ =2
0101
0110 r12 (01110) r2 (10000) d’ =3
0111 r7 (10010)
1000 r3 (10100) d’ =4
1001
1010 r9 (10100)
1011
1100 r14 (10101) d’ =4 r5 (11011) d’ =3
1101
Insert r15 (11000)
1110
1111 r6 (11101) d’ =4
d=4 r10 (11100) r1 (11111) d’ =4
51
Extendible Hashing – Example(cont.)
r1 (00001) d’ =3
0000
0001 r4 (00111) d’ =3
0010 r11 (00111)
0011
0100 r8 (01011) d’ =2
0101
0110 r12 (01110) r2 (10000) d’ =3
0111 r7 (10010)
1000 r3 (10100) d’ =4
1001
1010 r9 (10100)
1011
1100 r14 (10101) d’ =4 r5 (11011) d’ =3
1101
r15 (11000)
1110
1111 r6 (11101) d’ =4
d=4 r10 (11100) r1 (11111) d’ =4
52
Linear Hashing – Example
◼ M=4, h0(K) = K mod M, each bucket has 3
records.
◼ Initialization:
Split pointer
0 1 2 3
4:8: 5 : 9 : 13 6: : 7 : 11 :
53
Linear Hashing – Example(cont.)
Split pointer
0 1 2 3 4
8: : 5 : 9 : 13 6: : 7 : 11 : 4: :
54
Linear Hashing – Example(cont.)
insert 15
(15 mod 4 = 3)
Split pointer
0 1 2 3 4
8: : 5 : 9 : 13 6: : 7 : 11 : 4: :
17 : :
55
Linear Hashing – Example(cont.)
insert 3
(3 mod 4 = 3)
Split pointer
0 1 2 3 4
8: : 5 : 9 : 13 6: : 7 : 11 : 15 4: :
Bucket 3: overflow
Split bucket 1.
17 : :
=> Overflow records: Redistributed
56
Linear Hashing – Example(cont.)
5: bucket (5 mod 2*4 =) 5
9: bucket (9 mod 2*4 = ) 1
13: bucket (13 mod 2*4 = ) 5
17: bucket (17 mod 2*4 = ) 1
0 1 2 3 4 5
8: : 9 : 17 : 6: : 7 : 11 : 15 4: : 5 : 13 :
Split pointer
3: :
57
Linear Hashing – Example(cont.)
Bucket 3: overflow.
Insert 23
Split bucket 2. (23 mod 4 = 3)
0 1 2 3 4 5
8: : 9 : 17 : 6: : 7 : 11 : 15 4: : 5 : 13 :
Split pointer
3: :
58
Linear Hashing – Example(cont.)
Bucket 3: overflow
Split bucket 3
=> Overflow records: Redistributed
Split pointer
0 1 2 3 4 5 6
8: : 9 : 17 : : : 7 : 11 : 15 4: : 5 : 13 : 6: :
insert 31
(31 mod 4 = 3)
3 : 23 :
59
Linear Hashing – Example(cont.)
7: bucket (7 mod 2*4 =) 7
11: bucket (11 mod 2*4 = ) 3
15: bucket (15 mod 2*4 = ) 3 h1(K) = K mod 8
3: bucket (3 mod 2*4 = ) 3
23: bucket (23 mod 2*4 = ) 7
31: bucket (31 mod 2*4 = ) = 7
0 1 2 3 4 5 6 7
8: : 9 : 17 : : : 11 : 15 : 3 4: : 5 : 13 : 6: : 7 : 23 : 31
Split pointer
60
Contents
1 Disk Storage Devices
2 Files of Records
3 Operations on Files
4 Unordered Files and Ordered Files
5 Hashed Files
6 RAID Technology
61
Parallelizing Disk Access using RAID
Technology.
◼ Secondary storage technology must take steps to
keep up in performance and reliability with
processor technology.
◼ A major advance in secondary storage technology is
represented by the development of RAID, which
originally stood for Redundant Arrays of
Inexpensive Disks.
◼ The main goal of RAID is to even out the widely
different rates of performance improvement of disks
against those in memory and microprocessors.
62
RAID Technology (cont.)
◼ A natural solution is a large array of small independent
disks acting as a single higher-performance logical disk.
A concept called data striping is used, which utilizes
parallelism to improve disk performance.
◼ Data striping distributes data transparently over multiple
disks to make them appear as a single large, fast disk.
63
RAID Technology (cont.)
◼ Different raid organizations were defined based on
different combinations of the two factors of granularity of
data interleaving (striping) and pattern used to compute
redundant information.
❑ Raid level 0 has no redundant data and hence has the best write
performance.
❑ Raid level 1 uses mirrored disks.
❑ Raid level 2 uses memory-style redundancy by using Hamming
codes, which contain parity bits for distinct overlapping subsets of
components. Level 2 includes both error detection and correction.
64
RAID Technology (cont.)
❑ Raid level 3 uses a single parity disk relying on the disk controller to
figure out which disk has failed.
❑ Raid levels 4 and 5 use block-level data striping, with level 5 distributing
data and parity information across all disks.
65
RAID Technology (cont.)
❑ Raid level 6 applies the so-called P + Q redundancy scheme using
Reed-Soloman codes to protect against up to two disk failures by using
just two redundant disks.
66
Use of RAID Technology (cont.)
◼ Different raid organizations are being used under
different situations:
❑ Raid level 1 (mirrored disks) is the easiest for rebuild of a disk
from other disks
◼ It is used for critical applications like logs.
❑ Raid level 2 uses memory-style redundancy by using Hamming
codes, which contain parity bits for distinct overlapping subsets
of components. Level 2 includes both error detection and
correction.
❑ Raid level 3 (single parity disks relying on the disk controller to
figure out which disk has failed) and level 5 (block-level data
striping) are preferred for large volume storage, with level 3
giving higher transfer rates.
❑ Most popular uses of the RAID technology currently are: Level 0
(with striping), Level 1 (with mirroring) and Level 5 with an extra
drive for parity.
❑ Design decisions for RAID include – level of RAID, number of
disks, choice of parity schemes, and grouping of disks for block-
level striping. 67
Storage Area Networks
◼ The demand for higher storage has risen considerably
in recent times.
◼ Organizations have a need to move from a static fixed
data center oriented operation to a more flexible and
dynamic infrastructure for information processing.
◼ Thus they are moving to a concept of Storage Area
Networks (SANs).
❑ In a SAN, online storage peripherals are configured as
nodes on a high-speed network and can be attached and
detached from servers in a very flexible manner.
◼ This allows storage systems to be placed at longer
distances from the servers and provide different
performance and connectivity options.
68
Storage Area Networks (cont.)
◼ Advantages of SANs are:
❑ Flexible many-to-many connectivity among servers and
storage devices using fiber channel hubs and switches.
❑ Up to 10km separation between a server and a storage
system using appropriate fiber optic cables.
❑ Better isolation capabilities allowing nondisruptive addition
of new peripherals and servers.
◼ SANs face the problem of:
❑ combining storage options from multiple vendors
❑ dealing with evolving standards of storage management
software and hardware.
69
Review questions
1) What is the difference between a file organization and an
access method?
2) What is the difference between static and dynamic files?
3) What are the typical record-at-a-time operations for accessing
a file? Which of these depend on the current record of a file?
4) Discuss the advantages and disadvantages of (a) unordered
file, (b) ordered file, and (c) static hash file with buckets and
chaining. Which operations can be performed efficiently on
each of these organizations, and which operations are
expensive?
5) Discuss the techniques for allowing a hash file to expand and
shrink dynamically. What are the advantages and
disadvantages of each?
70
71
Chapter 3
2
Contents
3
Single-level index introduction
◼ A single-level index is an auxiliary file that
makes it more efficient to search for a record in
the data file.
◼ The index is usually specified on one field of the
file (although it could be specified on several
fields).
◼ One form of an index is a file of entries <field
value, pointer to record>, which is ordered by
field value.
◼ The index is called an access path on the field.
4
Single-level index introduction (cont.)
◼ The index file usually occupies considerably less
disk blocks than the data file because its entries
are much smaller.
◼ A binary search on the index yields a pointer to
the file record.
◼ Indexes can also be characterized as dense or
sparse:
❑ A dense index has an index entry for every search key
value (and hence every record) in the data file.
❑ A sparse (or nondense) index, on the other hand, has
index entries for only some of the search values
5
Example 1
Given the following data file:
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
◼ Record size R = 150 bytes, block size B = 512 bytes, r = 30.000 records
◼ SSN Field size VSSN = 9 bytes, record pointer size PR = 7 bytes
Then, we get:
◼ Blocking factor: bfr = B/R = 512/150 = 3 records/block
◼ Number of blocks needed for the file: b= r/bfr = 30.000/3 = 10.000 blocks
◼ Primary Indexes
◼ Clustering Indexes
◼ Secondary Indexes
7
Primary Index
8
Primary key field Data file
1 7
4 8
8 9
12 10
12
13
15
9
Primary Index
◼ Dense or Nondense?
❑ Nondense
10
Clustering Index
11
Clustering field Data file
12
Dept_No Name DoB Salary Sex
Clustering field
1
1
2
2
Index file
2
(<K(i), P(i)> entries)
2
Clustering Block 2
field value pointer
1
3
2
3
3
4
4
5
4
Data file 13
Clustering Index
◼ Dense or Nondense?
❑ Nondense
14
Secondary index
◼ A secondary index provides a secondary means of
accessing a file.
❑ The data file is unordered on indexing field.
◼ Indexing field:
❑ secondary key (unique value)
❑ nonkey (duplicate values)
15
Index file Secondary
(<K(i), P(i)> entries) key field Data file
◼ Dense or Nondense?
❑ Dense
17
Secondary index on non-key field
◼ Discussion: Structure of Secondary index on non-
key field?
◼ Option 1: include duplicate index entries with the
same K(i) value - one for each record.
◼ Option 2: keep a list of pointers <P(i, 1), ..., P(i, k)>
in the index entry for K(i).
◼ Option 3:
❑ more commonly used.
❑ one entry for each distinct index field value + an extra
level of indirection to handle the multiple pointers.
18
Blocks of record pointers Indexing field Data file
…
3
Index file 5
(<K(i), P(i)> entries) 1
…
Field Block
2
value pointer
3
4
…
1
2 3
3 3
…
4 1
…
5 5
1
…
◼ Dense or Nondense?
❑ Dense/ nondense
20
Summary of Single-level indexes
◼ Dense index?
❑ Secondary index
◼ Nondense index?
❑ Primary index
❑ Clustering index
❑ Secondary index
22
Summary of Single-level indexes
23
Example 2
Given the following data file:
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
◼ Record size R = 150 bytes, block size B = 512 bytes, r = 30.000 records
◼ SSN Field size VSSN = 9 bytes, block pointer size P = 6 bytes
Then, we get:
◼ Blocking factor: bfr = B/R = 512/150 = 3 records/block
◼ Number of blocks needed for the file: b = r/bfr = 30.000/3 = 10.000 blocks
24
Contents
25
Multi-Level Indexes
◼ Because a single-level index is an ordered file, we
can create a primary index to the index itself.
❑ The original index file is called the first-level index and the
index to the index is called the second-level index.
◼ We can repeat the process, creating a third, fourth,
..., top level until all entries of the top level fit in
one disk block.
◼ A multi-level index can be created for any type of
first-level index (primary, secondary, clustering) as
long as the first-level index consists of more than
one disk block.
26
A two-level primary
index resembling
ISAM (Indexed
Sequential Access
Method)
organization.
27
Example 3
Given the following data file:
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
◼ Record size R=150 bytes, block size B=512 bytes, r=30000 records
◼ SSN Field size VSSN=9 bytes, block pointer size P=6 bytes
Then, we get:
◼ Blocking factor: bfr= B/R = 512/150 = 3 records/block
◼ Number of blocks needed for the file: b= r/bfr= 30000/3 = 10000 blocks
For a primary index on the ordering key field SSN (Example 2):
◼ Index entry size: Ri=(VSSN+ P)=(9+6)=15 bytes
◼ Index blocking factor bfri= B/Ri = 512/15 = 34 entries/block
◼ Number of blocks for index file: b i= b/bfri = 10000/34 = 295 blocks
◼ Search for and retrieve a record needs: log2bi + 1 = log2295 + 1 = 10 block
accesses
For a multilevel index on the ordering key field SSN:
◼ Index blocking factor bfri= B/Ri = 512/15 = 34 entries/block
o This is the fan-out fo of the multilevel index.
◼ Number of 1st level index blocks: b1 = 295 blocks
◼ Number of 2nd level index blocks: b2 = b1 / fo = 295 / 34 = 9 blocks
◼ Number of 3th level index blocks: b3 = b2 / fo = 9 / 34 = 1 block → top level
◼ Number of level of this multilevel index: x = 3 levels
◼ Search for and retrieve a record needs: x + 1 = 4 blocks
28
31
32
33
34
35
36
37
Multi-Level Indexes
38
Contents
39
Dynamic Multilevel Indexes Using B-
Trees and B+-Trees
◼ Most multi-level indexes use B-tree or B+-tree data
structures because of the insertion and deletion
problem.
❑ This leaves space in each tree node (disk block) to allow
for new index entries
◼ These data structures are variations of search trees
that allow efficient insertion and deletion of new
search values.
◼ In B-Tree and B+-Tree data structures, each node
corresponds to a disk block.
◼ Each node is kept between half-full and completely
full.
40
Dynamic Multilevel Indexes Using B-
Trees and B+-Trees (cont.)
◼ An insertion into a node that is not full is quite
efficient.
❑ If a node is full, the insertion causes a split into
two nodes.
◼ Splitting may propagate to other tree levels.
◼ A deletion is quite efficient if a node does not
become less than half full.
◼ If a deletion causes a node to become less than
half full, it must be merged with neighboring
nodes.
41
Difference between B-tree and B+-tree
42
B-tree Structures
43
The Nodes of a B+-Tree
44
The Nodes of a B+-Tree (cont.)
45
Example 4: Calculate the order of a B-tree
◼ Suppose that:
❑ Search field V = 9 bytes, disk block size B = 512 bytes
❑ Record (data) pointer Pt = 7 bytes, block pointer is P = 6 bytes.
◼ Each B-tree node can have at most p tree pointers, p – 1
data pointers, and p – 1 search key field values.
◼ These must fit into a single disk block if each B-tree node is to
correspond to a disk block:
(p*P) + ((p-1)*(Pt+V)) B
(p*6) + ((p-1)*(7+9)) 512
(22*p) 528
◼ We can choose to be a large value that satisfies the above
inequality, which gives p = 23 (p = 24 is not chosen because
of additional information).
46
Example 5: Calculate approximate number
of entries of a B-tree
◼ Suppose that:
❑ Search field of Example 3 is a non-ordering key field, and we construct a B-Tree on
this field.
❑ Each node of the B-tree is 69 percent full.
◼ Each node, on the average, will have: p * 0.69 = 23 * 0.69 = 15.87 ≈ 16
pointers → 15 search key field values.
◼ The average fan-out fo = 16. We can start at the root and see how many
values and pointers can exist, on the average, at each subsequent level:
Level Nodes Index entries Pointers
Root: 1 node 15 entries 16 pointers
Level 1: 16 nodes 240 entries 256 pointers
Level 2: 256 nodes 3840 entries 4096 pointers
Level 3: 4096 nodes 61,440 entries
◼ At each level, number of entries = the total number of pointers at the
previous level * the average number of entries in each node.
◼ A two-level B-tree holds 3840+240+15 = 4095 entries on the average; a
three-level B-tree holds 65,535 entries on the average. 47
Example 6: Calculate the order of a B+-tree
◼ Suppose that:
❑ Search key field V=9 bytes, block size B=512bytes
❑ Record pointer is Pr = 7bytes, block pointer is P = 6bytes.
◼ An internal node of the B+-tree can have up to p tree pointers and p-
1 search field values; these must fit into a single block. Hence, we
have:
(p*P) + ((p-1)*V) B
(p*6) + ((p-1)*9) 512
15*p 512
48
Example 6: Calculate the order of a B+-tree
(cont.)
◼ The leaf nodes of B+-tree will have the same number of
values and pointers, except that the pointers are data
pointers and a next pointer. Hence, the order pleaf for the
leaf nodes can be calculated as follows:
(pleaf * (Pt+V))+P B
(pleaf * (7+9))+6 512
(16 * pleaf) 506
◼ If follows that each leaf node hold up to pleaf = 31 key
value/data pointer combinations, assuming that the data
pointers are record pointers.
49
Example 7: Calculate approximate number
of entries of a B+-tree
◼ Suppose that we construct a B+-Tree on the field of Example 6:
❑ Search key field V = 9 bytes, block size B = 512bytes
❑ Record pointer is Pr = 7bytes, block pointer is P = 6bytes.
❑ Each node is 69 percent full.
◼ On the average, each internal node will be have 34*0.69 ≈ 23.46 or
approximately 23 pointers, and hence 22 values.
◼ Each leaf node, on the average, will hold 0.69*pleaf = 0.69*31 ≈ 21.39 or
approximately 21 data record pointers.
◼ A B+-tree will have the following average number of entries at each level:
Level Nodes Index entries Pointers
Root 1 nodes 22 entries 23 pointers
Level 1 23 23*22 = 506 232=529 pointers
Level 2 529 529*22 = 11,638 233=12,167 pointers
Leaf level 12,167 12,167 *21 = 255,507
◼ A 3-level B+-tree holds up to 255,507 record pointers, on the average.
◼ Compare this to the 65,535 entries for corresponding B-tree in Example 4.
50
B+-Tree: Insert entry
51
B+-Tree: Insert entry (cont.)
52
Example of insertion in B+-tree
p = 3 and pleaf = 2
53
Example of insertion in B+-tree (cont.)
p = 3 and pleaf = 2
54
Example of insertion in B+-tree (cont.)
p = 3 and pleaf = 2
55
Example of insertion in B+-tree (cont.)
p = 3 and pleaf = 2
56
Example of insertion in B+-tree (cont.)
57
Example of insertion in B+-tree (cont.)
58
Example of insertion in B+-tree (cont.)
59
B+-Tree: Delete entry
◼ Remove the entry from the leaf node.
◼ If it happens to occur in an internal node:
❑ Remove.
❑ The value to its left in the leaf node must replace it in the internal
node.
◼ Deletion may cause underflow in leaf node:
❑ Try to find a sibling leaf node – a leaf node directly to the left or to
the right of the node with underflow.
❑ Redistribute the entries among the node and its siblings.
(Common method: The left sibling first and the right sibling later)
❑ If redistribution fails, the node is merged with its sibling.
❑ If merge occurred, must delete entry (pointing to node and
sibling) from parent node.
60
B+-Tree: Delete entry (cont.)
61
Example of deletion from B+-tree
p = 3 and pleaf = 2.
Delete 5
62
Example of deletion from B+-tree (cont.)
P = 3 and pleaf = 2.
63
Example of deletion from B+-tree (cont.)
p = 3 and pleaf = 2.
Delete 9:
Underflow (merge with left, redistribute)
64
Example of deletion from B+-tree (cont.)
p = 3 and pleaf = 2.
65
Search using B-trees and B+-trees
K=8
5<8
7< 8 <= 8
found
66
Search using B-trees and B+-trees
◼ Search conditions on indexing attributes
❑ =, <, >, ≤, ≥, between, MINIMUM value, MAXIMUM
value
◼ Search results
❑ Zero, one, or many data records
◼ Search cost
❑ B-trees
◼ From 1 to (1 + the number of tree levels) + data accesses
❑ B+-trees
◼ 1 (root level) + the number of tree levels + data accesses
67
Contents
69
Indexes on Multiple Keys
◼ In many retrieval and update requests, multiple
attributes are involved.
◼ If a certain combination of attributes is used
frequently, it is advantageous to set up an access
structure to provide efficient access by a key value
that is a combination of those attributes.
◼ If an index is created on attributes <A1, A2, … , An>,
the search key values are tuples with n values: <v1,
v2, … , vn>.
◼ A lexicographic ordering of these tuple values
establishes an order on this composite search key.
◼ An index on a composite key of n attributes works
similarly to any index discussed so far.
70
Contents
71
Other File Indexes
◼ Hash indexes
❑ The hash index is a secondary structure to access the
file by using hashing on a search key other than the one
used for the primary data file organization.
◼ Bitmap indexes
❑ A bitmap index is built on one particular value of a
field (the column in a table) with respect to all the rows
(records) and is an array of bits.
◼ Function-based indexes
❑ In Oracle, an index such that the value that results from
applying a function (expression) on a field or some fields
becomes the key to the index
72
Other File Indexes
◼ Hash indexes
❑ The hash index is a secondary structure to
access the file by using hashing on a search
key other than the one used for the primary
data file organization.
◼ access structures similar to indexes, based on
hashing
❑ Support for equality searches on the hash
field
73
Hash indexes
74
hashing
function:
the sum of
the digits
of Emp_id
modulo 10
75
Bitmap indexes
◼ A bitmap index is built on one particular value
of a field (the column in a table) with respect to
all the rows (records) and is an array of bits.
❑ Each bit in the bitmap corresponds to a row. If the bit is
set, then the row contains the key value.
◼ In a bitmap index, each indexing field value is
associated with pointers to multiple rows.
◼ Bitmap indexes are primarily designed for data
warehousing or environments in which queries
reference many columns in an ad hoc fashion.
❑ The number of distinct values of the indexed field is
small compared to the number of rows.
❑ The indexed table is either read-only or not subject to
significant modification by DML statements.
76
Bitmap indexes
77
Bitmap indexes
78
Function-based indexes
◼ The use of any function on a column prevents the
index defined on that column from being used.
❑ Indexes are only used with some specific search
conditions on indexed columns.
79
Function-based indexes
80
Contents
81
Index Creation
CREATE [ UNIQUE ] INDEX <index name>
ON <table name> ( <column name> [ <order> ] { , <column name> [ <order> ] } )
[ CLUSTER ] ;
82
B-tree index in Oracle 19c
83
B-tree for a clustered index in MS
SQL Server
84
Review questions
1) Define the following terms: indexing field, primary key field, clustering
field, secondary key field, block anchor, dense index, and nondense
(sparse) index.
2) What are the differences among primary, secondary, and clustering
indexes? How do these differences affect the ways in which these
indexes are implemented? Which of the indexes are dense, and which
are not?
3) Why can we have at most one primary or clustering index on a file, but
several secondary indexes?
4) How does multilevel indexing improve the efficiency of searching an
index file?
5) What is the order p of a B-tree? Describe the structure of B-tree nodes.
6) What is the order p of a B+-tree? Describe the structure of both internal
and leaf nodes of a B+-tree.
7) How does a B-tree differ from a B+-tree? Why is a B+-tree usually
preferred as an access structure to a data file?
85
86
Chapter 4
2
1. Introduction to Query Processing
3
Typical steps when processing a high-level query
4
COMPANY Database Schema
5
2. Translating SQL Queries into Relational
Algebra (1)
◼ Query block: the basic unit that can be translated into the algebraic
operators and optimized.
◼ A query block contains a single SELECT-FROM-WHERE expression,
as well as GROUP BY and HAVING clause if these are part of the
block.
◼ Nested queries within a query are identified as separate query blocks.
◼ Aggregate operators in SQL must be included in the extended algebra.
6
Translating SQL Queries into Relational Algebra (2)
◼ External sorting : refers to sorting algorithms that are suitable for large
files of records stored on disk that do not fit entirely in main memory, such
as most database files.
◼ Sort-Merge strategy : starts by sorting small subfiles (runs ) of the main
file and then merges the sorted runs, creating larger sorted subfiles that
are merged in turn.
– Sorting phase: nR = (b/nB)
– Merging phase: dM = Min(nB-1, nR);
nP= (logdM(nR))
nR: number of initial runs; b: number of file blocks;
nB: available buffer space; dM: degree of merging;
nP: number of passes.
8
Algorithms for External Sorting (2)
9
Algorithms for External Sorting (3)
/*Merge phase: merge subfiles until only 1 remains */
set i 1;
p logk-1m; /* p is the number of passes for the merging phase */
j m; /* the number of runs */
while (i<= p) do
{
n 1;
q (j/(k-1); /* the number of runs to write in this pass */
while ( n <= q) do
{
read next k-1 subfiles or remaining subfiles (from previous pass) one block at a time
merge and write as new subfile one block at a time;
n n+1;
}
j q;
The number of block accesses for the merge phase = 2*(b* logdMnR)
i i+1;
}
10
Example of External Sorting (1)
1 block = 2 records
15 22 2 27 14 6 51 18 35 16 50 36 9 8 32 12 11 33 30 30 23 21 24 29
Sort phase:
Read 3 blocks of the file → sort.
→ run: 3 blocks
11
Example of External Sorting (2)
Sort phase:
15 22 2 27 14 6 51 18 35 16 50 36 9 8 32 12 11 33 30 30 23 21 24 29
15 22 2 27 14 6 2 6 14 15 22 27
2 6 14 15 22 27
1 run
12
Example of External Sorting (3)
Sort phase
15 22 2 27 14 6 51 18 35 16 50 36 9 8 32 12 11 33 30 30 23 21 24 29
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30
13
Example of External Sorting (4)
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30
Merge phase:
Each step:
- Read 1 block from (nB - 1) runs to buffer
- Merge → temp block
- If temp block full: write to file
- If any empty block: Read next block from
corresponding run
14
Example of External Sorting (5)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30
2 6 16 18 6 16 18 2 16 18 2 6
15
Example of External Sorting (6)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30
14 15 16 18 15 16 18 14 16 18 14 15
2 6
16
Example of External Sorting (7)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30
22 27 16 18 22 27 18 16 22 27 16 18
2 6 14 15
17
Example of External Sorting (8)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30
22 27 35 36 27 35 36 22 35 36 22 27
2 6 14 15 16 18
18
Example of External Sorting (9)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30
35 36
Temp block
2 6 14 15 16 18 22 27
19
Example of External Sorting (10)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30
Temp block
1 new run
2 6 14 15 16 18 22 27 35 36 50 51
20
Example of External Sorting (11)
Merge phase: Pass 2
2 6 14 15 16 18 22 27 35 36 50 51 8 9 11 12 20 21 23 24 29 30 32 33
2 6 8 9 6 8 9 2 8 9 2 6
21
Example of External Sorting (12)
Merge phase: Pass 2
2 6 14 15 16 18 22 27 35 36 50 51 8 9 11 12 20 21 23 24 29 30 32 33
14 15 8 9
Temp block
2 6
22
Example of External Sorting (13)
Result:
2 6 8 9 11 12 14 15 16 18 20 21 22 23 24 27 29 30 32 33 35 36 50 51
23
4. Algorithms for SELECT and JOIN
Operations (1)
Implementing the SELECT Operation:
Examples:
◼ (OP1): σSSN='123456789'(EMPLOYEE)
◼ (OP2): σDNUMBER>5(DEPARTMENT)
◼ (OP3): σDNO=5(EMPLOYEE)
◼ (OP4): σDNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
◼ (OP5): σESSN='123456789' AND PNO=10(WORKS_ON)
◼ (OP6): σDNO IN (3, 27, 49) (EMPLOYEE)
25
Algorithms for SELECT and JOIN Operations (3)
26
Algorithms for SELECT and JOIN Operations (4)
27
Algorithms for SELECT and JOIN Operations (4)
28
Algorithms for SELECT and JOIN (5)
29
Algorithms for SELECT and JOIN (6)
❑ Records satisfying the disjunctive condition are the union of the records
satisfying the individual conditions.
❑ If any one of the conditions does not have an access path, we are
compelled to use the brute force, linear search approach (S1).
❑ Only if an access path exists on every simple condition in the disjunction
can we optimize the selection by retrieving the records satisfying each
condition - or their record ids - and then applying the union operation to
eliminate duplicates.
31
Algorithms for SELECT and JOIN Operations (7)
Implementing the SELECT
Operation (cont.):
◼ S11. Disjunctive (OR)
selection conditions:
32
Algorithms for SELECT and JOIN Operations (7)
33
Which search method should be used? (1)
◼ (OP1): σSSN='123456789'(EMPLOYEE)
◼ (OP2): σDNUMBER>5(DEPARTMENT)
◼ (OP3): σDNO=5(EMPLOYEE)
◼ (OP4): σDNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
◼ (OP5): σESSN='123456789' AND PNO=10(WORKS_ON)
◼ (OP6): σDNO IN (3, 27, 49) (EMPLOYEE)
34
Which search method should be used? (2)
◼ EMPLOYEE
❑ A clustering index on SALARY
❑ A secondary index on the key attribute SSN
❑ A secondary index on the nonkey attribute DNO
❑ A secondary index on SEX
(OP1): σSSN='123456789'(EMPLOYEE)
◼ DEPARTMENT
❑ A primary index on DNUMBER
❑ A secondary index on MGRSSN
36
Which search method should be used? (4)
◼ EMPLOYEE
❑ A clustering index on SALARY
❑ A secondary index on the key attribute SSN
❑ A secondary index on the nonkey attribute DNO
❑ A secondary index on SEX
39
Which search method should be used? (7)
◼ WORKS_ON
❑ A composite primary index on (ESSN, PNO)
40
Which search method should be used? (8)
◼ EMPLOYEE
❑ A clustering index on SALARY
❑ A secondary index on the key attribute SSN
❑ A secondary index on the nonkey attribute DNO
❑ A secondary index on SEX
◼ EMPLOYEE
❑ A clustering index on SALARY
❑ A secondary index on the key attribute SSN
❑ A secondary index on the nonkey attribute DNO
❑ A secondary index on SEX
◼ Examples
(OP8): EMPLOYEE DNO=DNUMBERDEPARTMENT
(OP9): DEPARTMENT MGRSSN=SSNEMPLOYEE
43
Algorithms for SELECT and JOIN Operations (9)
44
Algorithms for SELECT and JOIN Operations (10)
45
Algorithms for SELECT and JOIN Operations (12)
sort the tuples in R on attribute A; /* assume R has n tuples */
sort the tuples in S on attribute B; /* assume S has m tuples */
set i 1, j 1;
while (i ≤ n) and (j ≤ m)
do {
if R(i)[A] > S(j)[B]
then set j j + 1
elseif R(i)[A] < S(j)[B]
then set i i + 1
else { /* output a matched tuple */
output the combined tupe <R(i), S(j)> to T;
/* output other tuples that match R(i), if any */
set l j + 1 ;
while ( l ≤ m) and (R(i)[A] = S(l)[B])
do { output the combined tuple <R(i), S(l)> to T;
set l l + 1
}
/* output other tuples that match S(j), if any */
set k i+1
while ( k ≤ n) and (R(k)[A] = S(j)[B])
do { output the combined tuple <R(k), S(j)> to T;
set k k + 1
} Implementing Sort-Merge Join (J3): T R A=B S
set i i+1, j j+1;
}
46
}
R S
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
47
R S
R(i)[A] > S(j)[B]
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
48
R S
R(i)[A] < S(j)[B]
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
49
R S
R(i)[A] = S(j)[B]
C A B D
5 4 → R(2), S(2)
6 6
9 6
10 10
17 17
20 18
50
R S
R(i)[A] = S(j)[B]
C A B D
5 4 → R(2), S(3)
6 6
9 6
10 10
17 17
20 18
51
R S
R(i)[A] < S(j)[B]
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
52
R S
R(i)[A] < S(j)[B]
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
53
R S
R(i)[A] = S(j)[B]
C A B D
5 4 → R(4), S(4)
6 6
9 6
10 10
17 17
20 18
54
R S
R(i)[A] < S(j)[B]
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
55
R S
R(i)[A] = S(j)[B]
C A B D
5 4 → R(5), S(5)
6 6
9 6
10 10
17 17
20 18
56
R S
R(i)[A] < S(j)[B]
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
57
R S
R(i)[A] > S(j)[B]
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
58
R S
j > m → end.
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
Result: C A B D
R(2), S(2) 6 6
R(2), S(3) 6 6
R(4), S(4) 10 10
R(5), S(5) 17 17
59
Algorithms for SELECT and JOIN Operations (11)
60
Algorithms for SELECT and JOIN Operations (11)
61
Algorithms for SELECT and JOIN Operations (12)
62
Algorithms for SELECT and JOIN Operations (12)
63
Algorithms for SELECT and JOIN Operations (12)
64
Algorithms for SELECT and JOIN Operations (12)
65
5. Algorithms for PROJECT and SET
Operations (1)
◼ Algorithm for PROJECT operations π<attribute list>(R)
(Figure 19.3b)
◼ If <attribute list> has a key of relation R, extract all tuples from R with only the values for
the attributes in <attribute list>.
◼ If <attribute list> does NOT include a key of relation R, duplicated tuples must be
removed from the results.
◼ Methods to remove duplicate tuples:
◼ Sorting
◼ Hashing
66
Implementing T ∏<attribute list>(R)
67
Algorithms for PROJECT and SET Operations (2)
68
Algorithms for PROJECT and SET Operations (3)
❑ 2. Scan and merge both sorted files concurrently, keep in the merged results only those
tuples that appear in both relations.
◼ SET DIFFERENCE R-S (See Figure 19.3e)(keep in the merged results only those
tuples that appear in relation R but not in relation S.)
69
Union: T R S
sort the tuples in R and S using the same unique sort attributes;
set i 1, j 1;
while (i ≤ n) and (j ≤ m) do
{
if R(i) > S(j)
then
{ output S(j) to T;
set j j+1
}
else if R(i) < S(j)
then
{ output R(i) to T;
set i i+1
}
else set j j+1 /* R(i) = S(j), so we skip one of the duplicate tuples */
}
if (i ≤ n) then add tuples R(i) to R(n) to T;
if (j ≤ m) then add tuples S(j) to S(m) to T;
70
Intersection T R S
sort the tuples in R and S using the same unique sort attributes;
set i 1, j 1;
while (i ≤ n) and (j ≤ m) do
{
if R(i) > S(j)
then
set j j+1
else if R(i) < S(j)
then
set i i+1
else
{ output R(i) to T; /* R(i) = S(j), so we output the tuple
*/
set i i+1, j j+1
}
} 71
Difference T R − S
sort the tuples in R and S using the same unique sort attributes;
set i 1, j 1;
while (i ≤ n) and (j ≤ m) do
{
if R(i) > S(j)
then
set j j+1
else if R(i) < S(j)
then
{ output R(i) to T; /* R(i) has no matching S(j), so output R(i) */
set i i+1
}
else
set i i+1, j j+1
}
if (i ≤ n) then add tuples R(i) to R(n) to T;
72
6. Implementing Aggregate Operations
and Outer Joins (1)
Implementing Aggregate Operations:
◼ Aggregate operators : MIN, MAX, SUM, COUNT and AVG
❑ Table Scan
❑ Index
◼ Example
◼ If an (ascending) index on SALARY exists for the employee relation, then the optimizer could
decide on traversing the index for the largest value, which would entail following the right
most pointer in each index node from the root to a leaf.
73
Implementing Aggregate Operations and
Outer Joins (2)
◼ Implementing Aggregate Operations (cont.):
◼ SUM, COUNT and AVG
❑ For a dense index (each record has one index entry):
apply the associated computation to the values in the index.
❑ For a non-dense index: actual number of records associated with each index entry must
be accounted for
◼ With GROUP BY: the aggregate operator must be applied separately to each group of
tuples.
❑ Use sorting or hashing on the group attributes to partition the file into the appropriate
groups;
❑ Compute the aggregate function for the tuples in each group.
74
Implementing Aggregate Operations and
Outer Joins (3)
◼ Implementing Outer Join:
◼ Outer Join Operators : LEFT OUTER JOIN, RIGHT OUTER JOIN and FULL OUTER
JOIN.
◼ The full outer join produces a result which is equivalent to the union of the results of the
left and right outer joins.
◼ Example:
SELECT FNAME, DNAME
FROM ( EMPLOYEE LEFT OUTER JOIN DEPARTMENT ON DNO = DNUMBER);
◼ Note: The result of this query is a table of employee names and their associated
departments. It is similar to a regular join result, with the exception that if an employee
does not have an associated department, the employee's name will still appear in the
resulting table, although the department name would be indicated as null.
75
Implementing Aggregate Operations and
Outer Joins (4)
◼ Implementing Outer Join (cont.):
◼ Modifying Join Algorithms:
76
Implementing Aggregate Operations and
Outer Joins (5)
◼ Implementing Outer Join (cont.):
77
7. Combining Operations using Pipelining (1)
◼ Motivation
❑ A query is mapped into a sequence of operations.
78
Combining Operations using Pipelining (2)
◼ Example: 2-way join, 2 selections on the input files and one final
projection on the resulting file.
◼ Dynamic generation of code to allow for multiple operations to be
pipelined.
◼ Results of a select operation are fed in a "Pipeline " to the join
algorithm.
◼ Also known as stream-based processing.
79
8. Using Heuristics in Query Optimization(1)
80
Using Heuristics in Query Optimization (2)
81
Using Heuristics in Query Optimization (3)
◼ Example:
For every project located in ‘Stafford’, retrieve the project number, the
controlling department number and the department manager’s last name,
address and birthdate.
◼ Relation algebra :
PNUMBER, DNUM, LNAME, ADDRESS, BDATE(((σPLOCATION=‘STAFFORD’(PROJECT))
DNUM=DNUMBER (DEPARTMENT)) MGRSSN=SSN (EMPLOYEE))
◼ SQL query :
Q2: SELECT P.NUMBER, P.DNUM,E.LNAME, E.ADDRESS, E.BDATE
FROM PROJECT AS P, DEPARTMENT AS D, EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND D.MGRSSN=E.SSN AND
P.PLOCATION=‘STAFFORD’; 82
Two query trees for the query Q2
83
Query graph for Q2
84
Using Heuristics in Query Optimization (6)
◼ The same query could correspond to many different relational algebra expressions —
and hence many different query trees.
◼ The task of heuristic optimization of query trees is to find a final query tree that is
efficient to execute.
◼ Example :
Q: SELECT LNAME
86
Step 2: Moving SELECT operations down the query tree.
87
Step 3: Apply more restrictive SELECT operation first
88
Step 4: Replacing Cartesian Product and Select with Join operation.
89
Step 5: Moving Project operations down the query tree
90
Using Heuristics in Query Optimization (10)
General Transformation Rules for Relational Algebra Operations:
◼ 3. Cascade of π : In a cascade (sequence) of π operations, all but the last one can be
ignored:
πList1(π List2(...( πListn (R))...) ) = π List1(R)
◼ 4. Commuting σ with π : If the selection condition c involves only the attributes A1, ..., An in
the projection list, the two operations can be commuted:
πA1, A2,., An(σc(R)) = σc(πA1, A2,., An (R))
91
Using Heuristics in Query Optimization (11)
◼ Alternatively, if the selection condition c can be written as (c1 and c2), where condition c1
involves only the attributes of R and condition c2 involves only the attributes of S, the
operations commute as follows:
93
Using Heuristics in Query Optimization (13)
94
Using Heuristics in Query Optimization (14)
95
Using Heuristics in Query Optimization (15)
96
Using Heuristics in Query Optimization (16)
97
Using Heuristics in Query Optimization (17)
98
Using Heuristics in Query Optimization (18)
99
9. Using Selectivity and Cost Estimates in
Query Optimization (1)
◼ Cost-based query optimization: Estimate and compare the costs of
executing a query using different execution strategies and choose the
strategy with the lowest cost estimate.
◼ Issues
❑ Cost function
❑ Number of execution strategies to be considered
100
Using Selectivity and Cost Estimates in Query Optimization (2)
◼ 3. Computation cost
101
Using Selectivity and Cost Estimates in Query
Optimization (3)
Catalog Information Used in Cost Functions
◼ Information about the size of a file
105
Example
◼ rE = 10,000 , bE = 2000 , bfrE = 5
◼ Access paths:
❑ 1. A clustering index on SALARY, with levels xSALARY = 3 and average
selection cardinality SSALARY = 20.
❑ 2. A secondary index on the key attribute SSN, with xSSN = 4 (SSSN = 1).
❑ 3. A secondary index on the nonkey attribute DNO, with xDNO= 2 and first-
level index blocks bI1DNO= 4. There are dDNO = 125 distinct values for DNO,
so the selection cardinality of DNO is SDNO = (r/dDNO) = 80.
❑ 4. A secondary index on SEX, with xSEX = 1. There are dSEX = 2 values for
the sex attribute, so the average selection cardinality is SSEX = (r/dSEX) =
5000.
106
Example
❑ CS1b = 1000
❑ CS6a = xSSN + 1 = 4+1 = 5
❑ CS1a = 2000
❑ CS6b = xDNO + (bl1DNO/2) + (r/2) = 2 + 4/2 + 10000/2 = 5004
107
Example
◼ (op3): σDNO=5 (EMPLOYEE)
❑ CS1a = 2000
❑ CS6a = xDNO + sDNO = 2 + 80 = 82 (option 1 & 2)
❑ CS6a = xDNO + sDNO + 1 = 2 + 80 + 1= 83 (option 3)
108
Using Selectivity and Cost Estimates in Query
Optimization (7)
Examples of Cost Functions for JOIN
109
Using Selectivity and Cost Estimates in Query
Optimization (8)
Examples of Cost Functions for JOIN (cont.)
110
Using Selectivity and Cost Estimates in Query
Optimization (9)
Examples of Cost Functions for JOIN (cont.)
◼ J2. Single-loop join (cont.)
112
DEPARTMENT: rD = 125 and bD = 13 , xDNUMBER = 1, primary index on DNUMBER of DEPARTMENT, xDNUMBER = 1,
Example
jsOP6 = (1/IDEPARTMENTI ) = 1/rD = 1/125 , bfrED = 4
EMPLOYEE : rE = 10000, bE = 2000, secondary index on the nonkey attribute DNO, xDNO = 2, SDNO = 80).
◼ (op8): EMPLOYEE DNO=DNUMBER DEPARTMENT
❑ Method J1 with Employee as outer:
◼ CJ1 = bE + (bE * bD) + ((jsOP6 * rE * rD)/bfrED)
◼ = 2000 + (2000 * 13) + (((1/125) * 10,000 * 125)/4) =30,500
❑ Method J1 with Department as outer:
◼ CJ1 = bD + (bE * bD) + (((jsOP6 * rE * rD)/bfrED)
◼ = 13 + (13 * 2000) + (((1/125) * 10,000 * 125/4) = 28,513
❑ Method J2 with EMPLOYEE as outer loop:
◼ CJ2c = bE + (rE * (xDNUMBER + 1)) + ((jsOP6 * rE * rD)/bfrED
◼ = 2000 + (10,000 * 2) + (((1/125) * 10,000 * 125/4) = 24,500
❑ Method J2 with DEPARTMENT as outer loop:
◼ CJ2a = bD + (rD * (xDNO+ sDNO)) + ((jsOP6 * rE * rD)/bfrED) (option 1 & 2)
◼ = 13 + (125 * (2 + 80)) + (((1/125) * 10,000 * 125/4) = 12,763
◼ CJ2a = bD + (rD * (xDNO+ sDNO + 1)) + ((jsOP6 * rE * rD)/bfrED) (option 3)
◼ = 13 + (125 * (2 + 80 + 1)) + (((1/125) * 10,000 * 125/4) = 12,888
113
DEPARTMENT: rD = 125 and bD = 13 , xDNUMBER = 1, secondary index on MGRSSN of DEPARTMENT, sMGRSSN =
Example
1, xMGRSSN = 2, jsOP7 = (1/IEMPLOYEEI ) = 1/rE = 1/10,000 , bfrED = 4
EMPLOYEE : : rE = 10000, bE = 2000, secondary index on the key attribute SSN, with xSSN = 4 (SSSN = 1).
115
Using Selectivity and Cost Estimates in Query
Optimization (11)
◼ Example: 2 left-deep trees
116
10. Overview of Query Optimization in Oracle
◼ Oracle DBMS V8
❑ Rule-based query optimization: the optimizer chooses execution plans based
on heuristically ranked operations.
◼ (Currently it is being phased out)
❑ Cost-based query optimization: the optimizer examines alternative access
paths and operator algorithms and chooses the execution plan with lowest
estimate cost.
◼ The query cost is calculated based on the estimated usage of resources such as I/O,
CPU and memory needed.
❑ Application developers could specify hints to the ORACLE query optimizer.
❑ The idea is that an application developer might know more information about the
data.
117
11. Semantic Query Optimization
◼ Semantic Query Optimization:
❑ Uses constraints specified on the database schema in order to modify one query into
another query that is more efficient to execute.
◼ Explanation:
❑ Suppose that we had a constraint on the database schema that stated that no employee
can earn more than his or her direct supervisor. If the semantic query optimizer checks for
the existence of this constraint, it need not execute the query at all because it knows that
the result of the query will be empty. Techniques known as theorem proving can be used
for this purpose.
118
120
Chapter 5
Introduction to Transaction
Processing Concepts and Theory
1
Chapter Outline
◼ Introduction to Transaction Processing
2
1. Introduction to Transaction Processing (1)
3
Introduction to Transaction Processing (2)
4
Introduction to Transaction Processing (3)
5
Introduction to Transaction Processing (4)
READ AND WRITE OPERATIONS:
◼ Basic unit of data transfer from the disk to the computer
main memory is one block.
◼ Data item (what is read or written):
❑ the field of some record in the database,
❑ a larger unit such as a record or even a whole block.
◼ read_item(X) command includes the following
steps:
1. Find the address of the disk block that contains item X.
2. Copy that disk block into a buffer in main memory (if that
disk block is not already in some main memory buffer).
3. Copy item X from the buffer to the program variable
named X.
6
Introduction to Transaction Processing (5)
7
Two sample transactions. (a) Transaction T1.
(b) Transaction T2.
8
Introduction to Transaction Processing (7)
9
Some problems that occur when concurrent execution
is uncontrolled. (a) The lost update problem.
10
Some problems that occur when concurrent execution
is uncontrolled. (b) The temporary update problem.
11
Introduction to Transaction Processing (8)
Why Concurrency Control is needed (cont.):
◼ The Incorrect Summary Problem .
If one transaction is calculating an aggregate summary
function on a number of records while other transactions
are updating some of these records, the aggregate
function may calculate some values before they are
updated and others after they are updated.
13
Introduction to Transaction Processing (11)
Why recovery is needed:
(What causes a Transaction to fail)
1. A computer failure (system crash): A hardware or
software error occurs in the computer system during
transaction execution. If the hardware crashes, the
contents of the computer’s internal memory may be
lost.
2. A transaction or system error : Some operation in the
transaction may cause it to fail, such as integer overflow
or division by zero. Transaction failure may also occur
because of erroneous parameter values or because of
a logical programming error. In addition, the user may
interrupt the transaction during its execution.
14
Introduction to Transaction Processing (12)
Why recovery is needed (cont.):
3. Local errors or exception conditions detected by the
transaction:
- certain conditions necessitate cancellation of the
transaction. For example, data for the transaction may not
be found. A condition, such as insufficient account balance
in a banking database, may cause a transaction, such as a
fund withdrawal from that account, to be canceled.
- a programmed abort in the transaction causes it to fail.
4. Concurrency control enforcement: The concurrency
control method may decide to abort the transaction, to be
restarted later, because it violates serializability or because
several transactions are in a state of deadlock (see
Chapter 5).
15
Introduction to Transaction Processing (13)
16
2 . Transaction and System Concepts (1)
◼ Transaction states:
❑ Active state
❑ Partially committed state
❑ Committed state
❑ Failed state
❑ Terminated State
17
State transition diagram illustrating the states for
transaction execution.
18
Transaction and System Concepts (2)
Recovery manager keeps track of the following
operations:
◼ begin_transaction: This marks the beginning of
transaction execution.
◼ read or write: These specify read or write operations on
the database items
◼ end_transaction:
❑ This specifies that read and write transaction
operations have ended and marks the end limit of
transaction execution.
❑ may be necessary to check whether the changes
introduced by the transaction can be permanently
applied to the database or whether the transaction has
to be aborted because it violates concurrency control or
for some other reason.
19
Transaction and System Concepts (3)
Recovery manager keeps track of the following
operations (cont):
◼ commit_transaction: This signals a successful end of
the transaction so that any changes (updates) executed
by the transaction can be safely committed to the
database and will not be undone.
◼ rollback (or abort): This signals that the transaction
has ended unsuccessfully, so that any changes or
effects that the transaction may have applied to the
database must be undone.
20
Transaction and System Concepts (4)
21
Transaction and System Concepts (5)
The System Log
◼ Log or Journal :
❑ The log keeps track of all transaction operations that
affect the values of database items.
❑ This information may be needed to permit recovery
from transaction failures.
❑ The log is kept on disk → not affected by any type of
failure except for disk or catastrophic failure.
❑ In addition, the log is periodically backed up to
archival storage (tape) to guard against such
catastrophic failures.
22
Transaction and System Concepts (6)
23
Transaction and System Concepts (7)
24
Transaction and System Concepts (8)
25
Transaction and System Concepts (9)
26
3. Desirable Properties of Transactions (1)
ACID properties:
◼ Atomicity: A transaction is an atomic unit of
processing; it is either performed in its entirety
or not performed at all.
27
Desirable Properties of Transactions (2)
28
4. Characterizing Schedules based on
Recoverability (1)
◼ Transaction schedule or history:
❑ When transactions are executing concurrently in an
interleaved fashion
❑ The order of execution of operations from the various
transactions forms → a transaction schedule (or history).
29
Characterizing Schedules based on
Recoverability (2)
◼ Notation:
Notation Description
ri(X) read_item(X) - transaction Ti
wi(X) write_item(X) - transaction Ti
ci commit - transaction Ti
ai abort - transaction Ti
Characterizing Schedules based on
Recoverability (3)
◼ Example (1):
abort;
❑ Not conflict:
◼ r1(X) and r2(X)
◼ r1(Y) and w2(X)
◼ r1(X) and w1(X)
◼ …
Characterizing Schedules based on
Recoverability (7)
◼ Example (2):
❑ Sb: r1(X); w1(X); r2(X); w2(X); r1(Y); a1;
❑ Conflict:
◼ r1(X) and w2(X)
◼ w1(X) and r2(X)
◼ w1(X) and w2(X)
Characterizing Schedules based on
Recoverability (8)
Schedules classified on recoverability:
◼ Recoverable schedule: A schedule S is recoverable if
no transaction T in S commits until all transactions T’ that have
written an item that T reads have committed.
36
Characterizing Schedules based on
Recoverability (9)
Schedules classified on recoverability (cont.):
◼ Strict Schedules: A schedule in which a
transaction can neither read or write an item X until
the last transaction that wrote X has committed.
37
Characterizing Schedules based on
Recoverability (10)
◼ Example of Recoverable schedule :
Sa': r1(X); r2(X); w1(X); r1(Y); w2(X); c2; w1(Y); c1;
❑ Lost update
40
Characterizing Schedules based on
Serializability (2)
Serial Schedules:
(A) T1 followed by T2 (B) T2 followed by T1
Characterizing Schedules based on
Serializability (3)
43
Characterizing Schedules based on
Serializability (5)
◼ Conflict serializable: A schedule S is said to be
conflict serializable if it is conflict equivalent to some
serial schedule S’.
❑ In such a case, we can reorder the nonconflicting
operations in S until we form the equivalent serial schedule
S’.
44
Characterizing Schedules based on
Serializability (6)
◼ Being serializable is not the same as being
serial
45
Characterizing Schedules based on
Serializability (7)
◼ Serializability is hard to check.
❑ Interleaving of operations occurs in an operating
system through some scheduler
❑ Difficult to determine beforehand how the
operations in a schedule will be interleaved.
46
Characterizing Schedules based on
Serializability (8)
Practical approach:
◼ Come up with methods (protocols) to ensure
serializability.
◼ It’s not possible to determine when a schedule
begins and when it ends. Hence, we reduce the
problem of checking the whole schedule to checking
only a committed project of the schedule (i.e.
operations from only the committed transactions.)
◼ Current approach used in most DBMSs:
❑ Use of locks with two phase locking
47
Characterizing Schedules based on
Serializability (9)
Testing for conflict serializability
◼ Precedence graph (serialization graph) G = (N,
E)
❑ Directed graph
Serializable
schedule
Example of Serializability Testing (2)
Serializable
schedule
Example of Serializability Testing (3)
Not Serializable
schedule
Example of Serializability Testing (4)
Serializable
schedule
Another example of serializability testing. (a) The
READ and WRITE operations of three transactions T1,
T2, and T3.
55
Another example of serializability testing. (b) Schedule
E.
56
Another example of serializability testing. (c) Schedule
F.
57
58
6. Transaction Support in SQL2 (1)
59
Transaction Support in SQL2 (2)
60
Transaction Support in SQL2 (3)
61
Transaction Support in SQL2 (4)
62
Transaction Support in SQL2(5)
63
Transaction Support in SQL2 (6)
Sample SQL transaction:
EXEC SQL whenever sqlerror go to UNDO;
EXEC SQL SET TRANSACTION
READ WRITE
DIAGNOSTICS SIZE 5
ISOLATION LEVEL SERIALIZABLE;
EXEC SQL INSERT
INTO EMPLOYEE (FNAME, LNAME, SSN, DNO, SALARY)
VALUES ('Robert','Smith','991004321',2,35000);
EXEC SQL UPDATE EMPLOYEE
SET SALARY = SALARY * 1.1
WHERE DNO = 2;
EXEC SQL COMMIT;
GOTO THE_END;
UNDO: EXEC SQL ROLLBACK;
THE_END: ...
64
Transaction Support in SQL2 (7)
65
Chapter 6
2
1. Purpose of Concurrency Control
◼ To enforce Isolation (through mutual exclusion)
among conflicting transactions.
◼ To preserve database consistency through
consistency preserving execution of
transactions.
◼ To resolve read-write and write-write conflicts.
◼ Example:
❑ In concurrent execution environment: if T1 conflicts with T2
over a data item A
❑ Then the concurrency control decides if T1 or T2 should
get the A and if the other transaction is rolled-back or waits.
3
2. Two-Phase Locking Techniques (1)
◼ Locking is an operation which secures
❑ (a) permission to Read
❑ (b) permission to Write a data item for a transaction.
◼ Example:
❑ Lock (X). Data item X is locked in behalf of the requesting
transaction.
◼ Unlocking is an operation which removes these
permissions from the data item.
◼ Example:
❑ Unlock (X): Data item X is made available to all other
transactions.
◼ Lock and Unlock are Atomic operations.
4
Two-Phase Locking Techniques (2)
5
Two-Phase Locking Techniques (3)
◼ Type of Locks:
❑ Binary Locks
❑ Shared/ Exclusive (or Read/ Write) Locks
6
Two-Phase Locking Techniques (4)
◼ Binary Locks
❑ 2 values: locked and unlocked (1 and 0)
❑ The following code performs the lock operation:
B: if LOCK (X) = 0 (*item is unlocked*)
then LOCK (X) 1 (*lock the item*)
else begin
wait (until lock (X) = 0) and
the lock manager wakes up the transaction);
goto B
end;
7
Two-Phase Locking Techniques (5)
◼ Binary Locks
❑ The following code performs the unlock operation:
8
Two-Phase Locking Techniques (6)
◼ Binary Locks
❑ Rules:
1. A transaction T must issue the operation lock_item(X)
before any read_item(X) or write_item(X) operations in T.
2. A transaction T must issue the operation unlock_item(X)
after all read_item(X) and write_item(X) operations are
completed in T.
3. A transaction T will not issue a lock_item(X) operation if it
already holds the lock on item X.
4. A transaction T will not issue an unlock_item(X) operation
unless it already holds the lock on item X.
9
Two-Phase Locking Techniques (7)
10
Two-Phase Locking Techniques (8)
11
Two-Phase Locking Techniques (9)
◼ Shared/ Exclusive (or Read/ Write) Locks
❑ The following code performs the read lock operation:
B: if LOCK (X) = “unlocked” then
begin LOCK (X) “read-locked”;
no_of_reads (X) 1;
end
else if LOCK (X) “read-locked” then
no_of_reads (X) no_of_reads (X) +1;
else begin wait (until LOCK (X) = “unlocked” and
the lock manager wakes up the transaction);
go to B;
end;
12
Two-Phase Locking Techniques (10)
◼ Shared/ Exclusive (or Read/ Write) Locks
❑ The following code performs the write lock operation:
B: if LOCK(X) = “unlocked”
then LOCK(X) ← “write-locked”
else begin
wait (until LOCK(X) = “unlocked”
and the lock manager wakes up the transaction);
go to B
end;
13
Two-Phase Locking Techniques (11)
◼ Shared/ Exclusive (or Read/ Write) Locks
❑ The following code performs the unlock operation:
if LOCK (X) = “write-locked” then
begin LOCK (X) “unlocked”;
wakes up one of the transactions, if any
end
else if LOCK (X) “read-locked” then
begin
no_of_reads (X) no_of_reads (X) -1
if no_of_reads (X) = 0 then
begin
LOCK (X) = “unlocked”;
wake up one of the transactions, if any
end
end;
14
Two-Phase Locking Techniques (12)
◼ Shared/ Exclusive (or Read/ Write) Locks
❑ Rules:
1. A transaction T must issue the operation read_lock(X)
or write_lock(X) before any read_item(X) operation is
performed in T.
15
Two-Phase Locking Techniques (13)
◼ Shared/ Exclusive (or Read/ Write) Locks
❑ Rules (cont.):
4. A transaction T must not issue a read_lock(X)
operation if it already holds a read(shared) lock or a
write(exclusive) lock on item X.
16
Two-Phase Locking Techniques (14)
◼ Shared/ Exclusive (or Read/ Write) Locks
❑ Lock conversion
◼ Lock upgrade: existing read lock to write lock
17
Two-Phase Locking Techniques (15)
◼ Two-Phase Locking
❑ Two Phases:
◼ (a) Locking (Growing)
◼ (b) Unlocking (Shrinking).
❑ Locking (Growing) Phase:
◼ A transaction applies locks (read or write) on desired data
items one at a time.
❑ Unlocking (Shrinking) Phase:
◼ A transaction unlocks its locked data items one at a time.
❑ Requirement:
◼ For a transaction these two phases must be mutually
exclusively, that is, during locking phase unlocking phase
must not start and during unlocking phase locking phase must
not begin.
18
Two-Phase Locking Techniques (16)
19
Two-Phase Locking Techniques (17)
◼ Do not obey Two-Phase Locking
20
Two-Phase Locking Techniques (18)
◼ Two-Phase Locking
21
T’1 T’2
read_lock (Y);
read_item (Y);
write_lock (X);
read_lock (X);
unlock (Y);
read_item (X); wait
X:=X+Y;
write_item (X);
unlock (X);
read_lock (X);
read_item (X);
write_lock (Y);
unlock (X);
read_item (Y);
Y:=X+Y; Guaranteed to be
write_item (Y); serializable
unlock (Y);
22
T’1 T’2
read_lock (Y);
read_item (Y);
read_lock (X);
Deadlock
read_item (X);
write_lock (X);
write_lock (Y);
unlock (X);
read_item (Y);
Y:=X+Y;
write_item (Y);
unlock (Y);
unlock (Y);
read_item (X);
X:=X+Y;
write_item (X);
unlock (X); Can produce a deadlock
23
Two-Phase Locking Techniques (19)
◼ Two-Phase Locking
❑ Variations:
◼ (a) Basic
◼ (b) Conservative
◼ (c) Strict
◼ (d) Rigorous
❑ Conservative:
◼ Prevents deadlock by locking all desired data items
before transaction begins execution.
❑ Basic:
◼ Transaction locks data items incrementally. This
may cause deadlock which is dealt with.
24
Two-Phase Locking Techniques (20)
◼ Two-Phase Locking
❑ Strict:
◼ A transaction T does not release any of its
exclusive (write) locks until after it commits or
aborts.
◼ The most commonly used two-phase locking
algorithm.
❑ Rigorous:
◼ A Transaction T does not release any of its locks
(Exclusive or shared) until after it commits or
aborts.
25
Two-Phase Locking Techniques (21)
◼ Dealing with Deadlock and Starvation
❑ Deadlock
T’1 T’2
read_lock (Y); T’1 and T’2 did follow two-phase
read_item (Y); policy but they are deadlock
read_lock (X);
read_item (Y);
write_lock (X);
(waits for X) write_lock (Y);
(waits for Y)
26
Two-Phase Locking Techniques (22)
27
Two-Phase Locking Techniques (23)
28
Two-Phase Locking Techniques (24)
29
Two-Phase Locking Techniques (25)
T’1 T’2
read_lock (Y);
read_item (Y);
T’1 T’2
read_lock (X);
read_item (X);
30
Two-Phase Locking Techniques (26)
31
Two-Phase Locking Techniques (27)
32
Two-Phase Locking Techniques (28)
◼ Wound-wait:
❑ If TS(Ti) < TS(Tj), then (Ti older than Tj) abort Tj (Ti wounds
Tj) and restart it later with the same timestamp.
❑ Otherwise (Ti younger than Tj) Ti is allowed to wait.
33
Two-Phase Locking Techniques (29)
◼ Dealing with Deadlock and Starvation
◼ Starvation
❑ Starvation occurs when a particular transaction consistently
waits or restarted and never gets a chance to proceed
further.
❑ In a deadlock resolution it is possible that the same
transaction may consistently be selected as victim and
rolled-back.
❑ This limitation is inherent in all priority based scheduling
mechanisms.
❑ Wound-Wait and wait-die scheme can avoid starvation.
34
3. Concurrency Control Based on
Timestamp Ordering (1)
◼ Timestamp
❑ A monotonically increasing variable (integer)
indicating the age of an operation or a transaction.
A larger timestamp value indicates a more recent
event or operation.
35
Concurrency Control Based on
Timestamp Ordering (2)
◼ Timestamp
❑ The algorithm associates with each database
item X with two timestamp (TS) values:
◼ Read_TS(X): The read timestamp of item X; this is
the largest timestamp among all the timestamps of
transactions that have successfully read item X.
◼ Write_TS(X):The write timestamp of item X; this is
the largest timestamp among all the timestamps of
transactions that have successfully written item X.
36
Concurrency Control Based on
Timestamp Ordering (3)
◼ Basic Timestamp Ordering
❑ 1. Transaction T issues a write_item(X) operation:
◼ (a) If read_TS(X) > TS(T) or if write_TS(X) > TS(T)
❑ an younger transaction has already read the data item
❑ abort and roll-back T with a new timestamp and reject the
operation.
◼ (b) If the condition in part (a) does not exist, then execute
write_item(X) of T and set write_TS(X) to TS(T).
❑ 2. Transaction T issues a read_item(X) operation:
◼ (a) If write_TS(X) > TS(T)
❑ an younger transaction has already written to the data item
❑ abort and roll-back T with a new timestamp and reject the
operation.
◼ (b) If write_TS(X) TS(T), then execute read_item(X) of T and
set read_TS(X) to the larger of (TS(T) and the current
read_TS(X) )
37
Example:Three transactions executing under a
timestamp-based scheduler
T1 T2 T3 A B C
r1(B) RT = 200
r2(A) RT = 150
r3(C) RT=175
w1(B) WT=200
w1(A) WT=200
w2(C)
Abort
w3(A)
Why T2 must be aborted (rolled-back)?
38
Concurrency Control Based on
Timestamp Ordering (4)
◼ Strict Timestamp Ordering
❑ 1. Transaction T issues a write_item(X)
operation:
◼ If TS(T) > write_TS(X), then delay T until the transaction
T’ that wrote X has terminated (committed or aborted).
❑ 2. Transaction T issues a read_item(X) operation:
◼ If TS(T) > write_TS(X), then delay T until the transaction
T’ that wrote X has terminated (committed or aborted).
40
4. Multiversion Concurrency Control
Techniques (1)
◼ This approach maintains a number of
versions of a data item and allocates the right
version to a read operation of a transaction.
Thus unlike other mechanisms a read
operation in this mechanism is never
rejected.
◼ Side effect:
❑ Significantly more storage (RAM and disk) is
required to maintain multiple versions.
❑ To check unlimited growth of versions, a garbage
collection is run when some criteria is satisfied.
41
Multiversion Concurrency Control
Techniques (2)
◼ Multiversion technique based on timestamp ordering
❑ Assume X1, X2, …, Xn are the version of a data item X
created by a write operation of transactions. With each Xi a
read_TS (read timestamp) and a write_TS (write
timestamp) are associated.
❑ read_TS(Xi): The read timestamp of Xi is the largest of all
the timestamps of transactions that have successfully read
version Xi.
❑ write_TS(Xi): The write timestamp of Xi is the timestamps
of the transaction hat wrote the value of version Xi.
❑ A new version of Xi is created only by a write operation.
42
Multiversion Concurrency Control
Techniques (3)
◼ Multiversion technique based on timestamp ordering
To ensure serializability, the following two rules are used:
1. If transaction T issues write_item (X) and version i of X has
the highest write_TS(Xi) of all versions of X that is also less
than or equal to TS(T), and read _TS(Xi) > TS(T), then abort
and roll-back T; otherwise create a new version Xj and
read_TS(Xj) = write_TS(Xj) = TS(T).
2. If transaction T issues read_item (X), find the version i of X
that has the highest write_TS(Xi) of all versions of X that is
also less than or equal to TS(T), then return the value of Xi to
T, and set the value of read _TS(Xi) to the largest of TS(T)
and the current read_TS(Xi).
❑ Rule 2 guarantees that a read will never be rejected.
43
Example: Execution of transactions using
multiversion concurrency control
T1 T2 T3 T4 A0 A150 A200
r1(A) read
w1(A) Create
r2(A) Read
w2(A) Create
r3(A) read
r4(A) read
44
Multiversion Concurrency Control
Techniques (4)
Multiversion Two-Phase Locking Using Certify Locks
◼ Concept:
45
Multiversion Concurrency Control
Techniques (5)
Multiversion Two-Phase Locking Using Certify Locks
◼ Steps:
1. X is the committed version of a data item.
2. T creates a second version X’ after obtaining a write lock on X.
3. Other transactions continue to read X.
4. T is ready to commit so it obtains a certify lock on X’.
5. The committed version X becomes X’.
6. T releases its certify lock on X’, which is X now.
Compatibility tables for
Read Write Read Write Certify
Read yes no Read yes no no
Write no no Write no no no
Certify no no no
read/write locking scheme read/write/certify locking scheme
46
Multiversion Concurrency Control
Techniques (6)
Multiversion Two-Phase Locking Using Certify Locks
◼ Note:
❑ In multiversion 2PL read and write operations
from conflicting transactions can be processed
concurrently.
❑ This improves concurrency but it may delay
transaction commit because of obtaining certify
locks on all its writes. It avoids cascading abort
but like strict two phase locking scheme conflicting
transactions may get deadlocked.
47
5. Validation (Optimistic)
Concurrency Control Techniques (1)
◼ In this technique only at the time of commit
serializability is checked and transactions are aborted in
case of non-serializable schedules.
◼ Three phases:
1. Read phase
2. Validation phase
3. Write phase
1. Read phase:
❑ A transaction can read values of committed data items.
However, updates are applied only to local copies
(versions) of the data items (in database cache).
48
Validation (Optimistic) Concurrency
Control Techniques (2)
2. Validation phase: Serializability is checked before
transactions write their updates to the database.
❑ This phase for Ti checks that, for each transaction Tj that is either
committed or is in its validation phase, one of the following
conditions holds:
1. Tj completes its write phase before Ti starts its read phase.
2. Ti starts its write phase after Tj completes its write phase, and
the read_set of Ti has no items in common with the write_set of
Tj
3. Both the read_set and write_set of Ti have no items in common
with the write_set of Tj, and Tj completes its read phase before
Ti completes its read phase.
◼ The first condition is checked first for each transaction Tj. If (1) is
false then (2) is checked and if (2) is false then (3 ) is checked. If
none of these conditions holds, → fails and Ti is aborted.
49
Validation (Optimistic) Concurrency
Control Techniques (3)
3. Write phase: On a successful validation
transactions’ updates are applied to the
database; otherwise, transactions are
restarted.
50
6. Granularity of Data Items And
Multiple Granularity Locking (1)
◼ A lockable unit of data defines its granularity.
Granularity can be coarse (entire database) or it can be
fine (a tuple or an attribute of a relation).
◼ Data item granularity significantly affects concurrency
control performance. Thus, the degree of concurrency
is low for coarse granularity and high for fine granularity.
◼ Example of data item granularity:
1. A field of a database record (an attribute of a tuple)
2. A database record (a tuple or a relation)
3. A disk block
4. An entire file
5. The entire database
51
Granularity of data items and
Multiple Granularity Locking (2)
◼ The following diagram illustrates a hierarchy
of granularity from coarse (database) to fine
(record).
52
Granularity of data items and Multiple
Granularity Locking (3)
◼ To manage such hierarchy, in addition to read and write,
three additional locking modes, called intention lock
modes are defined:
❑ Intention-shared (IS): indicates that a shared lock(s) will
be requested on some descendent nodes(s).
❑ Intention-exclusive (IX): indicates that an exclusive
lock(s) will be requested on some descendent node(s).
❑ Shared-intention-exclusive (SIX): indicates that the
current node is locked in shared mode but an exclusive
lock(s) will be requested on some descendent nodes(s).
53
Granularity of data items and Multiple
Granularity Locking (4)
◼ These locks are applied using the following
compatibility matrix:
Intention-shared (IS
Intention-exclusive (IX)
Shared-intention-
exclusive (SIX)
54
Granularity of data items and
Multiple Granularity Locking (5)
◼ The set of rules which must be followed for
producing serializable schedule:
1. The lock compatibility must adhered to.
2. The root of the tree must be locked first, in any mode.
3. A node N can be locked by a transaction T in S or IX mode
only if the parent node is already locked by T in either IS or
IX mode.
4. A node N can be locked by T in X, IX, or SIX mode only if the
parent of N is already locked by T in either IX or SIX mode.
5. T can lock a node only if it has not unlocked any node (to
enforce 2PL policy).
6. T can unlock a node, N, only if none of the children of N are
currently locked by T.
55
Granularity of data items and Multiple
Granularity Locking (6)
◼ An example of a serializable execution:
56
Granularity of data items and
Multiple Granularity Locking (7)
◼ An example of a serializable execution (continued):
57
Chapter 7
1
Outline
2
1. Purpose of Database Recovery
◼ To bring the database into the last consistent state,
which existed prior to the failure.
◼ To preserve transaction properties (Atomicity,
Consistency, Isolation and Durability).
◼ Example:
❑ If the system crashes before a fund transfer
transaction completes its execution, then either one or
both accounts may have incorrect value.
❑ Thus, the database must be restored to the state
before the transaction modified any of the accounts.
2. Recovery Concepts (1)
Types of Failure
◼ The database may become unavailable for use
due to
❑ Transaction failure: Transactions may fail because
of incorrect input, deadlock, incorrect synchronization.
❑ System failure: System may fail because of
addressing error, application error, operating system
fault, RAM failure, etc.
❑ Media failure: Disk head crash, power disruption, etc.
Recovery Concepts (2)
Transaction Log
◼ For recovery from any type of failure data values prior to
modification (BFIM - BeFore Image) and the new value after
modification (AFIM – AFter Image) are required.
◼ These values and other information is stored in a sequential
file called Transaction log. A sample log is given below. Back
P and Next P point to the previous and next log records of the
same transaction.
T ID Back P Next P Operation Data item BFIM AFIM
T1 0 1 Begin
T1 1 4 Write X X = 100 X = 200
T2 0 8 Begin
T1 2 5 W Y Y = 50 Y = 100
T1 4 7 R M M = 200 M = 200
T3 0 9 R N N = 400 N = 400
T1 5 nil End
Recovery Concepts (3)
Data Caching
◼ Data items to be modified are first stored into
database cache by the Cache Manager (CM)
and after modification they are flushed (written)
to the disk.
◼ The flushing is controlled by Modified and Pin-
Unpin bits.
❑ Pin-Unpin: Instructs the operating system not to flush
the data item.
❑ Modified: Indicates the AFIM of the data item.
Recovery Concepts (4)
Data Update:
◼ Immediate Update: As soon as a data item is modified in
cache, the disk copy is updated.
◼ Deferred Update: All modified data items in the cache is
written either after a transaction ends its execution or after a
fixed number of transactions have completed their execution.
◼ Shadow update: The modified version of a data item does
not overwrite its disk copy but is written at a separate disk
location.
◼ In-place update: The disk version of the data item is
overwritten by the cache version.
Recovery Concepts (5)
11
Recovery Concepts (9)
Checkpointing
◼ Time to time (randomly or under some criteria) the
database flushes its buffer to database disk to
minimize the task of recovery. The following steps
defines a checkpoint operation:
1. Suspend execution of transactions temporarily.
2. Force write modified buffer data to disk.
3. Write a [checkpoint] record to the log, save the log to disk.
4. Resume normal transaction execution.
◼ During recovery redo or undo is required to
transactions appearing after [checkpoint] record.
Recovery Concepts (10)
Fuzzy Checkpointing
◼ The time need to force-write all modified memory
buffers may delay transaction processing
→ Fuzzy checkpointing.
◼ System can resume transaction processing after a
[begin_checkpoint] record is written to the log without
having to wait for step 2 to finish.
◼ When step 2 is completed → [end_checkpoint] record is
written to the log.
◼ Until step 2 is commpleted, the previous checkpoint
record should maintain valid.
3. Recovery Based on Deferred
Update (1)
◼ Deferred Update (No Undo/Redo)
◼ The data update goes as follows:
❑ A set of transactions records their updates in the
log.
❑ At commit point under WAL scheme these
updates are saved on database disk.
❑ After reboot from a failure the log is used to redo
all the transactions affected by this failure. No
undo is required because no AFIM is flushed to
the disk before a transaction commits.
Recovery Based on Deferred Update (2)
◼ Deferred Update in a single-user system
There is no concurrent data sharing in a single user
system. The data update goes as follows:
❑ A set of transactions records their updates in the log.
❑ At commit point under WAL scheme these updates are
saved on database disk.
◼ After reboot from a failure the log is used to redo all the
transactions affected by this failure. No undo is required
because no AFIM is flushed to the disk before a
transaction commits.
T1 T2
read_item (A) read_item (B)
read_item (D) write_item (B)
write_item (D) read_item (D)
write_item (D)
D 20
B 15
A 20
4. Recovery Based on Immediate
Update (1)
◼ Undo/No-redo Algorithm
❑ In this algorithm AFIMs of a transaction are
flushed to the database disk under WAL before it
commits.
❑ For this reason the recovery manager undoes all
transactions during recovery.
❑ No transaction is redone.
❑ It is possible that a transaction might have
completed execution and ready to commit but this
transaction is also undone.
Recovery Based on Immediate
Update (2)
◼ Undo/Redo Algorithm (Single-user
environment)
❑ Recovery schemes of this category apply undo and
also redo for recovery.
❑ In a single-user environment no concurrency control is
required but a log is maintained under WAL.
❑ Note that at any time there will be one transaction in
the system and it will be either in the commit table or
in the active table.
❑ The recovery manager performs:
◼ Undo of a transaction if it is in the active table.
◼ Redo of a transaction if it is in the commit table.
Recovery Based on Immediate
Update (3)
◼ Undo/Redo Algorithm (Concurrent execution)
◼ Recovery schemes of this category applies undo and
also redo to recover the database from failure.
◼ In concurrent execution environment a concurrency
control is required and log is maintained under WAL.
◼ Commit table records transactions to be committed and
active table records active transactions. To minimize the
work of the recovery manager checkpointing is used.
◼ The recovery performs:
❑ Undo of a transaction if it is in the active table.
❑ Redo of a transaction if it is in the commit table.
--- log ---
[start_transaction, T1]
[write_item, T1, D, 12, 20]
[checkpoint]
[start_transaction, T4]
[write_item, T4, B, 23, 15] Undo: T2 & T3
[start_transaction T2]
[commit, T1] B 12
[write_item, T2, B, 15, 12] D 20
[start_transaction, T3] A 20
[write_item, T4, A, 30, 20] B 15
[commit, T4]
[write_item, T3, A, 20, 30]
[write_item, T2, D, 20, 25]
[write_item, T2, B, 12, 17] Redo: T1 & T4
system crash D 20
B 15
A 20
5. Shadow paging (1)
◼ The AFIM does not overwrite its BFIM but recorded at
another place on the disk. Thus, at any time a data item
has AFIM and BFIM (Shadow copy of the data item) at
two different places on the disk.
X Y
X' Y'
Database
X and Y: Shadow copies of data items
X' and Y': Current copies of data items
Shadow paging (2)
◼ To manage access of data items by concurrent transactions
two directories (current and shadow) are used.
❑ The directory arrangement is illustrated below. Here a page
is a data item.
Shadow paging (3)
◼ Recovery:
❑ Free the modified database pages and to discard
the current directory (reinstating the shadow
directory)
◼ Committing a transaction corresponding to
discarding the previous shadow directory.
◼ NO-UNDO/ NO-REDO
◼ In a multiuser environment→ use logs and
checkpoints
27
6. ARIES Recovery Algorithm (1)
◼ Steal/no-force (UNDO/REDO)
◼ The ARIES Recovery Algorithm is based on:
❑ WAL (Write Ahead Logging)
❑ Repeating history during redo:
◼ ARIES will retrace all actions of the database system
prior to the crash to reconstruct the database state when
the crash occurred.
❑ Logging changes during undo:
◼ It will prevent ARIES from repeating the completed undo
operations if a failure occurs during recovery, which
causes a restart of the recovery process.
ARIES Recovery Algorithm (2)
undo redo
7. Recovery in multidatabase system
◼ A multidatabase system is a special distributed database
system where one node may be running relational database
system under UNIX, another may be running object-oriented
system under Windows and so on.
◼ A transaction may run in a distributed fashion at multiple
nodes.
◼ In this execution scenario the transaction commits only when
all these multiple nodes agree to commit individually the part
of the transaction they were executing.
◼ This commit scheme is referred to as “two-phase commit”
(2PC).
❑ If any one of these nodes fails or cannot commit the part of
the transaction, then the transaction is aborted.
◼ Each node recovers the transaction under its own recovery
protocol.
Final exam
◼ 90 minutes
◼ Multiple choice + essay questions
◼ Open test (only paper document)
40