Unit 1adtnotes
Unit 1adtnotes
ER MODEL:
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
11.1
ENTITY OCCURRENCE:
RELATIONSHIP TYPES:
● Each uniquely identifiable object of an entity type is referred to
simply as an entity occurrence. ● A relationship type is a set of associations between one or more
● We identify each entity type by a name and a list of properties. participating entity types.
● A database normally contains many different entity types. ● Each relationship type is given a name that describes its function.
the entity, which is normally a singular noun. relationship called POwns, which associates the PrivateOwner and
● In UML, the first letter of each word in the entity name is upper PropertyForRent entities.
● Whenever possible, a relationship name should be unique for a ● An example of a binary is the POwns relationship with two
given ER model. participating entity types, namely PrivateOwner and
● A relationship is only labeled in one direction, which normally PropertyForRent.
means that the name of the relationship only makes sense in one ● The term ‘complex relationship’ is used to describe relationships
direction (for example, Branch Has Staff makes more sense than with degrees higher than binary.
Staff Has Branch). Ternary:
● Once the relationship name is chosen, an arrow symbol is placed A relationship of degree three is called ternary.
beside the name indicating the correct direction for a reader to ● An example of a ternary relationship is Registers with three
interpret the relationship name (for example, Branch Has Staff) as participating entity types, namely Staff, Branch, and Client.
shown in Figure.
Degree of a relationship type
● The entities involved in a particular relationship type are referred
to as participants in that relationship.
● The number of participants in a relationship type is called the
degree of that relationship.
SINGLE VALUED ATTRIBUTE: ● For example, the value for the duration attribute of the Lease entity
Attribute that holds a single value for each occurrence of an entity type. is calculated from the rentStart and rentFinish attributes also of the
Lease entity type.
● For example, each occurrence of the Branch entity type has a
● We refer to the duration attribute as a derived attribute, the value of
single value for the branch number (branchNo) attribute (for
which is derived from the rentStart and rentFinish attributes.
example B003), and therefore the branchNo attribute is referred to
● The value of an attribute is derived from the entity occurrences in
as being single-valued.
the same entity type.
MULTIVALUED ATTRIBUTE: ● For example, the total number of staff (totalStaff) attribute of the
Staff entity type can be calculated by counting the total number of
Attribute that holds multiple values for each occurrence of an entity type. Staff entity occurrences.
● Derived attributes may also involve the association of attributes of
Example:
different entity types.
● Each occurrence of the Branch entity type can have multiple values ● For example, consider an attribute called deposit of the Lease
for the telNo attribute entity type. The value of the deposit attribute is calculated as twice
● A multi-valued attribute may have a set of numbers with upper and the monthly rent for a property.
lower limits. ● Therefore, the value of the deposit attribute of the Lease entity type
● For example, the telNo attribute of the Branch entity type has is derived from the rent attribute of the PropertyForRent entity
between one and three values. type.
DERIEVED ATTRIBUTES:
Attribute that represents a value that is derivable from value of a related Keys
attribute, or set of attributes, not necessarily in the same entity type. ● Candidate Key
● Primary Key
● Composite Key
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
CANDIDATE KEY: ● For example, consider an entity called Advert with propertyNo
● A candidate key is the minimal number of attributes, whose (property number), newspaperName, dateAdvert, and cost
value(s)uniquely identify each entity occurrence. attributes. Many properties are advertised in many newspapers on a
● The candidate key must hold values that are unique for every given date.
occurrence of an entity type. ● To uniquely identify each occurrence of the Advert entity type
● A candidate key cannot contain a null requires values for the propertyNo, newspaperName, and
● Example: dateAdvert attributes.
The branch number (branchNo) attribute is the candidate key for ● Thus, the Advert entity type has a composite primary key made up
the Branch entity type, and has a distinct value for each branch of the propertyNo, newspaperName, and dateAdvert attributes.
entity occurrence. Diagrammatic representation of attributes
PRIMARY KEY:
● The candidate key that is selected to uniquely identify each
occurrence of an entity type.
● The choice of primary key for an entity is based on considerations
of attribute length, the minimal number of attributes required, and
the future certainty of uniqueness.
● Example: staffNo is an example of primary key.
COMPOSITE KEY:
● A candidate key that consists of two or more attributes.
● The key of an entity type is composed of several attributes, whose
values together are unique for each entity occurrence but not ENTITY TYPES:
separately. 1. Strong Entity
2. Weak Entity
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
STRONG ENTITY:
Multiplicity
● The number (or range) of possible occurrences of an entity type
that may relate to a single occurrence of an associated entity type
through a particular relationship.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
Multiplicity
Multiplicity of Staff Manages Branch (1:1) relationship
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
Multiplicity of Newspaper Advertises PropertyForRent (*:*) Semantic net of ternary Registers relationship with values for Staff and
relationship Branch entities fixed
● When the staffNo and branchNo values are fixed the corresponding ▪ Cardinality
clientNo values are zero or more. 🢭 Describes maximum number of possible relationship
● Therefore, the multiplicity of the Registers relationship from the occurrences for an entity participating in a given
perspective of the Staff and Branch entities is 0..*, which is relationship type.
represented in the ER diagram by placing the 0..* beside the Client ▪ Participation
entity. 🢭 Determines whether all or only some entity occurrences
● If we repeat this test we find that the multiplicity when Staff/Client participate in a relationship.
values are fixed is 1..1, which is placed beside the Branch entity
and the Client/Branch values are fixed is 1..1, which is placed
beside the Staff entity.
Summary of multiplicity constraints
Multiplicity is made up of two types of restrictions on relationships: Multiplicity as cardinality and participation constraints
cardinality and participation.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
▪ Two main types of connection traps are called fan traps and chasm
traps.
FAN TRAP
Where a model represents a relationship between entity types, but pathway
between certain entity occurrences is ambiguous.
An Example of a Fan Trap
Problems with ER Models ▪ If we attempt to answer the question: ‘At which branch does staff
▪ Problems may arise when designing a conceptual data model number SG37 work?’ we are unable to give a specific answer
called connection traps. based on the current structure.
▪ Problems occur due to a misinterpretation of the meaning of ▪ We can only determine that staff number SG37 works at Branch
certain relationships. B003 or B007.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
▪ The inability to answer this question specifically is the result of a CHASM TRAP
fan trap associated with the misrepresentation of the correct Where a model suggests the existence of a relationship between entity
relationships between the Staff, Division, and Branch entities. types, but pathway does not exist between certain entity occurrences.
▪ We resolve this fan trap by restructuring the original ER model to
represent the correct association between these entities An Example of a Chasm Trap
Restructuring ER model to remove Fan Trap
▪ Therefore to solve this problem, we need to identify the missing Semantic Net of Restructured ER Model with Chasm Trap Removed
relationship, which in this case is the Offers relationship between
the Branch and PropertyForRent entities.
ER Model restructured to remove Chasm Trap
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
Normalization is database design technique like ER Model. 🢭 minimal redundancy with each attribute represented only
once with the important exception of attributes that form
🢭 It begins by examining functional dependency between
all or part of foreign keys which are essential for joining
attributes
of related relations.
🢭 Uses a series of steps which are described as normal
▪ The benefits of using a database that has a suitable set of
forms that helps to identify the optimal grouping for
relations is that the database will be:
these attributes
1. Top down approach – validation technique to check the o Reduction in the file storage space required by the base
structure of relations. relations thus minimizing costs.
2. Bottom up stand alone database design technique ▪ Relational databases also rely on the existence of a certain amount
of data redundancy.
▪ Major aim of relational database design is to group attributes into
▪ This redundancy is in the form of copies of primary keys (or
relations to minimize data redundancy.
candidate keys) acting as foreign keys in related relations to enable
▪ The opportunity to use normalization as a bottom-up standalone the modeling of relationships between data.
technique (Approach 1) is often limited by the level of detail that ▪ Problems associated with data redundancy are illustrated by
the database designer is reasonably expected to manage. comparing the Staff and Branch relations with the StaffBranch
Staff (staffNo, sName, position, salary, branchNo) For example, to insert the details of new staff located at branch
Branch (branchNo, bAddress) number B007, we must enter the correct details of branch number
StaffBranch (staffNo, sName, position, salary, branchNo, B007 so that the branch details are consistent with values for
▪ StaffBranch relation has redundant data; the details of a branch are inconsistency because we enter only the appropriate branch
repeated for every member of staff. number for each staff member in the Staff relation.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
Instead, the details of branch number B007 are recorded in the attribute branchNo relates the two relations. If we delete the tuple
database as a single tuple in the Branch relation. for staff number SA9 from the Staff relation, the details on branch
2. To insert details of a new branch that currently has no members of number B007 remain unaffected in the Branch relation.
staff into the StaffBranch relation, it is necessary to enter nulls into Modification Anomalies
the attributes for staff, such as staffNo. ▪ If we want to change the value of one of the attributes of a
However, as staffNo is the primary key for the StaffBranch particular branch in the StaffBranch relation, for example the
relation, attempting to enter nulls for staffNo violates entity address for branch number B003, we must update the tuples of all
integrity, and is not allowed. We therefore cannot enter a tuple for staff located at that branch.
a new branch into the StaffBranch relation with a null for the ▪ If this modification is not carried out on all the appropriate tuples
staffNo. The design of the relations shown in Figure avoids this of the StaffBranch relation, the database will become inconsistent.
problem because branch details are entered in the Branch relation ▪ In the example, branch number B003 may appear to have different
separately from the staff details. addresses in different staff tuples.
The details of staff ultimately located at that branch are entered at a later ▪ While the StaffBranch relation is subject to update anomalies, we
date into the Staff relation. can avoid these anomalies by decomposing the original relation
Deletion Anomalies into the Staff and Branch relations.
▪ If we delete a tuple from the StaffBranch relation that represents ▪ There are two important properties associated with decomposition
the last member of staff located at a branch, the details about that of a larger relation into smaller relations:
branch are also lost from the database. 1. The lossless-join property ensures that any instance of the original
▪ For example, if we delete the tuple for staff number SA9 (Mary relation can be identified from corresponding instances in the
Howe) from the StaffBranch relation, the details relating to branch smaller relations.
number B007 are lost from the database. 2. The dependency preservation property ensures that a constraint
▪ The design of the relations in Figure avoids this problem, because on the original relation can be maintained by simply enforcing
branch tuples are stored separately from staff tuples and only the some constraint on each of the smaller relations. We do not need to
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
perform joins on the smaller relations to check whether a constraint ▪ Diagrammatic representation.
on the original relation is violated.
Functional Dependencies
▪ An important concept associated with normalization is functional
dependency, which describes the relationship between attributes.
● Determinant
-Refers to the attribute, or group of attributes, on the
Characteristics of Functional Dependencies
left-hand side of the arrow of a functional dependency.
A relational schema has attributes (A, B, C, . . . , Z) and that the database is
described by a single universal relation called R = (A, B, C, . . . , Z). This
assumption means that every attribute in the database has a unique name.
Functional dependency
▪ Describes the relationship between attributes in a relation.
▪ For example, if A and B are attributes of relation R, B is
functionally dependent on A (denoted A --> B), if each value of A
is associated with exactly one value of B. (A and B may each
consist of one or more attributes.)
▪ It is property of the meaning or semantics of the attributes in a
relation.
▪ The semantics indicate how attributes relate to one another, and
specify the functional dependencies between attributes.
▪ When a functional dependency is present, the dependency is
specified as a constraint between the attributes.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
An Example Functional Dependency ● However, the only functional dependency that remains true for all
possible values for the staffNo and sName attributes of the Staff
relation is:
staffNo → sName
Transitive Dependencies their common sense and/or experience to provide the missing
A condition where A, B, and C are attributes of a relation such that if A --> information.
B and B --> C, then C is transitively dependent on A via B (provided that A Example - Identifying a set of functional dependencies for the
is not functionally dependent on B or C). StaffBranch relation
Important to recognize a transitive dependency because its existence in a ● Examine semantics of attributes in StaffBranch relation . Assume
relation can potentially cause update anomalies. that position held and branch determine a member of staff’s salary.
● With sufficient information available, identify the functional
Example of a transitive functional dependency dependencies for the StaffBranch relation as:
▪ Consider functional dependencies in the StaffBranch relation staffNo → sName, position, salary, branchNo,
StaffNo🡪sName, position, salary, branchNo, bAddress bAddress
BranchNo🡪bAddress branchNo → bAddress
● Transitive dependency, branchNo🡪bAddress exists on staffNo via bAddress → branchNo
branchNo branchNo, position → salary
Identifying Functional Dependencies bAddress, position → salary
▪ Identifying all functional dependencies between a set of attributes Example - Using sample data to identify functional dependencies.
is relatively simple if the meaning of each attribute and the ▪ Consider the data for attributes denoted A, B, C, D, and E in the
relationships between the attributes are well understood. Sample relation.
▪ This information should be provided by the enterprise in the form ▪ Important to establish that sample data values shown in relation are
of discussions with users and/or documentation such as the users’ representative of all possible values that can be held by attributes
requirements specification. A, B, C, D, and E. Assume true despite the relatively small amount
▪ However, if the users are unavailable for consultation and/or the of data shown in this relation.
documentation is incomplete then depending on the database
application it may be necessary for the database designer to use
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
1NF EXAMPLE
▪ We identify the key attribute for the ClientRental unnormalized
table as clientNo.
▪ Next, we identify the repeating group in the unnormalized table as
the property rented details, which repeats for each client.
▪ The structure of the repeating group is:
Repeating Group = (propertyNo, pAddress, rentStart, rentFinish,
rent, ownerNo, oName)
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
3NF EXAMPLE
▪ The relations have the following form:
Client (clientNo, cName)
Rental (clientNo, propertyNo, rentStart, rentFinish)
PropertyOwner (propertyNo, pAddress, rent, ownerNo, oName)
Third Normal Form (3NF)
▪ Based on the concept of transitive dependency.
▪ Transitive Dependency is a condition where
🢭 A, B and C are attributes of a relation such that if A → B
and B → C,
🢭 then C is transitively dependent on A through B.
(Provided that A is not functionally dependent on B or C).
▪ The new relations have the form:
DEFINITION:
PropertyForRent (propertyNo, pAddress, rent, ownerNo)
▪ A relation that is in 1NF and 2NF and in which no
Owner (ownerNo, oName)
non-primary-key attribute is transitively dependent on the primary
TABLE IN 3NF FORM
key.
2NF to 3NF
▪ Identify the primary key in the 2NF relation.
▪ Identify functional dependencies in the relation.
▪ If transitive dependencies exist on the primary key remove them by
placing them in a new relation along with a copy of their
determinant.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
▪ A relation that is in first and second normal form and in ▪ The aim of query processing is to determine which one is
which no non-primary-key attribute is transitively the most cost effective.
dependent on any candidate key. ▪ In network and hierarchical DBMSs, low-level procedural
query language is generally embedded in high-level
programming language.
▪ Programmer’s responsibility to select most appropriate
execution strategy.
▪ With declarative languages such as SQL, user specifies
what data is required rather than how it is to be retrieved.
▪ It relieves user of knowing what constitutes good
execution strategy.
▪ Additionally, giving the DBMS the responsibility for
selecting the best strategy prevents users from choosing
strategies that are known to be inefficient and gives the
DBMS more control over system performance.
There are two main techniques for query optimization:
▪ The first technique uses heuristic rules that order the operations in
a query.
▪ The other technique compares different strategies based on their
relative costs and selects the one that minimizes resource usage
Disk access tends to be dominant cost in query processing for
centralized DBMS.
TOPIC: 3 QUERY PROCESSING Query Processing
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
The activities involved in parsing, validating, optimizing, and executing a ▪ Both methods of query optimization depend on database statistics
query. to evaluate properly the different options that are available.
Aims of Query Processing: ▪ The accuracy and currency of these statistics have a significant
🢭 transform query written in high-level language (e.g. bearing on the efficiency of the execution strategy chosen.
SQL), into correct and efficient execution strategy ▪ The statistics cover information about relations, attributes, and
expressed in low-level language (implementing RA); indexes.
🢭 execute strategy to retrieve required data. ▪ For example, the system catalog may store statistics giving the
Query Optimization cardinality of relations, the number of distinct values for each
Activity of choosing an efficient execution strategy for processing query. attribute, and the number of levels in a multilevel index.
▪ Keeping the statistics current can be problematic. If the DBMS
▪ An important aspect of query processing is query optimization. updates the statistics every time a tuple is inserted, updated, or
▪ As there are many equivalent transformations of the same deleted, this would have a significant impact on performance
high-level query, the aim of query optimization is to choose the during peak periods.
one that minimizes resource usage. ▪ An alternative, and generally preferable, approach is to update the
▪ Generally, we try to reduce the total execution time of the query, statistics on a periodic basis, for example nightly, or whenever the
which is the sum of the execution times of all individual operations system is idle.
that make up the query. ▪ Another approach taken by some systems is to make it the users’
▪ Resource usage may also be viewed as the response time of the responsibility to indicate when the statistics are to be updated.
query, in which case we concentrate on maximizing the number of Comparison of different processing strategies
parallel operations. Find all Managers who work at a London branch.
▪ Since the problem is computationally intractable with a large SELECT *
number of relations, the strategy adopted is generally reduced to FROM Staff s, Branch b
finding a near optimum solution. WHERE s.branchNo = b.branchNo AND
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
(s.position = ‘Manager’ AND b.city = ‘London’); ▪ Query Processing has four main phases:
🢭 decomposition (consisting of parsing and validation);
🢭 optimization;
🢭 code generation;
🢭 execution.
▪ Assume:
🢭 1000 tuples in Staff; 50 tuples in Branch;
🢭 50 Managers; 5 London branches;
🢭 no indexes or sort keys;
🢭 results of any intermediate operations stored on disk;
🢭 cost of the final write is ignored;
🢭 tuples are accessed one at a time.
▪ Cost (in disk accesses) are:
(1) (1000 + 50) + 2*(1000 * 50) = 101 050
(2) 2*1000 + (1000 + 50) = 3 050
(3) 1000 + 2*50 + 5 + (50 + 5) = 1 160
▪ Cartesian product and join operations much more expensive than
selection, and third option significantly reduces size of relations
🢭
being joined together.
Phases of Query Processing
Dynamic versus Static Optimization
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
▪ There are two choices for when the first three phases of query ▪ Aims are to transform high-level query into Relational Algebra
processing can be carried out query and check that query is syntactically and semantically
🢭 dynamically every time query is run; correct.
🢭 statically when query is first submitted. ▪ Typical stages are:
🢭 analysis,
Dynamic optimization: 🢭 normalization,
▪ Advantages of dynamic Query Optimization arise from fact that 🢭 semantic analysis,
information is up to date. 🢭 simplification,
▪ Disadvantages are that performance of query is affected, time may 🢭 query restructuring.
limit finding optimum strategy. Analysis
Static optimization: ▪ Analyze query lexically and syntactically using compiler
▪ Advantages of static QO are removal of runtime overhead, and techniques.
more time to find optimum strategy. ▪ Verify relations and attributes exist.
▪ Disadvantages arise from fact that chosen execution strategy may ▪ Verify operations are appropriate for object type.
no longer be optimal when query is run. Example
▪ Could use a hybrid approach to overcome this disadvantage, where SELECT staff_no
the query is re-optimized if the system detects that the database FROM Staff
statistics have changed significantly since the query was last WHERE position > 10;
compiled. ▪ This query would be rejected on two grounds:
Query Decomposition 🢭 staff_no is not defined for Staff relation (should be
Query decomposition is the first phase of query processing. staffNo).
🢭 Comparison ‘>10’ is incompatible with type position,
which is variable character string.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
Finally, query transformed into some internal representation more suitable Conjunctive normal form: A sequence of conjuncts that are connected
for processing. with the (AND) operator. Each conjunct contains one or more terms
Some kind of query tree is typically chosen, constructed as follows: connected by the ∨ (OR)operator.
🢭 Leaf node created for each base relation. Example: (position = 'Manager' ∨ salary > 20000) ∧ (branchNo = 'B003')
🢭 Non-leaf node created for each intermediate relation
Disjunctive normal form: A sequence of disjuncts that are connected with
produced by RA operation.
the ∨ (OR) operator. Each disjunct contains one or more terms connected
🢭 Root of tree represents query result.
🢭 Sequence is directed from leaves to root. by the ∧ (AND) operator.
Example: (position = 'Manager' ∧ branchNo = 'B003' )
Semantic Analysis
Normalization tuple.
▪ A relation connection graph. ▪ If graph has cycle for which valuation sum is negative, query is
contradictory.
▪ Normalized attribute connection graph.
Checking Semantic Correctness
▪ Relation connection graph
SELECT p.propertyNo, p.street
▪ Create node for each relation and node for result. Create edges
between two nodes that represent a join, and edges between nodes FROM Client c, Viewing v, PropertyForRent p
that represent projection.
WHERE c.clientNo = v.clientNo AND
- If not connected, query is incorrectly
c.maxRent >= 500 AND
formulated.
Query restructuring
In the final stage of query decomposition, the query is restructured to
provide a more efficient implementation.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
TOPIC 4: QUERY OPTIMIZATON ▪ For example, the system catalog may store statistics giving the
Query Optimization cardinality of relations, the number of distinct values for each
Activity of choosing an efficient execution strategy for processing query. attribute, and the number of levels in a multilevel index.
▪ An important aspect of query processing is query optimization. ▪ Keeping the statistics current can be problematic. If the DBMS
▪ As there are many equivalent transformations of the same updates the statistics every time a tuple is inserted, updated, or
high-level query, the aim of query optimization is to choose the deleted, this would have a significant impact on performance
one that minimizes resource usage. during peak periods.
▪ Generally, we try to reduce the total execution time of the query, ▪ An alternative, and generally preferable, approach is to update the
which is the sum of the execution times of all individual operations statistics on a periodic basis, for example nightly, or whenever the
that make up the query. system is idle.
▪ Resource usage may also be viewed as the response time of the ▪ Another approach taken by some systems is to make it the users’
query, in which case we concentrate on maximizing the number of responsibility to indicate when the statistics are to be updated.
parallel operations.
▪ Since the problem is computationally intractable with a large Heuristical Approach to Query Optimization
number of relations, the strategy adopted is generally reduced to Heuristical approach to query optimization, which uses transformation rules
finding a near optimum solution. to convert one relational algebra expression into an equivalent form that is
▪ Both methods of query optimization depend on database statistics known to be more efficient.
to evaluate properly the different options that are available. Transformation Rules for RA Operations
▪ The accuracy and currency of these statistics have a significant (i) Conjunctive Selection operations can cascade into
bearing on the efficiency of the execution strategy chosen. individual Selection operations (and vice versa).
▪ The statistics cover information about relations, attributes, and σp∧q∧r(R) = σp(σq(σr(R)))
indexes. ▪ Sometimes referred to as cascade of Selection.
σbranchNo='B003' ∧ salary>15000(Staff) = σbranchNo='B003'(σsalary>15000(Staff))
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
ΠLΠM … ΠN(R) = ΠL (R) (vi) Commutativity of Selection and Theta join (or Cartesian
product).
For example:
▪ If selection predicate involves only attributes of one of join
ΠlNameΠbranchNo, lName(Staff) = ΠlName (Staff) relations, Selection and Join (or Cartesian product) operations
(iv) Commutativity of Selection and Projection. commute:
If predicate p involves only attributes in projection list,
Selection and Projection operations commute:
For prospective renters of flats, find properties that match requirements and
owned by CO93.
SELECT p.propertyNo, p.street
FROM Client c, Viewing v, PropertyForRent p
WHERE c.prefType = ‘Flat’ AND
c.clientNo = v.clientNo AND
v.propertyNo = p.propertyNo AND
c.maxRent >= p.rent AND
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
Binary Search (Ordered File, No Index) ▪ Assuming uniform distribution, would expect half the records to
▪ If predicate is of form A = x, and file is ordered on key attribute A, satisfy inequality, so estimated cost is:
cost estimate: nLevelsA(I) + [nBlocks(R)/2]
[log2(nBlocks(R))] Equality Condition on Clustering Index
▪ Generally, cost estimate is: ▪ Can use index to retrieve required records.
[log2(nBlocks(R))] + [SCA(R)/bFactor(R)] - 1 ▪ Estimated cost is:
▪ First term represents cost of finding first tuple using binary search. nLevelsA(I) + [SCA(R)/bFactor(R)]
▪ Expect there to be SCA(R) tuples satisfying predicate. ▪ Second term is estimate of number of blocks that will be required
Equality of Hash Key to store number of tuples that satisfy equality condition,
▪ If attribute A is hash key, apply hashing algorithm to calculate represented as SCA(R).
target address for tuple. Equality Condition on Non-Clustering Index
▪ If there is no overflow, expected cost is 1. ▪ Can use index to retrieve required records.
▪ If there is overflow, additional accesses may be necessary. ▪ Have to assume that tuples are on different blocks (index is not
Equality Condition on Primary Key clustered this time), so estimated cost becomes:
▪ Can use primary index to retrieve single record satisfying nLevelsA(I) + [SCA(R)]
condition. Equality Condition on Clustering Index
▪ Need to read one more block than number of index accesses, ▪ Can use index to retrieve required records.
equivalent to number of levels in index, so estimated cost is: ▪ Estimated cost is:
nLevelsA(I) + 1 nLevelsA(I) + [SCA(R)/bFactor(R)]
Inequality Condition on Primary Key ▪ Second term is estimate of number of blocks that will be required
▪ Can first use index to locate record satisfying predicate (A = x). to store number of tuples that satisfy equality condition,
▪ Provided index is sorted, records can be found by accessing all represented as SCA(R).
records before/after this one.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
Equality Condition on Non-Clustering Index ▪ If one term contains an ∨ (OR), and term requires linear search,
▪ Can use index to retrieve required records. entire selection requires linear search.
▪ Have to assume that tuples are on different blocks (index is not ▪ Only if index or sort order exists on every term can selection be
clustered this time), so estimated cost becomes: optimized by retrieving records that satisfy each condition and
nLevelsA(I) + [SCA(R)] applying union operator.
+
Inequality Condition on a Secondary B -Tree Index ▪ Again, record pointers can be used if they exist.
▪ From leaf nodes of tree, can scan keys from smallest value up to x Summary of cost of selection operation
(< or <= ) or from x up to maximum value (> or >=).
▪ Assuming uniform distribution, would expect half the leaf node
blocks to be accessed and, via index, half the file records to be
accessed.
▪ Estimated cost is:
nLevelsA(I) + [nLfBlocksA(I)/2 + nTuples(R)/2]
Composite Predicates - Conjunction without Disjunction
▪ May consider following approaches:
- If one attribute has index or is ordered, can use one of above
selection strategies. Can then check each retrieved record.
- For equality on two or more attributes, with composite index (or
Join Operation
hash key) on combined attributes, can search index directly.
▪ Main strategies for implementing join:
- With secondary indexes on one or more attributes (involved only
🢭 Block Nested Loop Join.
in equality conditions in predicate), could use record pointers if exist.
🢭 Indexed Nested Loop Join.
Composite Predicates - Selections with Disjunction
🢭 Sort-Merge Join.
🢭 Hash Join.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
Estimating Cardinality of Join ▪ Could read as many blocks as possible of smaller relation, R say,
▪ Cardinality of Cartesian product is: into database buffer, saving one block for inner relation and one for
nTuples(R) * nTuples(S) result.
▪ More difficult to estimate cardinality of any join as depends on ▪ New cost estimate becomes:
distribution of values. nBlocks(R) + [nBlocks(S)*(nBlocks(R)/(nBuffer-2))]
▪ Worst case, cannot be any greater than this value. ▪ If can read all blocks of R into the buffer, this reduces to:
▪ If assume uniform distribution, can estimate for Equijoins with a nBlocks(R) + nBlocks(S)
predicate (R.A = S.B) as follows: Indexed Nested Loop Join
▪ If A is key of R: nTuples(T) ≤ nTuples(S) ▪ If have index (or hash function) on join attributes of inner relation,
▪ If B is key of S: nTuples(T) ≤ nTuples(R) can use index lookup.
▪ Otherwise, could estimate cardinality of join as: ▪ For each tuple in R, use index to retrieve matching tuples of S.
▪ nTuples(T) = SCA(R)*nTuples(S) or ▪ Cost of scanning R is nBlocks(R), as before.
▪ nTuples(T) = SCB(S)*nTuples(R) ▪ Cost of retrieving matching tuples in S depends on type of index
Block Nested Loop Join and number of matching tuples.
▪ Simplest join algorithm is nested loop that joins two relations ▪ If join attribute A in S is PK, cost estimate is:
together a tuple at a time. nBlocks(R) + nTuples(R)*(nlevelsA(I) + 1)
▪ Outer loop iterates over each tuple in R, and inner loop iterates Sort-Merge Join
over each tuple in S. ▪ For Equijoins, most efficient join is when both relations are sorted
▪ As basic unit of reading/writing is a disk block, better to have two on join attributes.
extra loops that process blocks. ▪ Can look for qualifying tuples merging relations.
▪ Estimated cost of this approach is: ▪ May need to sort relations first.
nBlocks(R) + (nBlocks(R) * nBlocks(S)) ▪ Now tuples with same join value are in order.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
▪ If assume join is *:* and each set of tuples with same join value Summary
can be held in database buffer at same time, then each block of
each relation need only be read once.
▪ Cost estimate for the sort-merge join is:
nBlocks(R) + nBlocks(S)
▪ If a relation has to be sorted, R say, add:
nBlocks(R)*[log2(nBlocks(R)]
Hash Join
▪ For Natural or Equijoin, hash join may be used.
▪ Idea is to partition relations according to some hash function
that provides uniformity and randomness.
▪ Each equivalent partition should hold same value for join
attributes, although it may hold more than one value.
▪ Cost estimate of hash join as:
3(nBlocks(R) + nBlocks(S)) Projection Operation
▪ To implement projection need to:
🢭 remove attributes that are not required;
🢭 eliminate any duplicate tuples produced from previous
step. Only required if projection attributes do not include
a key.
▪ Two main approaches to eliminating duplicates:
🢭 sorting;
🢭 hashing.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
Problem: Join
Set Operations
▪ Can be implemented by sorting both relations on same attributes,
and scanning through each of sorted relations once to obtain
desired result.
▪ Could use sort-merge join as basis.
▪ Estimated cost in all cases is:
nBlocks(R) + nBlocks(S) + nBlocks(R)*[log2(nBlocks(R))] +
nBlocks(S)*[log2(nBlocks(S))]
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
▪ Could also use hashing algorithm. ▪ If n = 4 this is 120; if n = 10 this is > 176 billion.
Estimating Cardinality of Set Operations ▪ Compounded by different selection/join methods.
▪ As duplicates are eliminated when performing Union, difficult to Pipelining
estimate cardinality, but can give an upper and lower bound as: ▪ Materialization - output of one operation is stored in temporary
max(nTuples(R), nTuples(S)) ≤ nTuples(T) ≤ nTuples(R) + relation for processing by next.
nTuples(S) ▪ Could also pipeline results of one operation to another without
▪ For Set Difference, can also give upper and lower bound: creating temporary relation.
0 ≤ nTuples(T) ≤ nTuples(R) ▪ Known as pipelining or on-the-fly processing.
Aggregate Operations ▪ Pipelining can save on cost of creating temporary relations and
SELECT AVG(salary) reading results back in again.
FROM Staff; ▪ Generally, pipeline is implemented as separate process or thread.
▪ To implement query, could scan entire Staff relation and maintain
running count of number of tuples read and sum of all salaries.
▪ Easy to compute average from these two running counts.
Enumeration of Alternative Strategies
▪ Fundamental to efficiency of QO is the search space of possible
execution strategies and the enumeration algorithm used to search
this space.
▪ Query with 2 joins gives 12 join orderings:
▪ Restriction 2: Cartesian products are never formed unless WHEREc.maxRent < 500 AND
query itself specifies one. c.clientNo = v.clientNo AND
▪ Restriction 3: Inner operand of each join is a base relation, v.propertyNo = p.propertyNo;
never an intermediate result. This uses fact that with left-deep trees ▪ Attributes c.clientNo, v.clientNo, v.propertyNo, and p.propertyNo
inner operand is a base relation and so already materialized. are interesting.
▪ Restriction 3 excludes many alternative strategies but significantly ▪ If any intermediate result is sorted on any of these attributes, then
reduces number to be considered. corresponding partial strategy must be included in search.
Dynamic Programming ▪ Algorithm proceeds from the bottom up and constructs all
▪ Enumeration of left-deep trees using dynamic programming first alternative join trees that satisfy the restrictions above, as follows:
proposed for System R QO. ▪ Pass 1: Enumerate the strategies for each base relation using a
▪ Algorithm based on assumption that the cost model satisfies linear search and all available indexes on the relation. These partial
principle of optimality. strategies are partitioned into equivalence classes based on any
▪ Thus, to obtain optimal strategy for query with n joins, only need interesting orders. An additional equivalence class is created for
to consider optimal strategies for subexpressions with (n – 1) joins the partial strategies with no interesting order.
and extend those strategies with an additional join. Remaining ▪ For each equivalence class, strategy with lowest cost is retained for
suboptimal strategies can be discarded. consideration in next pass.
▪ To ensure some potentially useful strategies are not discarded ▪ Do not retain equivalence class with no interesting order if its
algorithm retains strategies with interesting orders: an lowest cost strategy is not lower than all other strategies.
intermediate result has an interesting order if it is sorted by a final ▪ For a given relation R, any selections involving only attributes of R
ORDER BY attribute, GROUP BY attribute, or any attributes that are processed on-the-fly. Similarly, any attributes of R that are not
participate in subsequent joins. part of the SELECT clause and do not contribute to any subsequent
SELECT p.propertyNo, p.street join can be projected out at this stage (restriction 1 above).
FROM Client c, Viewing v, PropertyForRent p
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
● At this point, it may be found that the transaction has violated Properties of Transactions
serializability or has violated an integrity constraint and the Four basic (ACID) properties of a transaction are:
transaction has to be aborted. (i) Atomicity ‘All or nothing’ property.
● Alternatively, the system may fail and any data updated by the (ii) Consistency Must transform database from one
transaction may not have been safely recorded on secondary consistent state to another.
storage. (iii) Isolation Partial effects of incomplete
● The transaction would go into the FAILED state and would have to transactions should not be visible to other transactions.
be aborted. (iv) Durability Effects of a committed transaction are
● If the transaction has been successful, any updates can be safely permanent and must not be lost because of later failure.
recorded and the transaction can go to the COMMITTED state.
● FAILED, which occurs if the transaction cannot be committed or
the transaction is aborted while in the ACTIVE state, perhaps due
to the user aborting the transaction or as a result of the concurrency
control protocol aborting the transaction to ensure serializability.
State Transition Diagram for Transaction
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
DBMS Transaction Subsystem ● The scheduler is sometimes referred to as the lock manager if the
concurrency control protocol is locking based.
● The objective of the scheduler is to maximize concurrency without
allowing concurrently executing transactions to interfere with one
another, and so compromise the integrity or consistency of the
database.
● If a failure occurs during the transaction, then the database could
be inconsistent.
● It is the task of the recovery manager to ensure that the database
is restored to the state it was in before the start of the transaction,
and therefore a consistent state.
● The buffer manager is responsible for the efficient transfer of data
between disk storage and main memory.
● Since the write operation on balx in T8 does not conflict with the o a directed edge Ti → Tj, if Tj reads the value of an item
subsequent read operation on baly in T7, we can change the order written by TI;
of these operations to produce the equivalent schedule S2. o a directed edge Ti → Tj, if Tj writes a value into an item
● If we also now change the order of the following non-conflicting after it has been read by Ti.
operations, we produce the equivalent serial schedule S3; ● If precedence graph contains cycle schedule is not conflict
- Change the order of the write(balx) of T8 with the write(baly) serializable.
of T7. Example - Non-conflict serializable schedule
- Change the order of the read(balx) of T8 with the read(baly) of ● T9 is transferring £100 from one account with balance balx to
T7. another account with balance baly.
- Change the order of the read(balx) of T8 with the write(baly) ● T10 is increasing balance of these two accounts by 10%.
of T7. ● Precedence graph has a cycle and so is not serializable.
● Schedule S3 is a serial schedule and, since S1 and S2 are
equivalent to S3, S1 and S2 are serializable schedules.
● This type of serializability is known as conflict serializability.
● Conflict serializable schedule orders any conflicting operations in
same way as some serial execution.
Testing for conflict serializability
Under constrained write rule (transaction updates data item based on
its old value, which is first read), use precedence graph to test for
serializability.
Precedence Graph
● Create:
o node for each transaction;
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
View Serializability
● Offers less stringent definition of schedule equivalence than
conflict serializability.
● Two schedules S1 and S2 are view equivalent if:
o For each data item x, if Ti reads initial value of x in S1, Ti
must also read initial value of x in S2.
o For each read on x by Ti in S1, if value read by x is written
by Tj, Ti must also read value of x produced by Tj in S2.
o For each data item x, if last write on x performed by Ti in
S1, same transaction must perform final write on x in S2.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
● Schedule is view serializable if it is view equivalent to a serial ● Durability states that once transaction commits, its changes cannot
schedule. be undone (without running another, compensating, transaction).
● Every conflict serializable schedule is view serializable, although Recoverable Schedule
converse is not true. A schedule where, for each pair of transactions T i and Tj, if Tj reads a data
● It can be shown that any view serializable schedule that is not item previously written by Ti, then the commit operation of Ti precedes the
conflict serializable contains one or more blind writes. commit operation of Tj.
● In general, testing whether schedule is serializable is NP-complete. Concurrency Control Techniques
Example - View Serializable schedule ● Two basic concurrency control techniques:
▪ Locking,
▪ Timestamping.
● Both are conservative approaches: delay transactions in case they
conflict with other transactions.
● Optimistic methods assume conflict is rare and only check for
conflicts at commit.
Locking
A procedure used to control concurrent access to data. When one
transaction is accessing the database, a lock may deny access to other
Recoverability transactions to prevent incorrect results.
● Serializability identifies schedules that maintain database Shared lock If a transaction has a shared lock on a data item, it can read the
consistency, assuming no transaction fails. item but not update it.
● Could also examine recoverability of transactions within schedule. Exclusive lock If a transaction has an exclusive lock on a data item, it can
● If transaction fails, atomicity requires effects of transaction to be both read and update the item.
undone.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
● Transaction uses locks to deny access to other transactions and so ● If upgrading is not supported, a transaction must hold exclusive
prevent incorrect updates. locks on all data items that it may update at some time during the
● Most widely used approach to ensure serializability. execution of the transaction, thereby potentially reducing the level
● Generally, a transaction must claim a shared (read) or exclusive of concurrency in the system.
(write) lock on a data item before read or write. ● For the same reason, some systems also permit a transaction to
● Lock prevents another transaction from modifying item or even issue an exclusive lock and then later to downgrade the lock to a
reading it, in the case of a write lock. shared lock.
Example - Incorrect Locking Schedule
Locking - Basic Rules ● For two transactions above, a valid schedule using these
● If transaction has shared lock on item, can read but not update rules is:
item. S = {write_lock(T9, balx), read(T9, balx), write(T9, balx), unlock(T9,
● If transaction has exclusive lock on item, can both read and update balx), write_lock(T10, balx), read(T10, balx), write(T10, balx), unlock(T10,
item. balx), write_lock(T10, baly), read(T10, baly), write(T10, baly), unlock(T10,
● Reads cannot conflict, so more than one transaction can hold baly), commit(T10), write_lock(T9, baly), read(T9, baly), write(T9, baly),
shared locks simultaneously on same item. unlock(T9, baly), commit(T9) }
● Exclusive lock gives transaction exclusive access to that item. ● If at start, balx = 100, baly = 400, result should be:
● balx = 220, baly = 330, if T9 executes before T10, or
● Some systems allow transaction to upgrade read lock to an ● balx = 210, baly = 340, if T10 executes before T9.
exclusive lock, or downgrade exclusive lock to a shared lock. ● However, result gives balx = 220 and baly = 340.
● Some systems permit a transaction to issue a shared lock on an ● S is not a serializable schedule.
item and then later to upgrade the lock to an exclusive lock. This ● Problem is that transactions release locks too soon, resulting in
in effect allows a transaction to examine the data first and then loss of total isolation and atomicity.
decide whether it wishes to update it.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
Cascading Rollback
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
● If every transaction in a schedule follows 2PL, schedule is ● To prevent this with 2PL, leave release of all locks until end of
serializable. transaction.
● However, problems can occur with interpretation of when locks Concurrency Control with Index Structures
can be released. ● Could treat each page of index as a data item and apply 2PL.
● However, as indexes will be frequently accessed, particularly
higher levels, this may lead to high lock contention.
● Can make two observations about index traversal:
o Search path starts from root and moves down to leaf
nodes but search never moves back up tree. Thus, once a
lower-level node has been accessed, higher-level nodes in
that path will not be used again.
to ensure that the operation is atomic. For example, a latch would transactions.
be obtained to write a page from the database buffers to disk, the ● Deadlock should be transparent to user, so DBMS should restart
page would then be written to disk, and the latch immediately transaction(s).
unset. As the latch is simply to prevent conflict for this type of ● Three general techniques for handling deadlock:
Deadlock Prevention
● DBMS looks ahead to see if transaction would cause deadlock and
never allows deadlock to occur.
● Could order transactions using transaction timestamps:
▪ Wait-Die - only an older transaction can wait for
younger one, otherwise transaction is aborted
(dies) and restarted with same timestamp.
o Wound-Wait - only a younger transaction can wait for an
older one. If older transaction requests lock held by
younger one, younger one is aborted (wounded). Recovery from Deadlock Detection
● DBMS allows deadlock to occur but recognizes it and breaks it. o choice of deadlock victim;
● Usually handled by construction of wait-for graph (WFG) showing o how far to roll a transaction back;
Hierarchy of Granularity
● Could represent granularity of locks in a hierarchical structure.
● Root node represents entire database, level 1s represent files, etc.
● When node is locked, all its descendants are also locked.
● DBMS should check hierarchical path before granting lock.
● Intention lock could be used to lock all ancestors of a locked node.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
Multiple-granularity locking
● To reduce the searching involved in locating locks on descendants,
the DBMS can use another specialized locking strategy called
multiple-granularity locking.
● This strategy uses a new type of lock called an intention lock.
● When any node is locked, an intention lock is placed on all the
ancestors of the node.
● Thus, if some descendant of File2 (in our example, Page2) is
locked and a request is made for a lock on File2, the presence of an
intention lock on File2 indicates that some descendant of that node
is
● already locked.
● Intention locks may be either Shared (read) or eXclusive (write).
● An intention shared (IS) lock conflicts only with an exclusive lock;
an intention exclusive (IX) lock conflicts with both a shared and an
exclusive lock.
● In addition, a transaction can hold a shared and intention exclusive
(SIX) lock that is logically equivalent to holding both a shared and
an
● IX lock.
● A SIX lock conflicts with any lock that conflicts with either a
shared or IX lock; in other words, a SIX lock is compatible only
with an IS lock.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
TOPIC: 7 RECOVERY the buffers have been flushed to secondary storage that any update
Database Recovery operations can be regarded as permanent.
Process of restoring database to a correct state in the event of a failure. ● This flushing of the buffers to the database can be triggered by a
● Need for Recovery Control specific command or automatically when the buffers become full.
o Two types of storage: volatile (main memory) and The explicit writing of the buffers to secondary storage is known as
nonvolatile. force-writing.
o Volatile storage does not survive system crashes.
o Stable storage represents information that has been ● If failure occurs between commit and database buffers being
replicated in several nonvolatile storage media with flushed to secondary storage then, to ensure durability, recovery
independent failure modes. manager has to redo (rollforward) transaction’s updates.
Types of Failures ● If transaction had not committed at failure time, recovery manager
● System crashes, resulting in loss of main memory. has to undo (rollback) any effects of that transaction for atomicity.
● Media failures, resulting in loss of parts of secondary storage. ● Partial undo - only one transaction has to be undone.
● Application software errors. ● Global undo - all transactions have to be undone.
● Natural physical disasters.
● Carelessness or unintentional destruction of data or facilities.
● Sabotage.
Transactions and Recovery
● Transactions represent basic unit of recovery.
● Recovery manager responsible for atomicity and durability.
● The database buffers occupy an area in main memory from which
data is transferred to and from secondary storage. It is only once
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
● ● When a page is requested from disk, the buffer manager will check
DBMS starts at time t0, but fails at time tf. Assume data to see whether the page is already in one of the database buffers.
for transactions T2 and T3 have been written to secondary storage. ● If it is not, the buffer manager will:
● T1 and T6 have to be undone. In absence of any other information, o use the replacement strategy to choose a buffer for
recovery manager has to redo T2, T3, T4, and T5. replacement (which we will call the replacement buffer)
Buffer management and increment its pinCount. The requested page is now
● The management of the database buffers plays an important role in pinned in the database buffer and cannot be written back
the recovery process. to disk yet. The replacement strategy will not choose a
● The buffer manager is responsible for the efficient management of buffer that has been pinned;
the database buffers that are used to transfer pages to and from o if the dirty variable for the replacement buffer is set, it
● This involves reading pages from disk into the buffers until the o read the page from disk into the replacement buffer and
buffers become full and then using a replacement strategy to reset the buffer’s dirty variable to zero.
decide which buffer(s) to force-write to disk to make space for new ● If the same page is requested again, the appropriate pinCount is
● When the system informs the buffer manager that it has finished o Checkpoint facility, which enables updates to database in
with the page, the appropriate pinCount is decremented by 1. progress to be made permanent.
● At this point, the system will also inform the buffer manager if it o Recovery manager, which allows DBMS to restore
has modified the page and the dirty variable is set accordingly. database to consistent state following a failure.
● When a pinCount reaches zero, the page is unpinned and the page Backup Mechanism:
can be written back to disk if it has been modified. ● The DBMS should provide a mechanism to allow backup copies of
● The following terminology is used in database recovery when the database and the log file to be made at regular intervals without
pages are written back to disk: necessarily having to stop the system first.
o -A steal policy allows the buffer manager to write a ● The backup copy of the database can be used in the event that the
buffer to disk before a transaction commits (the buffer is database has been damaged or destroyed. A backup can be a
unpinned). In other words, the buffer manages ‘steals’ a complete copy of the entire database or an incremental backup,
page from the transaction. The alternative policy is consisting only of modifications made since the last complete or
no-steal. incremental backup.
o -A force policy ensures that all pages updated by a ● Typically, the backup is stored on offline storage, such as magnetic
transaction are immediately written to disk when the tape.
transaction commits. The alternative policy is no-force.
Recovery Facilities Log File
● DBMS should provide following facilities to assist with recovery: To keep track of database transactions, the DBMS maintains a special file
o Backup mechanism, which makes periodic backup copies called a log (or journal) that contains information about all updates to the
of database. database.
o Logging facilities, which keep track of current state of The log may contain the following data:
transactions and database changes. ● Contains information about all updates to database:
o Transaction records.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
o writing a checkpoint record to the log file. This record Recovery Techniques
contains the identifiers of all transactions that are active at ● If database has been damaged:
the time of the checkpoint. o Need to restore last backup copy of database and reapply
● updates of committed transactions using log file.
o Checkpoint record is created containing identifiers of all ● If database is only inconsistent:
active transactions. o Need to undo changes that caused inconsistency. May also
o When failure occurs, redo all transactions that committed need to redo some transactions to ensure updates reach
since the checkpoint and undo all transactions active at time of secondary storage.
crash. o Do not need backup, but can restore database using
o In previous example, with checkpoint at time tc, changes made before- and after-images in the log file.
by T2 and T3 have been written to secondary storage.
Main Recovery Techniques
● Three main recovery techniques:
– Deferred Update
– Im mediate Update
– Shadow Paging
Deferred Update
● Updates are not written to the database until after a transaction has
o undo transactions T1 and T6. ● May be necessary to redo updates of committed transactions as
their effect may not have reached database.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
Immediate Update
● Updates are applied to database as they occur.
● Need to redo updates of committed transactions following a
failure.
● May need to undo effects of transactions that had not committed at
time of failure.
● Essential that log records are written before write to database.
Write-ahead log protocol.
● If no “transaction commit” record in log, then that transaction was
active at failure and must be undone.
● Undo operations are performed in reverse order in which they were
written to log.
Shadow Paging
● Maintain two page tables during life of a transaction: current page
and shadow page table.
● When transaction starts, two pages are the same.
● Shadow page table is never changed thereafter and is used to
restore database in event of failure.
● During transaction, current page table records all updates to
database.
● When transaction completes, current page table becomes shadow
page table.
TOPIC 8: DATABASE TUNING
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
• For the given set of tables, there may be alternative design ● Many query optimizers do not use indexes in the presence of
choices, all of which achieve 3NF or BCNF. One may be arithmetic expressions, numerical comparisons of attributes of
replaced by the other. different sizes and precision, NULL comparisons, and substring
• A relation of the form RCK,A, B, C, D, ... )-with Kas a comparisons.
set of key attributes that is in BCNF can be stored into ● Indexes are often not used for nested queries using IN
multiple tables that are also in BCNF. This is also called ● Some DISTINCTS may be redundant and can be avoided without
vertical partitioning. changing the result. A DISTINCT often causes a sort operation and
• Artributets) from one table may be repeated in another must be avoided as far as possible.
even though this creates redundancy and a potential ● Unnecessary use of temporary result tables can be avoided by
anomaly. collapsing multiple queries into a single query unless the
• Just as vertical partitioning splits a table vertically into temporary relation is needed for some intermediate processing.
multiple tables, horizontal partitioning takes horizontal ● In some situations involving use of correlated queries, temporaries
slices of a table and stores them as distinct tables. are useful.
Tuning Queries ● If multiple options for join condition are possible, choose one that
● There are mainly two indications that suggest that query tuning uses a clustering index and avoid those that contain string
may be needed: comparisons.
▪ A query issues too many disk accesses (for ● One idiosyncrasy with query optimizers is that the order of tables
example, an exact match query scans an entire in the FROM clause may affect the join processing. If that is the
table). case, one may have to switch this order so that the smaller of the
▪ The query plan shows that relevant indexes are two relations is scanned and the larger relation is used with an
not being used. appropriate index.
● Some typical instances of situations prompting query tuning
include the following:
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
● Some query optimizers perform worse on nested queries compared Additional Query Tuning Guidelines
to their equivalent unnested counterparts. There are four types of ● Additional techniques for improving queries apply in certain
nested queries: situations:
• Uncorrelated subqueries with aggregates in inner query. ● A query with multiple selection conditions that are connected via
• Uncorrelated subqueries without aggregates. OR may not be prompting the query optimizer to use any index.
• Correlated subqueries with aggregates in inner query. Such a query may be split up and expressed as a union of queries,
• Correlated subqueries without aggregates. each with a condition on an attribute that causes an index to be
● Out of the above four types, the first one typically presents no used.
problem, since most query optimizers evaluate the inner query ● To help in expediting a query, the following transformations may
once. be tried:
● However, for a query of the second type, most query optimizers • NOT condition may be transformed into a positive
may not use an index. expression.
● The same optimizers may do so if the query is writtenMas an • Embedded SELECT blocks using IN, = ALL, = ANY, and
unnested query. = SOME may be replaced by joins.
● Transformation of correlated subqueries may involve setting • If an equality join is set up between two tables, the range
temporary tables. predicate (selection condition) on the joining attribute set
● Finally, many applications are based on views that define the data up in one table may be repeated for the other table.
of interest to those applications. Sometimes, these views become ● WHERE conditions may be rewritten to utilize the indexes on
an overkill, because a query may be posed directly against a base multiple columns.
table, rather than going through a view that is defined by a join.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
– We cannot replace any dependency A → B in X with – the candidate keys overlap, that is have at least one
dependency C → B, where C is a proper subset of A, and attribute in common.
still have a set of dependencies that is equivalent to X. Review of Normalization (UNF to BCNF)
– We cannot remove any dependency from X and still have
a set of dependencies that is equivalent to X.
Boyce–Codd Normal Form (BCNF)
● Based on functional dependencies that take into account all
candidate keys in a relation, however BCNF also has additional
constraints compared with the general definition of 3NF.
● Boyce–Codd normal form (BCNF)
– A relation is in BCNF if and only if every determinant is a
candidate key.
● Difference between 3NF and BCNF is that for a functional
dependency A → B, 3NF allows this dependency in a relation if B
is a primary-key attribute and A is not a candidate key. Whereas,
BCNF insists that for this dependency to remain in a relation, A
must be a candidate key.
● Every relation in BCNF is also in 3NF. However, a relation in 3NF
is not necessarily in BCNF.
● Violation of BCNF is quite rare.
● The potential to violate BCNF may occur in a relation that:
– contains two (or more) composite candidate keys;
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES
Fourth Normal Form (4NF) o A trivial MVD does not specify a constraint on a relation,
● Although BCNF removes anomalies due to functional while a nontrivial MVD does specify a constraint.
dependencies, another type of dependency called a ● Defined as a relation that is in Boyce-Codd Normal Form and
multi-valued dependency (MVD) can also cause data contains no nontrivial multi-valued dependencies.
redundancy. 4NF – Example
● Possible existence of multi-valued dependencies in a relation
is due to 1NF and can result in data redundancy.
● Multi-valued Dependency (MVD)
o Dependency between attributes (for example, A, B,
and C) in a relation, such that for each value of A
there is a set of values for B and a set of values for C.
However, the set of values for B and C are
independent of each other.
● MVD between attributes A, B, and C in a relation using the
following notation:
A −>> B
A −>> C
● A multi-valued dependency can be further defined as being trivial
or nontrivial.
o A MVD A −>> B in relation R is defined as being
Fifth Normal Form (5NF)
trivial if (a) B is a subset of A or (b) A ∪ B = R. ● A relation decompose into two relations must have the
o A MVD is defined as being nontrivial if neither (a) nor lossless-join property, which ensures that no spurious tuples are
(b) are satisfied.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT 1
RELATIONAL MODEL ISSUES