0% found this document useful (0 votes)
37 views

Unit 4 Database Design and Query Processing

The document discusses database design and query processing. It covers topics such as combining schemas, functional dependencies, normal forms like first normal form, Boyce-Codd normal form, and third normal form. The goal is to develop a theory to determine if a relation is in a "good" form and how to decompose relations that are not in good form into lossless and dependency preserving forms.

Uploaded by

Omkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Unit 4 Database Design and Query Processing

The document discusses database design and query processing. It covers topics such as combining schemas, functional dependencies, normal forms like first normal form, Boyce-Codd normal form, and third normal form. The goal is to develop a theory to determine if a relation is in a "good" form and how to decompose relations that are not in good form into lossless and dependency preserving forms.

Uploaded by

Omkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Accredited ‘A’ Grade By NAAC

Subject: Database Management System

UNIT-IV

Database Design and Query Processing


Combine Schemas?
• Suppose we combine instructor and department into
inst_dept
• (No connection to relationship set inst_dept)
• Result is possible repetition of information
A Combined Schema Without Repetition
• Consider combining relations
• sec_class(sec_id, building, room_number) and
• section(course_id, sec_id, semester, year)
into one relation
• section(course_id, sec_id, semester, year,
building, room_number)
• No repetition in this case
What About Smaller Schemas?
• Suppose we had started with inst_dept. How would we know to split up
(decompose) it into instructor and department?
• Write a rule “if there were a schema (dept_name, building, budget),
then dept_name would be a candidate key”
• Denote as a functional dependency:
dept_name  building, budget
• In inst_dept, because dept_name is not a candidate key, the building and
budget of a department may have to be repeated.
• This indicates the need to decompose inst_dept
• Not all decompositions are good. Suppose we decompose
employee(ID, name, street, city, salary) into
employee1 (ID, name)
employee2 (name, street, city, salary)
A Lossy Decomposition
Example of Lossless-Join Decomposition
• Lossless join decomposition
• Decomposition of R = (A, B, C)
R1 = (A, B) R2 = (B, C)

A B C A B B C
 1 A  1 1 A
 2 B  2 2 B
r A,B(r) B,C(r)

A B C
A (r) B (r)
 1 A
 2 B
First Normal Form
• Domain is atomic if its elements are considered to be indivisible
units
• Examples of non-atomic domains:
• Set of names, composite attributes
• Identification numbers like CS101 that can be broken up into
parts
• A relational schema R is in first normal form if the domains of all
attributes of R are atomic
• Non-atomic values complicate storage and encourage redundant
(repeated) storage of data
• Example: Set of accounts stored with each customer, and set of
owners stored with each account
• We assume all relations are in first normal form
First Normal Form (Cont’d)

• Atomicity is actually a property of how the elements of


the domain are used.
• Example: Strings would normally be considered indivisible
• Suppose that students are given roll numbers which are strings
of the form CS0012 or EE1127
• If the first two characters are extracted to find the department,
the domain of roll numbers is not atomic.
• Doing so is a bad idea: leads to encoding of information in
application program rather than in the database.
Goal — Devise a Theory for the Following

• Decide whether a particular relation R is in “good”


form.
• In the case that a relation R is not in “good” form,
decompose it into a set of relations {R1, R2, ..., Rn}
such that
• each relation is in good form
• the decomposition is a lossless-join decomposition
• Our theory is based on:
• functional dependencies
• multivalued dependencies
Functional Dependencies

• Constraints on the set of legal relations.


• Require that the value for a certain set of attributes
determines uniquely the value for another set of
attributes.
• A functional dependency is a generalization of the
notion of a key.
Functional Dependencies (Cont.)
• Let R be a relation schema
  R and   R
• The functional dependency

holds on R if and only if for any legal relations r(R), whenever any two
tuples t1 and t2 of r agree on the attributes , they also agree on the
attributes . That is,
t1[] = t2 []  t1[ ] = t2 [ ]
• Example: Consider r(A,B ) with the following instance of r.

1 4
1 5
3 7

• On this instance, A  B does NOT hold, but B  A does hold.


Functional Dependencies (Cont.)
• K is a superkey for relation schema R if and only if K  R
• K is a candidate key for R if and only if
• K  R, and
• for no   K,   R
• Functional dependencies allow us to express constraints that cannot
be expressed using superkeys. Consider the schema:
inst_dept (ID, name, salary, dept_name, building, budget ).
We expect these functional dependencies to hold:
dept_name building
and ID  building
but would not expect the following to hold:
dept_name  salary
Use of Functional Dependencies
• We use functional dependencies to:
• test relations to see if they are legal under a given set of
functional dependencies.
• If a relation r is legal under a set F of functional
dependencies, we say that r satisfies F.
• specify constraints on the set of legal relations
• We say that F holds on R if all legal relations on R satisfy
the set of functional dependencies F.
• Note: A specific instance of a relation schema may satisfy a
functional dependency even if the functional dependency
does not hold on all legal instances.
• For example, a specific instance of instructor may, by
chance, satisfy
name  ID.
Functional Dependencies (Cont.)

• A functional dependency is trivial if it is satisfied by all


instances of a relation
• Example:
• ID, name  ID
• name  name
• In general,    is trivial if   
Closure of a Set of Functional Dependencies
• Given a set F of functional dependencies, there are
certain other functional dependencies that are logically
implied by F.
• For example: If A  B and B  C, then we can infer
that A  C
• The set of all functional dependencies logically implied
by F is the closure of F.
• We denote the closure of F by F+.
• F+ is a superset of F.
Boyce-Codd Normal Form
A relation schema R is in BCNF with respect to a set F of functional
dependencies if for all functional dependencies in F+ of the form

 

where   R and   R, at least one of the following holds:

•    is trivial (i.e.,   )
•  is a superkey for R

Example schema not in BCNF:

instr_dept (ID, name, salary, dept_name, building, budget )

because dept_name building, budget


holds on instr_dept, but dept_name is not a superkey
Decomposing a Schema into BCNF
• Suppose we have a schema R and a non-trivial dependency
 causes a violation of BCNF.
We decompose R into:
• (U  )
• (R-(-))
• In our example,
•  = dept_name
•  = building, budget
and inst_dept is replaced by
• (U  ) = ( dept_name, building, budget )
• ( R - (  -  ) ) = ( ID, name, salary, dept_name )
BCNF and Dependency Preservation
• Constraints, including functional dependencies, are costly to
check in practice unless they pertain to only one relation
• If it is sufficient to test only those dependencies on each
individual relation of a decomposition in order to ensure that all
functional dependencies hold, then that decomposition is
dependency preserving.
• Because it is not always possible to achieve both BCNF and
dependency preservation, we consider a weaker normal form,
known as third normal form.
Third Normal Form
• A relation schema R is in third normal form (3NF) if for all:
   in F+
at least one of the following holds:
•    is trivial (i.e.,   )
•  is a superkey for R
• Each attribute A in  –  is contained in a candidate key for R.
(NOTE: each attribute may be in a different candidate key)
• If a relation is in BCNF it is in 3NF (since in BCNF one of the first
two conditions above must hold).
• Third condition is a minimal relaxation of BCNF to ensure
dependency preservation.
Goals of Normalization

• Let R be a relation scheme with a set F of functional


dependencies.
• Decide whether a relation scheme R is in “good” form.
• In the case that a relation scheme R is not in “good” form,
decompose it into a set of relation scheme {R1, R2, ..., Rn} such
that
• each relation scheme is in good form
• the decomposition is a lossless-join decomposition
• Preferably, the decomposition should be dependency
preserving.
Functional-Dependency Theory

• We now consider the formal theory that tells us which


functional dependencies are implied logically by a given set
of functional dependencies.
• We then develop algorithms to generate lossless
decompositions into BCNF and 3NF
• We then develop algorithms to test if a decomposition is
dependency-preserving
Closure of a Set of Functional Dependencies
• Given a set F set of functional dependencies, there are certain
other functional dependencies that are logically implied by F.
• For e.g.: If A  B and B  C, then we can infer that A  C
• The set of all functional dependencies logically implied by F is
the closure of F.
• We denote the closure of F by F+.
Closure of a Set of Functional Dependencies
• We can find F+, the closure of F, by repeatedly applying
Armstrong’s Axioms:
• if   , then    (reflexivity)
• if   , then      (augmentation)
• if   , and   , then    (transitivity)
• These rules are
• sound (generate only functional dependencies that actually
hold), and
• complete (generate all functional dependencies that hold).
Example
• R = (A, B, C, G, H, I)
F={ AB
AC
CG  H
CG  I
B  H}
• some members of F+
•AH
• by transitivity from A  B and B  H
• AG  I
• by augmenting A  C with G, to get AG  CG
and then transitivity with CG  I
• CG  HI
• by augmenting CG  I to infer CG  CGI,
and augmenting of CG  H to infer CGI  HI,
and then transitivity
Procedure for Computing F+
• To compute the closure of a set of functional dependencies F:

F+=F
repeat
for each functional dependency f in F+
apply reflexivity and augmentation rules on f
add the resulting functional dependencies to F +
for each pair of functional dependencies f1and f2 in F +
if f1 and f2 can be combined using transitivity
then add the resulting functional dependency to F +
until F + does not change any further

NOTE: We shall see an alternative procedure for this task later


Closure of Functional Dependencies (Cont.)
• Additional rules:
• If    holds and    holds, then     holds (union)
• If     holds, then    holds and    holds
(decomposition)
• If    holds and     holds, then     holds
(pseudotransitivity)
The above rules can be inferred from Armstrong’s axioms.
Closure of Attribute Sets
• Given a set of attributes α define
, the closure of α under F
(denoted by α+) as the set of attributes that are functionally
determined by α under F

• Algorithm to compute α+, the closure of α under F

result := α;
while (changes to result) do
for each    in F do
begin
if   result then result := result  
end
Example of Attribute Set Closure
• R = (A, B, C, G, H, I)
• F = {A  B
AC
CG  H
CG  I
B  H}
• (AG)+
1. result = AG
2. result = ABCG (A  C and A  B)
3. result = ABCGH(CG  H and CG  AGBC)
4. result = ABCGHI (CG  I and CG  AGBCH)
• Is AG a candidate key?
1. Is AG a super key?
1. Does AG  R? == Is (AG)+  R
2. Is any subset of AG a superkey?
1. Does A  R? == Is (A)+  R
2. Does G  R? == Is (G)+  R
Uses of Attribute Closure
There are several uses of the attribute closure algorithm:
• Testing for superkey:
• To test if  is a superkey, we compute +, and check if +
contains all attributes of R.
• Testing functional dependencies
• To check if a functional dependency    holds (or, in other
words, is in F+), just check if   +.
• That is, we compute + by using attribute closure, and then
check if it contains .
• Is a simple and cheap test, and very useful
• Computing closure of F
• For each   R, we find the closure +, and for each S  +, we
output a functional dependency   S.
Canonical Cover
• Sets of functional dependencies may have redundant
dependencies that can be inferred from the others
• For example: A  C is redundant in: {A  B, B  C, A C}
• Parts of a functional dependency may be redundant
• E.g.: on RHS: {A  B, B  C, A  CD} can be simplified
to
{A  B, B  C, A  D}
• E.g.: on LHS: {A  B, B  C, AC  D} can be simplified
to
{A  B, B  C, A  D}
• Intuitively, a canonical cover of F is a “minimal” set of functional
dependencies equivalent to F, having no redundant dependencies
or redundant parts of dependencies
Extraneous Attributes
• Consider a set F of functional dependencies and the functional
dependency    in F.
• Attribute A is extraneous in  if A  
and F logically implies (F – {  })  {( – A)  }.
• Attribute A is extraneous in  if A  
and the set of functional dependencies
(F – {  })  { ( – A)} logically implies F.
• Note: implication in the opposite direction is trivial in each of the
cases above, since a “stronger” functional dependency always
implies a weaker one
• Example: Given F = {A  C, AB  C }
• B is extraneous in AB  C because {A  C, AB  C} logically
implies A  C (I.e. the result of dropping B from AB  C).
• Example: Given F = {A  C, AB  CD}
• C is extraneous in AB  CD since AB  C can be inferred
even after deleting C
Testing if an Attribute is Extraneous
• Consider a set F of functional dependencies and the functional
dependency    in F.
• To test if attribute A   is extraneous in 
1. compute ({} – A)+ using the dependencies in F
2. check that ({} – A)+ contains ; if it does, A is extraneous
in 
• To test if attribute A   is extraneous in 
1. compute + using only the dependencies in
F’ = (F – {  })  { ( – A)},
2. check that + contains A; if it does, A is extraneous in 
Canonical Cover
• A canonical cover for F is a set of dependencies Fc such that
• F logically implies all dependencies in Fc, and
• Fc logically implies all dependencies in F, and
• No functional dependency in Fc contains an extraneous attribute,
and
• Each left side of functional dependency in Fc is unique.
• To compute a canonical cover for F:
repeat
Use the union rule to replace any dependencies in F
1  1 and 1  2 with 1  1 2
Find a functional dependency    with an
extraneous attribute either in  or in 
/* Note: test for extraneous attributes done using Fc,
not F*/
If an extraneous attribute is found, delete it from   
until F does not change
Computing a Canonical Cover
• R = (A, B, C)
F = {A  BC, B  C, A  B, AB  C}
• Combine A  BC and A  B into A  BC
• Set is now {A  BC, B  C, AB  C}
• A is extraneous in AB  C
• Check if the result of deleting A from AB  C is implied by the
other dependencies
• Yes: in fact, B  C is already present!
• Set is now {A  BC, B  C}
• C is extraneous in A  BC
• Check if A  C is logically implied by A  B and the other
dependencies
• Yes: using transitivity on A  B and B  C.
• Can use attribute closure of A in more complex cases
• The canonical cover is: AB
BC
Lossless-join Decomposition
• For the case of R = (R1, R2), we require that for all possible
relations r on schema R
r = R1 (r ) R2 (r )
• A decomposition of R into R1 and R2 is lossless join if at least one
of the following dependencies is in F+:
• R1  R2  R1
• R1  R2  R2
• The above functional dependencies are a sufficient condition for
lossless join decomposition; the dependencies are a necessary
condition only if all constraints are functional dependencies
Example
• R = (A, B, C)
F = {A  B, B  C)
• Can be decomposed in two different ways
• R1 = (A, B), R2 = (B, C)
• Lossless-join decomposition:
R1  R2 = {B} and B  BC
• Dependency preserving
• R1 = (A, B), R2 = (A, C)
• Lossless-join decomposition:
R1  R2 = {A} and A  AB
• Not dependency preserving
(cannot check B  C without computing R1 R2)
Dependency Preservation
• Let Fi be the set of dependencies F + that include only
attributes in Ri.
• A decomposition is dependency preserving, if
(F1  F2  …  Fn )+ = F +
• If it is not, then checking updates for violation of
functional dependencies may require computing
joins, which is expensive.
Testing for Dependency Preservation
• To check if a dependency    is preserved in a decomposition
of R into R1, R2, …, Rn we apply the following test (with attribute
closure done with respect to F)
• result = 
while (changes to result) do
for each Ri in the decomposition
t = (result  Ri)+  Ri
result = result  t
• If result contains all attributes in , then the functional
dependency
   is preserved.
• We apply the test on all dependencies in F to check if a
decomposition is dependency preserving.
• This procedure takes polynomial time, instead of the exponential
time required to compute F+ and (F1  F2  …  Fn)+
Example
• R = (A, B, C )
F = {A  B
B  C}
Key = {A}
• R is not in BCNF
• Decomposition R1 = (A, B), R2 = (B, C)
• R1 and R2 in BCNF
• Lossless-join decomposition
• Dependency preserving
Basic Steps in Query Processing
1. Parsing and translation
2. Optimization
3. Evaluation
Basic Steps in Query Processing (Cont.)
• Parsing and translation
• translate the query into its internal form. This is then
translated into relational algebra.
• Parser checks syntax, verifies relations
• Evaluation
• The query-execution engine takes a query-evaluation plan,
executes that plan, and returns the answers to the query.
Basic Steps in Query Processing : Optimization
• A relational algebra expression may have many equivalent
expressions
• E.g., salary75000(salary(instructor)) is equivalent to
salary(salary75000(instructor))
• Each relational algebra operation can be evaluated using one of
several different algorithms
• Correspondingly, a relational-algebra expression can be evaluated
in many ways.
• Annotated expression specifying detailed evaluation strategy is
called an evaluation-plan.
• E.g., can use an index on salary to find instructors with salary <
75000,
• or can perform complete relation scan and discard instructors
with salary  75000
Basic Steps: Optimization

• Query Optimization: Amongst all equivalent evaluation plans choose


the one with lowest cost.
• Cost is estimated using statistical information from the
database catalog
• e.g. number of tuples in each relation, size of tuples, etc.
Measures of Query Cost
• Cost is generally measured as total elapsed time for
answering query
• Many factors contribute to time cost
• disk accesses, CPU, or even network communication
• Typically disk access is the major cost, and is also relatively
easy to estimate. Measured by taking into account
• Number of seeks * average-seek-cost
• Number of blocks read * average-block-read-cost
• Number of blocks written * average-block-write-cost
• Cost to write a block is greater than cost to read a block
• data is read back after being written to ensure that the
write was successful
Measures of Query Cost
• For simplicity we just use the number of block transfers from disk
and the number of seeks as the cost measures
• tT – time to transfer one block
• tS – time for one seek
• Cost for b block transfers plus S seeks
b * tT + S * tS
• We ignore CPU costs for simplicity
• Real systems do take CPU cost into account
• We do not include cost to writing output to disk in our cost
formulae
Measures of Query Cost
• Several algorithms can reduce disk IO by using extra buffer space
• Amount of real memory available to buffer depends on other
concurrent queries and OS processes, known only during
execution
• We often use worst case estimates, assuming only the
minimum amount of memory needed for the operation is
available
• Required data may be buffer resident already, avoiding disk I/O
• But hard to take into account for cost estimation
Selection Operation
• File scan
• Algorithm A1 (linear search). Scan each file block and test all
records to see whether they satisfy the selection condition.
• Cost estimate = br block transfers + 1 seek
• br denotes number of blocks containing records from relation r
• If selection is on a key attribute, can stop on finding record
• cost = (br /2) block transfers + 1 seek
• Linear search can be applied regardless of
• selection condition or
• ordering of records in the file, or
• availability of indices
Selections Using Indices
• Index scan – search algorithms that use an index
• selection condition must be on search-key of index.
• A2 (primary index, equality on key). Retrieve a single record
that satisfies the corresponding equality condition
• Cost = (hi + 1) * (tT + tS)
• A3 (primary index, equality on nonkey) Retrieve multiple
records.
• Records will be on consecutive blocks
• Let b = number of blocks containing matching records
• Cost = hi * (tT + tS) + tS + tT * b
Selections Using Indices
• A4 (secondary index, equality on nonkey).
• Retrieve a single record if the search-key is a candidate key
• Cost = (hi + 1) * (tT + tS)
• Retrieve multiple records if search-key is not a candidate key
• each of n matching records may be on a different block
• Cost = (hi + n) * (tT + tS)
• Can be very expensive!
Selections Involving Comparisons
• Can implement selections of the form AV (r) or A  V(r) by using
• a linear file scan,
• or by using indices in the following ways:
• A5 (primary index, comparison). (Relation is sorted on A)
• For A  V(r) use index to find first tuple  v and scan relation
sequentially from there
• For AV (r) just scan relation sequentially till first tuple > v; do
not use index
• A6 (secondary index, comparison).
• For A  V(r) use index to find first index entry  v and scan
index sequentially from there, to find pointers to records.
• For AV (r) just scan leaf pages of index finding pointers to
records, till first entry > v
• In either case, retrieve records that are pointed to
• requires an I/O for each record
• Linear file scan may be cheaper
Implementation of Complex Selections
• Conjunction: 1 2. . . n(r)
• A7 (conjunctive selection using one index).
• Select a combination of i and algorithms A1 through A7 that
results in the least cost for i (r).
• Test other conditions on tuple after fetching it into memory
buffer.
• A8 (conjunctive selection using composite index).
• Use appropriate composite (multiple-key) index if available.
• A9 (conjunctive selection by intersection of identifiers).
• Requires indices with record pointers.
• Use corresponding index for each condition, and take
intersection of all the obtained sets of record pointers.
• Then fetch records from file
• If some conditions do not have appropriate indices, apply test in
memory.
Algorithms for Complex Selections
• Disjunction:1 2 . . . n (r).
• A10 (disjunctive selection by union of identifiers).
• Applicable if all conditions have available indices.
• Otherwise use linear scan.
• Use corresponding index for each condition, and take union of
all the obtained sets of record pointers.
• Then fetch records from file
Evaluation of Expressions
• So far: we have seen algorithms for individual operations
• Alternatives for evaluating an entire expression tree
• Materialization: generate results of an expression whose
inputs are relations or are already computed, materialize
(store) it on disk. Repeat.
• Pipelining: pass on tuples to parent operations even as an
operation is being executed
Materialization
• Materialized evaluation: evaluate one operation at a time,
starting at the lowest-level. Use intermediate results
materialized into temporary relations to evaluate next-level
operations.
• E.g., in figure below, compute and store

then compute the store its join with instructor, and finally
compute the projection on name.
 building "Watson " (department )
Materialization
• Materialized evaluation is always applicable
• Cost of writing results to disk and reading them back can be quite
high
• Our cost formulas for operations ignore cost of writing results
to disk, so
• Overall cost = Sum of costs of individual operations +
cost of writing intermediate results to disk
• Double buffering: use two output buffers for each operation,
when one is full write it to disk while the other is getting filled
• Allows overlap of disk writes with computation and reduces
execution time
Pipelining
• Pipelined evaluation : evaluate several operations simultaneously,
passing the results of one operation on to the next.
• E.g., in previous expression tree, don’t store result of
 building "Watson " (department )
• instead, pass tuples directly to the join.. Similarly, don’t store
result of join, pass tuples directly to projection.
• Much cheaper than materialization: no need to store a temporary
relation to disk.
• Pipelining may not always be possible – e.g., sort, hash-join.
• For pipelining to be effective, use evaluation algorithms that
generate output tuples even as tuples are received for inputs to the
operation.
• Pipelines can be executed in two ways: demand driven and
producer driven
Pipelining
• In demand driven or lazy evaluation
• system repeatedly requests next tuple from top level operation
• Each operation requests next tuple from children operations as
required, in order to output its next tuple
• In between calls, operation has to maintain “state” so it knows
what to return next
• In producer-driven or eager pipelining
• Operators produce tuples eagerly and pass them up to their
parents
• Buffer maintained between operators, child puts tuples in
buffer, parent removes tuples from buffer
• if buffer is full, child waits till there is space in the buffer, and
then generates more tuples
• System schedules operations that have space in output buffer and
can process more input tuples
• Alternative name: pull and push models of pipelining
Pipelining
• Implementation of demand-driven pipelining
• Each operation is implemented as an iterator implementing the
following operations
• open()
• E.g. file scan: initialize file scan
• state: pointer to beginning of file
• E.g.merge join: sort relations;
• state: pointers to beginning of sorted relations
• next()
• E.g. for file scan: Output next tuple, and advance and store file
pointer
• E.g. for merge join: continue with merge from earlier state till
next output tuple is found. Save pointers as iterator state.
• close()
Evaluation Algorithms for Pipelining
• Some algorithms are not able to output results even as they get
input tuples
• E.g. merge join, or hash join
• intermediate results written to disk and then read back
• Algorithm variants to generate (at least some) results on the fly, as
input tuples are read in
• E.g. hybrid hash join generates output tuples even as probe
relation tuples in the in-memory partition (partition 0) are read in
• Double-pipelined join technique: Hybrid hash join, modified to
buffer partition 0 tuples of both relations in-memory, reading
them as they become available, and output results of any
matches between partition 0 tuples
• When a new r0 tuple is found, match it with existing s0 tuples,
output matches, and save it in r0
• Symmetrically for s0 tuples
Accredited ‘A’ Grade By NAAC

Thank You !

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy