lecture notes on database normalization
lecture notes on database normalization
Chengkai Li
Department of Computer Science and Engineering
The University of Texas at Arlington
I decided to write this document, because many students do not have the textbook, the slides
themselves won’t be sufficient, and we did not have enough time in lecture to elaborate it in
further details. My goal is to summarize the concepts we learned and explain various points about
normalization through examples. These examples can help you solve similar problems in homework
and exam. To thoroughly understand these topics, you should read the textbook.
Note: The Table 15.1 in the textbook (and in the slides of Chapter 15) is not fully rigorous.
For 3NF, it says “Relation should not have a nonkey attribute functionally determined by another
nonkey attribute (or by a set of nonkey attributes).” It should be “Relation should not have a
nonkey attribute functionally determined by another nonkey attribute (or by a set of attributes
that is not a superkey).” The original statement does not cover the case in which a nonkey attribute
1
is functionally determined by a set of attributes containing both nonkey and prime attributes but
is not a superkey.
Concept 3 (Functional Dependency) The concept of functional dependency (FD) was repeat-
edly explained in lectures and you can find its definition and explanation in slides and textbook.
It is very important for understanding all other concepts.
A note to make is that given any superkey (and thus candidate key) X in a relation schema and
any set of attributes Y in the schema, there exists an FD X → Y. (The FD is nontrivial if Y does
not overlap with X.)
Concept 4 (Transitive Closure) Given a set of attributes X, the transitive closure of X, denoted
by X+ , is the set of attributes that can be derived directly or transitively from X according to
functional dependencies (FDs). To simplify the notation, we also use a+ (instead of {a}+ ) to
denote the transitive closure of a single attribute a.
Procedure 2 (How to find candidate keys by transitive closure?) Given a set of attributes
X in a relation schema, if its transitive closure X+ contains all attributes in the schema, then X is a
superkey. Furthermore, if none of X’s proper subsets can be a superkey, then X is a candidate key.
1.2 Examples
Problem 1 Consider the following FDs for relation schema R(a, b, c, d, e): a → bc; cd → e; b →
d; e → a. List all candidate keys for R.
Solution 1 To find candidate keys, we can compute the transitive closure of every possible subset
of the attributes, i.e., every possible left-hand side of an FD.
Compute a+ : Begin with a+ ={a}; expand a+ to {abc} by FD a → bc; further expand it to
{abcd} by FD b → d; further expand it to {abcde} by FD cd → e. Hence a+ ={abcde}.
Compute b+ : b+ ={b}; then b+ ={bd} by FD b → d.
Compute c+ : c+ ={c}. It cannot be expanded. As we can see, none of the FDs has c only in
the left-hand side.
Compute d+ : d+ ={d}.
Compute e+ : e+ ={e}; then e+ ={ae} by FD e → a; we already know a+ ={abcde}, therefore
+
e ={abcde}.
Since a+ ={abcde} and e+ ={abcde}, we know that both a and e are candidate keys.
To find remaining candidate keys, we can continue the enumeration of all possible combinations
of attributes and compute their transitive closures. However, we can avoid the enumeration of
several cases, by making the following observation.
2
Remember we showed in Concept 1 that any proper superset of a candidate key cannot be a
candidate key. Instead, such a proper superset is only a superkey. Hence, we can conclude that
any proper superset of a or e is not a candidate key. Based on this observation, any multi-attribute
candidate key must not contain a or e.
For finding possible two-attribute candidate keys, we only need to look at bc, bd, and cd.
{bc}+ : First {bc}+ ={bc}; then {bc}+ ={bcd} by b → d; furthermore {bc}+ ={bcde} by cd →
e; finally {bc}+ ={abcde} by e → a. Therefore bc is a candidate key.
{bd}+ : First {bd}+ ={bd}. It cannot be further expanded. (Another way of reasoning about
it: We have FD b → d. Hence {bd}+ is equal to b+ , which we have already computed.)
{cd}+ : First {cd}+ ={cd}; then {cd}+ ={cde} by cd → e. Based on the previous steps, we
already know e is a candidate key. Hence cd is also a candidate key.
For finding possible three-attribute candidate keys, we only need to consider bcd. However, it
cannot be a candidate key, given that bc (cd) is a candidate key.
In summary, a, e, bc, cd are all the candidate keys.
Problem 2 Consider the following FDs for relation schema R(a, b, c, d, e): ab → c; cd → e; de →
b. List all candidate keys for R.
Solution 2 Again, we can enumerate all subsets of the schema including itself and compute the
transitive closures of these subsets. However, we can save time and avoid such exhaustive exami-
nation, by following some key observations.
Note that none of the FDs has a or d in its right-hand side. In other words, a and d cannot be
derived from any other attributes. Hence any candidate key must contain both a and d.
Consider possible 2-attribute candidate keys, i.e., ad. ad+ ={ad}. It is not a candidate key.
Consider possible 3-attribute candidate keys, i.e., abd, acd, ade. abd+ ={abcde}, acd+ ={abcde},
+
ade ={abcde}. All of them are candidate keys.
It is easy to see that there cannot be any 4-attribute or 5-attribute candidate keys. As mentioned
above, a candidate key must contain both a and d. Thus a 4-attribute or 5-attribute candidate key
must be a superset of abd, or acd, or ade, which are all candidate keys. That would disqualify it
from being considered a candidate key.
2 Normal Forms
In defining normal forms, we only consider nontrivial FDs. An FD X → Y is trivial if X subsumes
Y, i.e, X ⊇ Y.
We also only consider simple FD in which the right-hand side is a single attribute. Thus we
will represent it by X → y, where the lower case y indicates it is a single attribute. An FD with
multiple attributes in the right-hand side can be split into individual simple FDs.
2NF: The following statements are all equivalent. They all define what 2NF is about.
• A relation is in 2NF if every nonprime attribute is fully functional dependent on every can-
didate key.
• A relation is in 2NF if there does not exist an FD X → y such that y is a nonprime attribute
and X is a proper subset of a candidate key.
• A relation is in 2NF if there does not exist a nonprime attribute y that is partially dependent
on any candidate key.
3
3NF: The following statements are all equivalent. They all define what 3NF is about.
• A relation is in 3NF if it is in 2NF and there does not exist a nonprime attribute y that is
transitively dependent on any candidate key.
• A relation is in 3NF if there does not exist a nonprime attribute y that is dependent on any
set of attributes that is not a superkey.
• A relation is in 3NF if given any FD X → y, either y is a prime attribute or X is a superkey.
In practice, we often want our relations in BCNF or 3NF. Higher normal forms 4NF
and 5NF are not pursued in practice. The lower one 2NF is often insufficient.
3 Decomposition
3.1 Decomposition Algorithms
Procedure 3 (How to determine if a relation is in 2NF, 3NF, or BCNF?) Find all can-
didate keys based on given FDs. Then check if any FD violates the corresponding normal form,
according to the definitions of 2NF, 3NF, and BCNF.
Proof 1 We have proved this in class. It is also a question in HW3. We will provide the proof in
HW3 solution.
3. Decompose R into R1 and R2. R1 contains all attributes in X+ , R2 contains the set of
attributes (R−X+ )∪X, which is also equivalent to the set of attributes R − (X+ − X). In other
words, R2 contains X itself and all attributes outside of the transitive closure of X.
4. Find which original FDs in R are still applicable (i.e., preserved) in R1 and R2. Find all
candidate keys of R1 and R2 based on the preserved FDs in these relations, and find if they
are in BCNF.
5. If R1 or R2 is not in BCNF, recursively decompose the relation, until all relations are in
BCNF. (It is guaranteed that we will eventually have a set of relations that are all in BCNF.
See Proposition 1.)
4
(2) The tuples in the original relation are projected into the new relations. We may wonder–
Can we recover the original relation? In other words, do we get exactly the same relation if we
perform natural joins of the new relations? After all, we will join these relations for various queries
that are supposed to be executed on the original single relation.
Proposition 2 If we follow the above procedure to decompose a relation into BCNF relations, it
is guaranteed that the original relation can be recovered losslessly, i.e., we can obtain the original
relation exactly by natural joins of the new relations.
In HW3, through one question, you will see that if we do not follow the above procedure (e.g.,
if we decompose a relation in a fashion that is not based on a violating FD), then the lossless
property is not guaranteed. We may not get back the exact original relation.
3.3 Examples
Problem 3 Consider the following FDs for relation schema R(a, b, c, d): ab → c; c → d; d → a.
(1) Find all candidate keys.
(2) Find all BCNF violations.
(3) Decompose R into relations in BCNF.
(4) What FDs are not preserved by BCNF.
Solution 3
(1) b is not in the right-hand side of any of the given FDs. Hence every candidate key must contain
b. b+ ={b}. Hence b itself is not a candidate key. We need to consider its supersets. ab+ ={abcd},
bc+ ={abcd}, bd+ ={abcd}. All of them are candidate keys. And thus there cannot be any candi-
date key with more than 2 attributes.
(2) ab, bc, and bd are candidate keys. Let’s check each of the given FDs:
ab → c: left-hand side is a candidate key, and thus a superkey. Hence it does not violate BCNF.
c → d: left-hand side is not a superkey. Hence it violates BCNF.
d → a: left-hand side is not a superkey. Hence it violates BCNF.
(3.1) Start with c → d: The transitive closure of c is c+ ={a, c, d}. Hence R can be decomposed into
R1(b, c) and R2(a, c, d). For R1(b, c), we know a 2-attribute relation schema must be in BCNF.
In R2(a, c, d), the original FDs c → d and d → a are preserved. The only candidate key is c. Hence
there is a transitive dependency: c → d and then d → a. a is a nonprime attribute in R2. Thus
the FD d → a in R2 violates 3NF (and thus BCNF too). By this violating FD, we will further
decompose R2 into R3(c, d) and R4(a, d). Both R3 and R4 have only 2 attributes. Therefore they
are in BCNF. Hence the original R is decomposed into R1, R3, R4 so that all of them are in BCNF.
5
(3.2) Start with d → a: The transitive closure of d is d+ ={a, d}. Hence R can be decomposed
into R5(b, c, d) and R6(a, d). R6 must be in BCNF. In R5(b, c, d), c → d is preserved. Hence b, c
is the only candidate key in R5. c → d violates 2NF (and thus 3NF and BCNF as well) in R5
since c is only part of a candidate key and d is a nonprime attribute. We further decompose R5
by this violating FD into R7(b, c) and R8(c, d). Both are in BCNF. The original R is decomposed
into R6, R7, R8 so that all of them are in BCNF.
(4) If we follow the decomposition in (3.1) or (3.2), ab → c is not preserved in any of the resulting
relations. As we can see, it is not always possible to preserve all original FDs during decomposition.