Lecture 02
Lecture 02
MINING (SE-409)
Lecture-2
Introduction and Background
Dr. Huma
Software Engineering department
1
Normalization
2
Normalization
What is normalization?
What are the goals of normalization?
Eliminate redundant data.
Ensure data dependencies make sense.
3
Rules for First Normal Form
The first normal form expects you to follow a few simple rules while designing your
database, and they are:
For example: If you have a column dob to save date of births of a set of people,
then you cannot or you must not save 'names' of some of them in that column along
with 'date of birth' of others in that column. It should hold only 'date of birth' for all
the records/rows.
4
Rules for First Normal Form
Rule 3: Unique name for Attributes/Columns
This rule expects that each column in a table should have a unique name. This is to
avoid confusion at the time of retrieving data or performing any other operation on
the stored data.
If one or more columns have same name, then the DBMS system will be left
confused.
5
Rules for First Normal Form
roll_no name subject
101 Akon OS, CN
103 Ckon Java
102 Bkon C, C++
6
How to solve this Problem?
It's very simple, because all we have to do is break the values into
atomic values.
Here is our updated table and it now satisfies the First Normal Form.
7
Second Normal Form
• For a table to be in the Second Normal form, it
should be in the First Normal form and it should not
have Partial Dependency.
• Partial Dependency exists, when for a composite
primary key, any attribute in the table depends only
on a part of the primary key and not on the complete
primary key.
• To remove Partial dependency, we can divide the
table, remove the attribute which is causing partial
dependency, and move it to some other table where
it fits in well.
8
Let's create another table for Subject, which will have subject_id and subject_name fields
and subject_id will be the primary key.
subject_id subject_name
1 Java
2 C++
3 Php
9
Let's create another table Score, to store the marks obtained by students
in the respective subjects. We will also be saving name of the
teacher who teaches that subject along with marks.
10
In the score table we are saving the student_id to know which student's
marks are these and subject_id to know for which subject the marks
are for.
11
But where is Partial Dependency?
• Now if you look at the Score table, we have a column
names teacher which is only dependent on the
subject, for Java it's Java Teacher and for C++ it's C++
Teacher & so on.
• Now as we just discussed that the primary key for
this table is a composition of two columns which
is student_id & subject_id but the teacher's name
only depends on subject, hence the subject_id, and
has nothing to do with student_id.
• This is Partial Dependency, where an attribute in a
table depends on only a part of the primary key and
not on the whole key.
12
How to remove Partial
Dependency?
There can be many different solutions for this, but out objective is to remove teacher's
name from Score table.
The simplest solution is to remove columns teacher from Score table and add it to the
Subject table. Hence, the Subject table will become:
And our Score table is now in the second normal form, with no partial dependency.
13
How to remove Partial
Dependency?
And our Score table is now in the second normal form, with no partial dependency.
14
Third Normal Form (3NF)
15
• By transitive functional dependency, we mean
we have the following relationships in the
table: A is functionally dependent on B, and B
is functionally dependent on C. In this case, C
is transitively dependent on A via B.
• 3rd Normal Form Example
• Consider the following example:
16
17
Striking a balance between “good” & “evil”
De-normalization Normalization
Too many tables
4+ Normal Forms
Data Lists
18
What is De-normalization?
It is not chaos, more like a “controlled crash”
with the aim of performance enhancement
without loss of information.
23
De-normalization Techniques
24
Collapsing Tables
ColA ColB
denormalized
ColA ColC
25
Splitting Tables
Table Table_v1 Table_v2
ColA ColB ColC ColA ColB ColA ColC
Vertical Split
Table_h1 Table_h2
26
Horizontal split
Splitting Tables: Horizontal splitting…
Breaks a table into multiple tables based upon
common column values. Example: Campus
specific queries.
GOAL
27
Splitting Tables: Horizontal splitting
ADVANTAGE
Enhance security of data.
Organizing tables differently for different
queries.
28
Splitting Tables: Vertical Splitting
Infrequently accessed columns become extra
“baggage” thus degrading performance.
29
Pre-joining …
30
Master
Pre-Joining…
Sale_ID Sale_date Sale_person
normalized
1 M
Tx_ID Sale_ID Item_ID Item_Qty Sale_Rs Detail
denormalized
31
Adding Redundant Columns…
Table_1 Table_1’
ColA ColB ColA ColB ColC
Table_2 Table_2
32
Adding Redundant Columns…
Columns can also be moved, instead of making them
redundant. Very similar to pre-joining as discussed
earlier.
EXAMPLE
Frequent referencing of code in one table and
corresponding description in another table.
A join is required.
Note that:
Actually increases in storage space( header
size increase), and increase in update
overhead(query fast, few slow).
34
Derived Attributes: Example
DWH Data Model
Business Data Model
#SID #SID
DoB DoB
Degree Degree
Course Course
Grade Grade
Credits Credits Derived attributes
GP Calculated once
DoB: Date of Birth Age Used Frequently
36
DWH & OLAP
consistently during last quarter regional level during last year ??
only. Rest is OK
• Analysis is directional
– Drill Down [details. Year->month->week More in
– Roll Up subsequent
slides
– Pivot
39
Challenges…
• Not feasible to write predefined queries.
– Fails to remain user_driven (becomes programmer
driven).
40
Challenges
• Contradiction
– Want to compute answers in advance, but don't
know the questions
• Solution
– Compute answers to “all” possible “queries”. But
how?
41
“All” possible queries (level aggregates)
ALL ALL
43
Where does OLAP fit in?
?
Transaction
Data
Data
Loading
ELT
OLAP
Reports
Decision
Maker
Data Cube
(MOLAP) Presentation
Tools
44
OLTP vs. OLAP
Feature OLTP OLAP
Level of data Detailed Aggregated
Amount of data per Small Large
transaction
Views Pre-defined User-defined
[Programmer]
Typical write Update, insert, delete Bulk insert
operation
“age” of data Current (60-90 days) Historical 5-10 years and
also current [Active
DW]
Number of users High Low-Med
Tables Flat tables [Highly Multi-Dimensional tables
normalized]
Database size Med (109 B – 1012 B) High (1012 B – 1015 B)
Query Optimizing Requires experience Already “optimized”
45
Data availability High Low-Med
OLAP FASMI Test
Fast: Delivers information to the user at a fairly constant rate.
Most queries answered in under five seconds.
46