0% found this document useful (0 votes)

24 views46 pages

Lecture 02

This document discusses normalization in databases. It begins by defining normalization and its goals of eliminating redundant data and ensuring sensible data dependencies. It then describes the levels of normalization, including first normal form (1NF), second normal form (2NF), and third normal form (3NF). The document provides examples to illustrate the rules for 1NF and how to achieve 2NF and remove partial dependencies. It also discusses denormalization techniques used in data warehousing to improve query performance.

Uploaded by

Syed Badshah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views46 pages

Lecture 02

Uploaded by

Syed Badshah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Data Warehousing & DATA

MINING (SE-409)
Lecture-2
Introduction and Background

Dr. Huma
Software Engineering department

University of Engineering and Technology, Taxila

1
Normalization

2
Normalization
What is normalization?
What are the goals of normalization?
 Eliminate redundant data.
 Ensure data dependencies make sense.

What is the result of normalization?

What are the levels of normalization?

3
Rules for First Normal Form
The first normal form expects you to follow a few simple rules while designing your
database, and they are:

Rule 1: Single Valued Attributes

Each column of your table should be single valued which means they should not
contain multiple values. We will explain this with help of an example later, let's see
the other rules for now.

Rule 2: Attribute Domain should not change

This is more of a "Common Sense" rule. In each column the values stored must be
of the same kind or type.

For example: If you have a column dob to save date of births of a set of people,
then you cannot or you must not save 'names' of some of them in that column along
with 'date of birth' of others in that column. It should hold only 'date of birth' for all
the records/rows.

4
Rules for First Normal Form
Rule 3: Unique name for Attributes/Columns
This rule expects that each column in a table should have a unique name. This is to
avoid confusion at the time of retrieving data or performing any other operation on
the stored data.
If one or more columns have same name, then the DBMS system will be left
confused.

Rule 4: Order doesn't matters

This rule says that the order in which you store the data in your table doesn't matter.

Time for an Example

Here is our table, with some sample data added to it.

5
Rules for First Normal Form
roll_no name subject
101 Akon OS, CN
103 Ckon Java
102 Bkon C, C++

6
How to solve this Problem?
It's very simple, because all we have to do is break the values into
atomic values.
Here is our updated table and it now satisfies the First Normal Form.

roll_no name subject

101 Akon OS
101 Akon CN
103 Ckon Java
102 Bkon C
102 Bkon C++

7
Second Normal Form
• For a table to be in the Second Normal form, it
should be in the First Normal form and it should not
have Partial Dependency.
• Partial Dependency exists, when for a composite
primary key, any attribute in the table depends only
on a part of the primary key and not on the complete
primary key.
• To remove Partial dependency, we can divide the
table, remove the attribute which is causing partial
dependency, and move it to some other table where
it fits in well.

8
Let's create another table for Subject, which will have subject_id and subject_name fields
and subject_id will be the primary key.
subject_id subject_name
1 Java
2 C++
3 Php

9
Let's create another table Score, to store the marks obtained by students
in the respective subjects. We will also be saving name of the
teacher who teaches that subject along with marks.

score_id student_id subject_id marks teacher

1 10 1 70 Java
Teacher
2 10 2 75 C++ Teacher
3 11 1 80 Java Teach

10
In the score table we are saving the student_id to know which student's
marks are these and subject_id to know for which subject the marks
are for.

Together, student_id + subject_id forms a Candidate Key(learn

about Database Keys) for this table, which can be the Primary key.

Confused, How this combination can be a primary key?

See, if I ask you to get me marks of student with student_id 10, can you
get it from this table? No, because you don't know for which subject.
And if I give you subject_id, you would not know for which student.
Hence we need student_id + subject_id to uniquely identify any row.

11
But where is Partial Dependency?
• Now if you look at the Score table, we have a column
names teacher which is only dependent on the
subject, for Java it's Java Teacher and for C++ it's C++
Teacher & so on.
• Now as we just discussed that the primary key for
this table is a composition of two columns which
is student_id & subject_id but the teacher's name
only depends on subject, hence the subject_id, and
has nothing to do with student_id.
• This is Partial Dependency, where an attribute in a
table depends on only a part of the primary key and
not on the whole key.
12
How to remove Partial
Dependency?
There can be many different solutions for this, but out objective is to remove teacher's
name from Score table.
The simplest solution is to remove columns teacher from Score table and add it to the
Subject table. Hence, the Subject table will become:
And our Score table is now in the second normal form, with no partial dependency.

subject_id subject_name teacher

1 Java Java Teacher

2 C++ C++ Teacher

3 Php Php Teacher

13
How to remove Partial
Dependency?
And our Score table is now in the second normal form, with no partial dependency.

score_id student_id subject_id marks

1 10 1 70
2 10 2 75
3 11 1 80

14
Third Normal Form (3NF)

• Requirements for Third Normal Form

• For a table to be in the third normal form,
• It should be in the Second Normal form.
• And it should not have Transitive Dependency.

15
• By transitive functional dependency, we mean
we have the following relationships in the
table: A is functionally dependent on B, and B
is functionally dependent on C. In this case, C
is transitively dependent on A via B.
• 3rd Normal Form Example
• Consider the following example:

16
17
Striking a balance between “good” & “evil”
De-normalization Normalization
Too many tables
4+ Normal Forms

3rd Normal Form

2nd Normal Form

Data Cubes 1st Normal Form

Data Lists

Flat Table One big flat file

18
What is De-normalization?
 It is not chaos, more like a “controlled crash”
with the aim of performance enhancement
without loss of information.

 Normalization is a rule of thumb in DBMS,

but in Decision Support System ease of use is
achieved by way of denormalization.

 De-normalization comes in many flavors,

such as combining tables, splitting tables,
adding data etc., but all done very carefully.
19
Why De-normalization In DSS?
• Bringing “close” dispersed but related data items.

• Query performance in DSS significantly dependent

on physical data model.

• Very early studies showed performance difference

in orders of magnitude for different number de-
normalized tables and rows per table.

• The level of de-normalization should be carefully

considered.
20
How De-normalization improves performance?
De-normalization specifically improves
performance by either:

 Reducing the number of tables and hence the

reliance on joins, which consequently speeds up
performance.

 Reducing the number of joins required during

query execution, or

 Reducing the number of rows to be retrieved from

the Primary Data Table.
21
Areas for Applying De-Normalization Techniques
 Dimensional modelling, dealing with data come across (two
major schemas. Snowflake schema and

Dealing with the abundance of star schemas.

 Fast access of time series data for analysis. [time has

hierarchy: days+=weeks+=months+=years…..]. No single
person query. Collapse in case of normalize time data.
 Fast aggregate (sum, average etc.) results and
complicated calculations.
 Multidimensional analysis (e.g. geography) in a complex
hierarchy.
 Dealing with few updates but many join queries. (hint
oltp)
De-normalization will ultimately affect the database size [redundancy
increase] and query performance 22
Five principal De-normalization techniques
1. Collapsing Tables.
- Two entities with a One-to-One relationship.
- Two entities with a Many-to-Many relationship.

2. Splitting Tables (Horizontal/Vertical Splitting).

3. Pre-Joining. One to many relationship

4. Adding Redundant Columns (Reference Data).

5. Derived Attributes (Summary, Total, Balance etc).

23
De-normalization Techniques

24
Collapsing Tables
ColA ColB
denormalized

ColA ColB ColC

normalized

ColA ColC

 Reduced storage space.

 Reduced update time.

 Does not changes business view.

 Reduced foreign keys.

25
Splitting Tables
Table Table_v1 Table_v2
ColA ColB ColC ColA ColB ColA ColC

Vertical Split
Table_h1 Table_h2

ColA ColB ColC ColA ColB ColC

26
Horizontal split
Splitting Tables: Horizontal splitting…
Breaks a table into multiple tables based upon
common column values. Example: Campus
specific queries.

GOAL

 Spreading rows for taking advantage of

parallelism.

 Grouping data to avoid unnecessary query load

in WHERE clause.

27
Splitting Tables: Horizontal splitting
ADVANTAGE
 Enhance security of data.
 Organizing tables differently for different
queries.

 Graceful degradation of database in case of

table damage.

28
Splitting Tables: Vertical Splitting
 Infrequently accessed columns become extra
“baggage” thus degrading performance.

Very useful for rarely accessed large text columns

with large headers.

 Header size is reduced, allowing more rows per

block, thus reducing I/O.

 For an end user, the split appears as a single table

through a view.

29
Pre-joining …

• Identify frequent joins and append the tables

together in the physical data model.

• Generally used for 1:M such as master-detail. RI

is assumed to exist.

• Additional space is required as the master

information is repeated in the new header
table.

30
Master
Pre-Joining…
Sale_ID Sale_date Sale_person
normalized

1 M
Tx_ID Sale_ID Item_ID Item_Qty Sale_Rs Detail
denormalized

Tx_ID Sale_ID Sale_date Sale_person Item_ID Item_Qty Sale_Rs

31
Adding Redundant Columns…
Table_1 Table_1’
ColA ColB ColA ColB ColC

Table_2 Table_2

ColA ColC ColD … ColZ ColA ColC ColD … ColZ

32
Adding Redundant Columns…
Columns can also be moved, instead of making them
redundant. Very similar to pre-joining as discussed
earlier.

EXAMPLE
Frequent referencing of code in one table and
corresponding description in another table.

 A join is required.

 To eliminate the join, a redundant attribute added in

the target entity which is functionally independent of
the primary key.
33
Redundant Columns: Surprise

Note that:
 Actually increases in storage space( header
size increase), and increase in update
overhead(query fast, few slow).

34
Derived Attributes: Example
DWH Data Model
Business Data Model
#SID #SID
DoB DoB
Degree Degree
Course Course
Grade Grade
Credits Credits Derived attributes
GP  Calculated once
DoB: Date of Birth Age  Used Frequently

Age is also a derived attribute, calculated as Current_Date

– DoB (calculated periodically).

GP (Grade Point) column in the data warehouse data

model is included as a derived value. The formula for
calculating this field is Grade*Credits.
35
Online Analytical Processing (OLAP)

36
DWH & OLAP

• Relationship between DWH &

OLAP

• Data Warehouse & OLAP go

together.

• Analysis supported by OLAP

37
Supporting the human thought process
THOUGHT PROCESS QUERY SEQUENCE

An enterprise wide fall in profit What was the quarterly sales

during last year ??

? Profit down by a large percentage What was the quarterly sales at


consistently during last quarter regional level during last year ??
only. Rest is OK

What was the quarterly sales at

What is special about last quarter product level during last year?
?
What was the monthly sale for
last quarter group by products
Products alone doing OK, but
North region is most problematic.
What was the monthly sale for
last quarter group by region
OK. So the problem is the high
cost of products purchased
in north. What was the monthly sale of
products in north at store level
group by products purchased

How many such query sequences can be programmed in advance? 38

Analysis of last example
• Analysis is Ad-hoc [no predefine sequence of
quries]
• Analysis is interactive (user driven) [content
change with click: thought process continuty]
• Analysis is iterative
– Answer to one question leads to a dozen more

• Analysis is directional
– Drill Down [details. Year->month->week More in
– Roll Up subsequent
slides
– Pivot
39
Challenges…
• Not feasible to write predefined queries.
– Fails to remain user_driven (becomes programmer
driven).

– Fails to remain ad_hoc and hence is not interactive.

• Enable ad-hoc query support

– Business user can not build his/her own queries
(does not know SQL, should not know it).

– On_the_go SQL generation and execution too slow.

40
Challenges
• Contradiction
– Want to compute answers in advance, but don't
know the questions

• Solution
– Compute answers to “all” possible “queries”. But
how?

– NOTE: Queries are multidimensional aggregates at

some level

41
“All” possible queries (level aggregates)
ALL ALL

Province Frontier ... Punjab

Division Mardan ... Peshawar Lahore ... Multan

District Peshawar Lahore

City Lahore ... Gugranwala

Zone Defense ...Gulberg 42

OLAP: Facts & Dimensions

• FACTS: Quantitative values (numbers) or “measures.”

– e.g., units sold, sales $, Co, Kg etc.

• DIMENSIONS: Descriptive categories.

– e.g., time, geography, product etc.

– DIM often organized in hierarchies representing levels

of detail in the data (e.g., week, month, quarter, year,
decade etc.).

43
Where does OLAP fit in?

?

Transaction
Data
Data
Loading

ELT

OLAP


Reports

Decision
Maker
Data Cube
(MOLAP) Presentation
Tools

44
OLTP vs. OLAP
Feature OLTP OLAP
Level of data Detailed Aggregated
Amount of data per Small Large
transaction
Views Pre-defined User-defined
[Programmer]
Typical write Update, insert, delete Bulk insert
operation
“age” of data Current (60-90 days) Historical 5-10 years and
also current [Active
DW]
Number of users High Low-Med
Tables Flat tables [Highly Multi-Dimensional tables
normalized]
Database size Med (109 B – 1012 B) High (1012 B – 1015 B)
Query Optimizing Requires experience Already “optimized”
45
Data availability High Low-Med
OLAP FASMI Test
Fast: Delivers information to the user at a fairly constant rate.
Most queries answered in under five seconds.

Analysis: Performs basic numerical and statistical analysis of the

data, pre-defined by an application developer or defined ad-hocly
by the user.

Shared: Implements the security requirements necessary for

sharing potentially confidential data across a large user population.

Multi-dimensional: The essential characteristic of OLAP.

Information: Accesses all the data and information necessary and

relevant for the application, wherever it may reside and not limited
by volume.
...from the OLAP Report by Pendse and Creeth.

NORMALIZATION
No ratings yet
NORMALIZATION
6 pages
Resume Working Student Jollibee
50% (2)
Resume Working Student Jollibee
3 pages
Normalization & De-Normalization: Group Members
No ratings yet
Normalization & De-Normalization: Group Members
46 pages
Adbms (Bca) 1 1744715686575
No ratings yet
Adbms (Bca) 1 1744715686575
39 pages
DBMS MP
No ratings yet
DBMS MP
15 pages
Data Base - Database - Databse Chapter 9
No ratings yet
Data Base - Database - Databse Chapter 9
54 pages
Normalization in DBMS
No ratings yet
Normalization in DBMS
14 pages
Normalization
No ratings yet
Normalization
35 pages
12.1 Manupulating Data - Relational Data Base
No ratings yet
12.1 Manupulating Data - Relational Data Base
25 pages
Module3 PartB
No ratings yet
Module3 PartB
41 pages
Database Normalization
No ratings yet
Database Normalization
44 pages
Normalization
No ratings yet
Normalization
7 pages
RDBMS Normalization
No ratings yet
RDBMS Normalization
8 pages
DBMS Normalization
No ratings yet
DBMS Normalization
18 pages
Normalization and Demoralization
No ratings yet
Normalization and Demoralization
4 pages
12 Normalization
No ratings yet
12 Normalization
41 pages
Normalization
No ratings yet
Normalization
15 pages
Week 2
No ratings yet
Week 2
34 pages
Normalization
No ratings yet
Normalization
60 pages
Unit IV
No ratings yet
Unit IV
65 pages
Lecture - 5 6 16032023 111618am
No ratings yet
Lecture - 5 6 16032023 111618am
38 pages
DBMS Unit3
No ratings yet
DBMS Unit3
57 pages
Normalization
No ratings yet
Normalization
36 pages
DB Week 10 Lec 1
No ratings yet
DB Week 10 Lec 1
32 pages
Normalization of Database
No ratings yet
Normalization of Database
10 pages
PDF Document 2
No ratings yet
PDF Document 2
72 pages
NORMALIZATION
No ratings yet
NORMALIZATION
11 pages
DBMS Unit 3
No ratings yet
DBMS Unit 3
33 pages
Normalization
No ratings yet
Normalization
17 pages
Databases
No ratings yet
Databases
4 pages
Normalisation
No ratings yet
Normalisation
21 pages
De Normalization 17062020 101155am 01042022 064624pm
No ratings yet
De Normalization 17062020 101155am 01042022 064624pm
36 pages
DBMS Unit-4 Notes
No ratings yet
DBMS Unit-4 Notes
18 pages
DBMS Session 6 Notes
No ratings yet
DBMS Session 6 Notes
50 pages
Lesson5 NORMALIZATION (Midtrem)
No ratings yet
Lesson5 NORMALIZATION (Midtrem)
29 pages
Database Analysis-Unit-1I: Course Name: Faculty Name
100% (1)
Database Analysis-Unit-1I: Course Name: Faculty Name
36 pages
Lecture 7 - 8 - Normalization
No ratings yet
Lecture 7 - 8 - Normalization
30 pages
Database Techniques DB Normalization
No ratings yet
Database Techniques DB Normalization
37 pages
DB 2
No ratings yet
DB 2
15 pages
Normalization of Database-Ass-2
No ratings yet
Normalization of Database-Ass-2
31 pages
CS331 - Chapter5 Normalization
No ratings yet
CS331 - Chapter5 Normalization
35 pages
Normalization
No ratings yet
Normalization
13 pages
Normalization FNL
No ratings yet
Normalization FNL
14 pages
Normalization
No ratings yet
Normalization
47 pages
Topic6 Normalization Updated
No ratings yet
Topic6 Normalization Updated
14 pages
Data Normalization
No ratings yet
Data Normalization
25 pages
Normalization Lesson
No ratings yet
Normalization Lesson
13 pages
Data Warehousing: Lecture No 04
No ratings yet
Data Warehousing: Lecture No 04
47 pages
Islamic Republic of Afghanistan Ministry of Higher Education Herat University Computer Science Faculty
No ratings yet
Islamic Republic of Afghanistan Ministry of Higher Education Herat University Computer Science Faculty
35 pages
Database Chapter 5
No ratings yet
Database Chapter 5
23 pages
CSC2243-Databases-Part III
No ratings yet
CSC2243-Databases-Part III
60 pages
Lec3 De-Normalization
No ratings yet
Lec3 De-Normalization
38 pages
1NF, 2NF
No ratings yet
1NF, 2NF
9 pages
Part 1.2
100% (1)
Part 1.2
88 pages
The 4 Unique Buying Styles
100% (1)
The 4 Unique Buying Styles
4 pages
Normal Forms
No ratings yet
Normal Forms
30 pages
Solutions
100% (1)
Solutions
25 pages
Workshop Proposal
No ratings yet
Workshop Proposal
20 pages
Normalization in DBMS
No ratings yet
Normalization in DBMS
10 pages
Form and CGI
No ratings yet
Form and CGI
77 pages
Q.1 What Is Normalisation? ANSWER:-Normalisation Is The Process of Structuring A Relational Database in Accordance
No ratings yet
Q.1 What Is Normalisation? ANSWER:-Normalisation Is The Process of Structuring A Relational Database in Accordance
9 pages
Navigating Veterinary Practice in The Digital Age: Implementing A Web-Based Information Management System at Animals' Choice Clinic
No ratings yet
Navigating Veterinary Practice in The Digital Age: Implementing A Web-Based Information Management System at Animals' Choice Clinic
9 pages
Normalisation Data
No ratings yet
Normalisation Data
8 pages
Rooster
No ratings yet
Rooster
15 pages
Udyam Registration
No ratings yet
Udyam Registration
12 pages
IA Carpentry
No ratings yet
IA Carpentry
103 pages
Coating MG For Use With NH4ClO4 - Shimizu's Improved and Long-Term Stable Dichromate Method
No ratings yet
Coating MG For Use With NH4ClO4 - Shimizu's Improved and Long-Term Stable Dichromate Method
13 pages
Functional Dependency (Normalization) Asad Khailany, DSC.: First Normal Form
No ratings yet
Functional Dependency (Normalization) Asad Khailany, DSC.: First Normal Form
13 pages
Normalization Paper
No ratings yet
Normalization Paper
3 pages
HW8-smoother Tuning DIAL
100% (1)
HW8-smoother Tuning DIAL
5 pages
BS-08 Partitionof Bengal
No ratings yet
BS-08 Partitionof Bengal
23 pages
US Manufacturing Output Falls in April On Weak Auto Production by
No ratings yet
US Manufacturing Output Falls in April On Weak Auto Production by
5 pages
Aero Seal
No ratings yet
Aero Seal
14 pages
Final Bachelor Project 07 Vikram
No ratings yet
Final Bachelor Project 07 Vikram
62 pages
Normalization
No ratings yet
Normalization
26 pages
REST0001 - Week 5 Sensitivity Analysis Practice Questions - Solution
No ratings yet
REST0001 - Week 5 Sensitivity Analysis Practice Questions - Solution
19 pages
Rea P6 Extra Practice 1
No ratings yet
Rea P6 Extra Practice 1
16 pages
Expanding Mental Health Care in The Kingdom of Eswatini: Successes, Challenges and Recommendations From Initial Experiences in Lubombo Region
No ratings yet
Expanding Mental Health Care in The Kingdom of Eswatini: Successes, Challenges and Recommendations From Initial Experiences in Lubombo Region
8 pages
Da-1405 TDS en
No ratings yet
Da-1405 TDS en
1 page
Alcohol Detection and Monitoring
No ratings yet
Alcohol Detection and Monitoring
11 pages
2 2 2
No ratings yet
2 2 2
4 pages
ESG Module Handbook 23.24A
No ratings yet
ESG Module Handbook 23.24A
12 pages
(Physical Optics) Without Solve
No ratings yet
(Physical Optics) Without Solve
7 pages
Michael's Resume 2024
No ratings yet
Michael's Resume 2024
3 pages
Fast Track Quick Reference
No ratings yet
Fast Track Quick Reference
7 pages
RIPMWC Round 2 Sample Questions 2019
100% (3)
RIPMWC Round 2 Sample Questions 2019
2 pages
LSIQNF2309462 - Pak Ageng
No ratings yet
LSIQNF2309462 - Pak Ageng
1 page
Forbidden Topic in Health Policy Debate - Cost Effectiveness - The New York Times
No ratings yet
Forbidden Topic in Health Policy Debate - Cost Effectiveness - The New York Times
4 pages
WNS PTP 54600-58600
No ratings yet
WNS PTP 54600-58600
2 pages
SQL Interview Success From Beginner To Pro
From Everand
SQL Interview Success From Beginner To Pro
Shana
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.