0% found this document useful (0 votes)

5 views13 pages

DWH Unit 2

Data Warehouse and Mining notes Unit 2

Uploaded by

lasaciv776

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views13 pages

DWH Unit 2

Data Warehouse and Mining notes Unit 2

Uploaded by

lasaciv776

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Data Preprocessing in Data Mining

Data preprocessing is a data mining technique which is used to transform the raw data in a useful and
efficient format.

Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

• (a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
2. Fill the Missing values:

• (b). Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
2. Regression:
3. Clustering:

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
1. Normalization:
2. Attribute Selection:
3. Discretization:
4. Concept Hierarchy Generation:
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data
reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis
costs.
The various steps to data reduction are:
1. Data Cube Aggregation:

2. Attribute Subset Selection:

3. Numerosity Reduction:

4. Dimensionality Reduction:

What is Data Summarization?

The term Data Summarization can be defined as the presentation of a summary/report of generated
data in a comprehensible and informative manner. To relay information about the dataset,
summarization is obtained from the entire dataset. It is a carefully performed summary that will
convey trends and patterns from the dataset in a simplified manner.

Data has become more complex hence, there is a need to summarize the data to gain useful
information. Data summarization has great importance in data mining as it can also help in deciding
appropriate statistical tests to use depending on the general trends revealed from the summarization.
Denormalization
When we normalize tables, we break them into multiple smaller tables. So when we want to retrieve
data from multiple tables, we need to perform some kind of join operation on them. In that case, we
use the denormalization technique that eliminates the drawback of normalization.

Denormalization is a technique used by database administrators to optimize the efficiency of their

database infrastructure. This method allows us to add redundant data into a normalized database to
alleviate issues with database queries that merge data from several tables into a single table. The
denormalization concept is based on the definition of normalization that is defined as arranging a
database into tables correctly for a particular purpose.

Advantages of Denormalization
1. Enhance Query Performance

Fetching queries in a normalized database generally requires joining a large number of tables, but we
already know that the more joins, the slower the query. To overcome this, we can add redundancy to a
database by copying values between parent and child tables, minimizing the number of joins needed
for a query.

2. Make database more convenient to manage

A normalized database is not required calculated values for applications. Calculating these values on-
the-fly will take a longer time, slowing down the execution of the query. Thus, in denormalization,
fetching queries can be simpler because we need to look at fewer tables.

3. Facilitate and accelerate reporting

Suppose you need certain statistics very frequently. It requires a long time to create them from live data
and slows down the entire system. Suppose you want to monitor client revenues over a certain year for
any or all clients. Generating such reports from live data will require "searching" throughout the entire
database, significantly slowing it down.

Disadvantages of Denormalization

o It takes large storage due to data redundancy.

o It makes it expensive to updates and inserts data in a table.
o It makes update and inserts code harder to write.
o Since data can be modified in several ways, it makes data inconsistent. Hence, we'll need to
update every piece of duplicate data. It's also used to measure values and produce reports. We
can do this by using triggers, transactions, and/or procedures for all operations that must be
performed together.
Multi-Dimensional Data Model
The multi-Dimensional Data Model is a method which is used for ordering data in the database along
with good arrangement and assembling of the contents in the database.

The Multi Dimensional Data Model allows customers to interrogate analytical questions associated
with market or business trends, unlike relational databases which allow customers to access data in
the form of queries. They allow users to rapidly receive answers to the requests which they made by
creating and examining the data comparatively fast.

OLAP (online analytical processing) and data warehousing uses multi dimensional databases. It is used
to show multiple dimensions of the data to users.

It represents data in the form of data cubes. Data cubes allow to model and view the data from many
dimensions and perspectives. It is defined by dimensions and facts and is represented by a fact table.
Facts are numerical measures and fact tables contain measures of the related dimensional tables or
names of the facts.

Advantages of Multi-Dimensional Data Model

• A multi-dimensional data model is easy to handle.

• It is easy to maintain.
• Its performance is better than that of normal databases (e.g. relational databases).

Disadvantages of Multi-Dimensional Data Model

• The multi-dimensional Data Model is slightly complicated in nature and it requires professionals to
recognize and examine the data in the database.
• During the work of a Multi-Dimensional Data Model, when the system caches, there is a great effect
on the working of the system.
What is Star Schema?
A star schema is the elementary form of a dimensional model, in which data are organized
into facts and dimensions. A fact is an event that is counted or measured. A dimension includes
reference data about the fact.

A star schema is a relational schema where a relational schema whose design represents a
multidimensional data model. The star schema is the explicit data warehouse schema. It is known as star
schema because the entity-relationship diagram of this schemas simulates a star, with points, diverge
from a central table. The center of the schema consists of a large fact table, and the points of the star
are the dimension tables.

Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table has two types
of columns: those that include fact and those that are foreign keys to the dimension table. The primary
key of the fact tables is generally a composite key that is made up of all of its foreign keys.

A fact table might involve either detail level fact or fact that have been aggregated (fact tables that
include aggregated fact are often instead called summary tables). A fact table generally contains facts
with the same level of aggregation.
Dimension
Dimension is an architecture usually composed of one or more hierarchies that categorize data. If a
dimension has not got hierarchies and levels, it is called a flat dimension or list. The primary keys of
each of the dimensions table are part of the composite primary keys of the fact table. Dimensional
attributes help to define the dimensional value. They are generally descriptive, textual values.
Dimensional tables are usually small in size than fact table.

Fact tables store data about sales while dimension tables data about the geographic region (markets,
cities), clients, products, times, channels.

Snowflake Schema
A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if one or more
dimension tables do not connect directly to the fact table but must join through other dimension
tables."

The snowflake schema is an expansion of the star schema where each point of the star explodes into
more points. It is called snowflake schema because the diagram of snowflake schema resembles a
snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR schemas. When we
normalize all the dimension tables entirely, the resultant structure resembles a snowflake with the fact
table in the middle.

Snowflaking is used to develop the performance of specific queries. The schema is diagramed with each
fact surrounded by its associated dimensions, and those dimensions are related to other dimensions,
branching out into a snowflake pattern.

The snowflake schema consists of one fact table which is linked to many dimension tables, which can
be linked to other dimension tables through a many-to-one relationship. Tables in a snowflake schema
are generally normalized to the third normal form. Each dimension table performs exactly one level in
a hierarchy.
Advantage of Snowflake Schema
1. The primary advantage of the snowflake schema is the development in query performance due
to minimized disk storage requirements and joining smaller lookup tables.
2. It provides greater scalability in the interrelationship between dimension levels and components.
3. No redundancy, so it is easier to maintain.

Disadvantage of Snowflake Schema

1. The primary disadvantage of the snowflake schema is the additional maintenance efforts
required due to the increasing number of lookup tables. It is also known as a multi fact star
schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.

Fact Constellation
Fact Constellation is a schema for representing multidimensional model. It is a collection of multiple
fact tables having some common dimension tables. It can be viewed as a collection of several star
schemas and hence, also known as Galaxy schema. It is one of the widely used schema for Data
warehouse designing and it is much more complex than star and snowflake schema. For complex
systems, we require fact constellations.
Difference between Star and Snowflake Schema:
S.NO Star Schema Snowflake Schema

In star schema, The fact tables While in snowflake schema, The fact
and the dimension tables are tables, dimension tables as well as sub
1. contained. dimension tables are contained.

Star schema is a top-down

2. model. While it is a bottom-up model.

3. Star schema uses more space. While it uses less space.

It takes less time for the While it takes more time than star schema
4. execution of queries. for the execution of queries.

In star schema, Normalization While in this, Both normalization and

5. is not used. denormalization are used.

6. It’s design is very simple. While it’s design is complex.

The query complexity of star While the query complexity of snowflake

7. schema is low. schema is higher than star schema.

It’s understanding is very

8. simple. While it’s understanding is difficult.

It has less number of foreign

9. keys. While it has more number of foreign keys.

10. It has high data redundancy. While it has low data redundancy.
MOLAP vs. ROLAP

MOLAP ROLAP

Information retrieval is fast. Information retrieval is comparatively slow.

Uses sparse array to store data-sets. Uses relational table.

MOLAP is best suited for inexperienced ROLAP is best suited for experienced users.
users, since it is very easy to use.

Maintains a separate database for data It may not require space other than available in
cubes. the Data warehouse.

DBMS facility is weak. DBMS facility is strong.

OLTP VS OLAP
OLTP OLAP
users Clerks,IT Professional Knowledge Worker

Function Day to day operations Decision support

DB Design Application –oriented Subject oriented

Data Current, up- to-date Historical,summarized,
detailed,flat relational
isolated Multidimensional
integrated,consolidated
Usage Repetitive Ad-hoc
Access Read/write index/hash on Lots of scans
prim key

Unit of work Short,simple transaction Complex query

#Users Thousands Hundreds

DB Size 100MB-GB 100Gb-TB
OLAP Servers:
Online Analytical Processing(OLAP) refers to a set of software tools used for data analysis in order to
make business decisions. OLAP provides a platform for gaining insights from databases retrieved from
multiple database systems at the same time. It is based on a multidimensional data model, which enables
users to extract and view data from various perspectives. A multidimensional database is used to store
OLAP data. Many Business Intelligence (BI) applications rely on OLAP technology.
Type of OLAP servers:
The three major types of OLAP servers are as follows:
• ROLAP
• MOLAP
• HOLAP

Relational OLAP (ROLAP):

Relational On-Line Analytical Processing (ROLAP) is primarily used for data stored in a relational
database, where both the base data and dimension tables are stored as relational tables. ROLAP servers
are used to bridge the gap between the relational back-end server and the client’s front-end tools. ROLAP
servers store and manage warehouse data using RDBMS, and OLAP middleware fills in the gaps.
Benefits:
• It is compatible with data warehouses and OLTP systems.
• The data size limitation of ROLAP technology is determined by the underlying RDBMS. As a
result, ROLAP does not limit the amount of data that can be stored.
Limitations:
• SQL functionality is constrained.
• It’s difficult to keep aggregate tables up to date.
Multidimensional OLAP (MOLAP):
Through array-based multidimensional storage engines, Multidimensional On-Line Analytical
Processing (MOLAP) supports multidimensional views of data. Storage utilization in multidimensional
data stores may be low if the data set is sparse.
MOLAP stores data on discs in the form of a specialized multidimensional array structure. It is used for
OLAP, which is based on the arrays’ random access capability. Dimension instances determine array
elements, and the data or measured value associated with each cell is typically stored in the corresponding
array element. The multidimensional array is typically stored in MOLAP in a linear allocation based on
nested traversal of the axes in some predetermined order.
Benefits:
• Suitable for slicing and dicing operations.
• Outperforms ROLAP when data is dense.
• Capable of performing complex calculations.
Limitations:
• It is difficult to change the dimensions without re-aggregating.
• Since all calculations are performed when the cube is built, a large amount of data cannot be stored
in the cube itself.
Hybrid OLAP (HOLAP):
ROLAP and MOLAP are combined in Hybrid On-Line Analytical Processing (HOLAP). HOLAP offers
greater scalability than ROLAP and faster computation than MOLAP.HOLAP is a hybrid of ROLAP and
MOLAP. HOLAP servers are capable of storing large amounts of detailed data. On the one hand, HOLAP
benefits from ROLAP’s greater scalability. HOLAP, on the other hand, makes use of cube technology
for faster performance and summary-type information. Because detailed data is stored in a relational
database, cubes are smaller than MOLAP.
Benefits:
• HOLAP combines the benefits of MOLAP and ROLAP.
• Provide quick access at all aggregation levels.
Limitations
• Because it supports both MOLAP and ROLAP servers, HOLAP architecture is extremely complex.
• There is a greater likelihood of overlap, particularly in their functionalities.

What is Data Cube?

When data is grouped or combined in multidimensional matrices called Data Cubes. The data cube
method has a few alternative names or a few variants, such as "Multidimensional databases,"
"materialized views," and "OLAP (On-Line Analytical Processing)."

The general idea of this approach is to materialize certain expensive computations that are frequently
inquired.

For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be
materialized into a set of eight views as shown in fig, where psc indicates a view consisting of aggregate
function value (such as total-sales) computed by grouping three attributes part, supplier, and
customer, p indicates a view composed of the corresponding aggregate function values calculated by
grouping part alone, etc.

Chapter 2 Kimball Dimensional Modelling Techniques Overview
No ratings yet
Chapter 2 Kimball Dimensional Modelling Techniques Overview
14 pages
DWM Unit 2. Data Warehousing Modeling & OLAP I
100% (2)
DWM Unit 2. Data Warehousing Modeling & OLAP I
16 pages
August 2009 Bachelor of Science in Information Technology (BScIT)
No ratings yet
August 2009 Bachelor of Science in Information Technology (BScIT)
49 pages
Solutions To DM I MID (A)
100% (1)
Solutions To DM I MID (A)
19 pages
Detailed Lesson Plan in MAPEH I
73% (40)
Detailed Lesson Plan in MAPEH I
5 pages
Block Printing
100% (1)
Block Printing
30 pages
3 - Business Analysis in Data Mining - L6 - 7 - 8 - 9 - 10
No ratings yet
3 - Business Analysis in Data Mining - L6 - 7 - 8 - 9 - 10
40 pages
Data Warehouse Lec-3
No ratings yet
Data Warehouse Lec-3
38 pages
Data Cubemod2
100% (1)
Data Cubemod2
21 pages
DWM 2
No ratings yet
DWM 2
21 pages
CSIS 3300 W3 Denormalization StarSchema
No ratings yet
CSIS 3300 W3 Denormalization StarSchema
27 pages
Chapter V
No ratings yet
Chapter V
38 pages
DOS 1.0 Jan82
No ratings yet
DOS 1.0 Jan82
307 pages
Chapter Eight
No ratings yet
Chapter Eight
33 pages
DataWarehouse Interview Question
No ratings yet
DataWarehouse Interview Question
7 pages
Customer Behavior
No ratings yet
Customer Behavior
14 pages
Data Warehouse
No ratings yet
Data Warehouse
14 pages
Data Warehouse Concepts PDF
0% (1)
Data Warehouse Concepts PDF
14 pages
Infowar and Spiritual Apocalypse The Destiny of Mankind: DR Bill Deagle MD
100% (2)
Infowar and Spiritual Apocalypse The Destiny of Mankind: DR Bill Deagle MD
37 pages
Question & Answer Data Waerhousing
No ratings yet
Question & Answer Data Waerhousing
2 pages
Lect-6-Data warehousing-Part-II
No ratings yet
Lect-6-Data warehousing-Part-II
37 pages
CH 3
No ratings yet
CH 3
60 pages
Design Guide For Hot Dip Galvanizing Best Practice Venting and Draining PDF
No ratings yet
Design Guide For Hot Dip Galvanizing Best Practice Venting and Draining PDF
15 pages
Unit Iv
No ratings yet
Unit Iv
33 pages
Data Warehouse Concepts
No ratings yet
Data Warehouse Concepts
11 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
104 pages
DW Unit IV Notes
No ratings yet
DW Unit IV Notes
36 pages
DWDM Class PPT 9-9-23
No ratings yet
DWDM Class PPT 9-9-23
65 pages
Data Modelling
No ratings yet
Data Modelling
47 pages
Unit 2
No ratings yet
Unit 2
33 pages
Oh 3
No ratings yet
Oh 3
30 pages
Ssas Real Time Interview Questions and Answers
No ratings yet
Ssas Real Time Interview Questions and Answers
7 pages
IBM Planning Analytics With Watson
No ratings yet
IBM Planning Analytics With Watson
7 pages
Unit 4
No ratings yet
Unit 4
11 pages
Unit 5 DW
No ratings yet
Unit 5 DW
12 pages
Dimensional Modelling
No ratings yet
Dimensional Modelling
36 pages
Data Warehouse Schema
No ratings yet
Data Warehouse Schema
10 pages
Unit 2 DWM
No ratings yet
Unit 2 DWM
16 pages
Introduction To DataWarehouse and DataMining
No ratings yet
Introduction To DataWarehouse and DataMining
35 pages
Chapter Nine
No ratings yet
Chapter Nine
36 pages
1 3 1 Preparing The Title And-Title Page
No ratings yet
1 3 1 Preparing The Title And-Title Page
3 pages
Unit 2-DATA WAREHOUSE
No ratings yet
Unit 2-DATA WAREHOUSE
28 pages
What Is A Data Warehouse
No ratings yet
What Is A Data Warehouse
11 pages
Unit 2 Notes DWM
No ratings yet
Unit 2 Notes DWM
14 pages
dw4 - Dimension1
No ratings yet
dw4 - Dimension1
75 pages
1
No ratings yet
1
35 pages
What Is The Difference Between OLTP and OLAP?
No ratings yet
What Is The Difference Between OLTP and OLAP?
33 pages
Dimensional Modeling: Prof. Sunita Sahu
No ratings yet
Dimensional Modeling: Prof. Sunita Sahu
50 pages
Final DWM
No ratings yet
Final DWM
30 pages
DWM Chp2 Notes
No ratings yet
DWM Chp2 Notes
21 pages
Violin 2020 Grade 1 PDF
No ratings yet
Violin 2020 Grade 1 PDF
18 pages
Data Warehouse and Data Modelling
No ratings yet
Data Warehouse and Data Modelling
11 pages
Machine Tool Industry in India
No ratings yet
Machine Tool Industry in India
26 pages
Notis Georgiou, Portfolio
No ratings yet
Notis Georgiou, Portfolio
75 pages
Datawarehousing Top50 Interview Questions
No ratings yet
Datawarehousing Top50 Interview Questions
10 pages
Cost Based Optimization
No ratings yet
Cost Based Optimization
14 pages
DW Concepts
No ratings yet
DW Concepts
7 pages
5.data Warehouse
No ratings yet
5.data Warehouse
19 pages
TNSTC
No ratings yet
TNSTC
1 page
FIELD TRIP REPORT Sabnam
No ratings yet
FIELD TRIP REPORT Sabnam
21 pages
XII - I PreBoard - PHYSICS
No ratings yet
XII - I PreBoard - PHYSICS
12 pages
RP Lab File
No ratings yet
RP Lab File
20 pages
Guitar Rig 4 Getting Started English
No ratings yet
Guitar Rig 4 Getting Started English
29 pages
What Is Data Warehouse?: Explanatory Note
No ratings yet
What Is Data Warehouse?: Explanatory Note
11 pages
Data Warehouse Ques
No ratings yet
Data Warehouse Ques
10 pages
What Is Dimensional Model
No ratings yet
What Is Dimensional Model
7 pages
Dimensional Modelling
No ratings yet
Dimensional Modelling
26 pages
History of Africa Chuchu
No ratings yet
History of Africa Chuchu
3 pages
What Is The Level of Granularity of A Fact Table
No ratings yet
What Is The Level of Granularity of A Fact Table
15 pages
Acad Cal S1 2013-2014 - v1 PDF
No ratings yet
Acad Cal S1 2013-2014 - v1 PDF
2 pages
Gandhi, Islam and More
No ratings yet
Gandhi, Islam and More
2 pages
DW Basic Questions
No ratings yet
DW Basic Questions
9 pages
17
No ratings yet
17
105 pages
Chapter 1
No ratings yet
Chapter 1
35 pages
T4 Suhu & Kehadiran 2223
No ratings yet
T4 Suhu & Kehadiran 2223
18 pages
MATRIX For Data Need and Analysis AYUPAN
No ratings yet
MATRIX For Data Need and Analysis AYUPAN
7 pages
DWH Int Questions
100% (1)
DWH Int Questions
9 pages
Data Warehousing Interview Questions and Answers
No ratings yet
Data Warehousing Interview Questions and Answers
5 pages
Data Warehousing FAQ
No ratings yet
Data Warehousing FAQ
5 pages
Lesson 3
No ratings yet
Lesson 3
9 pages
Wa0005.
No ratings yet
Wa0005.
4 pages
MPC-REC 120v Install v1
No ratings yet
MPC-REC 120v Install v1
6 pages
Intro Practical
No ratings yet
Intro Practical
6 pages
1st Long Test PECS and SWOT
No ratings yet
1st Long Test PECS and SWOT
2 pages
BANK OF TANZANIA - Circular
No ratings yet
BANK OF TANZANIA - Circular
1 page
Ict SSS One, Two and Three
No ratings yet
Ict SSS One, Two and Three
8 pages
Datawarehouse Concepts
No ratings yet
Datawarehouse Concepts
5 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Basic Concepts in Data Structures
From Everand
Basic Concepts in Data Structures
K.Meenendranath Reddy
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DWH Unit 2

Uploaded by

DWH Unit 2

Uploaded by

Data Preprocessing in Data Mining

Steps Involved in Data Preprocessing:

• (a). Missing Data:

• (b). Noisy Data:

2. Attribute Subset Selection:

What is Data Summarization?

Denormalization is a technique used by database administrators to optimize the efficiency of their

2. Make database more convenient to manage

3. Facilitate and accelerate reporting

o It takes large storage due to data redundancy.

Advantages of Multi-Dimensional Data Model

• A multi-dimensional data model is easy to handle.

Disadvantages of Multi-Dimensional Data Model

Disadvantage of Snowflake Schema

Star schema is a top-down

3. Star schema uses more space. While it uses less space.

In star schema, Normalization While in this, Both normalization and

6. It’s design is very simple. While it’s design is complex.

The query complexity of star While the query complexity of snowflake

It’s understanding is very

It has less number of foreign

Information retrieval is fast. Information retrieval is comparatively slow.

Uses sparse array to store data-sets. Uses relational table.

DBMS facility is weak. DBMS facility is strong.

Function Day to day operations Decision support

DB Design Application –oriented Subject oriented

Unit of work Short,simple transaction Complex query

#Users Thousands Hundreds

Relational OLAP (ROLAP):

What is Data Cube?

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.