DWH Unit 2
DWH Unit 2
Data preprocessing is a data mining technique which is used to transform the raw data in a useful and
efficient format.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
1. Normalization:
2. Attribute Selection:
3. Discretization:
4. Concept Hierarchy Generation:
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data
reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis
costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
3. Numerosity Reduction:
4. Dimensionality Reduction:
The term Data Summarization can be defined as the presentation of a summary/report of generated
data in a comprehensible and informative manner. To relay information about the dataset,
summarization is obtained from the entire dataset. It is a carefully performed summary that will
convey trends and patterns from the dataset in a simplified manner.
Data has become more complex hence, there is a need to summarize the data to gain useful
information. Data summarization has great importance in data mining as it can also help in deciding
appropriate statistical tests to use depending on the general trends revealed from the summarization.
Denormalization
When we normalize tables, we break them into multiple smaller tables. So when we want to retrieve
data from multiple tables, we need to perform some kind of join operation on them. In that case, we
use the denormalization technique that eliminates the drawback of normalization.
Advantages of Denormalization
1. Enhance Query Performance
Fetching queries in a normalized database generally requires joining a large number of tables, but we
already know that the more joins, the slower the query. To overcome this, we can add redundancy to a
database by copying values between parent and child tables, minimizing the number of joins needed
for a query.
A normalized database is not required calculated values for applications. Calculating these values on-
the-fly will take a longer time, slowing down the execution of the query. Thus, in denormalization,
fetching queries can be simpler because we need to look at fewer tables.
Suppose you need certain statistics very frequently. It requires a long time to create them from live data
and slows down the entire system. Suppose you want to monitor client revenues over a certain year for
any or all clients. Generating such reports from live data will require "searching" throughout the entire
database, significantly slowing it down.
Disadvantages of Denormalization
The Multi Dimensional Data Model allows customers to interrogate analytical questions associated
with market or business trends, unlike relational databases which allow customers to access data in
the form of queries. They allow users to rapidly receive answers to the requests which they made by
creating and examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi dimensional databases. It is used
to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the data from many
dimensions and perspectives. It is defined by dimensions and facts and is represented by a fact table.
Facts are numerical measures and fact tables contain measures of the related dimensional tables or
names of the facts.
• The multi-dimensional Data Model is slightly complicated in nature and it requires professionals to
recognize and examine the data in the database.
• During the work of a Multi-Dimensional Data Model, when the system caches, there is a great effect
on the working of the system.
What is Star Schema?
A star schema is the elementary form of a dimensional model, in which data are organized
into facts and dimensions. A fact is an event that is counted or measured. A dimension includes
reference data about the fact.
A star schema is a relational schema where a relational schema whose design represents a
multidimensional data model. The star schema is the explicit data warehouse schema. It is known as star
schema because the entity-relationship diagram of this schemas simulates a star, with points, diverge
from a central table. The center of the schema consists of a large fact table, and the points of the star
are the dimension tables.
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table has two types
of columns: those that include fact and those that are foreign keys to the dimension table. The primary
key of the fact tables is generally a composite key that is made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact tables that
include aggregated fact are often instead called summary tables). A fact table generally contains facts
with the same level of aggregation.
Dimension
Dimension is an architecture usually composed of one or more hierarchies that categorize data. If a
dimension has not got hierarchies and levels, it is called a flat dimension or list. The primary keys of
each of the dimensions table are part of the composite primary keys of the fact table. Dimensional
attributes help to define the dimensional value. They are generally descriptive, textual values.
Dimensional tables are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic region (markets,
cities), clients, products, times, channels.
Snowflake Schema
A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if one or more
dimension tables do not connect directly to the fact table but must join through other dimension
tables."
The snowflake schema is an expansion of the star schema where each point of the star explodes into
more points. It is called snowflake schema because the diagram of snowflake schema resembles a
snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR schemas. When we
normalize all the dimension tables entirely, the resultant structure resembles a snowflake with the fact
table in the middle.
Snowflaking is used to develop the performance of specific queries. The schema is diagramed with each
fact surrounded by its associated dimensions, and those dimensions are related to other dimensions,
branching out into a snowflake pattern.
The snowflake schema consists of one fact table which is linked to many dimension tables, which can
be linked to other dimension tables through a many-to-one relationship. Tables in a snowflake schema
are generally normalized to the third normal form. Each dimension table performs exactly one level in
a hierarchy.
Advantage of Snowflake Schema
1. The primary advantage of the snowflake schema is the development in query performance due
to minimized disk storage requirements and joining smaller lookup tables.
2. It provides greater scalability in the interrelationship between dimension levels and components.
3. No redundancy, so it is easier to maintain.
Fact Constellation
Fact Constellation is a schema for representing multidimensional model. It is a collection of multiple
fact tables having some common dimension tables. It can be viewed as a collection of several star
schemas and hence, also known as Galaxy schema. It is one of the widely used schema for Data
warehouse designing and it is much more complex than star and snowflake schema. For complex
systems, we require fact constellations.
Difference between Star and Snowflake Schema:
S.NO Star Schema Snowflake Schema
In star schema, The fact tables While in snowflake schema, The fact
and the dimension tables are tables, dimension tables as well as sub
1. contained. dimension tables are contained.
It takes less time for the While it takes more time than star schema
4. execution of queries. for the execution of queries.
10. It has high data redundancy. While it has low data redundancy.
MOLAP vs. ROLAP
MOLAP ROLAP
MOLAP is best suited for inexperienced ROLAP is best suited for experienced users.
users, since it is very easy to use.
Maintains a separate database for data It may not require space other than available in
cubes. the Data warehouse.
OLTP VS OLAP
OLTP OLAP
users Clerks,IT Professional Knowledge Worker
The general idea of this approach is to materialize certain expensive computations that are frequently
inquired.
For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be
materialized into a set of eight views as shown in fig, where psc indicates a view consisting of aggregate
function value (such as total-sales) computed by grouping three attributes part, supplier, and
customer, p indicates a view composed of the corresponding aggregate function values calculated by
grouping part alone, etc.