0% found this document useful (0 votes)
5 views13 pages

DWH Unit 2

Data Warehouse and Mining notes Unit 2

Uploaded by

lasaciv776
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views13 pages

DWH Unit 2

Data Warehouse and Mining notes Unit 2

Uploaded by

lasaciv776
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Preprocessing in Data Mining

Data preprocessing is a data mining technique which is used to transform the raw data in a useful and
efficient format.

Steps Involved in Data Preprocessing:


1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

• (a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
2. Fill the Missing values:

• (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
2. Regression:
3. Clustering:

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
1. Normalization:
2. Attribute Selection:
3. Discretization:
4. Concept Hierarchy Generation:
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data
reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis
costs.
The various steps to data reduction are:
1. Data Cube Aggregation:

2. Attribute Subset Selection:

3. Numerosity Reduction:

4. Dimensionality Reduction:

What is Data Summarization?

The term Data Summarization can be defined as the presentation of a summary/report of generated
data in a comprehensible and informative manner. To relay information about the dataset,
summarization is obtained from the entire dataset. It is a carefully performed summary that will
convey trends and patterns from the dataset in a simplified manner.

Data has become more complex hence, there is a need to summarize the data to gain useful
information. Data summarization has great importance in data mining as it can also help in deciding
appropriate statistical tests to use depending on the general trends revealed from the summarization.
Denormalization
When we normalize tables, we break them into multiple smaller tables. So when we want to retrieve
data from multiple tables, we need to perform some kind of join operation on them. In that case, we
use the denormalization technique that eliminates the drawback of normalization.

Denormalization is a technique used by database administrators to optimize the efficiency of their


database infrastructure. This method allows us to add redundant data into a normalized database to
alleviate issues with database queries that merge data from several tables into a single table. The
denormalization concept is based on the definition of normalization that is defined as arranging a
database into tables correctly for a particular purpose.

Advantages of Denormalization
1. Enhance Query Performance

Fetching queries in a normalized database generally requires joining a large number of tables, but we
already know that the more joins, the slower the query. To overcome this, we can add redundancy to a
database by copying values between parent and child tables, minimizing the number of joins needed
for a query.

2. Make database more convenient to manage

A normalized database is not required calculated values for applications. Calculating these values on-
the-fly will take a longer time, slowing down the execution of the query. Thus, in denormalization,
fetching queries can be simpler because we need to look at fewer tables.

3. Facilitate and accelerate reporting

Suppose you need certain statistics very frequently. It requires a long time to create them from live data
and slows down the entire system. Suppose you want to monitor client revenues over a certain year for
any or all clients. Generating such reports from live data will require "searching" throughout the entire
database, significantly slowing it down.

Disadvantages of Denormalization

o It takes large storage due to data redundancy.


o It makes it expensive to updates and inserts data in a table.
o It makes update and inserts code harder to write.
o Since data can be modified in several ways, it makes data inconsistent. Hence, we'll need to
update every piece of duplicate data. It's also used to measure values and produce reports. We
can do this by using triggers, transactions, and/or procedures for all operations that must be
performed together.
Multi-Dimensional Data Model
The multi-Dimensional Data Model is a method which is used for ordering data in the database along
with good arrangement and assembling of the contents in the database.

The Multi Dimensional Data Model allows customers to interrogate analytical questions associated
with market or business trends, unlike relational databases which allow customers to access data in
the form of queries. They allow users to rapidly receive answers to the requests which they made by
creating and examining the data comparatively fast.

OLAP (online analytical processing) and data warehousing uses multi dimensional databases. It is used
to show multiple dimensions of the data to users.

It represents data in the form of data cubes. Data cubes allow to model and view the data from many
dimensions and perspectives. It is defined by dimensions and facts and is represented by a fact table.
Facts are numerical measures and fact tables contain measures of the related dimensional tables or
names of the facts.

Advantages of Multi-Dimensional Data Model

• A multi-dimensional data model is easy to handle.


• It is easy to maintain.
• Its performance is better than that of normal databases (e.g. relational databases).

Disadvantages of Multi-Dimensional Data Model

• The multi-dimensional Data Model is slightly complicated in nature and it requires professionals to
recognize and examine the data in the database.
• During the work of a Multi-Dimensional Data Model, when the system caches, there is a great effect
on the working of the system.
What is Star Schema?
A star schema is the elementary form of a dimensional model, in which data are organized
into facts and dimensions. A fact is an event that is counted or measured. A dimension includes
reference data about the fact.

A star schema is a relational schema where a relational schema whose design represents a
multidimensional data model. The star schema is the explicit data warehouse schema. It is known as star
schema because the entity-relationship diagram of this schemas simulates a star, with points, diverge
from a central table. The center of the schema consists of a large fact table, and the points of the star
are the dimension tables.

Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table has two types
of columns: those that include fact and those that are foreign keys to the dimension table. The primary
key of the fact tables is generally a composite key that is made up of all of its foreign keys.

A fact table might involve either detail level fact or fact that have been aggregated (fact tables that
include aggregated fact are often instead called summary tables). A fact table generally contains facts
with the same level of aggregation.
Dimension
Dimension is an architecture usually composed of one or more hierarchies that categorize data. If a
dimension has not got hierarchies and levels, it is called a flat dimension or list. The primary keys of
each of the dimensions table are part of the composite primary keys of the fact table. Dimensional
attributes help to define the dimensional value. They are generally descriptive, textual values.
Dimensional tables are usually small in size than fact table.

Fact tables store data about sales while dimension tables data about the geographic region (markets,
cities), clients, products, times, channels.

Snowflake Schema
A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if one or more
dimension tables do not connect directly to the fact table but must join through other dimension
tables."

The snowflake schema is an expansion of the star schema where each point of the star explodes into
more points. It is called snowflake schema because the diagram of snowflake schema resembles a
snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR schemas. When we
normalize all the dimension tables entirely, the resultant structure resembles a snowflake with the fact
table in the middle.

Snowflaking is used to develop the performance of specific queries. The schema is diagramed with each
fact surrounded by its associated dimensions, and those dimensions are related to other dimensions,
branching out into a snowflake pattern.

The snowflake schema consists of one fact table which is linked to many dimension tables, which can
be linked to other dimension tables through a many-to-one relationship. Tables in a snowflake schema
are generally normalized to the third normal form. Each dimension table performs exactly one level in
a hierarchy.
Advantage of Snowflake Schema
1. The primary advantage of the snowflake schema is the development in query performance due
to minimized disk storage requirements and joining smaller lookup tables.
2. It provides greater scalability in the interrelationship between dimension levels and components.
3. No redundancy, so it is easier to maintain.

Disadvantage of Snowflake Schema


1. The primary disadvantage of the snowflake schema is the additional maintenance efforts
required due to the increasing number of lookup tables. It is also known as a multi fact star
schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.

Fact Constellation
Fact Constellation is a schema for representing multidimensional model. It is a collection of multiple
fact tables having some common dimension tables. It can be viewed as a collection of several star
schemas and hence, also known as Galaxy schema. It is one of the widely used schema for Data
warehouse designing and it is much more complex than star and snowflake schema. For complex
systems, we require fact constellations.
Difference between Star and Snowflake Schema:
S.NO Star Schema Snowflake Schema

In star schema, The fact tables While in snowflake schema, The fact
and the dimension tables are tables, dimension tables as well as sub
1. contained. dimension tables are contained.

Star schema is a top-down


2. model. While it is a bottom-up model.

3. Star schema uses more space. While it uses less space.

It takes less time for the While it takes more time than star schema
4. execution of queries. for the execution of queries.

In star schema, Normalization While in this, Both normalization and


5. is not used. denormalization are used.

6. It’s design is very simple. While it’s design is complex.

The query complexity of star While the query complexity of snowflake


7. schema is low. schema is higher than star schema.

It’s understanding is very


8. simple. While it’s understanding is difficult.

It has less number of foreign


9. keys. While it has more number of foreign keys.

10. It has high data redundancy. While it has low data redundancy.
MOLAP vs. ROLAP

MOLAP ROLAP

Information retrieval is fast. Information retrieval is comparatively slow.

Uses sparse array to store data-sets. Uses relational table.

MOLAP is best suited for inexperienced ROLAP is best suited for experienced users.
users, since it is very easy to use.

Maintains a separate database for data It may not require space other than available in
cubes. the Data warehouse.

DBMS facility is weak. DBMS facility is strong.

OLTP VS OLAP
OLTP OLAP
users Clerks,IT Professional Knowledge Worker

Function Day to day operations Decision support

DB Design Application –oriented Subject oriented


Data Current, up- to-date Historical,summarized,
detailed,flat relational
isolated Multidimensional
integrated,consolidated
Usage Repetitive Ad-hoc
Access Read/write index/hash on Lots of scans
prim key

Unit of work Short,simple transaction Complex query

#Users Thousands Hundreds


DB Size 100MB-GB 100Gb-TB
OLAP Servers:
Online Analytical Processing(OLAP) refers to a set of software tools used for data analysis in order to
make business decisions. OLAP provides a platform for gaining insights from databases retrieved from
multiple database systems at the same time. It is based on a multidimensional data model, which enables
users to extract and view data from various perspectives. A multidimensional database is used to store
OLAP data. Many Business Intelligence (BI) applications rely on OLAP technology.
Type of OLAP servers:
The three major types of OLAP servers are as follows:
• ROLAP
• MOLAP
• HOLAP

Relational OLAP (ROLAP):


Relational On-Line Analytical Processing (ROLAP) is primarily used for data stored in a relational
database, where both the base data and dimension tables are stored as relational tables. ROLAP servers
are used to bridge the gap between the relational back-end server and the client’s front-end tools. ROLAP
servers store and manage warehouse data using RDBMS, and OLAP middleware fills in the gaps.
Benefits:
• It is compatible with data warehouses and OLTP systems.
• The data size limitation of ROLAP technology is determined by the underlying RDBMS. As a
result, ROLAP does not limit the amount of data that can be stored.
Limitations:
• SQL functionality is constrained.
• It’s difficult to keep aggregate tables up to date.
Multidimensional OLAP (MOLAP):
Through array-based multidimensional storage engines, Multidimensional On-Line Analytical
Processing (MOLAP) supports multidimensional views of data. Storage utilization in multidimensional
data stores may be low if the data set is sparse.
MOLAP stores data on discs in the form of a specialized multidimensional array structure. It is used for
OLAP, which is based on the arrays’ random access capability. Dimension instances determine array
elements, and the data or measured value associated with each cell is typically stored in the corresponding
array element. The multidimensional array is typically stored in MOLAP in a linear allocation based on
nested traversal of the axes in some predetermined order.
Benefits:
• Suitable for slicing and dicing operations.
• Outperforms ROLAP when data is dense.
• Capable of performing complex calculations.
Limitations:
• It is difficult to change the dimensions without re-aggregating.
• Since all calculations are performed when the cube is built, a large amount of data cannot be stored
in the cube itself.
Hybrid OLAP (HOLAP):
ROLAP and MOLAP are combined in Hybrid On-Line Analytical Processing (HOLAP). HOLAP offers
greater scalability than ROLAP and faster computation than MOLAP.HOLAP is a hybrid of ROLAP and
MOLAP. HOLAP servers are capable of storing large amounts of detailed data. On the one hand, HOLAP
benefits from ROLAP’s greater scalability. HOLAP, on the other hand, makes use of cube technology
for faster performance and summary-type information. Because detailed data is stored in a relational
database, cubes are smaller than MOLAP.
Benefits:
• HOLAP combines the benefits of MOLAP and ROLAP.
• Provide quick access at all aggregation levels.
Limitations
• Because it supports both MOLAP and ROLAP servers, HOLAP architecture is extremely complex.
• There is a greater likelihood of overlap, particularly in their functionalities.

What is Data Cube?


When data is grouped or combined in multidimensional matrices called Data Cubes. The data cube
method has a few alternative names or a few variants, such as "Multidimensional databases,"
"materialized views," and "OLAP (On-Line Analytical Processing)."

The general idea of this approach is to materialize certain expensive computations that are frequently
inquired.

For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be
materialized into a set of eight views as shown in fig, where psc indicates a view consisting of aggregate
function value (such as total-sales) computed by grouping three attributes part, supplier, and
customer, p indicates a view composed of the corresponding aggregate function values calculated by
grouping part alone, etc.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy