DMDW Unit2
DMDW Unit2
UNIT II Data Warehouse and OLAP Technology for Data Mining: Data Warehouse,
Multidimensional Data Model, Data Warehouse Architecture, Data Marts, Data Warehouse
Implementation, Further Development of Data Cube Technology, From Data Warehousing to
Data Mining, Data Cube Computation and Data Generalization, Attribute-Oriented Induction
Q) Write about Multi-Dimensional Modeling?
Dimensional Modeling (DM) is a data structure technique optimized for
data storage in a Data warehouse.
The purpose of dimensional modeling is to optimize the database for
faster retrieval of data. The concept of Dimensional Modeling was
developed by Ralph Kimball and consists of “fact” and “dimension”
tables.
A dimensional model in data warehouse is designed to read, summarize,
analyze numeric information like values, balances, counts, weights, etc. in a
data warehouse.
A multidimensional model is a technique, structure for data warehousing
tools. A multidimensional model display data in the form of a data-cube. This
data cube able data to be designed and display in multiple dimensions. This
cube is defined through dimensions and facts. Data cube can be represented
as follows:
1
DMDW-U2
2
DMDW-U2
In the above given presentation, the factory’s sales for Bangalore are, for
the time dimension, which is organized into quarters and the dimension of
items.
if we desire to view the data of the sales in a three-dimensional table, then it
is represented in the diagram given below. Here the data of the sales is
3
DMDW-U2
4
DMDW-U2
5
DMDW-U2
ADVANTAGES:
1. Most Suitable for Query Processing: View-only reporting applications show
enhanced performance.
2. Simple Queries
3. Simplest and Easiest to design.
DISADVANTAGES:
1. They don’t support many to many relationships between business entities.
2. More data redundancy:
Snowflake Schema
It is an extended version of the star schema where dimension tables
are sub-divided further.
It means that there are many levels of dimension tables. It is because
of the normalized dimensions here.
Normalization is a process that splits up data to avoid data
redundancy. This process sub-divides the tables and the number of
tables increases.
The Snowflake schema is nothing but a normalized Star schema.
6
DMDW-U2
ADVANTAGES:
1. Easy to maintain: It is due to reduced data redundancy.
2. Saves Storage space: Dimension tables are easier to update.
DISADVANTAGES:
1. Complex Schema: Source query joins are complex.
2. Query Performance is not so good: because of the complex queries.
Galaxy Schema
It consists of more than one fact table linked to the dimension tables
having attributes. It is also called a fact constellation schema.
Conformed dimensions are the dimension tables shared with the fact
tables. We can normalize the dimensions in this schema further, but it
will lead to a more complex design.
The following diagram shows Placement and Workshop as the two fact tables
present. And the dimension table, Student, and TPO are the conformed
dimensions.
7
DMDW-U2
ADVANTAGES:
1. Flexible schema.
2. Effective analysis and reporting.
DISADVANTAGES:
1. Has huge dimension tables hence resulting in difficulty in managing.
2. Hard to maintain: It is because of their complex design and as there are
many fact tables.
Q) What is data warehouse architecture?
Bill Inmon is widely recognized as “the father of data warehousing”.
He defines a data warehouse as:
"Data warehouse architecture refers to a subject-oriented, integrated, time-
variant, and non-volatile collection of data in support of management's
decision-making process."
Subject-oriented
The data in the warehouse is organized around subjects or topics rather than
the applications or source systems that generate the data.
Integrated :The data from each source system (e.g. CRM, ERP, Behavioral
Data, or e-commerce platforms) is brought together and made consistent in
the data warehouse.
8
DMDW-U2
Time-variant
Data in the warehouse is maintained over time, allowing for trend analysis,
forecasting, AI/ML, and historical reporting.
Non-volatile
Data written into the warehouse doesn't overwritten or deleted, ensuring the
stability and reliability of the data, which is crucial for trustworthy analysis.
Data Warehouse Architecture
Data Warehouse Architecture is complex that contains historical and
commutative data from multiple sources.
There are 3 approaches for constructing Data Warehouse layers:
Single Tier, Two tier and Three tier.
Single-tier architecture
The objective of a single layer is to minimize the amount of data stored. This
goal is to remove data redundancy.
Two-tier architecture
Two-layer architecture is one of the Data Warehouse layers which separates
physically available sources and data warehouse.
Three-Tier Data Warehouse Architecture
This is the most widely used Architecture of Data Warehouse. It consists of
the Top, Middle and Bottom Tier.
Bottom Tier: The database of the Data warehouse servers as the bottom
tier. It is usually a relational database system.
Middle Tier: The middle tier in Data warehouse is an OLAP server which is
implemented using either ROLAP or MOLAP model.
Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API
that you connect and get data out from the data warehouse.
9
DMDW-U2
10
DMDW-U2
11
DMDW-U2
OLAP tools
1. Query and reporting tools
Query and reporting tools can be further divided into
Reporting tools
Managed query tools
2. Application development tools
3. Data mining tools
Data mining is a process of discovering meaningful new correlation,
patterns, and trends by mining large amount data. Data mining tools are
used to make this process automatic.
4. OLAP tools
These tools are based on concepts of a multidimensional database. It allows
users to analyze the data using elaborate and complex multidimensional
views.
Data Marts
A data mart is a simple form of a data warehouse that is focused on a
single subject or line of business, such as sales, finance, or marketing.
Given their focus, data marts draw data from fewer sources than data
warehouses.
For example, a company might store data from various sources, such as
supplier information, orders, sensor data, employee information, and
financial records in their data warehouse.
However, the company stores information relevant to, for instance, the
marketing department, such as social media reviews and customer records,
in a data mart.
Advantages of Data mart:
More trustworthy data
Easier access to data.
Faster insights & decisions.
12
DMDW-U2
Lower cost: Data marts typically cost less to set up than a full data
warehouse.
Easier implementation & maintenance.
Better support short-term projects.
Better data access control.
How does a data mart work?
A data mart turns raw information into structured, meaningful content for
a specific business department.
To do this, data engineers set up a data mart to receive information
either from a data warehouse or directly from external data sources.
When it is connected to a data warehouse, the data mart retrieves a
selection of information that is relevant to a business unit.
Often, the information contains summarized data and excludes
unnecessary or detailed data.
ETL
Extract, transform, and load (ETL) is a process for integrating and
transferring information from various data sources into a single physical
database.
Data marts use ETL to retrieve information from external sources when it
does not come from a data warehouse.
The process involves the following steps.
Extract: collect raw information from various sources
Transform: structure the information into a common format
Load: transfer the processed data to the database
Analytics
Business analysts use software tools to retrieve, analyze, and represent data
from the data mart. For example, they use the information stored in data
marts for business intelligence analytics, reporting dashboards, and cloud
applications.
13
DMDW-U2
Independent:
An independent data mart is created without the use of central Data
warehouse. This kind of Data Mart is an ideal option for smaller groups
within an organization.
An independent data mart has neither a relationship with the enterprise data
warehouse nor with any other data mart. In Independent data mart, the
data is input separately, and its analyses are also performed autonomously.
14
DMDW-U2
Type of Data Summarized historical (traditionally). Summarized historical (in traditional DW’s).
Fewer source systems which are Wide variety of source systems from all
Data Sources
operationally focused. across the enterprise.
Analyzing smaller data sets (typically Analyzing large (typically 100+ GB),
Use Case/ <100 GB) focused on a particular complex, enterprise-wide datasets to
Scope subject to support analytics and support data mining, BI artificial
business intelligence (BI). intelligence, and machine learning.
Data Easier because data is already Requires strict governance rules and
governance partitioned. systems to access data.
15
DMDW-U2
16
DMDW-U2
17
DMDW-U2
4. Modeling
Physical modeling has the potential to significantly boost the performance of
data warehouses. Data partitioning, data location, access method selection,
indexing, and other similar aspects are all components that go into the
architecture of a physical data warehouse.
5. Connecting Drivers
It’s most likely that the data warehouse will compile information from a wide
variety of different sources. Finding the sources and connecting them
requires the use of the gateway, ODBC drivers, or another wrapper.
6. ETL Phase
This step requires the information to be extracted from the source system,
transformed, and then loaded.
The ETL tools are tested, which may require a staging environment. This is
the start of the process of filling the data warehouses.
In order for data warehouses to be useful, end-user applications are
required. The following phase entails the process of designing and deploying
software that is user-facing.
Steps to Data warehouse implementation
Data warehousing is considered one of the most important processes
involved in the process of collecting usable information for use in making
business choices.
In order to have a successful installation of the data warehouse system,
there is a precise order in which certain steps need to be carried out. The
following is what ends up taking place:
1. The Preparatory Work
It is helpful because it outlines the steps that we need to take in order to
achieve the objectives and goals that we have stated. Getting buy-in from
within an organization is one of the most important factors in the success of
any endeavor.
18
DMDW-U2
19
DMDW-U2
20
DMDW-U2
The base cuboid contains all three dimensions, city, item, and year. It
can return the total sales for any combination of the three dimensions.
A pre computation lead to fast response time. By this pre computation
required storage space. If all the cuboids in a data cube are pre
computed it occupies more space. This problem is referred to as the
“curse of dimensionality”.
If there were no hierarchies associated with each dimension, then the
total number of cuboids for an n-dimensional data cube, as we have
seen above, is 2n.
However, in practice, many dimensions may have hierarchies. For
example, the dimension “time” have conceptual level, such as in the
hierarchy “day < month < quarter < year”. For an n-dimensional data
cube, the total number of cuboids that can be generated is:
21
DMDW-U2
Example
If the cube has 10 dimensions and each dimension has 4 levels, what will be
the number of cuboids generated?
Solution
Here n=10 Li=4 for i=1,2…….10
Thus
Total number of cuboids= 5×5×5×5×5×5×5×5×5×5=510~9.8 ×106
Computation of Selected Cuboids
There are three choices for data cube materialization given a base cuboid:
1. No materialization: Do not pre compute any of the “non base” cuboids.
2. Full materialization: Pre compute all of the cuboids.
3. Partial materialization: Selectively compute a proper subset of the whole
set of possible cuboids.
Indexing OLAP Data: Bitmap Index and Join Index:
Most data warehouse systems support index structures. There are two index
data structures in OLAP are “bitmap indexing” and “join indexing”.
Bitmap Index
A Bitmap Index is a type of indexing technique that uses bitmaps to
represent the presence or absence of values in a column.
It is particularly useful for low cardinality columns, where the number of
distinct values is relatively small.
Here's how Bitmap Index works:
For each distinct value in the column, a bitmap is created.
Each bit in the bitmap represents a row in the table.
If a bit is set to 1, it indicates that the corresponding row contains the
value represented by the bitmap.
If a bit is set to 0, it indicates that the corresponding row does not
contain the value.
Example:
22
DMDW-U2
23
DMDW-U2
Join index tables based on the linkages between the sales fact table and the
location and item dimension tables shown in Figure.
Efficient Processing of OLAP Queries
The purpose of materializing cuboids and constructing OLAP index structures
is to speed up query processing in data cubes. Given materialized views,
query processing should proceed as follows:
1. Determine which operations should be performed on the available
cuboids: This involves transforming any selection, projection, roll-up
24
DMDW-U2
25
DMDW-U2
In this example, cities New jersey and Lost Angles and rolled up into
country USA
The sales figure of New Jersey and Los Angeles are 440 and 1560
respectively. They become 2000 after roll-up
In this aggregation process, data is location hierarchy moves up from
city to the country.
In the roll-up process at least one or more dimensions need to be
removed. In this example, Cities dimension is removed.
26
DMDW-U2
2) Drill-down
In drill-down data is fragmented into smaller parts. It is the opposite of the
rollup process. It can be done via
Moving down the concept hierarchy
Increasing a dimension
27
DMDW-U2
28
DMDW-U2
4) Pivot
In Pivot, you rotate the data axes to provide a substitute presentation of
data.
In the following example, the pivot is based on item types.
29
DMDW-U2
30
DMDW-U2
The apex (0-D) cuboid, representing the concept "all", is at the top of the
lattice. The 3-D base cuboid, ABC, is at the bottom of the lattice. It is the
least aggregated (most detailed or specialized) level.
31
DMDW-U2
32
DMDW-U2
33
DMDW-U2
Example
Let's say there is a University database that is to be characterized,
for that its corresponding DMQL will be:
“ use University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date, residence,
phone_no, GPA from student “
Its corresponding SQL statement can be:
“Select name, gender, major, birth_place, birth_date, residence,
phone_no, GPA from student where status in {“Msc”, “MBA”, “Ph.D.” } “
Initially the data is stored in “Initial Relation Working” table where
we can remove few attributes which has no meaning respective
task relevant data.
34
DMDW-U2
35