Unit-I DW - Architecture
Unit-I DW - Architecture
by:
Prof. Asha Ambhaikar
1
UNIT-I
2
Contents of Unit-I
3
Text Books:
4
Reference Books:
5
What is Data Warehousing?
A process of transforming
Information data into information
and making it available to
users in a timely enough
manner to make a
difference
Data
6
Data Warehousing --
It is a process
Technique for assembling and
managing data from various
sources for the purpose of
answering business
questions. Thus making
decisions that were not previous
possible
A decision support database
maintained separately from the
organization’s operational
database
7
What is Data Warehouse?
9
Data Warehouse—Integrated
Constructed by integrating multiple,
heterogeneous data sources
relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques
are applied.
Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is
converted.
10
Data Warehouse—Time Variant
11
Data Warehouse—Non-Volatile
12
Very Large Data Bases
Terabytes -- 10^12 bytes: Walmart-- 24 Terabytes
13
Data Warehousing
Physical separation of operational and decision
support environments
Purpose: to establish a data repository making
operational data accessible
Transforms operational data to relational form
Only data needed for decision support come from the
TPS
Data are transformed and integrated into a
consistent structure
Data warehousing (information warehousing): solves
the data access problem
End users perform ad hoc query, reporting analysis
and visualization
14
Evolution of Data Warehouse
15
Data Warehouse vs. Heterogeneous DBMS
16
Benefits of Data warehouse
Better Information
Better Strategies and plans
Better tactics and decisions
More efficient processed
Time saving
Reduction in paper reporting
17
Data Warehousing Benefits
18
Benefits of DW
Executives, managers and staff are provided
with improved access to data from many
databases with in the organization.
Manager manage with the data they want
rather than the data they get.
Less time spent gathering data from various
systems and more time available to analyze
and act.
Ability to quickly answer a series of questions,
each of which depends upon the answer to the
previous question. (in a sec or min)
19
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
20
OLTP vs. Data Warehouse
21
OLTP vs Data Warehouse
OLTP Warehouse (DSS)
Application Oriented Subject Oriented
Used to run business Used to analyze business
Detailed data Summarized and refined
Current up to date Snapshot data
Isolated Data Integrated Data
Repetitive access Ad-hoc access
Clerical User Knowledge User
(Manager)
22
OLTP vs Data Warehouse
23
OLTP vs Data Warehouse
24
To summarize ...
25
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
26
The Goals of a Data Warehouse
27
Goals of Data Warehouse
Makes an organization’s information accessible.
28
Needs for Data Warehousing
29
Why We need Separate Data Warehouse?
30
Trends in Data Warehouse
31
Three Complementary Trends
Data Warehousing
_ Consolidate data from many sources in one large repository.
– Loading, periodic synchronization of replicas.
– Semantic integration.
OLAP:
– Complex SQL queries and views.
– Queries based on spreadsheet-style operations and
multidimensional” view of data.
– Interactive and “online” queries.
Data Mining
_ Exploratory search for interesting
trends and anomalies.
32
Architecture and Infrastructure
33
Design of a Data Warehouse: A
Business Analysis Framework
Four views regarding the design of a data warehouse
Top-down view
allows selection of the relevant information necessary for the
data warehouse
Data source view
exposes the information being captured, stored, and
managed by operational systems
Data warehouse view
consists of fact tables and dimension tables
Business query view
sees the perspectives of data in the warehouse from the view
of end-user
34
Basic Elements of Data Warehouse
35
Basic Elements of a Data Warehouse
Source System
Staging Area
Presentation Area
End User Data Access Tools
Metadata
36
Basic Elements of Data Warehouse
Purchased
Data
Metadata Repository
37
Data Warehouse Architecture
Monitor
& OLAP Server
other Metadata
sources Integrator
Analysis tools
Operational Extract Query &
Transform Data Serve Reporting tools
DBs
Load
Refresh
Warehouse and
Data mining
tools
Data Marts
Middle Layer Top Layer
Bottom Layer
Data Sources
Data Storage OLAP Engine Front-End Tools 38
Working of Data Warehouse
Bottom Layer:
The bottom layer is a DW database servers
that is almost always a relational database
system
Data from operational databases and
external sources are extracted using
application program interfaces known as
gateways
It is supported by primary system
39
cont….
It has repository that is metadata (data about data)
Which is responsible for extracting the information
from DW according to the queries given by the end
users
Metadata is the bridge between DW and the DSS
It provides logical linkage between data and
application
Metadata can pinpoint access to information across
the entire DW.
40
Middle Layer:
TOP Layer:
The top layer is a client
That is the end user
It consists of
1.query and reporting tools
2.Analysis tools and
3. Data Mining Tools
It acts as an interface between the user
and the server
42
Cont..
43
Principles of Dimensional Modeling
44
Multidimensional Data Model
45
Multidimensional Data Models
all
0-D(apex) cuboid
product date country
1-D cuboids
3-D(base) cuboid
product, date, country
47
Multidimensionality
3-D + Spreadsheets (OLAP has this)
Data can be organized the way managers like to see
them, rather than the way that the system analysts do
Different presentations of the same data can be
arranged easily and quickly
48
Multidimensional Data
Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Office Day
Month
49
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR
Country
sum
Canada
Mexico
sum
50
Browsing a Data Cube
Visualization
OLAP capabilities
Interactive manipulation
51
OLAP Operations
OLAP means On Line Analytical Processing.
It is used to perform analysis on data and
Pivot : rotate
53
Roll-up and Drill Down
Higher Level of
Aggregation
Sales Channel
Region
Country
State
Location Address
Sales Representative
Low-level
Details
54
Slicing and Dicing
Household
Telecomm
Video Europe
Far East
Audio India
Juice
Cola 10
Milk 47
Crea 30
m 12 Product
Other operations
drill across: Executes queries involving
(across) more than one fact table
drill through: Operation uses relational SQL
facilities to drill through the bottom level of
the data cube to its back-end relational
tables
59
Physical Design Process
60
Stars, Snowflakes & fact Constellations:
61
Cont….
62
Fact Table
Central table
mostly raw numeric items
63
Star Schema
A single fact table and for each dimension
one dimension table
Does not capture hierarchies directly
T
p
date, custno, prodno, cityname, r
i ...
m
e
f o
d
a
c
c c
u t i
s t
t y
64
Snowflake schema
Represent dimensional hierarchy directly
by normalizing tables.
Easy to maintain and saves storage
date, custno, prodno, cityname, ...
Time Prod
f
a
c
Cust t Region
city
65
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
66
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
67
Example of Fact constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter item_key
time_key type
year supplier_type shipper_key
item_key
branch_key from_location
all all
69
OLAP Is FASMI
Fast
Analysis to
Share
Multidimensional
Information
70
Types of OLAP Servers
ROLAP SERVERS:
Relational On Line Analytical Processing are
intermediate servers which lies between a
relational back end server and client front
end tools.
They uses a relational DBMS to storage and
manage data
ROLAP servers support multidimensional
views of data
71
Cont…
Multidimensional OLAP (MOLAP) Servers
These servers support multidimensional views of data
level: array
Specialized SQL servers
specialized support for SQL queries over
star/snowflake schemas
72
From the Data Warehouse to Data Marts
Information
Individually
Structured
Less
Departmentally History
Structured
Normalized
Detailed
Organizationally
Structured Data Warehouse More
Data
73
Data Warehouse and Data Marts
OLAP
Data Mart Data Mart
Lightly summarized
Departmentally structured
Data Warehouse
Organizationally structured
Atomic
Detailed Data Warehouse Data
74
Characteristics of Data Mart
Data Sources
Data Warehouse
Data Marts
76
Data Warehouse Back-End Tools and
Utilities (ETL Tool)
Data extraction:
get data from multiple, heterogeneous, and external
sources
Data cleaning:
detect errors in the data and rectify them when
possible
Data transformation:
convert data from legacy or host format to warehouse
format
Load:
sort, summarize, consolidate, compute views, check
integrity, and build indices and partitions
Refresh
propagate the updates from the data sources to the
warehouse
77
Components of Data Warehouse
Reporting, query,EIS
tools
Operational highly
Data source Meta Data Summarized
Data
Operational
Lightly
Data Source
Summarized OLAP tool
data
Operational
Data Source
Detailed data
Data Mining
End-Users
Operational Tool
Data Source
79
1.Operational Data Sources
80
2. Operational Data Stores (ODS)
81
3. Load Manager
82
4. Warehouse Manager
Warehouse Manager performs all the
operations associated with the management of
the data in the warehouse.
The operation performed by the component
includes
Analysis of the data to ensure consistency
83
5. Query Manager
archive/backup data.
84
Cont…
Metadata
End-user access tools:
It can be categories in to five main groups
85
Data Flow
86
Cont…
87
Detailed Data
88
Lightly and Highly Summarized Data
It stores all the pre defined lightly and highly
aggregated data generated by the warehouse manager.
Transient as it will be subject to change on a ongoing
basis in order to respond to changing query
profiles.
The purpose of summary information is to….
Speed up the performance of queries.
90
Meta data
The area of the warehouse stores all the metadata(data
about data) definitions used by all the processes in the
warehouse.
It is used for variety of purposes….
Extraction and loading process: Meta data is
92
Tools used for Data Warehouse
Cognos Tool
EIS
DSS
OLAP
93
Data Warehouse Usage
Three kinds of data warehouse applications
Information processing
supports querying, basic statistical analysis, and
reporting using crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
94
Summary
Data warehouse
A subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-
making process
A multi-dimensional model of a data warehouse
Star schema, snowflake schema, fact constellations
95
Important Questions
What is Data Warehouse? Explain in detail.
Draw and explain the Data Warehouse Architecture.
Explain Data warehouse component with suitable
diagram.
What is OLAP? Explain OLAP operations along with its
types.
Explain Star Schema and snowflake Schema.
What is multidimensional data model? Explain with neat
diagram.
Compare the OLTP and OLAP.
What do you mean by project planning and
requirement? Explain how it is necessary in DW.
Explain the role of Project Management. 96