0% found this document useful (0 votes)
43 views71 pages

CHP 02 Data Warehouse Architecture

Uploaded by

tejaswinibambole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views71 pages

CHP 02 Data Warehouse Architecture

Uploaded by

tejaswinibambole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Chapter-02:

Data Warehousing & Online Analytical Processing

Dr.Sunil Khilari

Dr.Sunil Khilari
Chap-02 -Data Warehousing & Online
Analytical Processing
1. Introduction to data warehousing, Need of warehouse(DW),
Operational database versus DW.
2. Data warehouse life cycle, building a Data Warehouse, Data
Warehousing Components, Data Warehousing Architecture,
DW Models .
3. Extraction, Transformation & Loading, Metadata Repository,
feature selection & creation
4. Multi-Dimensional data Modeling: Star schema, snowflake
schema & fact constellation schema, On Line Analytical
Processing Categorization of OLAP Tools, Data cubes &
Operations on cubes.
5. Design and usage of Data Warehouse (at least one system
diagram) Dr.Sunil Khilari
2.1 Introduction of data warehousing

 The concept of data warehousing dates back to the late 1980s when
IBM researchers Barry Devlin and Paul Murphy developed the
"business data warehouse". In essence, the data warehousing.

 Corporate organization generate huge amount of data from their day-


to day.
 On-line transition processing (OLTP)- ATM,
 Reservation system, Net Banking.
 Insurance, Manufacturing, Web, e-commerce.

 Such data in today’s information era these huge, routinely generated


data collected to many corporate organization.

 These organization typically face the problem of being


“DATA_RICH” but “INFORATION _POOR”.
Dr.Sunil Khilari
Introduction of data warehousing (cont.)

 This problem leads to the challenge valuable information from huge


data and make it available to the right person, at the right place, at the
right time, at the right cost and in the desired to support the decision-
making process.
 Containing huge data records &numbering millions data even hundreds
of millions data.

 Data warehouse can be used for strategic purposes performing multi-


dimensional analysis and sophisticated 

 Data warehouse may be used for knowledge discovery and strategic


decision-making using data-mining tools

Dr.Sunil Khilari
Definition of Data warehouse

 Data warehouse is a central repository for all or significant parts of the data
that an enterprise's various business systems collect. It usually contains
historical data derived from transaction data, it include data from other
sources.

 Definition:“A data warehouse is a subject-oriented, integrated, time-


variant, non-volatile collection of data in support of the management’s
decision-making process”.

Dr.Sunil Khilari
Definition Data warehouse (cont.)
1. Subject-Oriented:
This ability to define a data warehouse by subject matter-sales,in this case,
makes the data warehouse subject oriented.
e.g. Bank would be organized by customer, deposit, interest rate, Loan.
Purchase, sales inventory

Customer Deposit Sales Purchase


Data data Data Data

2. Integrated:
While constructing the data warehouse, multiple, heterogeneous sources such
as relational database flat files and OLTP files are utilized and data collected
form them is integrated.

Integration is closely related to subject orientation. Data warehouses must put


data from different sources into a consistent format.
e.g. Purchase, Sales, Inventory
Dr.Sunil Khilari
Definition Data warehouse (cont.)
3. Nonvolatile:
Nonvolatile means that, once data entered into the warehouse, data should not
change. data are not updated or changed either after the data enters in data
warehouse. Data are only loaded, refreshed and accessed for quarries.

4. Time Variant:
All data in the data warehouse is identified with a particular time period.

Data in the data-warehouse is collected from the corporate data archives and
could be 3 to 10 years old or even older. the data provide historical perspective
and are used for comparison, trends and forecasting

Dr.Sunil Khilari
Data Warehousing Architecture

Data warehouses and their architectures vary depending upon the specifics of
an organization's situation. Three common architectures are:

 Data Warehouse Architecture (Basic)

 Data Warehouse Architecture (with a Staging Area)

 Data Warehouse Architecture (with a Staging Area and Data Marts)

Dr.Sunil Khilari
Data Warehouse Architecture (Basic)

Dr.Sunil Khilari
Data Warehouse Architecture (with a Staging Area)

Dr.Sunil Khilari
Data Warehouse Architecture (with a Staging Area and Data Marts)

Outflow

meta flow

Up flow
Inflow
Down flow

Archive/ Backup

Data Warehouse Architecture


Dr.Sunil Khilari
Data Warehouse Architecture(cont..)
 Operational Data: Repository of current and integrated operational data
used for analysis e.g. Bank system, Purchase system, Inventory system, PPC
system, OLTP, order processing, Billing system ,Payroll, Accounting system, HR
system.

 Staging Area : Need to clean and process your operational data before
putting it into the warehouse. A staging area simplifies building
summaries and general warehouse management

 Data Marts: “A subset of a data warehouse that supports the


requirements of a particular department or business function”
e.g. Finance, Purchase, Sales, inventory, PPC Payroll, etc.
In this example, a financial analyst might want to analyze historical data
for purchases and sales.
 Meat data: Data about data - the extraction and loading processes

Dr.Sunil Khilari
Data Warehouse Architecture(cont..)

 Data mart :A data mart is a simple form of a data


warehouse that is focused on a single subject (or functional
area), such as sales, finance or marketing. Data marts are
often built and controlled by a single department within an
organization. Given their single-subject focus, data marts
usually draw data from only a few sources. The sources
could be internal operational systems, a central data
warehouse, or external data

Dr.Sunil Khilari
Information flows of a data warehouse

Inflow:- the processes associate with the extraction, cleansing, and


loading of the from source system into the data warehouse.

Up flow:- the processes associate with adding value to the data in the
warehouse through summarizing, packing and distributing of the data

Down flow:- the processes associate with archiving and backing-up


the data in the warehouse

Meta flow:- the processes associate with the management of the


metadata.

Outflow:- the processes associate with making the data available to


end-user

Dr.Sunil Khilari
Data warehouse

Dr.Sunil Khilari
Data Warehouse ETL

Data Warehouse

ETL pipeline
outputs

Data is loaded and periodically updated via


Extract/Transform/Load (ETL) tools
ETL

ETL ETL ETL ETL

RDBMS1 RDBMS2

HTML XML1
1
Dr.Sunil Khilari
Data warehousing Design

The issue associate with data warehouse database design, Data


warehouse have develop their own technique- distinct from transaction
processing system, dimensional design techniques.

Designing a data warehouse database is highly complex-which user


requirement are most important and which data should be consider first-
solution of building data mart.

The database component of a data warehouse is described using a


technique called dimensionality modeling

Dr.Sunil Khilari
Data warehousing Design(Cont..)

Dimensionality modeling:
”A logical design technique that aims to present the data in a
standard, sensitive from that allows for high-performance access”

Dimension table: a set of smaller tables called dimension tables.

Fact table: Every dimensional model (DM) is composed of one table


with a composite primary key called the fact table.

“The fact table contains business facts or measures and foreign keys
which refer to candidate keys (normally primary keys) in the dimension
tables.”

Dr.Sunil Khilari
Fact Tables
In data warehousing, a fact table consists of the
measurements, metrics or facts of a business process. It is
located at the center of a star schema or a snowflake schema
surrounded by dimension tables. Where multiple fact tables
are used, these are arranged as a fact constellation schema. A
fact table typically has two types of columns: those that
contain facts and those that are foreign keys to dimension
tables. The primary key of a fact table is usually a composite
key that is made up of all of its foreign keys. Fact tables
contain the content of the data warehouse and store different
types of measures like additive, non additive, and semi
additive measures.
Dr.Sunil Khilari
Fact Table (Cont..)
Fact tables provide the (usually) additive values that act as independent
variables by which dimensional attributes are analyzed. Fact tables are often
defined by their grain. The grain of a fact table represents the most atomic
level by which the facts may be defined. The grain of a SALES fact table
might be stated as "Sales volume by Day by Product by Store". Each record
in this fact table is therefore uniquely defined by a day, product and store.
Other dimensions might be members of this fact table (such as
location/region) but these add nothing to the uniqueness of the fact records.
These "affiliate dimensions" allow for additional slices of the independent
facts but generally provide insights at a higher level of aggregation

Additive - Measures that can be added across any dimensions.


Non Additive - Measures that cannot be added across any dimension.
Semi Additive - Measures that can be added across some dimensions.
Dr.Sunil Khilari
Dimension Tables

A dimension is a structure, often composed of one or more hierarchies, that


categorizes data. Dimensional attributes help to describe the dimensional value.
They are normally descriptive, textual values. Several distinct
dimensions, combined with facts, enable you to answer business questions.
Commonly used dimensions are customers, products,
and time, location.
Dimension data is typically collected at the lowest level of detail and then aggregated
into higher level totals that are more useful for analysis. These natural rollups or
aggregations within a dimension table are called hierarchies.

Dr.Sunil Khilari
Dimensional data modeling- of Data Warehouses

 Star schema

 Snowflake schema

 Fact constellations

Dr.Sunil Khilari
Dimensional data modeling- of Data
Warehouses
 Star schema: “A fact table in the middle connected
to a set of dimension tables”
a logical structure that has a fact table containing factual data in the
center, surrounded by dimension table containing reference data.
key of the fact table is made up of two or more foreign key. This
characteristic ‘star-like’ structure is called a star schema or star join
The star schema is the simplest data warehouse schema. It is called a star
schema because the diagram resembles a star, with points radiating from a
center. The center of the star consists of one or more fact tables and the points
of the star are the dimension tables, as shown in fig. It contains:
 A large central table (fact table)
 A set of smaller attendant tables (dimension table), one for each dimension

Dr.Sunil Khilari
Star schema

Measures

Additive semi-
additive non-
additive
Descriptive, textual values.
Dr.Sunil Khilari
Star Schema for Sales
Dimension Tables

Fact Table Dr.Sunil Khilari


Dimensional data modeling- of Data Warehouses

 Snowflake schema: A refinement of star schema where some


dimensional hierarchy is further splitting (normalized) into a set of
smaller dimension tables, forming a shape similar to snowflake

 However, the snowflake structure can reduce the effectiveness of


browsing, since more joins will be needed 

Star flake schema:- A hybrid structure that contains a mixture of


star and snow flake schema

Dr.Sunil Khilari
Snowflake schema

Measures

Dr.Sunil Khilari
Fact constellations
 Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact
constellation

Dr.Sunil Khilari
Fact constellations

Measures

Dr.Sunil Khilari
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year item_key supplier_type

branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures 30
Dr.Sunil Khilari
Example of Snowflake Schema
time item
time_key item_key supplier
day Sales Fact Table item_name supplier_key
day_of_the_week brand supplier_type
month
time_key type
quarter item_key supplier_key
year
branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province
Measures Dr.Sunil Khilari country 31
Example of Fact Constellation
time
time_key Shipping Fact Table
day item
day_of_the_week Sales Fact Table item_key time_key
month item_name
quarter time_key brand item_key
year type shipper_key
item_key supplier_type
branch_key from_location

branch location_key location to_location


branch_key dollars_cost
units_sold location_key
branch_name
street
branch_type dollars_sold city units_shipped
province_or_street
avg_sales country shipper
Measures shipper_key
shipper_name
Dr.Sunil Khilari location_key
32
shipper_type
Dimensional data modeling-

 Another important feature of a DM is that all


natural keys are replaced with surrogate key, this
means that every join between fact and
dimension table id based on surrogate keys not
natural key. the use of surrogated key allows the
data in the warehouse

Dr.Sunil Khilari
Dr.Sunil Khilari
Dr.Sunil Khilari
Dr.Sunil Khilari
OLAP Technology & Data cubes

OLAP can be a valuable and rewarding business tool. Aside from producing reports, OLAP
analysis can aid an organization evaluate balanced scorecard targets.

Dr.Sunil Khilari
OLAP Technology
To obtain answers, such as the ones above, from a data model OLAP cubes are created.
OLAP cubes are not strictly cuboids - it is the name given to the process of
linking data from the different dimensions. The cubes can be developed along
business units such as sales or marketing. Or a giant cube can be formed with all the
dimensions.

OLAP Cube with Time, Customer and Product Dimensions

Dr.Sunil Khilari
OLAP Technology for Data Mining

OLAP tools enable users to interactively analyze multidimensional data from multiple
perspectives. OLAP consists of three basic analytical operations:

Consolidation- Drill-down and up and slicing and dicing

Consolidation involves the aggregation of data that can be accumulated and


computed in one or more dimensions.

For example, all sales offices are rolled up to the sales department or sales division
to anticipate sales trends. In contrast, the drill-down is a technique that allows users
to navigate through the details. For instance, users can view the sales by individual
products that make up a region’s sales.

Slicing and dicing is a feature whereby users can take out (slicing) a specific set of
data of the OLAP cube and view (dicing) the slices from different viewpoints.
Dr.Sunil Khilari
OLAP Server Architectures-MOLAP, ROLAP, HOLAP
(server)

OLAP storage is one of the critical choices to be made when


designing the solution. OLAP storage comes in three forms:

1) MOLAP - Multidimensional OLAP. In MOLAP, both the source


data and the aggregations are stores in a multidimensional
format. MOLAP is the fastest option for data retrieval, but requires the
most disk space. Disk space is less of a concern these days with lowering
storage and processing cost.

2) ROLAP - Relational OLAP. All data, including the aggregations are


stored within the source relational database. This will be a concern for larger
data warehousing implementations which have higher usage needs. ROLAP
is the slowest for data retrieval. Whether an aggregation exists or not, a
ROLAP database must access the data warehouse itself. ROLAP is best
suited for smaller data warehousing implementations.
Dr.Sunil Khilari
OLAP Server Architectures(Cont..)

3) HOLAP - Hybrid OLAP. HOLAP is a combination of both the


above storage methodologies. HOLAP databases store the aggregations
that exist within a multidimensional structure, leaving the cell-level data itself in
a relational form. Where the data is pre aggregated, HOLAP offers the
performance of MOLAP, where the data must be fetched from the tables.
HOLAP is as slow as ROLAP.

Due to shrinking hardware and processing cost, MOLAP are


generally most often used. HOLAP is a better solution if the solution is
accessing a stand-alone database. ROLAP are more convenient to set up when
the query demands are relatively low and also on a stand-alone database.

Dr.Sunil Khilari
OLAP Server Architectures(Cont..)

 What is the difference between OLAP, ROLAP & MOLAP?


 OLAP is an abbreviation for On-line Analytical Processing.
There are three OLAP-types:
 MOLAP: Multidimensional OLAP – enabling OLAP by provding
cubes.
 ROLAP: Relational OLAP – enabling OLAP using a relational
database management system.
 DOLAP: Desktop OLAP – enabling OLAP functionality on the
local computer.
 HOLAP - Hybrid OLAP. HOLAP is a combination of both the
above storage methodologies

Dr.Sunil Khilari
OLAP Server Architectures(Cont..)

Online analytical processing (OLAP)


IFor OLAP systems, response time is an effectiveness measure. OLAP
applications are widely used by Data Mining techniques. OLAP databases
store aggregated, historical data in multi-dimensional schemas (usually star
schemas). OLAP systems typically have data latency of a few hours, as
opposed to data marts, where latency is expected to be closer to one day.

Dr.Sunil Khilari
Operational DBMS vs. Data Warehouse vs.
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

From OLAP to OLAM (on-line analytical mining) Dr.Sunil Khilari


A multi-dimensional data model
 A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
3D Data cube Example
 A data cube, such as sales, allows data to be modeled and
viewed in multiple dimensions
 Suppose ALL ELETRONICS create a sales data warehouse
with respect to dimensions
 Time
 Item

 Location Dr.Sunil Khilari


3D Data cube Example

Dr.Sunil Khilari
4D Data cube Example
 A data cube, such as sales, allows data to be modeled and viewed
in multiple dimensions
 Suppose ALLELETRONICS create a sales data warehouse with
respect to dimensions
 Time
 Item
 Location
 Supplier

Dr.Sunil Khilari
4D Data cube Example

Dr.Sunil Khilari
Relational Database Model

Attribute 1 Attribute 2 Attribute 3 Attribute 4


Name Age Gender Emp No.
Row 1 Anderson 31 F 1001
Row 2 Green 42 M 1007
Row 3 Lee 22 M 1010
Row 4 Ramos 32 F 1020

The table above illustrates the employee relation.

Dr.Sunil Khilari
Multidimensional Database Model
Customer Store
Store

Time Time

SALES FINANCE

Product GL_Line

The data is found at the intersection of dimensions.

Dr.Sunil Khilari
Two dimensions

Dr.Sunil Khilari
Three dimensions

Dr.Sunil Khilari
Operations on cubes
Conceiving data as a cube with hierarchical dimensions leads
to conceptually straightforward operations to facilitate
analysis. Aligning the data content with a familiar
visualization enhances analyst learning and productivity. The
user-initiated process of navigating by calling for
page displays interactively, through the
specification of slices via rotations and drill down is
sometimes called "slice and dice". Common
operations include slice and dice, drill down, roll up,
and pivot.

Dr.Sunil Khilari
OLAP slicing
Slice is the act of picking a rectangular subset of a cube by choosing a single value
for one of its dimensions, creating a new cube with one fewer dimension.

The picture shows a slicing operation: The sales figures of all sales regions and all
product categories of the company in the year 2004 are "sliced" out the data cube.

Dr.Sunil Khilari
OLAP dicing
Dice: The dice operation produces a sub cube by allowing the analyst to pick
specific values of multiple dimensions.

The picture shows a dicing operation: The new cube shows the sales figures of a
limited number of product categories, the time and region dimensions cover the
same range as before.

Dr.Sunil Khilari
OLAP Drill down & up

Drill Down/Drill up :
The two basic hierarchical operations when displaying data at multiple levels of aggregations are
the ``drill-down'' and ``roll-up'' operations. Drill-down refers to the process of viewing data at a
level of increased detail, while roll-up refers to the process of viewing data with decreasing detail.
or (Higher level summary to lower level summary or detailed data, or introducing new
dimensions)

The picture shows a drill-down operation: The analysts moves from the summary
category “TV" to see the sales figures for the individual products.

Dr.Sunil Khilari
OLAP pivoting
Pivot allows an analyst to rotate the cube in space to see its various faces. For example, cities
could be arranged vertically and products horizontally while viewing data for a particular quarter.
Pivoting could replace products with time periods to see data across time for a single product.
The picture shows a pivoting operation: The whole cube is rotated, giving
another perspective on the data.

Dr.Sunil Khilari
Roll-up and Drill Down
Low- Level of
Details
 Sales Channel
 Region
 Country
 State
 Location Address
 Sales Representative

High-level
Dr.Sunil Khilari
Aggregation Details
Typical OLAP Operations
 Slice and dice:
 Project and select
 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes.
Other operations
 What is the drill up , drill down, drill by , drill trough ?
 Drill up : One level up in the Hierarchy
 Drill down: One level down in the Hierarchy
 Drill by : direct selection of level in the Hierarchy
 Drill trough : to drill data from one Hierarchy to another Hierarchy
 Drill across: involving (across) more than one fact table

Dr.Sunil Khilari
Browsing a Data Cube

 Visualization
 OLAP capabilities
 Interactive manipulation

Dr.Sunil Khilari
Example of Cube

11g Oracle –Data warehouse and OLAP

Dr.Sunil Khilari
Example of Cube

Dr.Sunil Khilari
The Complete Decision Support System

Information Sources Data Warehouse OLAP Servers Clients


Server (Tier 2) (Tier 3)
(Tier 1)
e.g., MOLAP
OLAP
Semistructured
Data
Sources Warehouse serve
extract Query/Reporting
transform serve
load
refresh e.g., ROLAP
Operational etc.
Data Mining
DB’s serve

Data Marts
Dr.Sunil Khilari
Multi-Tiered Architecture

Monitor
& OLAP Server
other Metadata OLAP
Integrator
sources

OLTP Extract Analysis Query


Operational Transform Data Serve Reports Data
DBs Load mining
Refresh
Warehouse

Query/Reporting

Data Marts
Data Sources Data Storage OLAP Engine
Front-End Tools
Source layer Transformation Layer Dr.Sunil Khilari
Presentation
layer
Data Warehouse Usage
 Three kinds of data warehouse applications
 Information processing
 supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
 Analytical processing
 multidimensional analysis of data warehouse data
 supports basic OLAP operations, slice-dice, drilling, pivoting
 Data mining
 knowledge discovery from hidden patterns
 supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.

Dr.Sunil Khilari
Benefits from data warehousing
 Time Savings
 For data suppliers and for users
 More and better information
 More cost –effective decisions-making
 Improvement of business processes
 Support for the accomplishment of strategic business
objectives.
 Better enterprise intelligence
 Information system reengineering

Dr.Sunil Khilari
Disadvantages of DW

 Over their life DW can have high costs.


 The data warehouse is usually not static
 DW have unstructured data.
 Data must be extracted, transformed and loaded
 There is a cost of delivering suboptimal information to the
organization
 Maintenance costs are high.
 Data warehouse security issue.

Dr.Sunil Khilari
Summary
 Data warehouse
 A subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-making
process
 A multi-dimensional model of a data warehouse
 Star schema, snowflake schema, fact constellations
 A data cube consists of dimensions & measures
 OLAP operations: drilling, rolling, slicing, dicing and pivoting
 OLAP servers: ROLAP, MOLAP, HOLAP
 Further development of data cube technology
 Discovery-drive and multi-feature cubes
From OLAP to OLAM (on-line analytical mining)
 Data preprocessing and major task
Dr.Sunil Khilari
Questions

Q.1) What is data warehouse? Explain data warehouse


architecture in details?
Q.2) Explain the Dimensional data Modeling in details?
Q.3) What is OLAP? What are 3 types of OLAP servers?
Q.4) Write short notes on OLAP.
Q.5) Write short notes on Data cubes.
Q.6) Write short notes on Operation on cube

Dr.Sunil Khilari

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy