0% found this document useful (0 votes)
97 views48 pages

What Is A Data Warehouse?

The document defines a data warehouse as a subject-oriented, integrated, non-volatile collection of historical data used to support management decision making. It contains summarized data from multiple operational databases. A data warehouse is organized around subjects like customers, products, sales rather than day-to-day operations. It provides a time-variant view of integrated data to support analysis and strategic decision making. Developing a data warehouse involves iterative planning, prototyping, and implementation to meet evolving analytical needs.

Uploaded by

Kishori Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views48 pages

What Is A Data Warehouse?

The document defines a data warehouse as a subject-oriented, integrated, non-volatile collection of historical data used to support management decision making. It contains summarized data from multiple operational databases. A data warehouse is organized around subjects like customers, products, sales rather than day-to-day operations. It provides a time-variant view of integrated data to support analysis and strategic decision making. Developing a data warehouse involves iterative planning, prototyping, and implementation to meet evolving analytical needs.

Uploaded by

Kishori Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

What is a Data

Warehouse?

Definition
A data warehouse is a stand alone repository
of information integrated from several possibly
heterogeneous operational databases.
Bill Inmons definition:- It is defined as subjectoriented, integrated, non-volatile and time
variant collection of data in support of
management's decision making process.
It contains integrated granular historical data.

Application-Orientation vs. SubjectOrientation


Application-Orientation

Subject-Orientation

Operation
al
Database
Loans

Credit
Card

Data
Warehouse
Customer
Vendor

Trust
Savings

Product
Activity

Basic Features of Data Warehousing


Subject-oriented:
A data warehouse is organized around major
subjects, such as customer, vendor, product, and
Sales. It focuses on the modeling and analysis of
data rather than day-to-day business operations.
Integrated: A data warehouse is constructed by
integrating multiple heterogeneous data sources.
Time variant: A data warehouse is a repository of
historical data. It gives the view of the data for a
designated time frame.
Non-volatile: A data warehouse is always a
physically
separate store of data transformed from the
application data
found in the operational environment. Due to this

Operational DBMS
Relational DBMS is often used for OLTP (online transaction processing)
It deals with day-to-day operations such as
banking, purchasing, manufacturing,
registration, accounting, etc.
These systems typically get data into the
database.
Each transaction processes information
about a single entity.
Following are some examples of OLTP
queries:
What is the price of 2GB Kingston Pen
drive?

Data Warehouse
The major purpose of maintaining data
warehouse is OLAP (on-line analytical processing).
Data warehousing systems are used for data
analysis and strategic decision making process.
Following are some examples of OLAP queries:
How is the student job placement rate
changing over the years across different
colleges of HSNC board ?
Is it financially viable to continue the
production unit at location X?
Since OLAP queries involve huge amount of
aggregation, we need to read huge amount of
data before we are able to conclusively answer
these queries.
The purpose of these queries is to support

Operational
Systems
Data

Current Values

Data Structure Optimized for


transaction
Access
High
Frequency
Access Type
Read, Update,
Insert
Usage
Predictable,
repetitive
Response
Sub-seconds
Time
Users
Large number

Decision
Support
System
Summarized
Historical data
Optimized for
queries
Medium to low
Read
Ad hoc, random,
heuristic
Several seconds
to minutes
Relatively small

Operational Systems (OLTP) vs


Data Warehouse
OLTP
Application Oriented
Used to run business
Detailed data
Current up to date
Isolated Data
Repetitive access
Clerical User
Performance Sensitive
Few Records accessed at
a time (tens)
Read/Update Access
Database Size
100MB
-100 GB
Thousands of users

Warehouse (DSS)
Subject Oriented
Used to analyze business
Summarized and refined
Snapshot data
Integrated Data
Ad-hoc access
Knowledge User (Manager)
Performance relaxed
Large volumes accessed at
a time(millions)
Mostly Read
Database Size
100 GB
- few terabytes
Hundreds of users

Data Warehouse Architecture


Following are the various parts of a Data
Warehouse:

Source System
Source Data Transport Layer
Data Quality Control and Data Profiling Layer
Metadata Management Layer
Data Integration Layer
Data Processing Layer
End User Reporting Layer

Data Warehouse Architecture

Source System
The Source Systems are operational systems that
feed data into the data warehouse.
The databases in operational systems are
designed to handle business transactions.
Such databases have difficulty accessing the data
for other management or informational purpose.
Therefore organizations require a data warehouse
that has integrated data from several operational
systems to understand customers, operations,
financial situation, product performance, trends
and a host of key business measurements.

Cntd

Source System
The data warehousing integrate data from
several operational systems and combine it with
information from other, often external, sources of
data.
It is very essential to identify the right data
sources and determine an efficient process to
collect the fact.

Source Data Transport Layer


This layer of the data warehouse architecture is
concerned with the transmission of data from the
source system to enterprise warehouse system.
There are various tools and processes involved in
transporting data from the source system to
enterprise warehouse system.

Data Quality Control and


Data Profiling Layer
The data in a data warehouse must be complete
and accurate.
So the quality of data must be examined prior to
the loading of source systems data into the data
warehouse.
The data profiling process prevents data quality
problems before they are introduced into the data
warehouse.

Metadata Management
Layer

Metadata is the auxiliary descriptive data that


exists to tell the user and the analyst where data
is in the DW 2.0 environment.
Meta data describes the actual data.
It is important for designing, constructing,
retrieving and controlling the data warehouse.

Data Integration Layer


Integration is the process of aggregating data
from different data sources to create a unified
view of these data.
A lot of formatting and cleaning activities happen
in this layer so that the data is consistent across
the enterprise.

Data Processing Layer


This layer consist of data staging and enterprise
warehouse.
Data staging often involves complex
programming, but increasingly warehouse tools
are being created that help in this process.
Staging may also involve data quality analysis
programs and filters that identify patterns and
structures with in existing operational data.

End User Reporting Layer

Data Warehouse Options


There are many ways to develop an enterprise
data warehouse.
Following are the key factors to be considered
while developing a data warehouse.
Scope
Data redundancy
Type of end user

Scope
A data warehouse project has either of the two
scopes:
Very broad scope( by integrating all enterprise
data from the beginning of time)
Very narrow scope ( by developing only a personal
data warehouse for a single manager for a single
year.
If the data warehouse is developed by taking all
informational data for the entire enterprise from the
beginning of time, then it is very expensive and take
large amount of time and money to built.
Therefore most organizations develop inexpensive
departmental data warehouses as first steps towards

Data Redundancy
There are essentially three levels of data redundancy
that enterprises should think about when
considering their data warehouse options:
"Virtual" or "Point-to-Point" Data Warehouses
Central Data Warehouses
Distributed Data Warehouses

Virtual Data Warehouses


This option provide end uses with direct access to
multiple operational databases through middleware
tools.
The advantages of this approach are:
Flexibility
No data redundancy
Provides end-users with the
most current corporate
information

Central Data Warehouses


It is a single physical repository that contains all
data for a specific functional area, department
division or enterprise.
A central data warehouse may contain information
from multiple operational systems.
A central data warehouse contain time variant data.
The advantages of this approach are:
security
Ease of management
The disadvantages are:
Performance implications
Expansion is expensive
At times non-reliable
cost

Distributed Data Warehouses


Distributed data warehouse are those in which
certain components are distributed across a number
of different physical databases.
Increasingly, large organizations are pushing
decision-making down to lower and lower levels of
the organization and in turn pushing the data
needed for decision making down (or out) to the LAN
or local computer serving the local decision-maker.
Distributed Data Warehouses usually involve the
most redundant data.

Inmons approach to
Distributed Data
Warehouses

Type of End-User
Executives and managers
Power users (business and financial analysts,
engineers)
Support users (clerical, administrative)

Why developing a Data


Warehouse is a different
approach
To builddevelopment
a classical operational
system developers

need to gather all requirements to build a complete


system all at once.
This approach is well suited for the operational
application environment where processes are run
repetitively, and where complete requirements can be
gathered before a system is built.
Unlike classical operational system, a data warehouse
is built iteratively, a step at a time.
The first reason for this approach is that data
warehouse projects tend to be large.
Another reason for this approach is due to the fact
that the requirements for a data warehouse are not
known when it is first built.
This is because the end users of the data warehouse
do not know exactly what they want.

Developing Data Warehouses


Developing a data warehousing involves activities
such as careful planning, requirements definition,
design, prototyping and implementation.

Developing Strategy
The first and foremost step in developing a data
warehouse is formulating a strategy which is
appropriate for its needs and its user population.
There are a number of strategies by which
organizations can get into data warehousing.
One way is to establish a "Virtual Data Warehouse"
environment by:
Installing a set of data access, data directory and
process management facilities
Training the end-users
Monitoring how the data warehouse facilities are
actually used
Based on actual usage, create a physical data
warehouse to support the high-frequency
Cntd..
requests.

Developing Strategy
Another strategy to develop a data warehouse is as
follows:
Duplicate operational data from a single
operational system
And provide data warehouse features to it with
the help of a series of information access tools.

Cntd..

Developing Strategy
Ultimately, the optimal data warehousing strategy is
to select a user population based on value to the
enterprise and do an analysis of their issues,
questions and data access needs.
Based on these needs, prototype data warehouses
are built and populated so the end-users can
experiment and modify their requirements.
Based on the requirements data can be extracted
from various operational systems.
The needs of various enterprises are different and
hence there is no one approach to build a data
warehouse that will fit the needs of every enterprise.

Cntd..

Data Warehouse Design Consideration


and Dimensional Modeling

Defining Dimensional Model


(Star Schema Model)
The central themeof adimensional modelis thestar
schema.
It represents multi dimensional data.
A star schema consists of:
a central fact table containing measures
and a set of dimension tables.
In star schema model a fact table is at the center of
the star and the dimension tables as points of the
star.

Cntd..

Defining Dimensional Model


(Star Schema Model)
A star schema represents one central set of facts.
The dimension tables contain descriptions about
each of the aspects.
Say for example a warehouse that store sales data,
there is a sales fact table stores facts about sales
while dimension tables store data about location ,
clients, items, times, branches.
The primary key in each dimension table is related
to a foreign key in the fact table.

Star Schema for Sales

One star can contain multiple facts.

Star Schema for Sales

Snowflake Schema
It is a modified version of star schema.
In a star schema, if dimension is complex and contain relationship
such as hierarchies, it is compressed or flattened to a single
dimension.
Like star schema model, the snowflake schema also represents a
dimensional model containing a central fact table and a set of
constituent dimension tables.
The major difference is that in snowflake schema complex
dimensions are normalized into sub-dimension tables.
In a snowflake schema implementation, Warehouse Builder uses
more than one table or view to store the dimension data.
Separate tables store data pertaining to each level in the
dimension.

Snowflake Schema
It is a modified version of star schema.
In a star schema, if dimension is complex and contain relationship
such as hierarchies, it is compressed or flattened to a single
dimension.
Like star schema model, the snowflake schema also represents a
dimensional model containing a central fact table and a set of
constituent dimension tables.
The major difference is that in snowflake schema complex
dimensions are normalized into sub-dimension tables.
In a snowflake schema implementation, Warehouse Builder uses
more than one table or view to store the dimension data.
Separate tables store data pertaining to each level in the
dimension.

Star Schema Vs. Snowflake


Schema

Star Schema

Snowflake Schema

Granularity of Facts

Granularity refers to the level of details in a fact


table.
Highest level of granularity is always advisable.
For highest level of granularity data should be kept
in most detailed level.
Low granularity means data is summarized or
aggregated.
Each non- key column in a fact table must be at the
same level of granularity.
Say for example following levels of granularity can

Additivity of Facts

A fact is something that is measurable and are


typically numerical values that can be aggregated.
Following are the three types of facts:
Additive: facts that are additive across all
dimensions.
Semi-Additive: facts that are additive across some
of the dimensions, but not all.
Non-Additive: facts that are not additive across
any dimension.
Additive facts are the most useful facts.

Cntd..

Additivity of Facts
In general, facts representing individual transactions
are fully additive, although cumulative totals are semiadditive.
Non-additive facts are usually the result of ratio or
other calculations.

Additivity of Facts
Example of Additive Fact:Time
Customer
Item
Location
Dimensions
Branch
Quantity
Fact
The Quantity can be summed up over all of the
dimensions(Time, Customer, Item, Location, Branch)

Additivity of Facts
Example of Semi-Additive Fact:Suppose a Bank has the following table to store
current balance by account by end of each day.
Date
Account
Current_Balance
Dimensions Fact
The Balance can not be summed up across Time
dimension. It does not make sense if we sum the
current balance by date.

Additivity of Facts

Example of Non-Additive Fact:Time


Customer
Item
Location
Dimensions
Branch
Net_Profit_Margin

The Price and Net_Profit_Margin can not be summed


up across any dimension.
Facts

Helper Tables
Helper tables usually take one of two forms:
Help for multi valued dimensions
Helper tables for complex hierarchies

Multi Valued Dimensions

Complex Hierarchies

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy