What Is A Data Warehouse?
What Is A Data Warehouse?
Warehouse?
Definition
A data warehouse is a stand alone repository
of information integrated from several possibly
heterogeneous operational databases.
Bill Inmons definition:- It is defined as subjectoriented, integrated, non-volatile and time
variant collection of data in support of
management's decision making process.
It contains integrated granular historical data.
Subject-Orientation
Operation
al
Database
Loans
Credit
Card
Data
Warehouse
Customer
Vendor
Trust
Savings
Product
Activity
Operational DBMS
Relational DBMS is often used for OLTP (online transaction processing)
It deals with day-to-day operations such as
banking, purchasing, manufacturing,
registration, accounting, etc.
These systems typically get data into the
database.
Each transaction processes information
about a single entity.
Following are some examples of OLTP
queries:
What is the price of 2GB Kingston Pen
drive?
Data Warehouse
The major purpose of maintaining data
warehouse is OLAP (on-line analytical processing).
Data warehousing systems are used for data
analysis and strategic decision making process.
Following are some examples of OLAP queries:
How is the student job placement rate
changing over the years across different
colleges of HSNC board ?
Is it financially viable to continue the
production unit at location X?
Since OLAP queries involve huge amount of
aggregation, we need to read huge amount of
data before we are able to conclusively answer
these queries.
The purpose of these queries is to support
Operational
Systems
Data
Current Values
Decision
Support
System
Summarized
Historical data
Optimized for
queries
Medium to low
Read
Ad hoc, random,
heuristic
Several seconds
to minutes
Relatively small
Warehouse (DSS)
Subject Oriented
Used to analyze business
Summarized and refined
Snapshot data
Integrated Data
Ad-hoc access
Knowledge User (Manager)
Performance relaxed
Large volumes accessed at
a time(millions)
Mostly Read
Database Size
100 GB
- few terabytes
Hundreds of users
Source System
Source Data Transport Layer
Data Quality Control and Data Profiling Layer
Metadata Management Layer
Data Integration Layer
Data Processing Layer
End User Reporting Layer
Source System
The Source Systems are operational systems that
feed data into the data warehouse.
The databases in operational systems are
designed to handle business transactions.
Such databases have difficulty accessing the data
for other management or informational purpose.
Therefore organizations require a data warehouse
that has integrated data from several operational
systems to understand customers, operations,
financial situation, product performance, trends
and a host of key business measurements.
Cntd
Source System
The data warehousing integrate data from
several operational systems and combine it with
information from other, often external, sources of
data.
It is very essential to identify the right data
sources and determine an efficient process to
collect the fact.
Metadata Management
Layer
Scope
A data warehouse project has either of the two
scopes:
Very broad scope( by integrating all enterprise
data from the beginning of time)
Very narrow scope ( by developing only a personal
data warehouse for a single manager for a single
year.
If the data warehouse is developed by taking all
informational data for the entire enterprise from the
beginning of time, then it is very expensive and take
large amount of time and money to built.
Therefore most organizations develop inexpensive
departmental data warehouses as first steps towards
Data Redundancy
There are essentially three levels of data redundancy
that enterprises should think about when
considering their data warehouse options:
"Virtual" or "Point-to-Point" Data Warehouses
Central Data Warehouses
Distributed Data Warehouses
Inmons approach to
Distributed Data
Warehouses
Type of End-User
Executives and managers
Power users (business and financial analysts,
engineers)
Support users (clerical, administrative)
Developing Strategy
The first and foremost step in developing a data
warehouse is formulating a strategy which is
appropriate for its needs and its user population.
There are a number of strategies by which
organizations can get into data warehousing.
One way is to establish a "Virtual Data Warehouse"
environment by:
Installing a set of data access, data directory and
process management facilities
Training the end-users
Monitoring how the data warehouse facilities are
actually used
Based on actual usage, create a physical data
warehouse to support the high-frequency
Cntd..
requests.
Developing Strategy
Another strategy to develop a data warehouse is as
follows:
Duplicate operational data from a single
operational system
And provide data warehouse features to it with
the help of a series of information access tools.
Cntd..
Developing Strategy
Ultimately, the optimal data warehousing strategy is
to select a user population based on value to the
enterprise and do an analysis of their issues,
questions and data access needs.
Based on these needs, prototype data warehouses
are built and populated so the end-users can
experiment and modify their requirements.
Based on the requirements data can be extracted
from various operational systems.
The needs of various enterprises are different and
hence there is no one approach to build a data
warehouse that will fit the needs of every enterprise.
Cntd..
Cntd..
Snowflake Schema
It is a modified version of star schema.
In a star schema, if dimension is complex and contain relationship
such as hierarchies, it is compressed or flattened to a single
dimension.
Like star schema model, the snowflake schema also represents a
dimensional model containing a central fact table and a set of
constituent dimension tables.
The major difference is that in snowflake schema complex
dimensions are normalized into sub-dimension tables.
In a snowflake schema implementation, Warehouse Builder uses
more than one table or view to store the dimension data.
Separate tables store data pertaining to each level in the
dimension.
Snowflake Schema
It is a modified version of star schema.
In a star schema, if dimension is complex and contain relationship
such as hierarchies, it is compressed or flattened to a single
dimension.
Like star schema model, the snowflake schema also represents a
dimensional model containing a central fact table and a set of
constituent dimension tables.
The major difference is that in snowflake schema complex
dimensions are normalized into sub-dimension tables.
In a snowflake schema implementation, Warehouse Builder uses
more than one table or view to store the dimension data.
Separate tables store data pertaining to each level in the
dimension.
Star Schema
Snowflake Schema
Granularity of Facts
Additivity of Facts
Cntd..
Additivity of Facts
In general, facts representing individual transactions
are fully additive, although cumulative totals are semiadditive.
Non-additive facts are usually the result of ratio or
other calculations.
Additivity of Facts
Example of Additive Fact:Time
Customer
Item
Location
Dimensions
Branch
Quantity
Fact
The Quantity can be summed up over all of the
dimensions(Time, Customer, Item, Location, Branch)
Additivity of Facts
Example of Semi-Additive Fact:Suppose a Bank has the following table to store
current balance by account by end of each day.
Date
Account
Current_Balance
Dimensions Fact
The Balance can not be summed up across Time
dimension. It does not make sense if we sum the
current balance by date.
Additivity of Facts
Helper Tables
Helper tables usually take one of two forms:
Help for multi valued dimensions
Helper tables for complex hierarchies
Complex Hierarchies