Data Warehouse Unit-I
Data Warehouse Unit-I
4
BI USING DATA WAREHOUSING
4.1 Introduction to DW
4.2 DW architecture
4.3 ETL Process
4.4 Data Warehouse Design
4.1 INTRODUCTION TO DW
Data Warehouse (DW) is maintained separately from the organization’s
operational database and is an environment. Its architectural construct
provides users with current and historical decision support information
which is not possible in the present traditional operational data store. DW
provides a new design which helps in reduced response time and enhance
the performance of queries for reports and analytics.
Data warehouse system is also known by the following name:
❖ Decision Support System (DSS)
❖ Executive Information System
❖ Management Information System
❖ Business Intelligence Solution
❖ Analytic Application
❖ Data Warehouse
50
History of Datawarehouse Bi Using Data Warehousing
The need to warehouse is to handle increasing amounts of Information.
3. Data Mart:
Data mart a subset of the DW is designed for a particular line of business
like sales, finance and etc.
51
Data Mining and Business ❖ Query Manager
Intelligence
❖ End-user access tools:
This is categorized into five different groups like 1. Data Reporting 2.
Query Tools 3. Application development tools 4. EIS tools, 5. OLAP tools
and data mining tools.
52
Bi Using Data Warehousing
53
Data Mining and Business The Future of Data Warehousing
Intelligence
❖ Change in Regulatory constrains.
❖ Size of the database.
❖ Multimedia data.
2. Oracle:
https://www.oracle.com/index.html
3. Amazon RedShift:
https://aws.amazon.com/redshift/?nc2=h_m1
Here is a complete list of useful Datawarehouse Tools.
54
3 QuerySurge Windows, It speeds up testing process up to Bi Using Data Warehousing
Linux 1,000 x and also providing up to
100% data coverage
It integrates an out-of-the-box
DevOps solution for most Build,
ETL & QA management
software.
55
Data Mining and Business 4.2 DW ARCHITECTURE
Intelligence
Business Analysis Framework
The business analyst get the information from the data warehouses to
measure the performance and make critical adjustments in order to win
over other business holders in the market. Has the following advantages −
❖ Can enhance business productivity.
❖ Helps us manage customer relationship.
❖ Brings down the costs by tracking trends, patterns over a long period
in a consistent and reliable manner.
To design an effective and efficient data warehouse, we need to
understand and analyze the business needs and construct a business
analysis framework. Views are as follows:
❖ The top-down view
❖ The data source view
❖ The data warehouse.
❖ The business query view
Data warehouses and their architectures very depending upon the elements
of an organization's situation and are classified as:
❖ Data Warehouse Architecture: Basic
❖ Data Warehouse Architecture: With Staging Area
❖ Data Warehouse Architecture: With Staging Area and Data Marts
Fig 6: Data Warehouse Architecture with Staging Area and Data Marts (a)
The figure 6 illustrates an example where purchasing, sales, and stocks are
separated. In this example, a financial analyst wants to analyze historical
data for purchases and sales or mine historical information to make
predictions about customer behavior.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-
tier architecture for a data warehouse system, as shown in fig:
58
Bi Using Data Warehousing
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple
source system), the reconciled layer and the data warehouse layer
(containing both data warehouses and data marts). The reconciled layer
sits between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard
reference data model for a whole enterprise. At the same time, it separates
the problems of source data extraction and integration from those of data
warehouse population.
59
Data Mining and Business
Intelligence
61
Data Mining and Business Load Performance
Intelligence
Data warehouses require increase loading of new data periodically basis
within less amount of time; performance on the load process should be
measured in hundreds of millions of rows and gigabytes per hour and must
not artificially constrain the volume of data business.
Load Processing
Many phases must be taken to load new or update data into the data
warehouse, including data conversion, filtering, reformatting, indexing,
and metadata update.
Query Performance
Fact-based management must not be slowed by the performance of the
data warehouse RDBMS; large, complex queries must be complete in
seconds.
Terabyte Scalability
Data warehouse sizes are growing at enormous rates. Today these size
from a few to 100 of GBs and TB-sized DW.
Types of Data Warehouses
There are different types of data warehouses, which are as follows:
62
Host-Based Data Warehouses Bi Using Data Warehousing
There are two types of host-based data warehouses which can be
implemented:
❖ Host-Based mainframe warehouses which reside on a high volume
database. Supported by robust and reliable high capacity structure such as
IBM system/390, UNISYS and Data General sequent systems, and
databases such as Sybase, Oracle, Informix, and DB2.
❖ Host-Based LAN data warehouses, where data delivery can be
handled either centrally or from the workgroup environment. The size of
the data warehouses of the database depends on the platform.
Data Extraction and transformation tools allow the automated extraction
and cleaning of data from production systems.
1. A huge load of complex warehousing queries would possibly have too
much of a harmful impact upon the mission-critical transaction
processing (TP)-oriented application.
2. These transaction processing systems have been developing in their
database design for transaction throughput.
3. There is no assurance that data remains consistent.
Host-Based (MVS) Data Warehouses
Those data warehouse uses that reside on large volume databases on MVS
are the host-based types of data warehouses. Often the DBMS is DB2 with
a huge variety of original source for legacy information like VSAM, DB2,
flat files, and Information Management System (IMS). of Java
66
❖ Impacting performance since the customer will be competing with the Bi Using Data Warehousing
production data stores.
Disadvantages
1. Queries competing with production record transactions can degrade
the performance.
2. No metadata, no summary record, or no individual DSS (Decision
Support System) integration or history.
3. No refreshing process, causing the queries to be very complex.
68
Bi Using Data Warehousing
Step 2) Transformation
Data extracted from source server is raw and not usable in its original
form and needs to be cleansed, mapped and transformed.
69
Data Mining and Business Step 3) Loading
Intelligence
Large volume of data needs to be loaded in a relatively short period and
needs to be optimized for performance.
In case of load failure, recover mechanisms should be configured to restart
from the point of failure without data integrity loss.
Types of Loading:
❖ Initial Load — populating all the Data Warehouse tables
❖ Incremental Load — applying ongoing changes as when needed
periodically.
❖ Full Refresh —erasing the contents of one or more tables and
reloading with fresh data.
ETL Tools
Prominent data warehousing tools available in the market are:
1. MarkLogic:
https://www.marklogic.com/product/getting-started/
2. Oracle:
https://www.oracle.com/index.html
3. Amazon RedShift:
https://aws.amazon.com/redshift/?nc2=h_m1
70
Bi Using Data Warehousing
Weaknesses
❖ Flexibility
❖ Hardware
❖ Learning Curve
ELT (Extract, Load and Transform)
ELT stands for Extract, Load and Transform is the various sights while
looking at data migration or movement. ELT involves the extraction of
aggregate information from the source system and loading to the target
method instead of transformation between the extraction and loading
phase. Once the data is copied or loaded into the target method, then
change takes place.
71
Data Mining and Business ❖ Risk minimization
Intelligence
❖ Utilize Existing Hardware
❖ Utilize Existing Skill sets
Weaknesses
❖ Against the Norm
❖ Tools Availability
Difference between ETL vs. ELT
Basics ETL ELT
Process Data is transferred to the Data remains in the DB
ETL server and moved except for cross Database
back to DB. High network loads (e.g. source to object).
bandwidth required.
Transformation Transformations are Transformations are
performed in ETL Server. performed (in the source or)
in the target.
Code Usage Typically used for Typically used for
❖ Source to target ❖ High amounts of data
transfer
❖ Compute-intensive
Transformations
❖ Small amount of
data
Time- It needs highs Low maintenance as data is
Maintenance maintenance as you need always available.
to select data to load and
transform.
Calculations Overwrites existing Easily add the calculated
column or Need to column to the existing table.
append the dataset and
push to the target
platform.
Analysis
73
Data Mining and Business Bottom-Up Design Approach
Intelligence
In "Bottom-Up" approach, a DW is described as "a copy of transaction
data specifical architecture for query and analysis," term the star schema.
In this approach, a data mart is created first to necessary reporting and
analytical capabilities for particular business processes. Data marts include
the lowest grain data and, aggregated data, if needed.
Main advantage of "bottom-up" design approach is, it has quick ROI, and
takes less time and effort than developing an enterprise-wide data
warehouse. In addition to it the risk of failure is even less. This method is
inherently incremental. This method allows the project team to learn and
grow.
74
Differentiate between Top-Down Design Approach and Bottom-Up Bi Using Data Warehousing
Design Approach
Breaks the vast problem into Solves the essential low-level problem
smaller sub problems. and integrates them into a higher one.
75
5
DATA MART
Unit Structure
5.1 Data mart
5.2 OLAP
5.3 Dimensional Modeling
5.4 Operations on Data Cube
5.5 Schema
5.6 References
5.7 MOOCs
5.8 Video Lectures
5.9 Quiz
76
Data mart
77
Data Mining and Business
Intelligence
Hybrid Data Mart:
It combines input from sources apart from Data warehouse and is helpful
in integration. Hybrid Data mart also supports large storage structures, and
it is best suited for flexible for smaller data-centric applications.
78
Constructing Data mart
In this second phase of implementation it involves in creating the physical
database and the logical structures. Involves the following tasks:
● Implementing the physical database designed in the earlier phase.
Database schema objects like table, indexes, views, etc. are to be created.
Populating:
In the third phase, data in populated in the data mart involving the
following tasks:
❖ Data Mapping
❖ Extraction of source data
❖ Cleaning and transformation operations
❖ Loading data into the data mart
❖ Creating and storing metadata
Accessing
Accessing is a fourth step which involves putting the data to use and
submit queries to the database & display the results of the queries
The accessing step needs to perform the following tasks:
❖ Translates database structures and objects names into business terms
❖ Set up and maintain database structures.
❖ Set up API and interfaces if required
Managing
Is the last step of Data Mart Implementation process and covers
management tasks like:
❖ User access management.
❖ System optimizations and fine-tuning
❖ Adding and managing fresh data into the data mart.
❖ Planning recovery scenarios and ensure system availability in the case
of system fails.
Disadvantages
● Maintenance problem.
● Data analysis is limited.
5.2 OLAP
Online Analytical Processing provide analysis of data for business
decisions and allow users to analyze database information from multiple
database systems at one time.
The primary objective is data processing and not data analysis
Example of OLAP
Any Data warehouse system is an OLAP system.
Uses of OLAP:
❖ A company might compare their mobile phone sales in September with
sales in October, then compare those results with another location
which may be stored in a separate database.
❖ Amazon analyzes purchases by its customers to come up with a
personalized homepage with products which likely interest to their
customer.
80
OLTP Data mart
Online transaction processing supports transaction-oriented applications in
a 3-tier architecture administering day to day transaction of an
organization.
81
Data Mining and Business OLTP Vs OLAP
Intelligence
82