Data Warehousing Reema Thareja
Data Warehousing Reema Thareja
WAREHOUSING
REEMA THAREJA
Assistant Professor
Department of Computer Science
Shyama Prasad Mukherjee College for Women
University of Delhi
© Oxford University Press
CONTENTS
Preface v
PART I
Chapter 1: Introduction to Data Warehousing 3
1.1 A Short Historical Note 4
1.2 Increasing Demand for Strategic Information 5
1.3 Data Warehouse Defined 10
1.4 Data Warehouse Users 14
1.5 Benefits of Data Warehousing 17
1.6 Concerns in Data Warehousing 19
PART II
Chapter 4: Gathering the Business Requirements 91
4.1 Introduction 92
4.2 Determining the End-user Requirements 93
4.3 Requirements Gathering Methods 95
4.4 Requirements Analysis 102
4.5 Dimensional Analysis 103
4.6 Information Package Diagrams (IPD) 107
PART III
Chapter 11: Building a Data Warehouse 305
11.1 Introduction 306
11.2 Problem Definition 306
11.3 Critical Success Factors 307
11.4 Requirement Analysis 309
11.5 Planning for the Data Warehouse 310
11.6 The Data Warehouse Design Stage 313
11.7 Building and Implementing Data Marts 317
11.8 Building Data Warehouses 317
11.9 Backup and Recovery 323
11.10 Establish the Data Quality Framework 325
11.12 Operating the Warehouse 327
11.13 Recipe for a Successful Warehouse 336
11.14 Data Warehouse Pitfalls 336
© Oxford University Press
x Contents
Glossary 457
Bibliography 466
Index 470
This chapter aims to provide an overview of the fundamental concepts of data warehousing.
It endeavours to answer questions regarding the need for a data warehouse, its evolution,
characteristics, and applications.
Case Study
We shall introduce a case study of a company After obtaining 100 such figures, he calculates
that requires a data warehouse to ease the the cumulative result and gives it to Pallav Raj.
running of its operations. Through the deci- The problem does not end here for the
sions taken by the managers, we shall study employee! Pallav Raj now wants a detailed
construction, operation and management of a product report of the previous year as he
data warehouse. This case study shall run wishes to know which products sold well and
throughout the book. those that did not even have a marginal sale.
Pallav Raj is the CEO of a large garments re- Again the employee contacts each and every
tail chain called JRTs. He asks one of his employ- store and thus the entire process is repeated.
ees to provide him with a status report on the Such situations prevail in a non-data ware-
business as he wishes to know if the company house environment. This is where the concept
was making an overall profit or loss. JRTs has of data warehouses comes into picture. In a
approximately 100 stores spread throughout data warehouse environment, the entire data
the country. Although this is not a difficult ques- of all the stores is stored at one place, that is,
tion to answer, the problem lies in collecting the on one single computer system at the main
relevant data that is spread across 100 stores. office. In such a situation, the employee’s work
With great difficulty, the employee contacts would have been very easy. Or rather, he
each and every store and asks the store manag- would not have been required as the CEO
ers to give a summarized figure describing could have himself gained access to all the
whether the store is running at a profit or loss. data while sitting on his chair.
© Oxford University Press
4 Data Warehousing
calls for adoption of design and implementation techniques that are strikingly
different from those applied in underlying operational information systems.
rectify the situation. Now, there may not be any regular reports to give to the
marketing department on what they want. The IT department has to gather the
data from multiple applications and start forming the report from scratch.
Sometimes, they have to get the information required for such ad hoc
reports from the databases of not one but several applications, perhaps
running on different platforms. What happens next? The marketing
department likes the report but now they may like the report to be produced in
a different form, containing some more information as illustrated in Fig. 1.1.
Users need
information
IT sends IT places
requested request
reports on backlog
IT creates
ad hoc queries
Most of these attempts by IT in the past ended in failure as the users could
not clearly define what they wanted in the first place. After seeing the first set
of reports, they wanted more data in different formats. The chain continued.
The mess was clearly due to the very nature of the process of making strategic
decisions.
Information needed for making strategic decisions must be available in an
interactive manner so that the users can query online, get results, and query
further. The information must be in a format suitable for analysis. Hence, some
factors that were responsible for the inability to provide strategic information
in the past prior to data warehousing are as follows:
IT received too many ad hoc requests for a variety of reports. But with
limited resources, IT was not able to generate all the reports in the
requested manner and within the assigned timeframe.
Requests were not only numerous, but also kept changing over time with
users wanting more reports subsequently to expand and understand
earlier reports.
The users indulged themselves into the spiral of asking for more and
more supplementary reports thereby increasing the IT load even further.
The users depended on IT to provide the information as they could not
access the information directly in an interactive manner.
As a result, IT was unable to provide an environment for flexible and
conducive analysis to the managers and executives for making strategic
decisions.
Table 1.3 illustrates how a data warehouse can help its users to analyse
sales.
Data
transformation
function like
Data
data extraction,
warehouse
cleansing,
aggregation
Operational system
Other application areas include: insurance companies, utilities providers, health care
providers, financial services companies, telecommunications service providers, travel,
transport and tourism companies, security agencies, logistic, inventory, and purchasing.
Users can also be classified based upon their job functions as below:
Executives and managers They need in-
Classification
of users formation for making high level strategic
decisions. They prefer customized and
personalized reports.
Technical analysts They perform com-
Based on
computing
Based on job plex analysis and statistical analysis, per-
functions
proficiency form drill-down, roll-up, slice and dice
Casual/novice user Executives/managers operations on the data.
Regular user Technical analysts
Power user Business analysts Business analysts Although these users
are comfortable with the technology,
Figure 1.3 Data warehouse users classification
they may not be able to write queries and
create reports from scratch. So, they rely on predefined queries and reports to
satisfy their information needs.
Usefulness Initially when organizations start with, say 50 GB data, the prob-
ability of all the data being used is quite high. But as the data grows up in size,
the percentage of data that is actually used goes down.
Data management When a data warehouse is recently deployed, it has small
amounts of data, so data management is not a complexity. But as the data
grows in size, the data management activities become more and more complex
and take much more time to accomplish. For example, refreshing the data
with new values might have taken only an hour when there was a meagre 50
GB of data in the database but now when the size of database has grown to 50
TB, the same activity may take several hours to complete.
Recapitulation
The operational computer systems provide in- A data warehouse is a blend of many technolo-
formation to run the day-to-day operations, gies as it takes data from different operational
but they cannot be readily used to make strate- systems and from outside sources like maga-
gic decisions. zines, journals, reports of other organizations
in the same industry; removes inconsisten-
Data warehousing is a new paradigm specifi-
cies, transforms the data, and finally stores it
cally intended to provide strategic informa-
in formats suitable for easy access for decision
tion.
making.
Data warehouses support decision making and
presents flexible, conducive, and interactive Data warehouses are meant to be used by execu-
source of strategic information to the manag- tives, mangers, and other people at higher
ers and executives. managerial levels who may not have much
technical expertise in handling the databases.
A data warehouse is not a single software or
Advantages of data warehouses include better
hardware product. Rather it is a computing
decisions, increased productivity, lower op-
environment where users are put directly in
erational costs, enhanced asset and liability
touch with the data they need to make better
management, and better CRM.
decisions. It is a user-centric environment.
© Oxford University Press
Introduction to Data Warehousing 21
While implementing a data warehouse in your The data warehouse is used in two basic modes.
organization, you need to be careful about ex- In the verification mode, the user proposes a
tracting, cleaning, and loading of data; check- hypothesis and asks a series of questions to
ing its compatibility with systems already in either confirm or repudiate it. In the discovery
place; providing training to end-users and mode, the user desires to discover patterns of
paying special attention to the security of the customer behaviour and relationships among
data. the products that sell together.
Objective Questions
1. Choose the right statements 3. Multiple choice questions
(a) Operational systems are meant to (a) Reasons for moving into data ware-
provide information to run the day-to- housing include
day business. (i) Processing huge volumes of data
(b) Operational systems are used to make (ii) Providing interactive analysis
strategic decisions. (iii) Increase in the computing power
(c) Data warehouse stores historical as (iv) Lower costs
well as current data.
(v) None of these
(d) Historical data is used to study unusual
(vi) All of these
trends in sales.
(b) Characteristics of a data warehouse
(e) Data warehouse contains integrated in-
formation from heterogeneous sources. include
(f) Strategic information is required to run (i) Stores only current data
day-to-day operations. (ii) Facilitates analyses of large vol-
umes of data
(g) Strategic information is needed for the
survival of the corporation in a highly (iii) Data extracted from only a single
competitive world. application
(h) Operational staff needs strategic infor- (iv) User friendliness
mation. (v) Contains read intensive data
(vi) Can be updated
2. Fill in the blanks
(c) Choose the characteristics of an opera-
(a) A _____ is a user centric environment. tional system.
(b) _____ provides the users with access to (i) Current data
accurate, consolidated information
(ii) Optimized for complex queries
from various internal and external
(iii) Predictive usage
sources.
(iv) 100 MB – 1 GB database size
(c) The users of the data warehouse include
_____, _____, _____ and _____. (v) High access frequency
(d) Data warehouse provides _____ data.
Review Questions
1. What do you understand by strategic 5. Give reasons why operational systems are
information? Give suitable examples. Also not useful for making strategic decisions.
write down some of the characteristics of 6. Explain the factors which lead to the
strategic information. For a commercial growth and usage of data warehouses.
bank, name five types of strategic objectives.
7. Data warehouse is an environment, not a
2. Explain the term Information Crisis. product. Comment.
3. As you have seen, a retail store collects huge 8. Write a short note on benefits of data
amounts of data through its operational warehousing.
systems. Name any four types of
9. How can you say that data warehousing is
transaction data that are likely to be
a blend of many technologies?
collected by the retail store through its daily
10. Data warehousing is the only viable means
operations.
to resolve the information crisis and to
4. Differentiate between operational systems provide strategic information. Justify the
and informational systems. statement.