0% found this document useful (0 votes)
7 views57 pages

Block 1

The document outlines a course on Data Warehousing and Data Mining offered by Indira Gandhi National Open University, consisting of four blocks that cover fundamentals, architecture, and applications of data warehousing, as well as data mining techniques. It emphasizes the importance of data warehousing in providing historical insights and supporting business intelligence for decision-making. The course is structured into units that explore various aspects of data warehousing, including its evolution, design approaches, and dimensional modeling techniques.

Uploaded by

priyanshuattu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views57 pages

Block 1

The document outlines a course on Data Warehousing and Data Mining offered by Indira Gandhi National Open University, consisting of four blocks that cover fundamentals, architecture, and applications of data warehousing, as well as data mining techniques. It emphasizes the importance of data warehousing in providing historical insights and supporting business intelligence for decision-making. The course is structured into units that explore various aspects of data warehousing, including its evolution, design approaches, and dimensional modeling techniques.

Uploaded by

priyanshuattu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

MCS-221

Data Warehousing
Indira Gandhi National Open University
School of Computer and Information
Sciences (SOCIS)
and Data Mining

DATA WAREHOUSE FUNDAMENTALS


AND ARCHITECTURE 1
MCS-221
Data Warehousing
and Data Mining
Indira Gandhi
National Open University
School of Computer and
Information Sciences

1
Block

DATA WAREHOUSE FUNDAMENTALS


AND ARCHITECTURE
UNIT 1
Fundamentals of Data Warehouse 1
UNIT 2
Data Warehouse Architecture 19
UNIT 3
Dimensional Modeling 35
PROGRAMME DESIGN COMMITTEE
Prof. (Retd.) S.K. Gupta , IIT, Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Prof. T.V. Vijay Kumar JNU, New Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Prof. Ela Kumar, IGDTUW, Delhi Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Prof. Gayatri Dhingra, GVMITM, Sonipat Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Mr. Milind Mahajan,. Impressico Business Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
Solutions, New Delhi Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU

COURSE DESIGN COMMITTEE


Prof. T.V. Vijay Kumar, JNU, New Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Dr. Rahul Johri, USICT, GGSIPU, New Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Mr. Vinay Kumar Sharma, NVLI, IGNOU Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU

SOCIS FACULTY
Prof. P. Venkata Suresh, Director, SOCIS, IGNOU
Prof. V.V. Subrahmanyam, SOCIS, IGNOU
Dr. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Dr. Naveen Kumar, Associate Professor, SOCIS, IGNOU (on EOL)
Dr. M.P. Mishra, Associate Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU
Dr. Manish Kumar, Assistant Professor, SOCIS, IGNOU

BLOCK PREPARATION TEAM


Course Editor Course Writers
Prof. Devendra Kumar Tayal Unit 1: Ms. U. Chaitanya, Asst Professor
Dept. of Computer Science & Engineering Dept. of Information Technology
Indira Gandhi Delhi Technical University for Women New Mahatma Gandhi Institute of Technology
Delhi Hyderabad
Language Editor Unit 2: Prof. K. Swathi
Prof. Parmod Kumar NRI Institute of Technology
School of Humanities Vijayawada
IGNOU Unit 3: Prof. Archana Singh
New Delhi Dept. of Information Technology
Amity School of Engineering & Technology
Noida
Course Coordinator: Prof. V.V. Subrahmanyam

Print Production
Mr. Sanjay Aggarwal, Assistant Registrar (Publication), MPDD

July,
April2022
2023

Indira Gandhi National Open University, 2022

ISBN- 978-93-5568-774-6

All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without
permission in writing from the Indira Gandhi National Open University.

Further information on the Indira Gandhi National Open University courses may be obtained from the University’s office at
Maidan Garhi, New Delhi-110068.

Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by MPDD, IGNOU.

Laser Typeset by Raj Printers, A-9, Sector B-2, Tronica City, Loni (Gzb.)

printed at : Rohan Pragya Printing And Packaging Pvt Ltd. H-76, Site-V, UPSIDC, Kasna
COURSE INTRODUCTION
This course is of 4 credits, divided into two parts – first part (2 credits) covering the
Data Warehousing and second part (2 credits) covering the Data Mining.
A data warehouse is a system that stores data from a company’s operational
databases as well as external sources. Data warehouse platforms are different from
operational databases because they store historical information, making it easier
for business leaders to analyze data over a specific period of time. Data warehouse
platforms also sort data based on different subject matter, such as customers,
products or business activities.
Many global corporations have turned to data warehousing to organize data that
streams in from corporate branches and operations centers around the world. It’s
essential for IT students to understand how data warehousing helps businesses
remain competitive in a quickly evolving global marketplace. Data warehousing
is an increasingly important business intelligence tool which enables historical
insights, ensure consistency, allow organizations to make better business decisions,
decrease costs, maximize efficiency, increase the power and speed of data analytics,
provides major competitive edge and increase sales to improve the bottom line.
It is necessary to choose adequate Data Mining algorithms for making Data
Warehouse more useful. Data mining algorithms are used for transforming data
into business information and thereby improving decision making process. Data
Mining is a set of methods used for data analysis, created with the aim to find
out specific dependence, relations and rules related to data and making them out
in the new higher level quality information. Data Mining gives results that show
the interdependence and relations of data. These dependences are mainly based
on various mathematical and statistical relations. Data are collected from internal
database and converted into various documents, reports, list etc. which can be
further used in decision making processes. After selecting the data for analysis,
Data Mining is applied to the appropriate rules of behavior and patterns. That
is the reasons why Data Mining is also known as extraction of knowledge, data
archeology or pattern analysis. Data mining helps to develop smart market decision,
run accurate campaigns, make predictions, and more. With the help of Data mining,
we can analyze customer behaviors and their insights. This leads to great success
and data-driven business.
The course is organized into 4 Blocks:
Block 1 covers the Introductory topics on Data Warehousing, Data Warehouse
architecture, Data Marts and Dimensional Modeling.
Block 2 covers the Extract, Transform and Loading (ETL) aspects of Data
Warehousing, Online Analytical Processing and some Trends in Data Warehouse.
Block 3 covers the introductory topics related to Data Mining, Data Preprocessing
and Mining Frequent Patterns and Associations
Block 4 covers the Classification, Clustering of Data Mining, Text and Web
Mining.
There is a lab component associated with this course (i.e., Section-2 Data Mining
Lab of MCSL-223 course).
BLOCK INTRODUCTION
The title of the Block is Data Warehouse Fundamentals and Architecture. The
objectives of this block are to make you understand about the underlying concepts
of Data Warehousing, identify the components of the Data Warehouse Architecture,
to know the difference between the Data Warehouse and Data Marts, to understand
the Data Warehouse Development Life Cycle and to elucidate the dimensional
modeling techniques.
The block is organized into 3 units:
Unit 1 covers the fundamentals of data warehousing, its evolution, characteristics
of data warehousing, online transaction processing systems and applications of data
warehouses;
Unit 2 covers the data warehouse architecture, data marts and data warehouse
development life cycle; and
Unit 3 covers the introduction to dimensional modeling, identifying facts and
dimensions, star schema, snowflake schema and fact constellation schema.
UNIT 1 FUNDAMENTALS OF DATA WAREHOUSE
Structure
1.0 Introduction
1.1 Objectives
1.2 Evolution of Data Warehouse
1.3 Data Warehouse and its Need
1.3.1 Need for Data Warehouse
1.3.2 Benefits of Data Warehouse
1.4 Data Warehouse Design Approaches
1.4.1 Top-Down Approach
1.4.2 Bottom-Up Approach
1.5 Characteristics of a Data Warehouse
1.5.1 How Data Warehouse Works?
1.6 OLTP and OLAP
1.6.1 Online Transaction Processing (OLTP)
1.6.2 Online Analytical Processing (OLAP)
1.7 Data Granularity
1.8 Metadata and Data Warehousing
1.9 Data Warehouse Applications
1.10 Types of Data Warehouses
1.10.1 Enterprise Data Warehouse
1.10.2 Operational Data Store
1.10.3 Data Mart
1.11 Popular Data Warehouse Platforms
1.12 Summary
1.13 Solutions/Answers
1.14 Further Readings

1.0 INTRODUCTION
A database often contains information or data collection that is generally stored
electronically in a computer system. It is easy to access, manage, modify, update,
monitor, and organize the data. Data is stored in the tables of the database.
The process of consolidating data and analyzing it to obtain some insights has
been around for centuries, but we just recently began referring to this as data
warehousing. Any operational or transactional system is only designed with its
own functionality and hence, it could handle limited amounts of data for a limited
amount of time. The operational systems are not designed or architected for long
term data retention as the historical data is little to no importance to them. However,
to gain a point-in-time visibility and understand the high-level operational aspects
of any business, the historical data plays a vital role. With the emergence of matured
Relational Database Management Systems (RDBMS) in 1960s, engineers across
various enterprises started architecting ways to copy the data from the transactional
systems over to different databases via manual or automated mechanism and use
Data Warehouse it for reporting and analysis. As the data in the transactional systems would get
Fundamentals purged periodically, it would not be the case in these analytical repositories as their
and Architecture
purpose was to store as much data as possible; hence the word “data warehouse”
came into existence because these repositories would become a warehouse for the
data.
Data Warehousing (DW) as a practice became very prominent during late 80s when
the enterprises started building decision support systems that were mainly responsible
to support reporting. As there was a rapid advancement in the performance of these
relational database during late 1990s and early 2000s, Data Warehousing became
a core part of the Information Technology group across large enterprises. In fact,
some of the vendors like Netezza, Teradata started offering customized hardware
to manage data warehouse architectures within state-of-the-art machines. Data
Warehousing had evolved to be on top of the list of priorities since mid 2000s. Data
supply chain ecosystem has grown exponentially in the current world and so is the
way enterprises architect their data warehouses.
A well architected data warehouse serves as an extended vision for the enterprise
where multiple departments can gain actionable insights to manage key business
decisions that could drive operational excellence or revenue generating opportunities
for the enterprise.
This unit covers the basic features of data warehousing, its evolution, characteristics,
online transaction processing (OLTP), online analytical processing, popular
platforms and applications of data warehouses.

1.1 OBJECTIVES
After going through this unit, you shall be able to:
yy understand the evolution of data warehouse;
yy describe various characteristics of data warehouse;
yy list the benefits and applications of a data warehouse;
yy discuss the significance of metadata in data ware house;
yy list and discuss the types of data warehouses, and
yy identify the popular data warehouse platforms;

1.2 EVOLUTION OF DATA WAREHOUSE


The relational database revolution in the early 1980s ushered in an era of improved
access to the valuable information contained deep within data. It was soon discovered
that databases modeled to be efficient at transactional processing were not always
optimized for complex reporting or analytical needs.
In fact, the need for systems offering decision support functionality predates the
first relational model and SQL. But the practice known today as Data Warehousing
really saw its genesis in the late 1980s. An IBM Systems Journal article published
in 1988, An architecture for a business information system coined the term
“business data warehouse,” although a future progenitor of the practice, Bill Inmon,
used a similar term in the 1970s. Considered by many to be the Father of Data
2 Warehousing, Bill Inmon, an American Computer Scientist is first began to discuss
the principles around the Data Warehouse and even coined the term. Throughout Fundamentals of
the latter 1970s into the 1980s, Inmon worked extensively as a data professional, Data Warehouse
honing his expertise in all manners of relational Data Modeling. Inmon’s work as
a Data Warehousing pioneer took off in the early 1990s when he ventured out on
his own, forming his first company, Prism Solutions. One of Prism’s main products
was the Prism Warehouse Manager, one of the first industry tools for creating and
managing a Data Warehouse.
In 1992, Inmon published Building the Data Warehouse, one of the seminal
volumes of the industry. Later in the 1990s, Inmon developed the concept of the
Corporate Information Factory, an enterprise level view of an organization’s data
of which Data Warehousing plays one part. Inmon’s approach to Data Warehouse
design focuses on a centralized data repository modeled to the third normal form.
Inmon's approach is often characterized as a top-down approach. Inmon feels using
strong relational modeling leads to enterprise-wide consistency facilitating easier
development of individual data marts to better serve the needs of the departments
using the actual data. This approach differs in some respects to the “other” father of
Data Warehousing, Ralph Kimball.
While Inmon’s Building the Data Warehouse provided a robust theoretical
background for the concepts surrounding Data Warehousing, it was Ralph
Kimball’s The Data Warehouse Toolkit, first published in 1996, that included a
host of industry-honed, practical examples for OLAP-style modeling. Kimball, on
the other hand, favors the development of individual data marts at the departmental
level that get integrated together using the Information Bus architecture. This bottom
up approach fits-in nicely with Kimball’s preference for star-schema modeling.
Both approaches remain core to Data Warehousing architecture as it stands today.
Smaller firms might find Kimball’s data mart approach to be easier to implement
with a constrained budget. Dimensional modeling in many cases is easier for the
end user to understand.
According to Bill Inmon, “A warehouse is a subject-oriented, integrated, time-
variant and non-volatile collection of data in support of management’s decision
making process”.
According to Ralph Kimball, “Data Warehouse (DW) is the conglomerate of all data
marts within the enterprise. Information is always stored in the dimensional model”.

1.3 DATA WAREHOUSING AND ITS NEED


Data Warehouse is used to collect and manage data from various sources, in order
to provide meaningful business insights. A data warehouse is usually used for
linking and analyzing heterogeneous sources of business data. The data warehouse
is the center of the data collection and reporting framework developed for the BI
system. Data warehouse systems are real-time repositories of information, which
are likely to be tied to specific applications. Data warehouses gather data from
multiple sources (including databases), with an emphasis on storing, filtering,
retrieving and in particular, analyzing huge quantities of organized data. The data
warehouse operates in information-rich environment that provides an overview
of the company, makes the current and historical data of the company available
for decisions, enables decision support transactions without obstructing operating
systems, makes information consistent for the organization, and presents a flexible
and interactive information source. 3
Data Warehouse 1.3.1 Need for Data Warehouse
Fundamentals
and Architecture Data warehouses are used extensively in the largest and most complex businesses
around the world. In demanding situations, good decision making becomes critical.
Significant and relevant data is required to make decisions. This is possible only
with the help of a well-designed data warehouse. Following are some of the reasons
for the need of Data Warehouses:
Enhancing the turnaround time for analysis and reporting: Data warehouse
allows business users to access critical data from a single source enabling them
to take quick decisions. They need not waste time retrieving data from multiple
sources. The business executives can query the data themselves with minimal or no
support from IT which in turn saves money and time.
Improved Business Intelligence: Data warehouse helps in achieving the vision
for the managers and business executives. Outcomes that affect the strategy and
procedures of an organization will be based on reliable facts and supported with
evidence and organizational data.
Benefit of historical data: Transactional data stores data on a day to day basis
or for a very short period of duration without the inclusion of historical data. In
comparison, a data warehouse stores large amounts of historical data which enables
the business to include time-period analysis, trend analysis, and trend forecasts.
Standardization of data: The data from heterogeneous sources are available in a
single format in a data warehouse. This simplifies the readability and accessibility
of data. For example, gender is denoted as Male/ Female in Source 1 and m/f in
Source 2 but in a data warehouse the gender is stored in a format which is common
across all the businesses i.e. M/F.
Immense ROI (Return On Investment): Return On Investment refers to the
additional revenues or reduces expenses a business will be able to realize from any
project.
Now, let us study the benefits.
1.3.2 Benefits of Data Warehouse
Several enterprises adopt data warehousing as it offers many benefits, such as
streamlining the business and increasing profits. Following are some of the benefits
of having a data warehouse:
Scalability - Businesses today cannot survive for long if they cannot easily expand
and scale to match the increase in the volume of daily transactions. DW is easy to
scale, making it easier for the business to stride ahead with minimum hassle.
Access to Historical Insights - Though real-time data is important, historical
insights cannot be ignored when tracing patterns. Data warehousing allows
businesses to access past data with just a few clicks. Data that are months and years
old can be stored in the warehouse.
Works On-Premises and on Cloud - Data warehouses can be built on-premises
or on cloud platforms. Enterprises can choose either option, depending on their
existing business system and the long-term plan. Some businesses rely on both.
Better Efficiency - Data warehousing increases the efficiency of the business by
4 collecting data from multiple sources and processing it to provide reliable and
actionable insights. The top management uses these insights to make better and Fundamentals of
faster decisions, resulting in more productivity and improved performance. Data Warehouse

Improved Data Security - Data security is crucial in every enterprise. By collecting


data in a centralized warehouse, it becomes easier to set up a multi-level security
system to prevent the data from being misused. Provide restricted access to data
based on the roles and responsibilities of the employees.
Increase Revenue and Returns - When the management and employees have
access to valuable data analytics, their decisions and actions will strengthen the
business. This increases the revenue in the long run.
Faster and Accurate Data Analytics - When data is available in the central data
warehouse, it takes less time to perform data analysis and generate reports. Since
the data is already cleaned and formatted, the results will be more accurate.
Let us study the various approaches in detail in the following section.

1.4 DATA WAREHOUSE DESIGN APPROACHES


Data Warehouse design approaches are very important aspect of building data
warehouse. Selection of right data warehouse design could save lot of time and
project cost.
There are two different Data Warehouse Design Approaches normally followed
when designing a Data Warehouse solution and based on the requirements of
your project you can choose which one suits your particular scenario. These
methodologies are a result of research from Bill Inmon (Top-Down Approach) and
Ralph Kimball(Bottom up Approach).
1.4.1 Top-down Approach
Bill Inmon’s design methodology is based on a top-down approach which is
illustrated in the Figure 1. In the top-down approach, the data warehouse is
designed first and then data mart is built on top of data warehouse.

Source System

Exract Data to
Stage

Data Warehouse

ETL

Figure 1: Top-Down DW Design Approach 5


Data Warehouse Below are the steps that are involved in top-down approach:
Fundamentals
and Architecture yy Data is extracted from the various source systems. The extracts are loaded
and validated in the stage area. Validation is required to make sure the
extracted data is accurate and correct. You can use the ETL tools or approach
to extract and push to the data warehouse.
yy Data is extracted from the data warehouse in regular basis in stage area. At
this step, you will apply various aggregation, summarization techniques on
extracted data and loaded back to the data warehouse.
yy Once the aggregation and summarization is completed, various data marts
extract that data and apply the some more transformation to make the data
structure as defined by the data marts.
1.4.2 Bottom-up Approach
Ralph Kimball’s data warehouse design approach is called dimensional modelling
or the Kimball methodology which is illustrated in Figure 2. This methodology
follows the bottom-up approach.
As per this method, data marts are first created to provide the reporting and analytics
capability for specific business process, later with these data marts enterprise data
warehouse is created.

Source Systems

DM2

ETL

Data Warehouse

Figure 2: Bottom-Up DW Design Approach

Basically, Kimball model reverses the Inmon model i.e. Data marts are directly
loaded with the data from the source systems and then ETL process is used to load
in to Data Warehouse. The above image depicts how the top-down approach works.
Below are the steps that are involved in bottom-up approach:
yy The data flow in the bottom up approach starts from extraction of data from
various source systems into the stage area where it is processed and loaded
into the data marts that are handling specific business process.
6 yy After data marts are refreshed the current data is once again extracted in
stage area and transformations are applied to create data into the data mart
structure. The data is the extracted from Data Mart to the staging area is Fundamentals of
aggregated, summarized and so on loaded into EDW and then made available Data Warehouse
for the end user for analysis and enables critical business decisions.
yy Having discussed the data warehouse design strategies, let us study the
characteristics of the DW in the next section.

1.5 CHARACTERISTICS OF A DATA WAREHOUSE


Data warehouses are systems that are concerned with studying, analyzing and
presenting enterprise data in a way that enables senior management to make
decisions. The data warehouses have four essential characteristics that distinguish
them from any other data and these characteristics are as follows:
• Subject-oriented
A DW is always a subject-oriented one, as it provides information about a specific
theme instead of current organizational operations. On specific themes, it can be
done. That means that it is proposed to handle the data warehousing process with
a specific theme (subject) that is more defined. Figure 3 shows Sales, Products,
Customers and Account are the different themes.
A data warehouse never emphasizes only existing activities. Instead, it focuses on
data demonstration and analysis to make different decisions. It also provides an
easy and accurate demonstration of specific themes by eliminating information that
is not needed to make decisions.

Figure 3: Subject-Oriented Characteristic Feature of a DW

• Integrated
Integration involves setting up a common system to measure all similar data from
multiple systems. Data was to be shared within several database repositories
and must be stored in a secured manner to access by the data warehouse. A data
warehouse integrates data from various sources and combines it in a relational
database. It must be consistent, readable, and coded.. The data warehouse integrates
several subject areas as shown in the figure 4.

7
Figure 4: Integrated Characteristic Feature of a DW
Data Warehouse • Time-Variant
Fundamentals
and Architecture Information may be held in various intervals such as weekly, monthly, and yearly
as shown in Figure 5. It provides a series of limited-time, variable rate, online
transactions. The data warehouse covers a broader range of data than the operational
systems. When the data stored in the data store has a certain amount of time, it can
be predictable and provide history. It has aspects of time embedded within it. One
other facet of the data warehouse is that the data cannot be changed, modified or
updated once it is stored.

Figure 5: Time- Variant Characteristic Feature of a DW

• Non-Volatile
The data residing in the data warehouse is permanent, as the name non -volatile
suggests. It also ensures that when new data is added, data is not erased or
removed. It requires the mammoth amount of data and analyses the data within
the technologies of warehouse. Figure 6 shows the non- volatile data warehouse
vs operational database. A data warehouse is kept separate from the operational
database and thus the data warehouse does not represent regular changes in the
operational database. Data warehouse integration manages different warehouses
relevant to the topic.

Figure 6: Non –Volatile Characteristic Feature of DW

1.5.1 How Data Warehouse Works?


A data warehouse is a central repository in which one or more sources of information
are collected. Data in the data warehouse may be Structured, Semi-structured, or
Unstructured. Data are processed, transformed, and accessed by end users for use in
8 business intelligence reporting and decision-making. A data warehouse integrates
disparate primary sources into a comprehensive source. Through the integration of Fundamentals of
all this information, an organization can maintain a more holistic level of customer Data Warehouse
service. This ensures that all available data is properly considered. Data warehouse
enables data mining to find patterns of information that increases profits.
Figure 7 shows the important components of the data warehouse.

Figure 7: Components of a Data Warehouse

• Load Manager
Load Manager Component of data warehouse is responsible for collection of
data from operational system and converts them into usable form for the users.
This component is responsible for importing and exporting data from operational
systems. This component includes all of the programs and applications interfaces
that are responsible for pooling the data out of the operational system, preparing it,
loading it into warehouse itself it performs the following tasks such as identification
of data, validation of data about the accuracy, extraction of data from original
source, cleansing of data by eliminating meaningless values and making it usable,
data formatting, data standardization by getting them into a consistent form, data
merging by taking data from different sources and consolidating into one place and
establishing referential integrity.
• Warehouse Manager
The warehouse manager is the center of data-warehousing system and is the
data warehouse itself. It is a large, physical database that holds a vast amount of
information from a wide variety of sources. The data within the data warehouse
is organized such that it becomes easy to find, use and update frequently from its
sources.
• Query Manager
Query Manager Component provides the end-users with access to the stored
warehouse information through the use of specialized end-user tools. Data mining
access tools have various categories such as query and reporting, on-line analytical
processing (OLAP), statistics, data discovery and graphical and geographical
information systems.
9
Data Warehouse • End-user access tools
Fundamentals
and Architecture This is divided into the following categories, such as:
• Reporting Data
• Query Tools
• Data Dippers
• Tools for EIS
• Tools for OLAP and tools for data mining.
 Check Your Progress 1
1) What is a Data Warehouse and why is it important?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
…….…………………………………………………………......................
2) Mention the characteristics of a Data Warehouse.
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
…….…………………………………………………………......................

1.6 OLTP AND OLAP


Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP)
are the two terms which look similar but refer to different kinds of systems. Online
transaction processing (OLTP) captures, stores, and processes data from transactions
in real time. Online analytical processing (OLAP) uses complex queries to analyze
aggregated historical data from OLTP systems.
1.6.1 Online Transaction Processing (OLTP)
An OLTP system captures and maintains transaction data in a database. Each
transaction involves individual database records made up of multiple fields or
columns. Examples include banking and credit card activity or retail checkout
scanning.
In OLTP, the emphasis is on fast processing, because OLTP databases are read,
written, and updated frequently. If a transaction fails, built-in system logic ensures
data integrity.
1.6.2 Online Analytical Processing (OLTP)
OLAP applies complex queries to large amounts of historical data, aggregated
from OLTP databases and other sources, for data mining, analytics, and business
intelligence projects. In OLAP, the emphasis is on response time to these complex
queries. Each query involves one or more columns of data aggregated from many
rows.
Examples include year-over-year financial performance or marketing lead
generation trends. OLAP databases and data warehouses give analysts and decision-
10 makers the ability to use custom reporting tools to turn data into information. Query
failure in OLAP does not interrupt or delay transaction processing for customers, Fundamentals of
but it can delay or impact the accuracy of business intelligence insights. Data Warehouse

OLTP is operational, while OLAP is informational. A glance at the key features


of both kinds of processing illustrates their fundamental differences, and how they
work together. The table (Table 1) below summarizes differences between OLTP
and OLAP.
Table 1: OLTP Vs OLAP

OLTP OLAP
Characteristics Handles a large number of Handles large volumes of data
small transactions with complex queries
Query types Simple standardized queries Complex queries
Operations Based on INSERT, UPDATE, Based on SELECT commands
DELETE commands to aggregate data for reporting
Response time Milliseconds Seconds, minutes, or hours
depending on the amount of
data to process
Design Industry-specific, such as Subject-specific, such as
retail, manufacturing, or sales, inventory, or marketing
banking
Source Transactions Aggregated data from
transactions
Purpose Control and run essential Plan, solve problems, support
business operations in real decisions, discover hidden
time insights
Data updates Short, fast updates initiated by Data periodically refreshed
user with scheduled, long-running
batch jobs
Space Generally small if historical Generally large due to
requirements data is archived aggregating large datasets
Backup and Regular backups required to Lost data can be reloaded
recovery ensure business continuity and from OLTP database as
meet legal and governance needed in lieu of regular
requirements backups
Productivity Increases productivity of end Increases productivity of
users business managers, data
analysts, and executives
Data view Lists day-to-day business Multi-dimensional view of
transactions enterprise data
User examples Customer-facing personnel, Knowledge workers such
clerks, online shoppers as data analysts, business
analysts, and executives
Database Normalized databases for Denormalized databases for
design efficiency analysis
OLTP provides an immediate record of current business activity, while OLAP
generates and validates insights from that data as it’s compiled over time. That
historical perspective empowers accurate forecasting, but as with all business 11
Data Warehouse intelligence, the insights generated with OLAP are only as good as the data pipeline
Fundamentals from which they emanate.
and Architecture
 Check Your Progress 2
1) Why a data warehouse is separated from Operational Databases?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
…….…………………………………………………………......................
2) Mention the key differences between a database and a data warehouse.
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………..

1.7 DATA GRANULARITY


Granularity is one of the main elements in the modeling of DW data. Granularity
of data refers to detail levels. Multiple levels of detail may be available depending
on the requirements. At least two granular levels exist for many data warehouses.
The relation between detailing and granularity is important to understand. It means
greater detail of the data (less summary) when you speak of less granularity or fine
granularity. Greater granularity means fewer details or gross granularity (greater
summarization). The operational data is stored at the lowest level of information.
The sale units will be stored and stored in the outlet system at the unit level of
product per transaction. The amount ordered is collected and stored by the customer
at unit level per order in the order entry system. You can add up the individual
transactions whenever you need the summary data. When you search for items
in a product ordered this month and add them together, you have the total of all
product orders entered in that month. Referral data is generally not tracked in an
operating system. When a user requests an analysis in the data warehouse, the user
has to view summary data first. A user can successfully promote the entire product
unit throughout a given area. The user may then wish to examine the region's
breakdowns. The next step might be to look at the following levels of the sales
units in each store. The analysis often starts at a high level, and is reduced in detail.
Therefore, in a data store, the summary of data at different levels can be maintained
effectively. You can provide answers at the simplest level or the most detailed
level. The detail level of data is the level of granularity in an existing data
warehouse. The more detail in the data, the finer the size of the data. You need to
save a lot of permanent data in the data warehouse in order to safely store data. The
type of granularity depends on how data will be processed, and the performance
expectations. For example, each year details of each month, day, hour, minute,
second, and so forth are available.
1.8 METADATA AND DATA WAREHOUSING
In a data warehouse, data is stored using a common schema controlled by a common
dictionary. Within the data dictionary, data is kept about the logical data structures,
file and address data, index information and others. The metadata should contain
the following data warehouse information:
12
yy The data structure based on the programmer's view Fundamentals of
Data Warehouse
yy Data structure based on DSS analysts' view
yy The DW's data sources
yy The data transformation at the moment of its migration to DW
yy Model of data
yy The connection between the data model and the DW
yy Data extraction history
In the DW environment, metadata is a major component. Metadata helps to control
reporting accuracy, validates the transformation of data and ensures calculation
accuracy. Metadata also complies with the company end-users' definition of
business terms. More details on metadata are provided in the next Unit.

1.9 DATA WAREHOUSING APPLICATIONS


In different sectors, there are numerous applications such as e-commerce,
telecommunication, transport, marketing, distribution and retail. Given below are
some of the applications of data warehouses:
Investment and Insurance: In this sector, data warehousing is used to analyze the
customer, market trends and other patterns of data. The two sub-sectors where data
warehousing plays an important role are Forex and stock markets.
Healthcare: A data warehousing system is used to forecast outcomes of a treatment
generate its reports and share the data with different units. These units can be the
research labs, medical units, and insurance providers. Enterprise data warehouses
serve as the backbone of healthcare systems as they are updated with recent
information which is crucial for saving lives.
Retail: Be it distribution, marketing, examining pricing policies, keeping a track
of promotional deals, and finding the pattern in the customer buying trends: data
warehousing solves it all. Many retail chains incorporate enterprise data warehousing
for business intelligence and forecasting.
Social Media Websites: Social networking sites such as Facebook, Twitter,
LinkedIn etc. are based on large data sets analyses. These sites collect data on
members, groups; locations etc. and store this information in a single central
repository. Data warehouse is necessary to implement the same data, because of its
high volume of data.
Banking: Most banks are now using warehouses to see account/cardholder spending
patterns. They use this to make special offers, deals, etc. available.
Government: In addition to store and analyze taxes used to detect tax theft,
government uses the data warehouse.
Airlines: It is used in the airline system for operational purposes such as crew
assignments, road profitability analyses, flight frequency programs promotions, etc.
Public sector: Information is collected in the public sector's data warehouse. It helps
government agencies and departments manage their data and records.
13
Data Warehouse
Fundamentals 1.10 TYPES OF DATA WAREHOUSES
and Architecture
There are three different types of traditional Data Warehouse models as listed
below:
i. Enterprise
ii. Operational
iii. Data Mart
(i) Enterprise Data Warehouse
An enterprise provides a central repository database for decision support throughout
the enterprise. It is a central place where all business information from different
sources and applications are made available. Once it is stored, it can be used for
analysis and used by all the people across the organization. The enterprise goal is to
provide a complete overview of any particular object in the data model.
(ii) Operational Data Warehouse
These features have a sizable enterprise-wide scope, but unlike the substantial
enterprise warehouse, data is refreshed in near real-time and used for routine
commercial activity. It assists in obtaining data straight from the database, which
also helps data transaction processing. The data present in the Operational Data
Store can be scrubbed, and the duplication which is present can be reviewed and
fixed by examining the corresponding market rules.
(iii) Data Mart
Data Mart may be a subset of knowledge warehouse, and it supports a specific
region, business unit, or business function. Data Mart focuses on storing data for a
particular functional area, and it contains a subset of data saved in a memory. Data
Marts help in enhancing user responses and also reduces the volume of data for data
analysis. It makes it more comfortable to go forward with the report. More on Data
Marts can be studied in the next Unit.

1.11 POPULAR DATA WAREHOUSE PLATFORMS


A data warehouse is a critical database for supporting data analysis and acts as a
conduit between analytical tools and operational data stores. The most popular data
warehousing solutions include a range of useful features for data management and
consolidation.
You can use them to extract/curate data from a range of environments, transform
data and remove duplicates, and ensure consistency in your analytics.
Google BigQuery
BigQuery is a cost-effective data warehousing tool with built-in machine learning
capabilities. You can integrate it with Cloud ML and TensorFlow to create powerful
AI models. It can also execute queries on petabytes of data for real-time analytics.
This scalable and serverless cloud data warehouse is ideal for companies that want
to keep costs low. If you need a quick way to make informed decisions through data
analysis, BigQuery is one of the solutions.

14
AWS Redshift Fundamentals of
Data Warehouse
Redshift is a cloud-based data warehousing tool for enterprises. The platform can
process petabytes of data quite fast. That's why it’s suitable for high-speed data
analytics. It also supports automatic concurrency scaling. The automation increases
or decreases query processing resources to match workload demand.
Although tooling provided by Amazon reduces the need to have a database
administrator full time, it does not eliminate the need for one. Amazon Redshift is
known to have issues with handling storage efficiently in an environment prone to
frequent deletes.
Snowflake
Snowflake is a data warehousing solution that offers a variety of options for public
cloud technology. With Snowflake, you can make your business more data-driven.
You may use Snowflake to set up an enterprise-grade cloud data warehouse. With
Snowflake, you can analyze data from various unstructured and structured sources.
However, Snowflake is dependent on Azure, Amazon Web Services (AWS),
Google Cloud Services (GCS). The support can be a problem whenever one of
those cloud servers has an independent outage.
Microsoft Azure Synapse
Microsoft Azure is a robust platform for data management, analytics, integration,
and more, with solutions spanning AI, blockchain, and more than a dozen unique
databases for varying use cases. Among them is Azure Synapse, formerly known
as Azure SQL Data Warehouse, a platform built for analytics, providing you the
ability to query data using either serverless or provisioned resources at scale.
Azure Synapse brings together the two worlds of data warehousing and analytics
with a unified experience to ingest, prepare, manage, and serve data for immediate
BI and machine learning. The broader Azure platform includes thousands of tools,
including others that interface with the various Azure databases.

1.12 SUMMARY
In this unit you have studied about the evolution, characteristics, benefits and
applications of Data Warehouse.
Operational database system provides day-to-day information, although strategic
decision-making cannot be used easily. Data Warehouse is a concept designed to
aid strategic information. Data Warehouse allows people to make decisions and
provides flexible, convenient and interactive sources of strategic intelligence. A
data warehouse combines several technologies because it collects data from various
operational data base systems and external sources such as magazines, newspapers
and reports from the same industry, removes contradictions, transforms the data
and then stores them in formats suited to easy access for decision-making purposes.
The defining characteristics of the data warehouse are: Subject oriented, integrated,
time-variant, and non-volatile.
Data warehouses are meant to be used by executives, managers, and other people
at higher managerial levels who may not have much technical expertise in handling
the databases.
Advantages of data warehouses include better decisions, increased productivity, 15
Data Warehouse lower operational costs, enhanced asset and liability management, and better CRM.
Fundamentals
and Architecture
1.13 SOLUTIONS/ANSWERS
 Check Your Progress 1
1) Data Warehousing (DW) is a process for collecting and managing data
from diverse sources to provide meaningful insights into the business. A
Data Warehouse is typically used to connect and analyze heterogeneous
sources of business data. The data warehouse is the centerpiece of the BI
system built for data analysis and reporting.
It is amalgam of technologies and components which helps to use data
strategically. Instead of transaction processing, it is the automated collection
of a vast amount of information by a company that is configured for demand
and review. It’s a process of transforming data into information and making
it available for users to make a difference in a timely way.
The archive of decision support (Data Warehouse) is managed independently
from the operating infrastructure of the organization. The data warehouse,
however, is not a product but rather an environment. It is an organizational
framework of an information system that provides consumers with
knowledge regarding current and historical decision help that is difficult to
access or present in the conventional operating data store.
Data storage platforms also sort data on a variety of subjects like customers,
products or business.
• Data storage is a tool that companies can use increasingly important for
corporate intelligence:
• Make uniformity possible. All research data gathered and shared to decision
makers worldwide should be used in a uniform format. Standardization of
data from various sources reduces the risk of misinterpretation as well as
overall accuracy of interpretation.
• Take better business decisions. Successful entrepreneurs have a thorough
understanding of data, and are good at predicting future trends. The data
storage system helps users access various data sets at speed and efficiency.
• Data storage platforms allow companies to access their business' past
history and evaluate ideas and projects. This gives managers an idea of how
they can improve their sales and management practices.
2). Following are the four main characteristics of a data warehouse:
i) Subject oriented
A data warehouse is subject-oriented, as it provides information on a topic rather
than the ongoing operations of organizations. Such issues may be inventory,
promotion, storage, etc. Never does a data warehouse concentrate on the current
processes. Instead, it emphasized modeling and analyzing decision-making data.
It also provides a simple and succinct description of the particular subject by
excluding details that would not be useful in helping the decision process.

16
(ii) Integrated Fundamentals of
Data Warehouse
Integration in Data Warehouse means establishing a standard unit of measurement
from the different databases for all the similar data. The data must also get stored in
a simple and universally acceptable manner within the Data Warehouse. Through
combining data from various sources such as a mainframe, relational databases, flat
files, etc., a data warehouse is created. It must also keep the naming conventions,
format, and coding consistent. Such an application assists in robust data analysis.
Consistency must be maintained in naming conventions, measurements of
characteristics, specification of encoding, etc.
(iii) Time-variant
Compared to operating systems, the time horizon for the data warehouse is given
period and provides historical information. It contains a temporal element, either
explicitly or implicitly.One such location in the record key system where Data
Warehouse data shows time variation is. Each primary key contained with the DW
should have an element of time either implicitly or explicitly. Just like the day, the
month of the week, etc.
(iv) Non-volatile
Also, the data warehouse is non-volatile, meaning that prior data will not be erased
when new data are entered into it. Data is read-only, only updated regularly. It also
assists in analyzing historical data and in understanding what and when it happened.
The transaction process, recovery, and competitiveness control mechanisms are not
required. In the Data Warehouse environment, activities such as deleting, updating,
and inserting that are performed in an operational application environment are
omitted.
 Check Your Progress 2
1) Data Warehouse systems are segregated from production databases so that
they aren't intermingled and cause conflicts.
•  here is a database available for tasks such as searching records,
T
indexing, and digital archiving. Data warehouse queries are often
complex due to their varied and complex nature.
• I t is possible to manage multiple transactions simultaneously
through business databases. Concurrency control and recovery
mechanisms are needed to ensure that the database in operational
databases is robust and consistent.
•  he operational database query allows for reading and modification
T
of operations, whilst the read access to stored information is required
for OLAP queries only.
•  database of operations maintains current information. In contrast,
A
historical data is kept in a warehouse.
2) A database stores the current data required to power an application. A data
warehouse stores current and historical data from one or more systems in
a predefined and fixed schema, which allows business analysts and data
scientists to easily analyze the data. The table below summarizes differences
between databases, data warehouses:
17
Data Warehouse Table 2: Database Vs Data Warehouse
Fundamentals
and Architecture Characteristic Database Data Warehouse
Feature
Workloads Operational and Analytical
transactional
Data Type Structured or semi- Structured and/or semi-
structured structured
Schema Flexibility Rigid or flexible schema Pre-defined and fixed
depending on database type schema definition for ingest
(schema on write and read)
Data Freshness Real time May not be up-to-date
based on frequency of ETL
processes
Users Application developers Business analysts and data
scientists
Pros Fast queries for storing and The fixed schema makes
updating data working with the data easy
for business analysts
Cons May have limited analytics Difficult to design and
capabilities evolve schema
Scaling compute may
require unnecessary scaling
of storage, because they are
tightly coupled

1.14 FURTHER READINGS


1. William H. Inmon, Building the Data Warehouse, Wiley, 4th Edition, 2005.
2. Data Warehousing Fundamentals, Paulraj Ponnaiah, Wiley Student Edition.
3. Data Warehousing, Reema Thareja, Oxford University Press, 2011.

18
UNIT 2 DATA WAREHOUSE ARCHITECTURE
Structure
2.0 Introduction
2.1 Objectives
2.2 Data Warehouse Architecture and its Types
2.2.1 Types of Data Warehouse Architectures
2.3 Components of Data Warehouse Architecture
2.4 Layers of Data Warehouse Architecture
2.4.1 Best Practices for Data Warehouse Architecture
2.5 Data Marts
2.5.1 Data Mart Vs Data Warehouse
2.6 Benefits of Data Marts
2.7 Types of Data Marts
2.8 Structure of a Data Mart
2.9 Designing the Data Marts
2.10 Limitations with Data Marts
2.11 Summary
2.12 Solutions / Answers
2.13 Further Readings

2.0 INTRODUCTION
In the previous unit we had studied about the data warehousing and related topics.
Despite numerous advancements over the last five years in the arena of Big
Data, cloud computing, predictive analysis, and information technologies, data
warehouses have gained more significance. For the success of any data warehouse,
its architecture plays an important role. Since three decades, the data warehouse
architecture has been the pillar of the corporate data ecosystems.
This unit present various topics including the basic concept of data warehouse
architecture, its types, significant components and layers of data ware house
architecture, data marts and their designing.

2.1 OBJECTIVES
After going through this unit, you shall be able to:
• understand the purpose of data warehouse architecture;
• describe the process of storing the data in a data warehouse;
• list and discuss the various types of data warehouse architectures;
• discuss various components and layers of data warehouse architecture;
• to summarize the functionality of data marts, their benefits and various
types, and
• to know the ways of structuring and designing the data marts.
Data Warehouse
Fundamentals 2.2 DATA WAREHOUSE ARCHITECTURE AND ITS TYPES
and Architecture
Data warehouse architecture is a data storage framework’s design of an organization.
It takes information from raw data sets and stores it in a structured and easily
digestible format.
A data warehouse architecture plays a vital role in the data enterprise. As databases
assist in storing and processing data, and data warehouses help in analyzing that
data.
Data warehousing is a process of storing a large amount of data by a business or
organization. The data warehouse is designed to perform large complex analytical
queries on large multi-dimensional datasets in a straightforward manner. Data
warehouses extract data from different resources, which are in different fonts,
convert it into a unique form, and place data in Data Warehouse.
2.2.1 Types of Data Warehouse Architectures
Data warehouse architecture defines the arrangement of the data in different
databases. As the data must be organized and cleansed to be valuable, a modern
data warehouse structure identifies the most effective technique of extracting
information from raw data.
Using a dimensional model, the raw data in the staging area is extracted and converted
into a simple consumable warehousing structure to deliver valuable business
intelligence. When designing a data warehouse, there are three different types of
models to consider, based on the approach of number of tiers the architecture has.
(i) Single-tier data warehouse architecture
(ii) Two-tier data warehouse architecture
(iii) Three-tier data warehouse architecture
The details of each of the architecture are given below:
(i) Single-tier data warehouse architecture
The single-tier architecture (Figure 1) is not a frequently practiced approach. The
main goal of having such architecture is to remove redundancy by minimizing the
amount of data stored. Its primary disadvantage is that it doesn’t have a component
that separates analytical and transactional processing.

Figure 1: Single Tier Data Warehouse Architecture


20
(ii) Two-tier data warehouse architecture Data Warehouse
Architecture
The two-tier architecture (Figure 2) includes a staging area for all data sources,
before the data warehouse layer. By adding a staging area between the sources and
the storage repository, you ensure all data loaded into the warehouse is cleansed
and in the appropriate format.

Figure 2: Two-Tier Data Warehouse Architecture

(iii) Three-tier data warehouse architecture


The three-tier approach (Figure 3) is the most widely used architecture for data
warehouse systems.
Essentially, it consists of three tiers:
1. The bottom tier is the database of the warehouse, where the cleansed and
transformed data is loaded.
2. The middle tier is the application layer giving an abstracted view of the
database. It arranges the data to make it more suitable for analysis. This is
done with an OLAP server, implemented using the ROLAP or MOLAP
model.
3. The top-tier is where the user accesses and interacts with the data. It
represents the front-end client layer. You can use reporting tools, query,
analysis or data mining tools.

Figure 3: Three-Tier Data Warehouse Architecture

21
Data Warehouse Figure 4 illustrates the complete data warehouse architecture with three tiers:
Fundamentals
and Architecture

Figure 4: Three-Tier Data Warehouse Architecture

2.2.2 Cloud-based Data Warehouse Architecture


Cloud-based data warehouse architecture is relatively new when compared to
legacy options. This data warehouse architecture means that the actual data
warehouses are accessed through the cloud. There are several cloud based data
warehouses options, each of which has different architectures for the same benefits
of integrating, analyzing, and acting on data from different sources. The difference
between a cloud-based data warehouse approach compared to that of a traditional
approach include:
• Up-front costs: The different components required for traditional, on-
premises data warehouses mandate pricey up-front expenses. Since the
components of cloud architecture are accessed through the cloud, these
expenses don’t apply.
• Ongoing costs: While businesses with on-prem data warehouses must deal
with upgrade and maintenance costs, the cloud offers a low, pay-as-you-go
model.
• Speed: Cloud-based data warehouse architecture is substantially speedier
than on-premises options, partly due to the use of ELT — which is an
uncommon process for on-premises counterparts.
• Flexibility: Cloud data warehouses are designed to account for the variety
of formats and structures found in big data. Traditional relational options
are designed simply to integrate similarly structured data.
• Scale: The elastic resources of the cloud make it ideal for the scale required
of big datasets. Additionally, cloud-based data warehousing options can
also scale down as needed, which is difficult to do with other approaches.
Cloud-based platforms make it possible to create, share, and store massive data sets
with ease, paving the way for more efficient and effective data access and analysis.
Cloud systems are built for sustainable business growth, with many modern
Software-as-a Service (SaaS) providers separating data storage from computing to
improve scalability when querying data.
22
Some of the more notable cloud data warehouses in the market include Amazon Data Warehouse
Redshift, Google BigQuery, Snowflake, and Microsoft Azure SQL Data Warehouse. Architecture

Now, let’s learn about the major components of a data warehouse and how they
help build and scale a data warehouse in the next section.

2.3 COMPONENTS OF DATA WAREHOUSE


ARCHITECTURE
A data warehouse design consists of six main components:
• Data Warehouse Database
• ETL
• Metadata
• Data Warehouse Access Tools
• Data Warehouse Bus
• Data Warehouse Reporting Layer
The details of all the components are given below:
2.3.1 Data Warehouse Database
The central component of DW architecture is a data warehouse database that stocks
all enterprise data and makes it manageable for reporting. Obviously, this means you
need to choose which kind of database you’ll use to store data in your warehouse.
The following are the four database types that you can use:
• Typical relational databases are the row-centered databases you perhaps use
on an everyday basis —for example, Microsoft SQL Server, SAP, Oracle,
and IBM DB2.
• Analytics databases are precisely developed for data storage to sustain and
manage analytics, such as Teradata and Greenplum.
• Data warehouse applications aren’t exactly a kind of storage database,
but several dealers now offer applications that offer software for data
management as well as hardware for storing data. For example, SAP Hana,
Oracle Exadata, and IBM Netezza.
• Cloud-based databases can be hosted and retrieved on the cloud so that you
don’t have to procure any hardware to set up your data warehouse—for
example, Amazon Redshift, Google BigQuery, and Microsoft Azure SQL.
2.3.2 Extraction, Transformation, and Loading Tools (ETL)
ETL tools are central components of enterprise data warehouse architecture.
These tools help extract data from different sources, transform it into a suitable
arrangement, and load it into a data warehouse.
The ETL tool you choose will determine:
• The time expended in data extraction
• Approaches to extracting data
23
Data Warehouse • Kind of transformations applied and the simplicity to do so
Fundamentals
and Architecture • Business rule definition for data validation and cleansing to improve end-
product analytics
• Filling mislaid data
• Outlining information distribution from the fundamental depository to your
BI applications
More on ETL can be studied in Unit-4
2.3.3 Metadata
Before we delve into the different types of metadata in data mining, we first need
to understand what metadata is. In the data warehouse architecture, metadata
describes the data warehouse database and offers a framework for data. It helps in
constructing, preserving, handling, and making use of the data warehouse.
There are two types of metadata in data mining:
• Technical Metadata comprises information that can be used by developers
and managers when executing warehouse development and administration
tasks.
• Business Metadata comprises information that offers an easily understandable
standpoint of the data stored in the warehouse.
Metadata plays an important role for businesses and the technical teams to
understand the data present in the warehouse and convert it into information.
2.3.4 Data Warehouse Access Tools
A data warehouse uses a database or group of databases as a foundation. Data
warehouse corporations generally cannot work with databases without the use of
tools unless they have database administrators available. However, that is not the
case with all business units. This is why they use the assistance of several no-code
data warehousing tools, such as:
• Query and reporting tools help users produce corporate reports for analysis
that can be in the form of spreadsheets, calculations, or interactive visuals.
• Application development tools help create tailored reports and present them
in interpretations intended for reporting purposes.
• Data mining tools for data warehousing systematize the procedure of
identifying arrays and links in huge quantities of data using cutting-edge
statistical modeling methods.
• OLAP tools help construct a multi-dimensional data warehouse and allow
the analysis of enterprise data from numerous viewpoints.
2.3.5 Data Warehouse Bus
It defines the data flow within a data warehousing bus architecture and includes a
data mart. A data mart is an access level that allows users to transfer data. It is also
used for partitioning data that is produced for a particular user group.
2.3.6 Data Warehouse Reporting Layer
24 The reporting layer in the data warehouse allows the end-users to access the BI
interface or BI database architecture. The purpose of the reporting layer in the data Data Warehouse
warehouse is to act as a dashboard for data visualization, create reports, and take Architecture
out any required information.
Constructing a data warehouse is primarily dependent on a particular business. And
every data warehouse architecture has four layers. Let’s study them in following
section.

2.4 LAYERS OF DATA WAREHOUSE ARCHITECTURE


In general, the data warehouse architecture can be divided into four layers. They are:
i. Data Source Layer
ii. Data Staging Layer
iii. Data Storage Layer
iv. Data Presentation Layer
Let us study the various layers and their functionality.
(i) Data source layer
The data source layer is the place where unique information, gathered from an
assortment of inner and outside sources, resides in the social database. Following
are the examples of the data source layer:
• Operational Data — Product information, stock information, marketing
information, or HR information
• Social Media Data — Website hits, content fame, contact page completion
• Outsider Data — Demographic information, study information, statistics
information
While most data warehouses manage organized data, thought ought to be given to the
future utilization of unstructured data sources, for example, voice accounts, scanned
pictures, and unstructured text. These floods of data are significant storehouses of
information and ought to be viewed when building up your warehouse.
(ii) Data Staging Layer
This layer dwells between information sources and the data warehouse. In this
layer, information is separated from various inside and outer data sources. Since
source data comes in various organizations, the data extraction layer will use
numerous technologies and devices to extricate the necessary information. Once
the extracted data has been stacked, it will be exposed to high-level quality checks.
The conclusive outcome will be perfect and organized data that you will stack into
your data warehouse. The staging layer contains the given parts:
• Landing Database and Staging Area
The landing database stores the information recovered from the data source. Before
the data goes to the warehouse, the staging process does stringent quality checks
on it. Arranging is a basic step in architecture. Poor information will add up to
inadequate data, and the result is poor business dynamic. The arranging layer is
where you need to make changes in accordance with the business process to deal
with unstructured information sources. 25
Data Warehouse • Data Integration Tool
Fundamentals
and Architecture Extract, Transform and Load tools (ETL) are the data tools used to extricate
information from source frameworks, change, and prepare information and load it
into the warehouse.
(iii) Data Storage Layer
This layer is the place where the data that was washed down in the arranging zone
is put away as a solitary central archive. Contingent upon your business and your
warehouse architecture necessities, your data storage might be a data warehouse
center, data mart (data warehouse somewhat recreated for particular departments),
or an Operational Data Store (ODS).
(iv) Data Presentation Layer
This is where the users communicate with the scrubbed and sorted out data. This
layer of the data architecture gives users the capacity to query the data for item or
service insights, break down the data to conduct theoretical business situations, and
create computerized or specially appointed reports.
You may utilize an OLAP or reporting instrument with an easy to understand
Graphical User Interface (GUI) to assist users with building their queries, perform
analysis, or plan their reports.
2.4.1 Best Practices for Data Warehouse Architecture
Designing the data warehouse with the designated architecture is an art. Some of
the best practices are shown below:
• Create data warehouse models that are optimized for information retrieval
in both dimensional, de-normalized, or hybrid approaches.
• Select a single approach for data warehouse designs such as the top-down
or the bottom-up approach and stick with it.
• Always cleanse and transform data using an ETL tool before loading the
data to the data warehouse.
• Create an automated data cleansing process where all data is uniformly
cleaned before loading.
• Allow sharing of metadata between different components of the data
warehouse for a smooth retrieval process.
• Always make sure that data is properly integrated and not just consolidated
when moving it from the data stores to the data warehouse. This would
require the 3NF normalization of data models.
• Monitor the performance and security. The information in the data warehouse
is valuable, though it must be readily accessible to provide value to the
organization. Monitor system usage carefully to ensure that performance
levels are high.
• Maintain the data quality standards, metadata, structure, and governance.
New sources of valuable data are becoming available routinely, but
they require consistent management as part of a data warehouse. Follow
procedures for data cleaning, defining metadata, and meeting governance
26 standards.
• Provide an agile architecture. As the corporate and business unit usage Data Warehouse
increases, they will discover a wide range of data mart and warehouse needs. Architecture
A flexible platform will support them far better than a limited, restrictive
product.
• Automate the processes such as maintenance. In addition to adding value to
business intelligence, machine learning can automate data warehouse technical
management functions to maintain speed and reduce operating costs.
• Use the cloud strategically. Business units and departments have different
deployment needs. Use on-premise systems when required, and capitalize
on cloud data warehouses for scalability, reduced cost, and phone and tablet
access.

2.5 DATA MARTS


A data mart is a subset of a data warehouse focused on a particular line of business,
department, or subject area. Data marts make specific data available to a defined
group of users, which allows those users to quickly access critical insights without
wasting time searching through an entire data warehouse. For example, many
companies may have a data mart that aligns with a specific department in the
business, such as finance, sales, or marketing.
2.5.1 Data Mart Vs Data Warehouse
Data marts and data warehouses are both highly structured repositories where data
is stored and managed until it is needed. However, they differ in the scope of data
stored: data warehouses are built to serve as the central store of data for the entire
business, whereas a data mart fulfills the request of a specific division or business
function. Because a data warehouse contains data for the entire company, it is best
practice to have strictly control who can access it. Additionally, querying the data
you need in a data warehouse is an incredibly difficult task for the business. Thus,
the primary purpose of a data mart is to isolate—or partition—a smaller set of data
from a whole to provide easier data access for the end consumers.
A data mart can be created from an existing data warehouse—the top-down
approach—or from other sources, such as internal operational systems or external
data. Similar to a data warehouse, it is a relational database that stores transactional
data (time value, numerical order, reference to one or more object) in columns and
rows making it easy to organize and access.
On the other hand, separate business units may create their own data marts based
on their own data requirements. If business needs dictate, multiple data marts
can be merged together to create a single, data warehouse. This is the bottom-up
development approach.
In a nut-shell, following are the differences:
• Data mart is for a specific company department and normally a subset of an
enterprise-wide data warehouse.
• Data marts improve query speed with a smaller, more specialized set of
data.
• Data warehouses help make enterprise-wide strategic decisions, data marts
are for department level, tactical decisions. 27
Data Warehouse • Data warehouse includes many data sets and takes time to update, data
Fundamentals marts handle smaller, faster-changing data sets.
and Architecture
• Data warehouse implementation can take many years, data marts are much
smaller in scope and can be implemented in months.

2.6 BENEFITS OF DATA MARTS


Data marts are designed to meet the needs of specific groups by having a
comparatively narrow subject of data. And while a data mart can still contain
millions of records, its objective is to provide business users with the most relevant
data in the shortest amount of time.
With its smaller, focused design, a data mart has several benefits to the end user,
including the following:
• Cost-efficiency: There are many factors to consider when setting up a data
mart, such as the scope, integrations, and the process to extract, transform,
and load (ETL). However, a data mart typically only incurs a fraction of the
cost of a data warehouse.
• Simplified data access: Data marts only hold a small subset of data, so
users can quickly retrieve the data they need with less work than they could
when working with a broader data set from a data warehouse.
• Quicker access to insights: Intuition gained from a data warehouse supports
strategic decision-making at the enterprise level, which impacts the entire
business. A data mart fuels business intelligence and analytics that guide
decisions at the department level. Teams can leverage focused data insights
with their specific goals in mind. As teams identify and extract valuable
data in a shorter space of time, the enterprise benefits from accelerated
business processes and higher productivity.
• Simpler data maintenance: A data warehouse holds a wealth of business
information, with scope for multiple lines of business. Data marts focus on
a single line, housing fewer than 100GB, which leads to less clutter and
easier maintenance.
• Easier and faster implementation: A data warehouse involves significant
implementation time, especially in a large enterprise, as it collects data from
a host of internal and external sources. On the other hand, you only need a
small subset of data when setting up a data mart, so implementation tends
to be more efficient and include less set-up time.

2.7 TYPES OF DATA MARTS


There are three types of data marts that differ based on their relationship to the data
warehouse and the respective data sources of each system.
• Dependent data marts are partitioned segments within an enterprise data
warehouse. This top-down approach begins with the storage of all business
data in one central location. The newly created data marts extract a defined
subset of the primary data whenever required for analysis.
• Independent data marts act as a standalone system that doesn't rely on a
28 data warehouse. Analysts can extract data on a particular subject or business
process from internal or external data sources, process it, and then store it in Data Warehouse
a data mart repository until the team needs it. Architecture

• Hybrid data marts combine data from existing data warehouses and other
operational sources. This unified approach leverages the speed and user-
friendly interface of a top-down approach and also offers the enterprise-
level integration of the independent method.

2.8 STRUCTURE OF A DATA MART


A data mart is a subject-oriented relational database that stores transactional data
in rows and columns, which makes it easy to access, organize, and understand. As
it contains historical data, this structure makes it easier for an analyst to determine
data trends. Typical data fields include numerical order, time value, and references
to one or more objects.
Companies organize data marts in a multidimensional schema as a blueprint to
address the needs of the people using the databases for analytical tasks. The three
main types of schema:
Star
Star schema is a logical formation of tables in a multidimensional database that
resembles a star shape. In this blueprint, one fact table—a metric set that relates to
a specific business event or process—resides at the center of the star, surrounded
by several associated dimension tables.
There is no dependency between dimension tables, so a star schema requires fewer
joins when writing queries. This structure makes querying easier, so star schemas
are highly efficient for analysts who want to access and navigate large data sets.
Snowflake
A snowflake schema is a logical extension of a star schema, building out the
blueprint with additional dimension tables. The dimension tables are normalized to
protect data integrity and minimize data redundancy.
While this method requires less space to store dimension tables, it is a complex
structure that can be difficult to maintain. The main benefit of using snowflake
schema is the low demand for disk space, but the caveat is a negative impact on
performance due to the additional tables.
Data Vault
Data vault is a modern database modeling technique that enables IT professionals to
design agile enterprise data warehouses. This approach enforces a layered structure
and has been developed specifically to combat issues with agility, flexibility, and
scalability that arise when using the other schema models. Data vault eliminates
star schema's need for cleansing and streamlines the addition of new data sources
without any disruption to existing schema.

2.9 DESIGNING THE DATA MARTS


Data marts guide important business decisions at a departmental level. For example,
a marketing team may use data marts to analyze consumer behaviors, while sales
staff could use data marts to compile quarterly sales reports. As these tasks happen
29
Data Warehouse within their respective departments, the teams don't need access to all enterprise
Fundamentals data.
and Architecture
Typically, a data mart is created and managed by the specific business department
that intends to use it. The process for designing a data mart usually comprises the
following steps:
(i) Essential Requirements Gathering
The first step is to create a robust design. Some critical processes involved in this
phase include collecting the corporate and technical requirements, identifying data
sources, choosing a suitable data subset, and designing the logical layout (database
schema) and physical structure.
(ii) Build/Construct
The next step is to construct it. This includes creating the physical database and the
logical structures. In this phase, you’ll build the tables, fields, indexes, and access
controls.
(iii) Populate/Data Transfer
The next step is to populate the mart, which means transferring data into it. In this
phase, you can also set the frequency of data transfer, such as daily or weekly. This
usually involves extracting source information, cleaning and transforming the data,
and loading it into the departmental repository.
(iv) Data Access
In this step, the data loaded into the data mart is used in querying, generating reports,
graphs, and publishing. The main task involved in this phase is setting up a meta-
layer and translating database structures and item names into corporate expressions
so that non-technical operators can easily use the data mart. If necessary, you can
also set up API and interfaces to simplify data access.
(v) Manage
The last step involves management and observation, which includes:
• Controlling ongoing user access.
• Optimization and refinement of the target system for improved performance.
• Addition and management of new data into the repository.
• Configuring recovery settings and ensuring system availability in the event
of failure.

2.10 LIMITATIONS WITH DATA MARTS


Prospective builders of data warehouses are frequently advised to “start small” with
a data mart and use that kernel to expand gradually into a full blown data warehouse.
This approach to warehousing generally leads to failed projects for several reasons.
Sometimes the new data mart is so successful that the configuration is overrun
by user demands. The databases grow too large too fast, response times become
unacceptably long, and user frustration leads to searching for other ways to get the
answers.
30
The more common reason for failure is that the data mart is immediately Data Warehouse
unsuccessful because it is designed in such a way that users are unable to retrieve Architecture
the sort of information they want and need to extract from the data. Databases
are highly denormalized to respond to a small set of canned queries; summaries,
rather than detail data, comprise the database so that fine-grained exploratory data
analysis is not possible; and support for ad hoc queries is either absent or so poor as
to discourage users from bothering with them.
The very factors that frequently defeat data mart projects are also the most
commonly recommended approaches to designing data marts and data warehouses
in the popular data warehousing literature:
• Denormalization (dimensional modeling)
• Storing aggregates at the expense of detail data
• Skewing performance toward a small, preselected set of queries at the
expense of all other exploratory analyses
 Check your Progress 1
1. Define data warehouse architecture.
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………...
2. What is the correct flow of the data warehouse architecture?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
………………………………………………………………………….......
3. Mention some Data Mart Use Cases.
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………...

2.11 SUMMARY
Data warehouse architecture is the design and building blocks of the modern
data warehouse. In this unit we have studied the basic building blocks of the data
warehouse, data warehouse architecture, its types, architecture models, data marts,
designing of data marts and limitations.
In this next unit we will study about Dimensional Modeling.

31
Data Warehouse
Fundamentals 2.12 SOLUTIONS / ANSWERS
and Architecture
 Check Your Progress 1:
1. The method for defining the entire architecture of data communication
processing as well as the presentation that exists for end-clients is
the data warehouse architecture. Every data warehouse is different,
and each of them is characterized based on the standard vital
components.
In simple words, a data warehouse is an information system
that consists of commutative and historical data from single
or multiple sources. The process of reporting and analysis of
data in the organizations is simplified with the help of different
data warehousing concepts. There are different approaches to
constructing a data warehouse architecture. Any approach is used
based on the requirements of the organizations.
2. On every operational database, there are a certain fixed number of
operations that have to be applied. There are different well-defined
techniques for delivering suitable solutions. Data warehousing
is found to be more effective when the correct flow of the data
warehouse architecture is completely followed.
The four different processes that contribute to a data warehouse
are extracting and loading the data, cleaning and transforming the
data, backing up and archiving the data, and carrying out the query
management process by directing them to the appropriate data
sources.
3. Data marts are used to solve specific organizational problems,
especially those that are unique to one department. Typical use
cases for a data mart include:
Focused Analytics
Analytics is perhaps the most common application of data marts. The
data in these repositories is entirely relevant to the requirements of
the business department, with no extraneous information, resulting
in faster and more accurate analysis. For example, financial analysts
will find it easier to work with a financial data mart, rather than
working with an entire data warehouse.
Fast Turnaround
Data marts are generally faster to develop than a data warehouse, as
the developers are working with fewer sources and a limited schema.
Data marts are ideal for data projects operating under challenging
time constraints.

32
Permission Management Data Warehouse
Architecture
Data marts can be a risk-free way to grant limited data access without
exposing the entire data warehouse. For example, dependent data
mart contains a segment of warehouse data, and users are only able
to view the contents of the mart. This prevents unauthorized access
and accidental writes.
Better Resource Management
Data marts are sometimes used where there is a disparity in resource
usage between different departments. For example, the logistics
department might perform a high volume of daily database actions,
which causes the marketing team’s analytics tools to run slow. By
providing each department with its own data mart, it’s easier to
allocate resources according to their needs.

2.13 FURTHER READINGS


1. William H. Inmon, Building the Data Warehouse, Wiley, 4th Edition, 2005.
2. Data Warehousing Fundamentals, Paulraj Ponnaiah, Wiley Student Edition,
2001.
3. Data Warehousing, Reema Thareja, Oxford University Press, 2011.

33
UNIT 3 DIMENSIONAL MODELING
Structure
3.0 Introduction
3.1 Objectives
3.2 Dimensional Modeling
3.2.1 Strengths of Dimensional Modeling
3.3 Identifying Facts and Dimensions
3.4 Star Schema
3.4.1 Features of Star Schema
3.5 Advantages and Disadvantages of Star Schema
3.6 Snowflake Schema
3.6.1 Features of Snowflake Schema
3.7 Advantages and Disadvantages of Snowflake Schema
3.7.1 Star Schema Vs Snowflake Schema
3.8 Fact Constellation Schema
3.8.1 Advantages and Disadvantages of Fact Constellation Schema
3.9 Aggregate Tables
3.10 Need for Building Aggregate Fact Tables
Limitations of Aggregate Fact Tables
3.11 Aggregate Fact Tables and Derived Dimension Tables
3.12 Summary
3.13 Solutions/Answers
3.14 Further Readings

3.0 INTRODUCTION
In the earlier unit, we had studied about the Data Warehouse Architecture and
Data Marts. In this unit let us focus on the modeling aspects. In this unit we will
go through the dimensional modeling, star schema, snowflake schema, aggregate
tables and Fact constellation schema.

3.1 OBJECTIVES
After going through this unit, you shall be able to:
• understand the purpose of dimension modeling;
• identifying the measures, facts, and dimensions;
• discuss the fact and dimension tables and their pros and cons;
• discuss the Star and Snowflake schemas;
• explore comparative analysis of star and snowflake schema;
• describe Aggregate facts, fact constellation, and
• discuss various examples of star and snowflake schema.
Data Warehouse
Fundamentals 3.2 DIMENSIONAL MODELING
and Architecture
Dimensional modeling is a data model design adopted when building a data
warehouse. Simply, it can be understood that dimension modeling reduces the
response time of query fired unlike relational systems. The concept behind
dimensional modeling is all about the conceptual design. Firstly let’s see the
introduction to dimensional modeling and how it is different from a traditional
data model design. A data model is a representation of how data is stored in a
database and it is usually a diagram of the few tables and the relationships that
exist between them. This modeling is designed to read, summarize and compute
some numeric data from a data warehouse. A data warehouse is an example of a
system that requires small number of large tables. This is due to many users using
the application to read lot of data a characteristic of a data warehouse is to write the
data once and read it many times over so it is the read operation that is dominant in
a data warehouse. Now let's look at the data warehouse containing customer related
information in a single table this makes it a lot easier for analytics just to count the
number of customers by country but this time the use of tables in the data warehouse
simplify the query processing. The main objective of dimension modeling is to
provide an easy architecture for the end user to write queries and also, to reduce
the number of relationships between the tables and dimensions hence providing
efficient query handling.
Dimensional modeling populates data in a cube as a logical representation with
OLAP data management. The concept was developed by Ralph Kimball. It has
“fact” and “dimension” as its two important measure. The transaction record is
divided into either “facts”, which consists of business numerical transaction data,
or “dimensions”, which are the reference information that gives context to the facts.
The more detail about fact and dimension is explained in the subsequent sections.
The main objective of dimension modeling is to provide an easy architecture for the
end user to write queries. Also it will reduce the number of relationships between
the tables and dimensions, hence providing efficient query handling.
The following are the steps in Dimension Modeling as shown in figure1.
1. Identify Business Process
2. Identify Grain (level of detail)
3. Identify dimensions and attributes
5. Build Schema
The model should describe the Why, How much, When/Where/Who and What of
your business process.

36
Dimensional Modeling

Figure 1: Steps in Dimension Modeling

Step 1: Identify the Business Objectives


Selection of the right business process to build a data warehouse and identifying the
business objectives is the first step in dimension modeling. This is very important
step otherwise this can lead to repeated process and software defects.
Step 2: Identifying Granularity
The grain literally means each minute detail of the business problem. This is
decomposing of the large and complex problem into the lowest level information.
For example, if there is some data month-wise. So, the table would contain details of
all the months in a year. It depends on the report to be submitted to the management.
This affects the size of the data warehouse.
Step 3: Identifying Dimensions and attributes
The dimensions of the data warehouse can be understood by the entities of the
database. like, items, products, date, stocks, time etc. The identification of the
primary keys and the foreign keys specifications all are described here.
Step 4: Build the Schema
The database structure or arrangement of columns in a database table, decides the
schema. There are various popular schemas like, star, snowflake, fact constellation
schemas - summarizing, from the selection of business process to identifying
each and every finest level of detail of the business transactions. Identifying the
significant dimensions and attributes would help to build the schema.
3.2.1 Strengths of Dimensional Modeling
Following are some of the strengths of Dimensional Modeling:
• It provides the simplicity of architecture or schema to understand and handle
various stakeholders from warehouse designers to business clients.
• It reduces the number of relationships between different data elements.
• It promotes data quality by enforcing foreign key constraints as a form of
referential integrity check on a data warehouse. The dimensional modeling 37
Data Warehouse helps the database administrators to maintain the reliability of the data.
Fundamentals
and Architecture • The aggregate functions used in the schemas optimize the query performance
posted by the customers. Since data warehouse size keeps on increasing
and with this increased size, the optimization becomes the concern which
dimension modeling makes it easy.

3.3 IDENTIFYING FACTS AND DIMENSIONS


We have studied the steps of dimension modeling in the previous section. The last
step narrated is to build the schema. So, let’s see the elementary measures to build
a schema.
Facts and Fact table: A fact is an event. It is a measure which represents business
items or transactions of items having association and context data. The Fact table
contains the description of all the primary keys of all the tables used in the business
processes which acts as a foreign key in the fact table. It also has an aggregate
function to compute the business process on some entity. It is a numeric attribute
of a fact, representing the performance or behavior of the business relative to the
dimensions. The number of columns in the fact table is less than the dimension
table. It is more normalized form.
Dimensions and Dimension table: It is a collection of data which describe one
business dimension. Dimensions decide the contextual background for the facts,
and they are the framework over which OLAP is performed. Dimension tables
establish the context of the facts. The table stores fields that describe the facts. The
data in the table are in de normalized form. So, it contains large number of columns
as compared to fact table. The attributes in a dimension table are used as row and
column headings in a document or query results display.
Example: In the example of student registration case study to any particular course
can have attributes like student_id, course_id, program_id, date_of_registration,
fee_id in fact table. Course summary can have course name, duration of the course
etc. Student information can contain the personal details about the student like
name, address, contact details etc.
Student Registration
Fact Table (student_id, course_id, program_id, date_of_registration, fee_id)
Measure: Sum (Fee_amount))
Dimension Tables (Student_details,
Course_details
Program_details,
Fee_details,
Date)

3.4 STAR SCHEMA


There are three basic popular models which are used for dimensional modeling:
• Star Model
• Snowflake Model
• Fact Constellation Schema
38
Star Schema: It represents the multidimensional model. In this model the data is Dimensional Modeling
organized into facts and dimensions. The star model is the underlying structure for
a dimensional model. It has one broad central table (fact table) and a set of smaller
tables (dimensions) arranged in a star design. This design is logically shown in the
below figure 2.

Figure 2 : Star Schema

3.4.1 Features of Star Schema


• The data is in denormalized database.
• It provides quick query response
• Star schema is flexible can be changed or added easily.
• It reduces the complexity of metadata for developers and end users.

3.5 
ADVANTAGES AND DISADVANTAGES OF STAR
SCHEMA
3.5.1 Advantages of Star Schema
Star schemas are easy for end users and applications to understand and navigate.
With a well-designed schema, users can quickly analyze large, multidimensional
data sets. The main advantages of star schemas in a decision-support environment
are:
• Query performance
Because a star schema database has a small number of tables and clear join paths,
queries run faster than they do against an OLTP system. Small single-table queries,
usually of dimension tables, are almost instantaneous. Large join queries that
involve multiple tables take only seconds or minutes to run.
In a star schema database design, the dimensions are linked only through the central
fact table. When two dimension tables are used in a query, only one join path,
intersecting the fact table, exists between those two tables. This design feature
enforces accurate and consistent query results.

39
Data Warehouse • Load performance and administration
Fundamentals
and Architecture Structural simplicity also reduces the time required to load large batches of data
into a star schema database. By defining facts and dimensions and separating them
into different tables, the impact of a load operation is reduced. Dimension tables
can be populated once and occasionally refreshed. You can add new facts regularly
and selectively by appending records to a fact table.
• Built-in referential integrity
A star schema has referential integrity built in when data is loaded. Referential
integrity is enforced because each record in a dimension table has a unique primary
key, and all keys in the fact tables are legitimate foreign keys drawn from the
dimension tables. A record in the fact table that is not related correctly to a dimension
cannot be given the correct key value to be retrieved.
• Easily understood
A star schema is easy to understand and navigate, with dimensions joined only
through the fact table. These joins are more significant to the end user, because they
represent the fundamental relationship between parts of the underlying business.
Users can also browse dimension table attributes before constructing a query.
3.5.2 Disadvantages of Star Schema
As mentioned before, improving read queries and analysis in a star schema could
involve certain challenges:
• Decreased data integrity: Because of the denormalized data structure, star
schemas do not enforce data integrity very well. Although star schemas use
countermeasures to prevent anomalies from developing, a simple insert or
update command can still cause data incongruities.
• Less capable of handling diverse and complex queries: Databases
designers build and optimize star schemas for specific analytical needs.
As denormalized data sets, they work best with a relatively narrow set of
simple queries. Comparatively, a normalized schema permits a far wider
variety of more complex analytical queries.
• No Many-to-Many Relationships: Because they offer a simple dimension
schema, star schemas don’t work well for “many-to-many data relationships”
Example 1: Suppose a star schema is composed of a Sales fact table as shown in
Figure 3a and several dimension tables connected to it for Time, Branch, Item and
Location.
Fact Table
Sales is the Fact table.
Dimension Tables
The Time table has a column for each day, month, quarter, year etc..
The Item table has columns for each item_key, item_name, brand, type and
supplier_type.
The Branch table has columns for each branch_key, branch_name and branch_
type.
40
The Location table has columns of geographic data, including street, city, state, Dimensional Modeling
and country. Unit_Sold and Dollars_Sold are the Measures.

Figure 3a: Example of Star Schema

The measures may be unit_sold and dollars_sold.


Example 2:
The star schema works by dividing data into measurements and the “who, what,
where, when, why, and how” descriptive context. Broadly, these two groups are
facts and dimensions.
By doing this, the star schema methodology allows the business user to restructure
their transactional database into smaller tables that are easier to fit together. Fact
tables are then linked to their associated dimension tables with primary or foreign
key relationships. An example of this would be a quick grocery store purchase. The
amount you spent and how many items you bought would be considered a fact,
but what you bought, when you bought it and the specific grocery store’s location
would all be considered dimensions.
Once these two groups have been established, we can connect them by the unique
transaction number associated with your specific purchase. An important note is
that each fact, or measurement, will be associated with multiple dimensions. This
is what forms the star shape, the fact in the center, and dimensions drawing out
around it. Dimensions relating to the grocery store, the products you bought, and
descriptions about you as their customer will be carefully separated into its table
with its attributes.
This example is modeled as shown below and star schema for this is depicted in
Figure 3b.
Fact Table
Sales is the Fact Table.

41
Data Warehouse Dimension Tables
Fundamentals
and Architecture The Store table consists of columns like store_id store_address, city, region, state
and country.
Customer table has columns for each product_id, product_time and product_type.
Sales_Type includes sales_type_id and type_name columns.
Product table consists of product_id, product_name and product_type.
Time table consists of columns like time_id, action_date, action_week, action_
month, action_year and action_ weekday.
Measures may be amount spent and no. of items bought.

Figure 3b: Example of Star Schema

 Check Your Progress 1


1) Discuss the characteristics of star schema?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………...
2) Draw a Star Schema for a marketing employee staying in a NewYork city of
the country USA. He buys products and wants to compute the total product
sold and how much sales done?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
42 ……………………………………………………………………………...
Dimensional Modeling
3.6 SNOWFLAKE SCHEMA
The other popular modeling technique is Snowflake Schema. You can understand
the term flakes as chocolate flakes on the pastry and ice-creams. These flakes add
additional tastes to the chocolate. Similarly, snowflake schema is the extension of
star schema which adds more dimensions to give more meaning to the logical view
of the database. These additional tables are more normalized than star schema. The
arrangement of data is like that the centralized fact table relates to multiple related
dimensional tables. This can become more complex if the dimensions are more
detailed and at multiple levels. In the conceptual hierarchy child table has multiple
parent tables. You must keep in mind that we are just extending or flaking the
dimension tables not the fact tables.
Snowflake Model
The snowflake model is the conclusion of decomposing one or more of the
dimensions. Snowflake Schema in data warehouse is a logical arrangement
of tables in a multidimensional database such that the ER diagram resembles a
snowflake shape. A Snowflake Schema is an extension of a Star Schema, and it
adds additional dimensions. The dimension tables are normalized which splits data
into additional tables.
In the following Snowflake Schema example, Country is further normalized into
an individual table.
3.6.1 Features of Snowflake Schema
Following are the important features of snowflake schema:
1. It has normalized tables
2. Occupy less disk space.
3. It requires more lookup time as many tables are interconnected and
extending dimensions.
Example
In the figure 4, the snowflake schema is shown of a case study of customers, sales,
products, location wise quantity sold, and number of items sold are calculated. The
customers, products, date, store are saved in the fact table with their respective
primary keys acting in fact table as a foreign key.
You will observe that the two aggregate functions can be applied to calculate
quantity sold and amount sold. Further, the some dimensions are extended to the
type of customer and also store information territory wise too. Note, date has been
expanded into date, month, year. This schema will give you more opportunity to
perform query handling in detail.

43
Data Warehouse
Fundamentals
and Architecture

Figure 4: Snowflake Schema

3.7 ADVANTAGES AND DISADVANTAGES OF


SNOWFLAKE SCHEMA
Following are the advantages of Snowflake schema:
• A Snowflake schema occupies a much smaller amount of disk space
compared to the Star schema. Lesser disk space means more convenience
and less hassle.
• Snowflake schema of small protection from various Data integrity issues. Most
people tend to prefer the Snowflake schema because of how safe if it is.
• Data is easy to maintain and more structured.
• Data quality is better than star schema.
Disadvantages of Snowflake Schema
• Complex data schemas: As you might imagine, snowflake schemas create
many levels of complexity while normalizing the attributes of a star schema.
This complexity results in more complicated source query joins. In offering
a more efficient way to store data, snowflake can result in performance
declines while browsing these complex joins. Still, processing technology
advancements have resulted in improved snowflake schema query
performance in recent years, which is one of the reasons why snowflake
schemas are rising in popularity.
• Slower at processing cube data: In a snowflake schema, the complex joins
result in slower cube data processing. The star schema is generally better
for cube data processing.
• Lower data integrity levels: While snowflake schemas offer greater
normalization and fewer risks of data corruption after performing UPDATE
and INSERT commands, they do not provide the level of transnational
44 assurance that comes with a traditional, highly-normalized database
structure. Therefore, when loading data into a snowflake schema, it's vital to Dimensional Modeling
be careful and double-check the quality of information post-loading.
3.7.1 Star Schema Vs Snowflake Schema
Following are the differences between Star and Snowflake schema.
Features Star Schema Snowflake Schema
Normalized The dimension tables in star This schema has normalized
Dimension schema are not normalized so dimension tables
Tables they may contain redundancies
Queries The execution of queries is The execution of snowflake
relatively faster as there are schema complex queries is
less joins needed in forming a slower than star schema as
query. many joins and foreign key
relations are needed to form
a query. Thus performance is
affected.
Performance Star schema model has faster It has slow performance as
execution and response time compared to star schema
Storage Space This type of schema requires Snowflake schema tables
more storage space as are easy to maintain and
compared to snowflake due to save storage space due to
unnormalised tables. normalized tables.
Usage Star schema is preferred when If the dimension table contains
the dimension tables have large number of rows,
lesser rows snowflake schema is preferred
Type of DW This schema is suitable for 1:1 It is used for complex
or 1: many relationships such relationships such as many:
as data marts. many in enterprise Data
warehouses.
Dimension Star schema has a single table Snowflake schema may have
Tables for each dimension more than one dimension table
for each dimension.

3.8 FACT CONSTELLATION SCHEMA


There is another schema for representing a multidimensional model. This term fact
constellation is like the galaxy of universe containing several stars. It is a collection of
fact schemas having one or more-dimension tables in common as shown in the Figure 5
below. This logical representation is mainly used in designing complex database systems.

Figure 5: Fact Constellation Schema


45
Data Warehouse In the figure 5 it can be observed that there are two fact tables and two-dimension
Fundamentals tables in the pink boxes are the common dimension tables connecting both the star
and Architecture
schemas.
For example, if we are designing a fact constellation schema for Placement and
Workshop in a University consider,
Fact tables
Placement (Stud_roll, Company_id, TPO_id), need to calculate the number of
students eligible and number of students placed.
Workshop (Stud_roll, Institute_id, TPO_id) need to find out the facts about number
of students selected, number of students attended the workshop)
So, there are two fact tables namely, Placement and Workshop which are part of
two different star schemas having:
i) dimension tables – Company, Student and TPO in Star schema with fact
table Placement and
ii) dimension tables – Training Institute, Student and TPO in Star schema with
fact table Workshop.
Both the star schema has two-dimension tables common and hence, forming a fact
constellation or galaxy schema as shown in figure 6.

Figure 6: Fact Constellation of placement and workshop

3.8.1 Advantages and Disadvantages of Fact Constellation Schema


Advantage
This schema is more flexible and gives wider perspective about the data warehouse
system.
Disadvantage
As, this schema is connecting two or more facts to form a constellation. This kind
of structure makes it complex to implement and maintain.

46
 Check Your Progress 2 Dimensional Modeling

1. Compare and contrast Star schema with Snowflake Schema?


……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………...
2. Suppose that a data warehouse consists of dimensions time, doctor, ward
and patient, and the two measures count and charge, where charge is the
fee that a doctor charges a patient for a visit. Enumerate three classes of
schemes that are popularly used for modeling.
a) Draw a Star Schema diagram
b) Draw a Snowflake Schema diagram.
……………………………………………………………………………..
…………………………………………………………………………..…
……………………………………………………………………………...

3.9 AGGREGATE TABLES


Since, in the data warehouse the data is stored in multidimensional cube. In the
information technology industry, there are various tools available to process the
queries posted on the data warehouse engine. These tools are called business
intelligence (BI) tools. These tools help to answer the complex queries and to take
decisions. Aggregate word is very similar to the aggregation of the database schemas
of relational tables that you must be familiar with. Aggregate fact tables roll up
the basic fact tables of the schema to improve the query processing. The business
tools smoothly select the level of aggregation to improve the query performance.
Aggregate fact tables contain foreign keys referring to dimension tables.
Points to note about Aggregate tables:
1) It is also called summary tables.
2) It contains pre-computed queries of the data warehouse schema.
3) It reduces the dimensionality of the base fact tables.
4) It can be used to respond to the queries of the dimensions that are saved.

3.10 NEED FOR BUILDING AGGREGATE FACT TABLES


Let us understand the need of building aggregate table. Aggregate tables also
referred to pre-computed tables having partially summarized data.
• Simply putting in one word, it’s about speed or quick response to queries.
This you can understand as an intermediate table which stores the results of
the queries on I/O disk space. It uses aggregates functionality.
For example, there is a company ABC corporation limited which takes
orders online and it there are millions of customer transactions placing
orders. So, the dimension tables for the company could be Customer,
Product and Order_date. In the fact table it maintains all the orders placed
say, Fact_Orders. To generate a report of monthly orders by product type 47
Data Warehouse and by a particular region. It needs aggregates which are summary tables
Fundamentals can be obtained by Groupby SQL query.
and Architecture
• It occupies less space than atomic fact tables. It nearly takes the half time of
a general query processing.
• One of the more popular uses of aggregates is to adjust the granularity
of a dimension. When the granularity of a dimension is changed, the fact
table must be partially summarized to match the current grain of the new
dimension, resulting in the creation of new dimensional and fact tables that
fit this new grain standard.
• The Roll-up OLAP operation of the base fact tables generates aggregate
tables. Hence the query performance increases as it reduces the number of
rows to be accessed for the retrieval of data of a query.

3.11 AGGREGATE FACT TABLE AND DERIVED


DIMENSION TABLES
Aggregate facts are produced by calculating measures from more atomic fact tables.
These tables contain computational SQL aggregate functions like AVERAGE,
MIN, MAX, COUNT etc. It also contains function that helps to find output using
group by. The aggregate fact tables produce summary statistics. Whenever, the
speedy query handling is required the aggregate fact tables is the best option.
• Basically, aggregates allow you to store the intermediate results or pre-
calculate the subqueries or queries fired on a data warehouse by summing
data up to higher levels and storing them in a separate star.
• You can understand aggregate fact tables as the conformed copy of the fact
table as it should provide you the same result of the query as the detailed
fact table.
• This aggregate fact tables can be used in the case of large datasets or when
there are large number of queries. It reduces the response time of the
queries fired by users or customers. It is very useful in business intelligence
application tools.
When you have complicated questions of multiple facts in multiple tables that are
stored at different levels from one another, and when a reporting request includes
yet another level, the levels at which facts are stored become even more relevant.
You must be able to meet users' need for fact reporting at the business level. There's
nothing wrong with improving the overall intelligence.
The levels at which facts are stored become especially important when you begin to
have complex queries with multiple facts in multiple tables that are stored at levels
different from one another, and when a reporting request involves still a different
level. You must be able to support fact reporting at the business levels which users
require. There is nothing wrong with enhancing an aggregate with new facts or
deriving new dimension. For measures, the only issue is if the new measures are
atomic in the context of the aggregate fact. If, however, the new measures are
received at a lower grain, you would be better off creating a new atomic fact for
those measures prior to incorporating summarized measures into the aggregate.
This would allow the new measures to be used for other purposes without having
48
to go back to the source.
Let's say we have a fact table: FactBillReciept has monthly transactions. There can Dimensional Modeling
be different types of transaction receipts during a month for each supplier. This
huge data would result in lot of calculations. So, we would build another aggregate
table which is derived of base table.
FactBillMonthReceipt: It contains aggregated receipts per month, per supplier. But
the problem is it has additional foreign keys like supplier_status for the month.
To solve this, we have the concept of derived tables which contains additional
measures and foreign keys that are not present in the base fact table.
Conformed Dimension
A conformed dimension is the dimension that is shared across multiple data mart
or subject area. An organization may use the same dimension table across different
projects without making any changes to the dimension tables.
Derived Tables
It is the significant addition to the Data Warehouse. Derived tables are used to
create a second-level data marts for cross functional analysis.
Consolidated Fact tables: It is the fact table which has data from different fact
tables used to form a schema with a common grain.
For example, to design a Sales department Data Warehouse schema assuming there
are following entities and respective grains in them.
Sales: Employee, date, and product.
Budget: Department, Financial Year, Quarter-wise
Product can have various attributes like, product size, product _category etc..
One thing to notice here is that the product attributes keep on changing as per
the requirements, but product dimension remains the same. So, it is better to keep
Product as a separate dimension.
Let’s design the tables and its grains.

Figure 7: Aggregate Tables and Derived tables


49
Data Warehouse The derived tables are very useful in terms of putting fewer loads on the Data
Fundamentals Warehouse engine for calculation.
and Architecture
 Check Your Progress 3
1. Discuss the limitations of Aggregate Fact tables.
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………...

3.12 SUMMARY
This unit presented the basic designing of data warehouse. These topics are more
focused on the various kind of modeling and schemas. It explored the grains, facts,
and dimensions of the schemas. It is important to know about the dimensional
modeling .as the appropriate modeling technique would yield the correct respond
the queries.
A dimensional modeling is a kind of data structure used to optimize design of Data
warehouse for the query retrieval operations. There are various schema designs.
Here, it discussed star, snowflake, and fact constellations. From denormalized to
normalized schemas uses dimension, fact, derived and aggregate fact table. Every
table has some purpose and used for efficient designing in terms of space and query
handling. This unit discusses the pros and cons of every tables. The number of
examples used to explain the designing in different scenarios.

3.13 SOLUTIONS/ANSWERS
Check Your Progress 1:
1) Characteristics of Star Schema:
•  very dimension in a star schema is represented with only one-
E
dimension table.
• The dimension table should contain the set of attributes.
• The dimension table is joined to the fact table using a foreign key
• The dimension table are not joined to each other
• Fact table would contain key and measure
• The Star schema is easy to understand and provides optimal disk
usage.
•  he dimension tables are not normalized. For instance, in the above
T
figure, Country ID does not have Country lookup table as an OLTP
design would have.
• The schema is widely supported by BI Tools

50
Dimensional Modeling
2)

Figure 8: Star Schema

 Check Your Progress 2:


1:
Star Schema Snowflake Schema
It is a logical arrangement of one factIt is a logical arrangement of one fact
table surrounded by other dimension table with dimension tables and further
tables like a star. dimension tables are normalized to
other dimensions
It requires a single join SQL command It requires many joins SQL command to
to fetch the data fetch the data
Simple Database design and respond to Complex database design and respond
query time is very less time to queries is high
The data is not normalized. High level The data is normalized so low level of
of redundancy redundancy.
2: a. Star Schema of Hospital Management

Figure 9 : Fact Schema of Hospital Management System 51


Data Warehouse b. Snowflake Schema of Hospital Management
Fundamentals
and Architecture

Figure 10: Snowflake Schema of Hospital Management System

 Check Your Progress 3:


1.
Limitations of Aggregate fact tables: Aggregate tables take lot of time to scan the
rows of the base fact table. So, there will be more tables to manage. The size of
aggregates in computing can be costly. Based on the greedy approach the size of
aggregates is decided using hashing technique. If there are n dimensions in the
table, then there can be 2n possible aggregates. The load on the data warehouse
becomes more complex.

3.14 FURTHER READINGS


• Building the Data Warehouse, William H. Inmon, Wiley, 4th Edition, 2005.
• Data Warehousing Fundamentals, Paulraj Ponnaiah, Wiley Student Edition
• Data Warehousing, Reema Thareja, Oxford University Press.
• Data Warehousing, Data Mining & OLAP, Alex Berson and Stephen
J.Smith, Tata McGraw – Hill Edition, 2016.

52

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy