Block 1
Block 1
Data Warehousing
Indira Gandhi National Open University
School of Computer and Information
Sciences (SOCIS)
and Data Mining
1
Block
SOCIS FACULTY
Prof. P. Venkata Suresh, Director, SOCIS, IGNOU
Prof. V.V. Subrahmanyam, SOCIS, IGNOU
Dr. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Dr. Naveen Kumar, Associate Professor, SOCIS, IGNOU (on EOL)
Dr. M.P. Mishra, Associate Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU
Dr. Manish Kumar, Assistant Professor, SOCIS, IGNOU
Print Production
Mr. Sanjay Aggarwal, Assistant Registrar (Publication), MPDD
July,
April2022
2023
ISBN- 978-93-5568-774-6
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without
permission in writing from the Indira Gandhi National Open University.
Further information on the Indira Gandhi National Open University courses may be obtained from the University’s office at
Maidan Garhi, New Delhi-110068.
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by MPDD, IGNOU.
Laser Typeset by Raj Printers, A-9, Sector B-2, Tronica City, Loni (Gzb.)
printed at : Rohan Pragya Printing And Packaging Pvt Ltd. H-76, Site-V, UPSIDC, Kasna
COURSE INTRODUCTION
This course is of 4 credits, divided into two parts – first part (2 credits) covering the
Data Warehousing and second part (2 credits) covering the Data Mining.
A data warehouse is a system that stores data from a company’s operational
databases as well as external sources. Data warehouse platforms are different from
operational databases because they store historical information, making it easier
for business leaders to analyze data over a specific period of time. Data warehouse
platforms also sort data based on different subject matter, such as customers,
products or business activities.
Many global corporations have turned to data warehousing to organize data that
streams in from corporate branches and operations centers around the world. It’s
essential for IT students to understand how data warehousing helps businesses
remain competitive in a quickly evolving global marketplace. Data warehousing
is an increasingly important business intelligence tool which enables historical
insights, ensure consistency, allow organizations to make better business decisions,
decrease costs, maximize efficiency, increase the power and speed of data analytics,
provides major competitive edge and increase sales to improve the bottom line.
It is necessary to choose adequate Data Mining algorithms for making Data
Warehouse more useful. Data mining algorithms are used for transforming data
into business information and thereby improving decision making process. Data
Mining is a set of methods used for data analysis, created with the aim to find
out specific dependence, relations and rules related to data and making them out
in the new higher level quality information. Data Mining gives results that show
the interdependence and relations of data. These dependences are mainly based
on various mathematical and statistical relations. Data are collected from internal
database and converted into various documents, reports, list etc. which can be
further used in decision making processes. After selecting the data for analysis,
Data Mining is applied to the appropriate rules of behavior and patterns. That
is the reasons why Data Mining is also known as extraction of knowledge, data
archeology or pattern analysis. Data mining helps to develop smart market decision,
run accurate campaigns, make predictions, and more. With the help of Data mining,
we can analyze customer behaviors and their insights. This leads to great success
and data-driven business.
The course is organized into 4 Blocks:
Block 1 covers the Introductory topics on Data Warehousing, Data Warehouse
architecture, Data Marts and Dimensional Modeling.
Block 2 covers the Extract, Transform and Loading (ETL) aspects of Data
Warehousing, Online Analytical Processing and some Trends in Data Warehouse.
Block 3 covers the introductory topics related to Data Mining, Data Preprocessing
and Mining Frequent Patterns and Associations
Block 4 covers the Classification, Clustering of Data Mining, Text and Web
Mining.
There is a lab component associated with this course (i.e., Section-2 Data Mining
Lab of MCSL-223 course).
BLOCK INTRODUCTION
The title of the Block is Data Warehouse Fundamentals and Architecture. The
objectives of this block are to make you understand about the underlying concepts
of Data Warehousing, identify the components of the Data Warehouse Architecture,
to know the difference between the Data Warehouse and Data Marts, to understand
the Data Warehouse Development Life Cycle and to elucidate the dimensional
modeling techniques.
The block is organized into 3 units:
Unit 1 covers the fundamentals of data warehousing, its evolution, characteristics
of data warehousing, online transaction processing systems and applications of data
warehouses;
Unit 2 covers the data warehouse architecture, data marts and data warehouse
development life cycle; and
Unit 3 covers the introduction to dimensional modeling, identifying facts and
dimensions, star schema, snowflake schema and fact constellation schema.
UNIT 1 FUNDAMENTALS OF DATA WAREHOUSE
Structure
1.0 Introduction
1.1 Objectives
1.2 Evolution of Data Warehouse
1.3 Data Warehouse and its Need
1.3.1 Need for Data Warehouse
1.3.2 Benefits of Data Warehouse
1.4 Data Warehouse Design Approaches
1.4.1 Top-Down Approach
1.4.2 Bottom-Up Approach
1.5 Characteristics of a Data Warehouse
1.5.1 How Data Warehouse Works?
1.6 OLTP and OLAP
1.6.1 Online Transaction Processing (OLTP)
1.6.2 Online Analytical Processing (OLAP)
1.7 Data Granularity
1.8 Metadata and Data Warehousing
1.9 Data Warehouse Applications
1.10 Types of Data Warehouses
1.10.1 Enterprise Data Warehouse
1.10.2 Operational Data Store
1.10.3 Data Mart
1.11 Popular Data Warehouse Platforms
1.12 Summary
1.13 Solutions/Answers
1.14 Further Readings
1.0 INTRODUCTION
A database often contains information or data collection that is generally stored
electronically in a computer system. It is easy to access, manage, modify, update,
monitor, and organize the data. Data is stored in the tables of the database.
The process of consolidating data and analyzing it to obtain some insights has
been around for centuries, but we just recently began referring to this as data
warehousing. Any operational or transactional system is only designed with its
own functionality and hence, it could handle limited amounts of data for a limited
amount of time. The operational systems are not designed or architected for long
term data retention as the historical data is little to no importance to them. However,
to gain a point-in-time visibility and understand the high-level operational aspects
of any business, the historical data plays a vital role. With the emergence of matured
Relational Database Management Systems (RDBMS) in 1960s, engineers across
various enterprises started architecting ways to copy the data from the transactional
systems over to different databases via manual or automated mechanism and use
Data Warehouse it for reporting and analysis. As the data in the transactional systems would get
Fundamentals purged periodically, it would not be the case in these analytical repositories as their
and Architecture
purpose was to store as much data as possible; hence the word “data warehouse”
came into existence because these repositories would become a warehouse for the
data.
Data Warehousing (DW) as a practice became very prominent during late 80s when
the enterprises started building decision support systems that were mainly responsible
to support reporting. As there was a rapid advancement in the performance of these
relational database during late 1990s and early 2000s, Data Warehousing became
a core part of the Information Technology group across large enterprises. In fact,
some of the vendors like Netezza, Teradata started offering customized hardware
to manage data warehouse architectures within state-of-the-art machines. Data
Warehousing had evolved to be on top of the list of priorities since mid 2000s. Data
supply chain ecosystem has grown exponentially in the current world and so is the
way enterprises architect their data warehouses.
A well architected data warehouse serves as an extended vision for the enterprise
where multiple departments can gain actionable insights to manage key business
decisions that could drive operational excellence or revenue generating opportunities
for the enterprise.
This unit covers the basic features of data warehousing, its evolution, characteristics,
online transaction processing (OLTP), online analytical processing, popular
platforms and applications of data warehouses.
1.1 OBJECTIVES
After going through this unit, you shall be able to:
yy understand the evolution of data warehouse;
yy describe various characteristics of data warehouse;
yy list the benefits and applications of a data warehouse;
yy discuss the significance of metadata in data ware house;
yy list and discuss the types of data warehouses, and
yy identify the popular data warehouse platforms;
Source System
Exract Data to
Stage
Data Warehouse
ETL
Source Systems
DM2
ETL
Data Warehouse
Basically, Kimball model reverses the Inmon model i.e. Data marts are directly
loaded with the data from the source systems and then ETL process is used to load
in to Data Warehouse. The above image depicts how the top-down approach works.
Below are the steps that are involved in bottom-up approach:
yy The data flow in the bottom up approach starts from extraction of data from
various source systems into the stage area where it is processed and loaded
into the data marts that are handling specific business process.
6 yy After data marts are refreshed the current data is once again extracted in
stage area and transformations are applied to create data into the data mart
structure. The data is the extracted from Data Mart to the staging area is Fundamentals of
aggregated, summarized and so on loaded into EDW and then made available Data Warehouse
for the end user for analysis and enables critical business decisions.
yy Having discussed the data warehouse design strategies, let us study the
characteristics of the DW in the next section.
• Integrated
Integration involves setting up a common system to measure all similar data from
multiple systems. Data was to be shared within several database repositories
and must be stored in a secured manner to access by the data warehouse. A data
warehouse integrates data from various sources and combines it in a relational
database. It must be consistent, readable, and coded.. The data warehouse integrates
several subject areas as shown in the figure 4.
7
Figure 4: Integrated Characteristic Feature of a DW
Data Warehouse • Time-Variant
Fundamentals
and Architecture Information may be held in various intervals such as weekly, monthly, and yearly
as shown in Figure 5. It provides a series of limited-time, variable rate, online
transactions. The data warehouse covers a broader range of data than the operational
systems. When the data stored in the data store has a certain amount of time, it can
be predictable and provide history. It has aspects of time embedded within it. One
other facet of the data warehouse is that the data cannot be changed, modified or
updated once it is stored.
• Non-Volatile
The data residing in the data warehouse is permanent, as the name non -volatile
suggests. It also ensures that when new data is added, data is not erased or
removed. It requires the mammoth amount of data and analyses the data within
the technologies of warehouse. Figure 6 shows the non- volatile data warehouse
vs operational database. A data warehouse is kept separate from the operational
database and thus the data warehouse does not represent regular changes in the
operational database. Data warehouse integration manages different warehouses
relevant to the topic.
• Load Manager
Load Manager Component of data warehouse is responsible for collection of
data from operational system and converts them into usable form for the users.
This component is responsible for importing and exporting data from operational
systems. This component includes all of the programs and applications interfaces
that are responsible for pooling the data out of the operational system, preparing it,
loading it into warehouse itself it performs the following tasks such as identification
of data, validation of data about the accuracy, extraction of data from original
source, cleansing of data by eliminating meaningless values and making it usable,
data formatting, data standardization by getting them into a consistent form, data
merging by taking data from different sources and consolidating into one place and
establishing referential integrity.
• Warehouse Manager
The warehouse manager is the center of data-warehousing system and is the
data warehouse itself. It is a large, physical database that holds a vast amount of
information from a wide variety of sources. The data within the data warehouse
is organized such that it becomes easy to find, use and update frequently from its
sources.
• Query Manager
Query Manager Component provides the end-users with access to the stored
warehouse information through the use of specialized end-user tools. Data mining
access tools have various categories such as query and reporting, on-line analytical
processing (OLAP), statistics, data discovery and graphical and geographical
information systems.
9
Data Warehouse • End-user access tools
Fundamentals
and Architecture This is divided into the following categories, such as:
• Reporting Data
• Query Tools
• Data Dippers
• Tools for EIS
• Tools for OLAP and tools for data mining.
Check Your Progress 1
1) What is a Data Warehouse and why is it important?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
…….…………………………………………………………......................
2) Mention the characteristics of a Data Warehouse.
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
…….…………………………………………………………......................
OLTP OLAP
Characteristics Handles a large number of Handles large volumes of data
small transactions with complex queries
Query types Simple standardized queries Complex queries
Operations Based on INSERT, UPDATE, Based on SELECT commands
DELETE commands to aggregate data for reporting
Response time Milliseconds Seconds, minutes, or hours
depending on the amount of
data to process
Design Industry-specific, such as Subject-specific, such as
retail, manufacturing, or sales, inventory, or marketing
banking
Source Transactions Aggregated data from
transactions
Purpose Control and run essential Plan, solve problems, support
business operations in real decisions, discover hidden
time insights
Data updates Short, fast updates initiated by Data periodically refreshed
user with scheduled, long-running
batch jobs
Space Generally small if historical Generally large due to
requirements data is archived aggregating large datasets
Backup and Regular backups required to Lost data can be reloaded
recovery ensure business continuity and from OLTP database as
meet legal and governance needed in lieu of regular
requirements backups
Productivity Increases productivity of end Increases productivity of
users business managers, data
analysts, and executives
Data view Lists day-to-day business Multi-dimensional view of
transactions enterprise data
User examples Customer-facing personnel, Knowledge workers such
clerks, online shoppers as data analysts, business
analysts, and executives
Database Normalized databases for Denormalized databases for
design efficiency analysis
OLTP provides an immediate record of current business activity, while OLAP
generates and validates insights from that data as it’s compiled over time. That
historical perspective empowers accurate forecasting, but as with all business 11
Data Warehouse intelligence, the insights generated with OLAP are only as good as the data pipeline
Fundamentals from which they emanate.
and Architecture
Check Your Progress 2
1) Why a data warehouse is separated from Operational Databases?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
…….…………………………………………………………......................
2) Mention the key differences between a database and a data warehouse.
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………..
14
AWS Redshift Fundamentals of
Data Warehouse
Redshift is a cloud-based data warehousing tool for enterprises. The platform can
process petabytes of data quite fast. That's why it’s suitable for high-speed data
analytics. It also supports automatic concurrency scaling. The automation increases
or decreases query processing resources to match workload demand.
Although tooling provided by Amazon reduces the need to have a database
administrator full time, it does not eliminate the need for one. Amazon Redshift is
known to have issues with handling storage efficiently in an environment prone to
frequent deletes.
Snowflake
Snowflake is a data warehousing solution that offers a variety of options for public
cloud technology. With Snowflake, you can make your business more data-driven.
You may use Snowflake to set up an enterprise-grade cloud data warehouse. With
Snowflake, you can analyze data from various unstructured and structured sources.
However, Snowflake is dependent on Azure, Amazon Web Services (AWS),
Google Cloud Services (GCS). The support can be a problem whenever one of
those cloud servers has an independent outage.
Microsoft Azure Synapse
Microsoft Azure is a robust platform for data management, analytics, integration,
and more, with solutions spanning AI, blockchain, and more than a dozen unique
databases for varying use cases. Among them is Azure Synapse, formerly known
as Azure SQL Data Warehouse, a platform built for analytics, providing you the
ability to query data using either serverless or provisioned resources at scale.
Azure Synapse brings together the two worlds of data warehousing and analytics
with a unified experience to ingest, prepare, manage, and serve data for immediate
BI and machine learning. The broader Azure platform includes thousands of tools,
including others that interface with the various Azure databases.
1.12 SUMMARY
In this unit you have studied about the evolution, characteristics, benefits and
applications of Data Warehouse.
Operational database system provides day-to-day information, although strategic
decision-making cannot be used easily. Data Warehouse is a concept designed to
aid strategic information. Data Warehouse allows people to make decisions and
provides flexible, convenient and interactive sources of strategic intelligence. A
data warehouse combines several technologies because it collects data from various
operational data base systems and external sources such as magazines, newspapers
and reports from the same industry, removes contradictions, transforms the data
and then stores them in formats suited to easy access for decision-making purposes.
The defining characteristics of the data warehouse are: Subject oriented, integrated,
time-variant, and non-volatile.
Data warehouses are meant to be used by executives, managers, and other people
at higher managerial levels who may not have much technical expertise in handling
the databases.
Advantages of data warehouses include better decisions, increased productivity, 15
Data Warehouse lower operational costs, enhanced asset and liability management, and better CRM.
Fundamentals
and Architecture
1.13 SOLUTIONS/ANSWERS
Check Your Progress 1
1) Data Warehousing (DW) is a process for collecting and managing data
from diverse sources to provide meaningful insights into the business. A
Data Warehouse is typically used to connect and analyze heterogeneous
sources of business data. The data warehouse is the centerpiece of the BI
system built for data analysis and reporting.
It is amalgam of technologies and components which helps to use data
strategically. Instead of transaction processing, it is the automated collection
of a vast amount of information by a company that is configured for demand
and review. It’s a process of transforming data into information and making
it available for users to make a difference in a timely way.
The archive of decision support (Data Warehouse) is managed independently
from the operating infrastructure of the organization. The data warehouse,
however, is not a product but rather an environment. It is an organizational
framework of an information system that provides consumers with
knowledge regarding current and historical decision help that is difficult to
access or present in the conventional operating data store.
Data storage platforms also sort data on a variety of subjects like customers,
products or business.
• Data storage is a tool that companies can use increasingly important for
corporate intelligence:
• Make uniformity possible. All research data gathered and shared to decision
makers worldwide should be used in a uniform format. Standardization of
data from various sources reduces the risk of misinterpretation as well as
overall accuracy of interpretation.
• Take better business decisions. Successful entrepreneurs have a thorough
understanding of data, and are good at predicting future trends. The data
storage system helps users access various data sets at speed and efficiency.
• Data storage platforms allow companies to access their business' past
history and evaluate ideas and projects. This gives managers an idea of how
they can improve their sales and management practices.
2). Following are the four main characteristics of a data warehouse:
i) Subject oriented
A data warehouse is subject-oriented, as it provides information on a topic rather
than the ongoing operations of organizations. Such issues may be inventory,
promotion, storage, etc. Never does a data warehouse concentrate on the current
processes. Instead, it emphasized modeling and analyzing decision-making data.
It also provides a simple and succinct description of the particular subject by
excluding details that would not be useful in helping the decision process.
16
(ii) Integrated Fundamentals of
Data Warehouse
Integration in Data Warehouse means establishing a standard unit of measurement
from the different databases for all the similar data. The data must also get stored in
a simple and universally acceptable manner within the Data Warehouse. Through
combining data from various sources such as a mainframe, relational databases, flat
files, etc., a data warehouse is created. It must also keep the naming conventions,
format, and coding consistent. Such an application assists in robust data analysis.
Consistency must be maintained in naming conventions, measurements of
characteristics, specification of encoding, etc.
(iii) Time-variant
Compared to operating systems, the time horizon for the data warehouse is given
period and provides historical information. It contains a temporal element, either
explicitly or implicitly.One such location in the record key system where Data
Warehouse data shows time variation is. Each primary key contained with the DW
should have an element of time either implicitly or explicitly. Just like the day, the
month of the week, etc.
(iv) Non-volatile
Also, the data warehouse is non-volatile, meaning that prior data will not be erased
when new data are entered into it. Data is read-only, only updated regularly. It also
assists in analyzing historical data and in understanding what and when it happened.
The transaction process, recovery, and competitiveness control mechanisms are not
required. In the Data Warehouse environment, activities such as deleting, updating,
and inserting that are performed in an operational application environment are
omitted.
Check Your Progress 2
1) Data Warehouse systems are segregated from production databases so that
they aren't intermingled and cause conflicts.
• here is a database available for tasks such as searching records,
T
indexing, and digital archiving. Data warehouse queries are often
complex due to their varied and complex nature.
• I t is possible to manage multiple transactions simultaneously
through business databases. Concurrency control and recovery
mechanisms are needed to ensure that the database in operational
databases is robust and consistent.
• he operational database query allows for reading and modification
T
of operations, whilst the read access to stored information is required
for OLAP queries only.
• database of operations maintains current information. In contrast,
A
historical data is kept in a warehouse.
2) A database stores the current data required to power an application. A data
warehouse stores current and historical data from one or more systems in
a predefined and fixed schema, which allows business analysts and data
scientists to easily analyze the data. The table below summarizes differences
between databases, data warehouses:
17
Data Warehouse Table 2: Database Vs Data Warehouse
Fundamentals
and Architecture Characteristic Database Data Warehouse
Feature
Workloads Operational and Analytical
transactional
Data Type Structured or semi- Structured and/or semi-
structured structured
Schema Flexibility Rigid or flexible schema Pre-defined and fixed
depending on database type schema definition for ingest
(schema on write and read)
Data Freshness Real time May not be up-to-date
based on frequency of ETL
processes
Users Application developers Business analysts and data
scientists
Pros Fast queries for storing and The fixed schema makes
updating data working with the data easy
for business analysts
Cons May have limited analytics Difficult to design and
capabilities evolve schema
Scaling compute may
require unnecessary scaling
of storage, because they are
tightly coupled
18
UNIT 2 DATA WAREHOUSE ARCHITECTURE
Structure
2.0 Introduction
2.1 Objectives
2.2 Data Warehouse Architecture and its Types
2.2.1 Types of Data Warehouse Architectures
2.3 Components of Data Warehouse Architecture
2.4 Layers of Data Warehouse Architecture
2.4.1 Best Practices for Data Warehouse Architecture
2.5 Data Marts
2.5.1 Data Mart Vs Data Warehouse
2.6 Benefits of Data Marts
2.7 Types of Data Marts
2.8 Structure of a Data Mart
2.9 Designing the Data Marts
2.10 Limitations with Data Marts
2.11 Summary
2.12 Solutions / Answers
2.13 Further Readings
2.0 INTRODUCTION
In the previous unit we had studied about the data warehousing and related topics.
Despite numerous advancements over the last five years in the arena of Big
Data, cloud computing, predictive analysis, and information technologies, data
warehouses have gained more significance. For the success of any data warehouse,
its architecture plays an important role. Since three decades, the data warehouse
architecture has been the pillar of the corporate data ecosystems.
This unit present various topics including the basic concept of data warehouse
architecture, its types, significant components and layers of data ware house
architecture, data marts and their designing.
2.1 OBJECTIVES
After going through this unit, you shall be able to:
• understand the purpose of data warehouse architecture;
• describe the process of storing the data in a data warehouse;
• list and discuss the various types of data warehouse architectures;
• discuss various components and layers of data warehouse architecture;
• to summarize the functionality of data marts, their benefits and various
types, and
• to know the ways of structuring and designing the data marts.
Data Warehouse
Fundamentals 2.2 DATA WAREHOUSE ARCHITECTURE AND ITS TYPES
and Architecture
Data warehouse architecture is a data storage framework’s design of an organization.
It takes information from raw data sets and stores it in a structured and easily
digestible format.
A data warehouse architecture plays a vital role in the data enterprise. As databases
assist in storing and processing data, and data warehouses help in analyzing that
data.
Data warehousing is a process of storing a large amount of data by a business or
organization. The data warehouse is designed to perform large complex analytical
queries on large multi-dimensional datasets in a straightforward manner. Data
warehouses extract data from different resources, which are in different fonts,
convert it into a unique form, and place data in Data Warehouse.
2.2.1 Types of Data Warehouse Architectures
Data warehouse architecture defines the arrangement of the data in different
databases. As the data must be organized and cleansed to be valuable, a modern
data warehouse structure identifies the most effective technique of extracting
information from raw data.
Using a dimensional model, the raw data in the staging area is extracted and converted
into a simple consumable warehousing structure to deliver valuable business
intelligence. When designing a data warehouse, there are three different types of
models to consider, based on the approach of number of tiers the architecture has.
(i) Single-tier data warehouse architecture
(ii) Two-tier data warehouse architecture
(iii) Three-tier data warehouse architecture
The details of each of the architecture are given below:
(i) Single-tier data warehouse architecture
The single-tier architecture (Figure 1) is not a frequently practiced approach. The
main goal of having such architecture is to remove redundancy by minimizing the
amount of data stored. Its primary disadvantage is that it doesn’t have a component
that separates analytical and transactional processing.
21
Data Warehouse Figure 4 illustrates the complete data warehouse architecture with three tiers:
Fundamentals
and Architecture
Now, let’s learn about the major components of a data warehouse and how they
help build and scale a data warehouse in the next section.
• Hybrid data marts combine data from existing data warehouses and other
operational sources. This unified approach leverages the speed and user-
friendly interface of a top-down approach and also offers the enterprise-
level integration of the independent method.
2.11 SUMMARY
Data warehouse architecture is the design and building blocks of the modern
data warehouse. In this unit we have studied the basic building blocks of the data
warehouse, data warehouse architecture, its types, architecture models, data marts,
designing of data marts and limitations.
In this next unit we will study about Dimensional Modeling.
31
Data Warehouse
Fundamentals 2.12 SOLUTIONS / ANSWERS
and Architecture
Check Your Progress 1:
1. The method for defining the entire architecture of data communication
processing as well as the presentation that exists for end-clients is
the data warehouse architecture. Every data warehouse is different,
and each of them is characterized based on the standard vital
components.
In simple words, a data warehouse is an information system
that consists of commutative and historical data from single
or multiple sources. The process of reporting and analysis of
data in the organizations is simplified with the help of different
data warehousing concepts. There are different approaches to
constructing a data warehouse architecture. Any approach is used
based on the requirements of the organizations.
2. On every operational database, there are a certain fixed number of
operations that have to be applied. There are different well-defined
techniques for delivering suitable solutions. Data warehousing
is found to be more effective when the correct flow of the data
warehouse architecture is completely followed.
The four different processes that contribute to a data warehouse
are extracting and loading the data, cleaning and transforming the
data, backing up and archiving the data, and carrying out the query
management process by directing them to the appropriate data
sources.
3. Data marts are used to solve specific organizational problems,
especially those that are unique to one department. Typical use
cases for a data mart include:
Focused Analytics
Analytics is perhaps the most common application of data marts. The
data in these repositories is entirely relevant to the requirements of
the business department, with no extraneous information, resulting
in faster and more accurate analysis. For example, financial analysts
will find it easier to work with a financial data mart, rather than
working with an entire data warehouse.
Fast Turnaround
Data marts are generally faster to develop than a data warehouse, as
the developers are working with fewer sources and a limited schema.
Data marts are ideal for data projects operating under challenging
time constraints.
32
Permission Management Data Warehouse
Architecture
Data marts can be a risk-free way to grant limited data access without
exposing the entire data warehouse. For example, dependent data
mart contains a segment of warehouse data, and users are only able
to view the contents of the mart. This prevents unauthorized access
and accidental writes.
Better Resource Management
Data marts are sometimes used where there is a disparity in resource
usage between different departments. For example, the logistics
department might perform a high volume of daily database actions,
which causes the marketing team’s analytics tools to run slow. By
providing each department with its own data mart, it’s easier to
allocate resources according to their needs.
33
UNIT 3 DIMENSIONAL MODELING
Structure
3.0 Introduction
3.1 Objectives
3.2 Dimensional Modeling
3.2.1 Strengths of Dimensional Modeling
3.3 Identifying Facts and Dimensions
3.4 Star Schema
3.4.1 Features of Star Schema
3.5 Advantages and Disadvantages of Star Schema
3.6 Snowflake Schema
3.6.1 Features of Snowflake Schema
3.7 Advantages and Disadvantages of Snowflake Schema
3.7.1 Star Schema Vs Snowflake Schema
3.8 Fact Constellation Schema
3.8.1 Advantages and Disadvantages of Fact Constellation Schema
3.9 Aggregate Tables
3.10 Need for Building Aggregate Fact Tables
Limitations of Aggregate Fact Tables
3.11 Aggregate Fact Tables and Derived Dimension Tables
3.12 Summary
3.13 Solutions/Answers
3.14 Further Readings
3.0 INTRODUCTION
In the earlier unit, we had studied about the Data Warehouse Architecture and
Data Marts. In this unit let us focus on the modeling aspects. In this unit we will
go through the dimensional modeling, star schema, snowflake schema, aggregate
tables and Fact constellation schema.
3.1 OBJECTIVES
After going through this unit, you shall be able to:
• understand the purpose of dimension modeling;
• identifying the measures, facts, and dimensions;
• discuss the fact and dimension tables and their pros and cons;
• discuss the Star and Snowflake schemas;
• explore comparative analysis of star and snowflake schema;
• describe Aggregate facts, fact constellation, and
• discuss various examples of star and snowflake schema.
Data Warehouse
Fundamentals 3.2 DIMENSIONAL MODELING
and Architecture
Dimensional modeling is a data model design adopted when building a data
warehouse. Simply, it can be understood that dimension modeling reduces the
response time of query fired unlike relational systems. The concept behind
dimensional modeling is all about the conceptual design. Firstly let’s see the
introduction to dimensional modeling and how it is different from a traditional
data model design. A data model is a representation of how data is stored in a
database and it is usually a diagram of the few tables and the relationships that
exist between them. This modeling is designed to read, summarize and compute
some numeric data from a data warehouse. A data warehouse is an example of a
system that requires small number of large tables. This is due to many users using
the application to read lot of data a characteristic of a data warehouse is to write the
data once and read it many times over so it is the read operation that is dominant in
a data warehouse. Now let's look at the data warehouse containing customer related
information in a single table this makes it a lot easier for analytics just to count the
number of customers by country but this time the use of tables in the data warehouse
simplify the query processing. The main objective of dimension modeling is to
provide an easy architecture for the end user to write queries and also, to reduce
the number of relationships between the tables and dimensions hence providing
efficient query handling.
Dimensional modeling populates data in a cube as a logical representation with
OLAP data management. The concept was developed by Ralph Kimball. It has
“fact” and “dimension” as its two important measure. The transaction record is
divided into either “facts”, which consists of business numerical transaction data,
or “dimensions”, which are the reference information that gives context to the facts.
The more detail about fact and dimension is explained in the subsequent sections.
The main objective of dimension modeling is to provide an easy architecture for the
end user to write queries. Also it will reduce the number of relationships between
the tables and dimensions, hence providing efficient query handling.
The following are the steps in Dimension Modeling as shown in figure1.
1. Identify Business Process
2. Identify Grain (level of detail)
3. Identify dimensions and attributes
5. Build Schema
The model should describe the Why, How much, When/Where/Who and What of
your business process.
36
Dimensional Modeling
3.5
ADVANTAGES AND DISADVANTAGES OF STAR
SCHEMA
3.5.1 Advantages of Star Schema
Star schemas are easy for end users and applications to understand and navigate.
With a well-designed schema, users can quickly analyze large, multidimensional
data sets. The main advantages of star schemas in a decision-support environment
are:
• Query performance
Because a star schema database has a small number of tables and clear join paths,
queries run faster than they do against an OLTP system. Small single-table queries,
usually of dimension tables, are almost instantaneous. Large join queries that
involve multiple tables take only seconds or minutes to run.
In a star schema database design, the dimensions are linked only through the central
fact table. When two dimension tables are used in a query, only one join path,
intersecting the fact table, exists between those two tables. This design feature
enforces accurate and consistent query results.
39
Data Warehouse • Load performance and administration
Fundamentals
and Architecture Structural simplicity also reduces the time required to load large batches of data
into a star schema database. By defining facts and dimensions and separating them
into different tables, the impact of a load operation is reduced. Dimension tables
can be populated once and occasionally refreshed. You can add new facts regularly
and selectively by appending records to a fact table.
• Built-in referential integrity
A star schema has referential integrity built in when data is loaded. Referential
integrity is enforced because each record in a dimension table has a unique primary
key, and all keys in the fact tables are legitimate foreign keys drawn from the
dimension tables. A record in the fact table that is not related correctly to a dimension
cannot be given the correct key value to be retrieved.
• Easily understood
A star schema is easy to understand and navigate, with dimensions joined only
through the fact table. These joins are more significant to the end user, because they
represent the fundamental relationship between parts of the underlying business.
Users can also browse dimension table attributes before constructing a query.
3.5.2 Disadvantages of Star Schema
As mentioned before, improving read queries and analysis in a star schema could
involve certain challenges:
• Decreased data integrity: Because of the denormalized data structure, star
schemas do not enforce data integrity very well. Although star schemas use
countermeasures to prevent anomalies from developing, a simple insert or
update command can still cause data incongruities.
• Less capable of handling diverse and complex queries: Databases
designers build and optimize star schemas for specific analytical needs.
As denormalized data sets, they work best with a relatively narrow set of
simple queries. Comparatively, a normalized schema permits a far wider
variety of more complex analytical queries.
• No Many-to-Many Relationships: Because they offer a simple dimension
schema, star schemas don’t work well for “many-to-many data relationships”
Example 1: Suppose a star schema is composed of a Sales fact table as shown in
Figure 3a and several dimension tables connected to it for Time, Branch, Item and
Location.
Fact Table
Sales is the Fact table.
Dimension Tables
The Time table has a column for each day, month, quarter, year etc..
The Item table has columns for each item_key, item_name, brand, type and
supplier_type.
The Branch table has columns for each branch_key, branch_name and branch_
type.
40
The Location table has columns of geographic data, including street, city, state, Dimensional Modeling
and country. Unit_Sold and Dollars_Sold are the Measures.
41
Data Warehouse Dimension Tables
Fundamentals
and Architecture The Store table consists of columns like store_id store_address, city, region, state
and country.
Customer table has columns for each product_id, product_time and product_type.
Sales_Type includes sales_type_id and type_name columns.
Product table consists of product_id, product_name and product_type.
Time table consists of columns like time_id, action_date, action_week, action_
month, action_year and action_ weekday.
Measures may be amount spent and no. of items bought.
43
Data Warehouse
Fundamentals
and Architecture
46
Check Your Progress 2 Dimensional Modeling
3.12 SUMMARY
This unit presented the basic designing of data warehouse. These topics are more
focused on the various kind of modeling and schemas. It explored the grains, facts,
and dimensions of the schemas. It is important to know about the dimensional
modeling .as the appropriate modeling technique would yield the correct respond
the queries.
A dimensional modeling is a kind of data structure used to optimize design of Data
warehouse for the query retrieval operations. There are various schema designs.
Here, it discussed star, snowflake, and fact constellations. From denormalized to
normalized schemas uses dimension, fact, derived and aggregate fact table. Every
table has some purpose and used for efficient designing in terms of space and query
handling. This unit discusses the pros and cons of every tables. The number of
examples used to explain the designing in different scenarios.
3.13 SOLUTIONS/ANSWERS
Check Your Progress 1:
1) Characteristics of Star Schema:
• very dimension in a star schema is represented with only one-
E
dimension table.
• The dimension table should contain the set of attributes.
• The dimension table is joined to the fact table using a foreign key
• The dimension table are not joined to each other
• Fact table would contain key and measure
• The Star schema is easy to understand and provides optimal disk
usage.
• he dimension tables are not normalized. For instance, in the above
T
figure, Country ID does not have Country lookup table as an OLTP
design would have.
• The schema is widely supported by BI Tools
50
Dimensional Modeling
2)
52