0% found this document useful (0 votes)
51 views86 pages

Ccs341 Data Warehousing All Units

The document outlines the syllabus for the CCS341 Data Warehousing course at Rajalakshmi Institute of Technology for the academic year 2023-2024. It covers various topics including data warehouse architecture, ETL and OLAP technologies, metadata, data marts, dimensional modeling, and system/process management. Each unit provides foundational knowledge necessary for understanding data warehousing and its applications in business intelligence.

Uploaded by

ramya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views86 pages

Ccs341 Data Warehousing All Units

The document outlines the syllabus for the CCS341 Data Warehousing course at Rajalakshmi Institute of Technology for the academic year 2023-2024. It covers various topics including data warehouse architecture, ETL and OLAP technologies, metadata, data marts, dimensional modeling, and system/process management. Each unit provides foundational knowledge necessary for understanding data warehousing and its applications in business intelligence.

Uploaded by

ramya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

RAJALAKSHMI INSTITUTE OF TECHNOLOGY

KUTHAMBAKKAM, CHENNAI - 600 124

CCS341 DATAWAREHOUSING

NOTES

DEPARTMENT OF COMPUTER SCIENCE AND BUSINESS SYSTEMS

(THIRD YEAR/SIXTH SEMESTER)

2023-2024(EVEN)
SYLLABUS

UNIT I INTRODUCTION TO DATA WAREHOUSE


Data warehouse Introduction - Data warehouse components- operational
database Vs data warehouse – Data warehouse Architecture – Three-tier Data
Warehouse Architecture - Autonomous Data Warehouse- Autonomous Data
Warehouse Vs Snowflake - Modern Data Warehouse

UNIT II ETL AND OLAP TECHNOLOGY


What is ETL – ETL Vs ELT – Types of Data warehouses - Data warehouse Design
and Modeling - Delivery Process - Online Analytical Processing (OLAP) -
Characteristics of OLAP - Online Transaction Processing (OLTP) Vs OLAP - OLAP
operations- Types of OLAP- ROLAP Vs MOLAP Vs HOLAP.

UNIT III META DATA, DATA MART AND PARTITION STRATEGY


Meta Data – Categories of Metadata – Role of Metadata – Metadata Repository –
Challenges for Meta Management - Data Mart – Need of Data Mart- Cost Effective
Data Mart- Designing Data Marts- Cost of Data Marts- Partitioning Strategy –
Vertical partition – Normalization – Row Splitting – Horizontal Partition

UNIT IV DIMENSIONAL MODELING AND SCHEMA


Dimensional Modeling- Multi-Dimensional Data Modeling – Data Cube- Star
Schema- Snowflake schema- Star Vs Snowflake schema- Fact constellation
Schema- Schema Definition - Process Architecture- Types of Data Base
Parallelism – Datawarehouse Tools

UNIT V SYSTEM & PROCESS MANAGERS


Data Warehousing System Managers: System Configuration Manager- System
Scheduling Manager - System Event Manager - System Database Manager -
System Backup Recovery Manager - Data Warehousing Process Managers: Load
Manager – Warehouse Manager- Query Manager – Tuning – Testing
RIT CCS341 - DATA WAREHOUSING 1

UNIT-I
INTRODUCTION TO DATA WAREHOUSE
Data warehouse Introduction - Data warehouse components- operational database Vs data
warehouse – Data warehouse Architecture – Three-tier Data Warehouse Architecture -
Autonomous Data Warehouse- Autonomous Data Warehouse Vs Snowflake - Modern Data
Warehouse

Introduction
A Data Warehouse is Built by combining data from multiple diverse sources that support
analytical reporting, structured and unstructured queries, and decision making for the organization,
and Data Warehousing is a step-by-step approach for constructing and using a Data Warehouse.
Many data scientists get their data in raw formats from various sources of data and information. But,
for many data scientists also as business decision-makers, particularly in big enterprises, the main
sources of data and information are corporate data warehouses. A data warehouse holds data from
multiple sources, including internal databases and Software (SaaS) platforms. After the data is
loaded, it often cleansed, transformed, and checked for quality before it is used for analytics
reporting, data science, machine learning, or anything.

What is Data Warehouse?


A Data Warehouse is a collection of software tools that facilitates analysis of a large set of
business data used to help an organization make decisions. A large amount of data in data
warehouses comes from numerous sources such that internal applications like marketing, sales, and
finance; customer-facing apps; and external partner systems, among others. It is a centralized data
repository for analysts that can be queried whenever required for business benefits. A data
warehouse is mainly a data management system that’s designed to enable and support business
intelligence (BI) activities, particularly analytics. Data warehouses are alleged to perform queries,
cleaning, manipulating, transforming and analyzing the data and they also contain large amounts of
historical data.

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 2

What is Data Warehousing?


The process of creating data warehouses to store a large amount of data is named Data
Warehousing. Data Warehousing helps to improve the speed and efficiency of accessing different
data sets and makes it easier for company decision-makers to obtain insights that will help the
business and promoting marketing tactics that set them aside from their competitors. We can say that
it is a blend of technologies and components which aids the strategic use of data and information. The
main goal of data warehousing is to create a hoarded wealth of historical data that can be retrieved and
analyzed to supply helpful insight into the organization’s operations.

Need of Data Warehousing.


Data Warehousing is a progressively essential tool for business intelligence. It allows organizations
to make quality business decisions. The data warehouse benefits by improving data analytics, it also
helps to gain considerable revenue and the strength to compete more strategically in the market. By
efficiently providing systematic, contextual data to the business intelligence tool of an organization,
the data warehouses can find out more practical business strategies.

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 3

1. Business User: Business users or customers need a data warehouse to look at summarized
data from the past. Since these people are coming from a non-technical background also, the
data may berepresented to them in an uncomplicated way.
2. Maintains consistency: Data warehouses are programmed in such a way that they can be
applied in a regular format to all collected data from different sources, which makes it
effortless for company decision-makers to analyze and share data insights with their
colleagues around the globe. By standardizing the data, the risk of error in interpretation is
also reduced and improves overall accuracy.
3. Store historical data: Data Warehouses are also used to store historical data that means, the
time variable data from the past and this input can be used for various purposes.
4. Make strategic decisions: Data warehouses contribute to making better strategic decisions.
Some business strategies may be depending upon the data stored within the data warehouses.
5. High response time: Data warehouse has got to be prepared for somewhat sudden masses
and typeof queries that demands a major degree of flexibility and fast latency.
Characteristics of Data warehouse:
1. Subject Oriented: A data warehouse is often subject-oriented because it delivers may be
achieved on a particular theme which means the data warehousing process is proposed to
handle a particular theme that is more defined. These themes are often sales, distribution,
selling. etc.
2. Time-Variant: When the data is maintained via totally different intervals of time like
weekly, monthly, or annually, etc. It founds numerous time limits that are unit structured

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 4
between the big datasets and are command within the online transaction method (OLTP). The
time limits for the data warehouse are extended than that of operational systems. The data
resided within the data warehouse is predetermined with a particular interval of time and
delivers information from the historical perspective. It contains parts of time directly or
indirectly.
3. Non-volatile: The data residing in the data warehouse is permanent and defined by its
names. It additionally
means that the data in the data warehouse is cannot be erased or deleted or also when new
data is inserted into it. In the data warehouse, data is read-only and can only be refreshed at a
particular interval of time. Operations such as delete, update and insert that is done in a
software application over data is lost in the data warehouse environment. There are only two
types of data operations that can be done in the data warehouse:
 Data Loading
 Data Access
1. Integrated: A data warehouse is created by integrating data from numerous different sources
such that from mainframe computers and a relational database. Additionally, it should also
have reliable naming conventions, formats, and codes. Integration of data warehouse
benefits in the successful analysis of data. Dependability in naming conventions, column
scaling, encoding structure, etc. needs to be confirmed. Integration of data warehouse
handles numerous subject-oriented warehouses.

Architecture & Components of Data Warehouse:


Data warehouse architecture defines the comprehensive architecture of data processing and
presentation that will be useful for data analysis and decision making within the enterprise and
organization. Each organization has different data warehouses depending upon their need, but all of
them are characterized by some standardcomponents.
Data Warehouse applications are designed to support the user’s data requirements, an example of
this is online analytical processing (OLAP). These include functions such as forecasting, profiling,
summary reporting, and trend analysis.
The architecture of the data warehouse mainly consists of the proper arrangement of its elements, to
build an efficient data warehouse with software and hardware components. The elements and
components may vary based on the requirement of organizations. All of these depend on the
organization’s circumstances.

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 5

1. Source Data Component:


In the Data Warehouse, the source data comes from different places. They are group into four
categories:
 External Data: For data gathering, most of the executives and data analysts rely on
information coming from external sources for a numerous amount of the information they
use. They use statistical features associated with their organization that is brought out by
some external sources and department.
 Internal Data: In every organization, the consumer keeps their “private” spreadsheets,
reports, client profiles, and generally even department databases. This is often the interior
information, a part that might be helpful in every data warehouse.
 Operational System data: Operational systems are principally meant to run the business. In
each operation system, we periodically take the old data and store it in achieved files.
 Flat files: A flat file is nothing but a text database that stores data in a plain text format. Flat
files generally are text files that have all data processing and structure markup removed. A
flat file contains a table with a single record per line.
2. Data Staging:
After the data is extracted from various sources, now it’s time to prepare the data files for storing in
the data warehouse. The extracted data collected from various sources must be transformed and
made ready in a format that is suitable to be saved in the data warehouse for querying and analysis.
The data staging contains three primary functions that take place in this part:

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 6

 Data Extraction: This stage handles various data sources. Data analysts should employ
suitable techniques for every data source.
 Data Transformation: As we all know, information for a knowledge warehouse comes
from many alternative sources. If information extraction for a data warehouse posture huge
challenges, information transformation gifts even important challenges. We tend to perform
many individual tasks as a part of information transformation. First, we tend to clean the info
extracted from every source of data. Standardization of information elements forms an
outsized part of data transformation. Data transformation contains several kinds of
combining items of information from totally different sources. Information transformation
additionally contains purging supply information that’s not helpful and separating
outsourced records into new mixtures. Once the data transformation performs ends, we’ve got
a set of integrated information that’s clean, standardized, and summarized.
 Data Loading: When we complete the structure and construction of the data warehouse and
go live for the first time, we do the initial loading of the data into the data warehouse
storage. The initial load moves high volumes of data consuming a considerable amount of
time.
3. Data Storage in Warehouse:
Data storage for data warehousing is split into multiple repositories. These data repositories contain
structureddata in a very highlynormalized form for fast and efficient processing.
 Metadata: Metadata means data about data i.e. it summarizes basic details regarding data,
creating findings & operating with explicit instances of data. Metadata is generated by an
additional correction or automatically and can contain basic information about data.
 Raw Data: Raw data is a set of data and information that has not yet been processed and
was delivered from a particular data entity to the data supplier and hasn’t been processed
nonetheless by machine or human. This data is gathered out from online sources to deliver

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 7
deep insight into users’ online behavior.
 Summary Data or Data summary: Data summary is an easy term for a brief conclusion of
an enormous theory or a paragraph. This is often one thing where analysts write the code and
in the end, they declare the ultimate end in the form of summarizing data. Data summary is
the most essential thing in data mining and processing.
4. Data Marts:
Data marts are also the part of storage component in a data warehouse. It can store the information
of a specific function of an organization that is handled by a single authority. There may be any
number of data marts in a particular organization depending upon the functions. In short, data marts
contain subsets of the data stored in data warehouses. Now, the users and analysts can use data for
various applications like reporting, analyzing, mining, etc. The data is made available to them
whenever required.

Three Tier Architectures


Data Warehouse Architecture
The Three-Tier Data Warehouse Architecture is the commonly used Data Warehouse design in
order to build a Data Warehouse by including the required Data Warehouse Schema Model, the
required OLAP server type, and the required front-end tools for Reporting or Analysis purposes,
which as the name suggests contains three tiers such as Top tier, Bottom Tier and the Middle Tier
that are procedurally linked with one another from Bottom tier(data sources) through Middle
tier(OLAP servers) to the Top tier(Front-end tools).
Data Warehouse Architecture is the design based on which a Data Warehouse is built, to
accommodate the desired type of Data Warehouse Schema, user interface application and
database management system, for data organization and repository structure. The type of
Architecture is chosen based on the requirement provided by the project team. Three-tier Data
Warehouse Architecture is the commonly used choice, due to its detailing in the structure. The

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 8
three different tiers here are termed as:
 Top-Tier
 Middle-Tier
 Bottom-Tier
Each Tier can have different components based on the prerequisites presented by the decision-
makers of theproject but are o the novelty of their respective tier.

Three-Tier Data Warehouse Architecture


Here is a pictorial representation for the Three-Tier Data Warehouse Architecture
1. Bottom Tier
The Bottom Tier in the three-tier architecture of a data warehouse consists of the Data Repository.
Data Repository is the storage space for the data extracted from various data sources, which
undergoes a series of activities as a part of the ETL process. ETL stands for Extract, Transform and
Load. As a preliminary process, before the data is loaded into the repository, all the data relevant and
required are identified from several sources of the system. These data are then cleaned up, to avoid
repeating or junk data from its current storage units. The next step is to transform all these data into a
single format of storage. The final step of ETL is to Load the data on the repository. Few commonly
used ETL tools are:
Informatica
Microsoft SSIS
Snaplogic
Confluent
Apache Kafka
Alooma
Ab Initio
IBM Infosphere
UNIT 1: INTRODUCTION TO DATA WAREHOUSE
RIT CCS341 - DATA WAREHOUSING 9
The storage type of the repository can be a relational database management system or a
multidimensional database management system. A relational database system can hold simple
relational data, whereas a multidimensional database system can hold data that more than one
dimension. Whenever the Repository includes both relational and multidimensional database
management systems, there exists a metadata unit. As the name suggests, the metadata unit consists of
all the metadata fetched from both the relational database and multidimensional database systems. This
Metadata unit provides incoming data to the next tier, that is, the middle tier. From the user’s
standpoint, the data from the bottom tier can be accessed only with the use of SQL queries. The
complexity of the queries depends on the type of database. Data from the relational database system can
be retrieved using simple queries, whereas the multidimensional database system demands complex
queries with multiple joins and conditional statements.
2. Middle Tier
The Middle tier here is the tier with the OLAP servers. The Data Warehouse can have more than
one OLAP server, and it can have more than one type of OLAP server model as well, which
depends on the volume of the data to be processed and the type of data held in the bottom tier. There
are three types of OLAP server models, such as:
ROLAP
Relational online analytical processing is a model of online analytical processing which
carries out an active multidimensional breakdown of data stored in a relational database,
instead of redesigning a relational database into a multidimensional database.
This is applied when the repository consists of only the relational database system in it.
MOLAP
Multidimensional online analytical processing is another model of online analytical
processing that catalogs and comprises of directories directly on its multidimensional
database system.
This is applied when the repository consists of only the multidimensional database system in
it.
HOLAP
Hybrid online analytical processing is a hybrid of both relational and multidimensional online
analytical processing models.
When the repository contains both the relational database management system and the
multidimensional database management system, HOLAP is the best solution for a smooth
functional flow between the database systems. HOLAP allows storing data in both the

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 10
relational and the multidimensional formats.
The Middle Tier acts as an intermediary component between the top tier and the data repository,
that is, the top tier and the bottom tier respectively. From the user’s standpoint, the middle tier gives
an idea about the conceptual outlook of the database.
3. Top Tier
The Top Tier is a front-end layer, that is, the user interface that allows the user to connect with the
database systems. This user interface is usually a tool or an API call, which is used to fetch the
required data for Reporting, Analysis, and Data Mining purposes. The type of tool depends purely
on the form of outcome expected. It could be a Reporting tool, an Analysis tool, a Query tool or a
Data mining tool.
It is essential that the Top Tier should be uncomplicated in terms of usability. Only user-friendly
tools can give effective outcomes. Even when the bottom tier and middle tier are designed with at
most cautiousness and clarity, if the Top tier is enabled with a bungling front-end tool, then the
whole Data Warehouse Architecture can become an utter failure. This makes the selection of the user
interface/ front-end tool as the Top Tier, which will serve as the face of the Data Warehouse system, a
very significant part of the Three-Tier Data Warehouse Architecture designing process.

Autonomous Data Warehouse

Introduction to Oracle ADW


Within the range of Oracle’s Cloud-based services is Oracle’s Autonomous Data Warehouse. It is a
Cloud Data Warehouse that completely managed all Data Warehouse operations and their
complexity. Wide automation of features such as data securing, configuration, provisioning,
development, scaling, and backups are provided with Oracle Autonomous Data Warehouse.
The process of Data Management and Processing with Oracle Autonomous Data Warehouse can be
illustrated as follows:

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 11

Key Features of Oracle ADW


The Oracle Autonomous Data Warehouse offers several features that are easy to use and remain
compatiblewith various existing tools and applications. These include the following:
The end-to-end management system of Oracle ADW is fully managed and fully tuned to
load large amounts of data and run complex queries without any human moderation
required.
It is a highly elastic service and can be auto-scaled or manually scaled depending upon
how yourdata requirements grow.
The Oracle Autonomous Data Warehouse provides automated index management and
real-timestatistic updates collectively.
It can be configured in advance to incorporate different user types and provide optimized
querying features for multiple workloads.
It is enriched with built-in tools such as Oracle Application Express (APEX), Oracle
REST Data Services (ORDS), and Oracle Database Actions.

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 12
Autonomous Data Warehouse Vs Snowflake
Autonomous Data Warehous Snowflak
e
 “Load and Go!” No DBA expertise  Requires manual tuning and specialized
required to provision, backup, secure, or DBA expertise for cluster set and
patch storage options
 Runs on Exadata backbone with  Runs on generic AWS or Azure
enterprise class capabilities infrastructure
 99.95% Availability (only 4 hours of  Only 99.9% Availability (up to 9 hours
downtime per year) with fully per year) and requires Premier Support
automated backups and RAC for for HA at additional cost
compute node failures
 Auto applies security patches and  Shared tenancy and no isolation of
upgrades with complete tenant and data operational users from customer
isolation application data
 Single data warehouse – scaling  DBA must setup single or multi-
compute or storage is not constrained by clusters, resulting in over-provisioning
ÿxed blocks and higher costs
 Advanced enterprise grade analytics  Must rely on 3rd parties for BI,
tools built-in with complete Oracle Data Integration and advanced
Cloud integration analytics

Modern Data Warehouse

What is Modern Data Warehouse?


A Modern Data Warehouse is a cloud-based solution that gathers and stores that information.
Organizations can process this data to make intelligent decisions. That’s why various organizations
use a Modern Data Warehouse to improve their finances, human resources, and operations business
processes. Quality cloud- based warehouse departments need this information to make smarter
decisions.

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 13
Modern Data Warehouse Pyramid

There are five different components of a Modern Data Warehouse.

Level 1: Data Acquisition


Data acquisition can come from a variety of sources such as:
 IoT devices
 Social media posts
 YouTube videos
 Website content
 Customer data
 Enterprise Resource Planning
 Legacy data stores
Level 2: Data Engineering
Once you acquired it, you need to upload it into the data warehouse. Data engineering uses pipelines
and ETL (extract, transform, load) tools. Using these different tools, you can upload that information to a
data warehouse similar to a factory. Data engineering is similar to a truck bringing raw materials into
a factory.
Level 3: Data Management Governance
Once the data comes into the factory, you need someone to evaluate the quality of the data. You then
need tosteward that data because security and privacy must be considered.
Data governance helps ensure the quality of the info by stewarding, prepping, and cleaning the data
to ensureit is ready for analysis.

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 14

Level 4: Reporting and Business Intelligence


Once you prep and clean the data, you can start using factory analysis to take that raw material(data)
and turn it into a finished good (business intelligence). For our purposes, we will use Microsoft
Power BIto help you visualize the information by using advanced analytics, KPIs, and workflow
automation. When you are finished, you can see exactly what’s going on with your data.
Level 5: Data Science
Modern Data Warehouse is about more than seeing the information; it’s about using the data to make
smarter decisions. That’s one of the key concepts you should walk away with here today. There are
several different programs to help you leverage the data to your benefit, including:
 AI
 Deep learning
 Machine learning
 Statistical modeling
 Natural language processing (NLP)
Keep in mind that all the algorithms above need data to work successfully. The more data you
provide, the smarter your decisions, and the smarter your results. It’s essential to see if you want to
understand your reports that you leverage AI to get better answers, leading us back to Modern Data
Warehouse. Again, it is more than gathering and storing data. It is about making smart decisions.
Final Thoughts
Modern data warehouse takes the data most corporations have and turns it into actionable
intelligence based on visual stories.

Referential Link: https://youtu.be/KP1rvlrrrwY

UNIT 1: INTRODUCTION TO DATA WAREHOUSE


RIT CCS341 - DATA WAREHOUSING 1

UNIT-II
ETL AND OLAP TECHNOLOGY

What is ETL – ETL Vs ELT – Types of Data warehouses - Data warehouse Design and Modeling -
Delivery Process - Online Analytical Processing (OLAP) - Characteristics of OLAP – Online
Transaction Processing (OLTP) Vs OLAP - OLAP operations- Types of OLAP- ROLAP Vs MOLAP Vs
HOLAP.

Extraction, Load and Transform (ELT): Extraction, Load and Transform (ELT) is the
technique of extracting raw data from the source and storing it in data warehouse of the target
server and preparing it for end stream users.
ELT comprises of 3 different operations performed on the data:
1. Extract:
Extracting data is the technique of identifying data from one or more sources. The sources
may be databases, files, ERP, CRM or any other useful source of data.
2. Load:
Loading is the process of storing the extracted raw data in data warehouse or data lakes.
3. Transform:
Data transformation is the process in which the raw data source is transformed to the
target format required for analysis.

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 2

Data is retrieved from the warehouse anytime when required. The data transformed as
required is then sent forward for analysis. When you use ELT, you move the entire data
set as it exists in the source systems to the target. This means that you have the raw data
at your disposal inthe data warehouse, in contrast to the ETL approach.
Extraction, Transform and Load (ETL):
ETL is the traditional technique of extracting raw data, transforming it for the users as
required and storing it in data warehouses. ELT was later developed, having ETL as its
base. The three operations happening in ETL and ELT are the same except that their order
of processing is slightly varied. This change in sequence was made to overcome some
drawbacks.
1. Extract:
It is the process of extracting raw data from all available data sources such as databases,
files, ERP, CRM or any other.
2. Transform:
The extracted data is immediately transformed as required by the user.
3. Load:
The transformed data is then loaded into the data warehouse from where the users can
access it.

The data collected from the sources are directly stored in the staging area. The
transformations required are performed on the data in the staging area. Once the data is
transformed, the resultant data is stored in the data warehouse. The main drawback of
ETLarchitecture is that once the transformed data is stored in the warehouse, it cannot be
modified again whereas in ELT, a copy of the raw data is always available in the
warehouseand only the required data is transformed when needed.

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 3

Difference between ELT and ETL:

ELT ETL

ELT tools do not require ETL tools require specific hardware with
additional hardware their own engines to perform
transformations

Mostly Hadoop or NoSQL database RDBMS is used exclusively to store data


to store data. Rarely RDBMS is used

As all components are in one As ETL uses staging area, extra time is
system, loading is done only once required to load the data

Time to transform data is independent of The system has to wait for large sizes of data.
the size of data As the size of data increases, transformation
time also increases

It is cost effective and available to Not cost effective for small and
all business using SaaS solution mediumbusiness

The data transformed is used by The data transformed is used by users reading
data scientists and advanced analysts report and SQL coders

Creates ad hoc views.Low cost for Views are created based on multiple
building and maintaining scripts.Deleting view means deleting data

Best for unstructured and non- Best for relational and structured data. Better
relational data. Ideal for data lakes. for small to medium amounts of data
Suited for verylarge amounts of data

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 4

Data Warehouse Design


A data warehouse is a single data repository where a record from multiple data sources is
integrated for online business analytical processing (OLAP). This implies a data warehouse needs to
meet the requirements from all the business stages within the entire organization. Thus, data warehouse
design is a hugely complex, lengthy, and hence error-prone process. Furthermore, business analytical
functions change over time, which results in changes in the requirements for the systems. Therefore,
data warehouse and OLAP systems are dynamic, and the design process is continuous.
Data warehouse design takes a method different from view materialization in the industries. Itsees
data warehouses as database systems with particular needs such as answering management related
queries. The target of the design becomes how the record from multiple data sources should be
extracted, transformed, and loaded (ETL) to be organized in a database as the data warehouse.
There are two approaches
1. "top-down" approach
2. "bottom-up" approach

Top-down Design Approach


In the "Top-Down" design approach, a data warehouse is described as a subject-oriented, time- variant,
non-volatile and integrated data repository for the entire enterprise data from different sources are
validated, reformatted and saved in a normalized (up to 3NF) database as the data warehouse. The
data warehouse stores "atomic" information, the data at the lowest level of granularity, from where
dimensional data marts can be built by selecting the data required for specific business subjects or
particular departments. An approach is a data-driven approach as the information is gathered and
integrated first and then business requirements by subjects for building data marts are formulated.
The advantage of this method is which it supports a single integrated data source. Thus data marts built
from it will have consistency when they overlap.

Advantages of top-down design


Data Marts are loaded from the data warehouses.
Developing new data mart from the data warehouse is very easy.

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 5

Disadvantages of top-down design


This technique is inflexible to changing departmental needs.
The cost of implementing the project is high.
Bottom-Up Design Approach

In the "Bottom-Up" approach, a data warehouse is described as "a copy of transaction data specifical
architecture for query and analysis," term the star schema. In this approach, a data mart is created first
to necessary reporting and analytical capabilities for particular business processes (or subjects). Thus it
is needed to be a business-driven approach in contrast to Inmon's data- driven approach.
Data marts include the lowest grain data and, if needed, aggregated data too. Instead of a normalized
database for the data warehouse, a denormalized dimensional database is adapted to meet the data
delivery requirements of data warehouses.

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 6

Using this method, to use the set of data marts as the enterprise data warehouse, data marts should be
built with conformed dimensions in mind, defining that ordinary objects are represented the same in
different data marts. The conformed dimensions connected the data marts to form a data warehouse,
which is generally called a virtual data warehouse.
The advantage of the "bottom-up" design approach is that it has quick ROI, as developing a data mart,
a data warehouse for a single subject, takes far less time and effort than developing an enterprise-wide
data warehouse. Also, the risk of failure is even less. This method is inherently incremental. This
method allows the project team to learn and grow.

Advantages of bottom-up design


Documents can be generated quickly.
The data warehouse can be extended to accommodate new business units.
It is just developing new data marts and then integrating with other data marts.

Disadvantages of bottom-up design


The locations of the data warehouse and the data marts are reversed in the bottom-up approach.

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 7

Differentiate between Top-Down Design Approach and Bottom-Up Design Approach

Top Down Approach Bottom up Approach


Breaks the vast problem into smaller sub Solves the essential low-level
problems. problem andintegrates them into a higher
one.
Inherently architected- not a union ofseveral Inherently incremental; can schedule
data marts. essential datamarts first.

Single, central storage of informationabout the Departmental information stored.


content.

Centralized rules and control. Departmental rules and control.

It includes redundant information. Redundancy can be removed.

It may see quick results if implementedwith Less risk of failure, favorable return on
repetitions. investment,and proof of techniques.

Data Warehouse Modeling


Data warehouse modeling is the process of designing the schemas of the detailed and summarized
information of the data warehouse. The goal of data warehouse modeling is to develop a schema
describing the reality, or at least a part of the fact, which the data warehouse is needed to support.
Data warehouse modeling is an essential stage of building a data warehouse for two main
reasons. Firstly, through the schema, data warehouse clients can visualize the relationships among the
warehouse data, to use them with greater ease. Secondly, a well-designed schema allows an effective
data warehouse structure to emerge, to help decrease the cost of implementing the warehouse and
improve the efficiency of using it.
Data modeling in data warehouses is different from data modeling in operational databasesystems. The
primary function of data warehouses is to support DSS processes. Thus, the objective of data
warehouse modeling is to make the data warehouse efficiently support complex querieson long term
information.

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 8

In contrast, data modeling in operational database systems targets efficiently supporting simple
transactions in the database such as retrieving, inserting, deleting, and changing data. Moreover, data
warehouses are designed for the customer with general information knowledge about the enterprise,
whereas operational database systems are more oriented toward use by software specialists for creating
distinct applications.
Data Warehouse model is illustrated in the given diagram.

The data within the specific warehouse itself has a particular architecture with the emphasis on
various levels of summarization, as shown in figure:

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 9

The current detail record is central in importance as it:


 Reflects the most current happenings, which are commonly the most
stimulating. It is numerous as it is saved at the lowest method of the
Granularity.
 It is always (almost) saved on disk storage, which is fast to access but expensive and
difficultto manage.

Older detail data is stored in some form of mass storage, and it is infrequently accessed and kept at a
level detail consistent with current detailed data.
Lightly summarized data is data extract from the low level of detail found at the current, detailedlevel
and usually is stored on disk storage. When building the data warehouse we have to remember what
unit of time is summarization done over and also the components or what attributes the summarized
data will contain in.
Highly summarized data is compact and directly available and can even be found outside the
warehouse.
Metadata is the final element of the data warehouses and is really of various dimensions in which it is
not the same as file drawn from the operational data, but it is used as:-

 A directory to help the DSS investigator locate the items of the data warehouse.
 A guide to the mapping of record as the data is changed from the operational data to the data
warehouse environment.
 A guide to the method used for summarization between the current, accurate data and the lightly
summarized information and the highly summarized data, etc.

Data Modeling Life Cycle


In this section, we define a data modeling life cycle. It is a straight forward process of transforming the
business requirements to fulfill the goals for storing, maintaining, and accessing the data within IT
systems. The result is a logical and physical data model for an enterprise data warehouse.
The objective of the data modeling life cycle is primarily the creation of a storage area for
business information. That area comes from the logical and physical data modeling stages, as shown in
Figure.

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 10

Conceptual Data Model


A conceptual data model recognizes the highest-level relationships between the different entities.
Characteristics of the conceptual data model
 It contains the essential entities and the relationships among
them.No attribute is specified.
 No primary key is specified.
We can see that the only data shown via the conceptual data model is the entities that define the data
and the relationships between those entities. No other data, as shown through the conceptual data
model.

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 11

Logical Data Model


A logical data model defines the information in as much structure as possible, without observing how
they will be physically achieved in the database. The primary objective of logical data modeling is to
document the business data structures, processes, rules, and relationships by a single view - the logical
data model.

Features of a logical data model


 It involves all entities and relationships among them. All attributes for each entity
are specified.
 The primary key for each entity is stated.
 Referential Integrity is specified (FK Relation).

The phase for designing the logical data model which are as follows:

 Specify primary keys for all entities.


 List the relationships between different entities.List all attributes for each entity.
 Normalization.
 No data types are listed

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 12

Physical Data Model


Physical data model describes how the model will be presented in the database. A physical database
model demonstrates all table structures, column names, data types, constraints, primary key, foreign
key, and relationships between tables. The purpose of physical data modeling is the mapping of the
logical data model to the physical structures of the RDBMS system hosting the data warehouse. This
contains defining physical RDBMS structures, such as tables and data types to use when storing the
information. It may also include the definition of new data structures for enhancing query performance.

Characteristics of a physical data model


 Specification all tables and columns.
 Foreign keys are used to recognize relationships between tables.
The steps for physical data model design which are as follows:
 Convert entities to tables.
 Convert relationships to foreign keys.
 Convert attributes to columns.

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 13

Types of Data Warehouse Models

Enterprise Warehouse
An Enterprise warehouse collects all of the records about subjects spanning the entire organization. It
supports corporate-wide data integration, usually from one or more operational systems or external
data providers, and it's cross-functional in scope. It generally contains detailed information as well as
summarized information and can range in estimate from a few gigabyte to hundreds of gigabytes,
terabytes, or beyond.
An enterprise data warehouse may be accomplished on traditional mainframes, UNIX super servers, or
parallel architecture platforms. It required extensive business modeling and may take years to develop
and build.
Data Mart
A data mart includes a subset of corporate-wide data that is of value to a specific collection of users.
The scope is confined to particular selected subjects. For example, a marketing data mart may restrict
its subjects to the customer, items, and sales. The data contained in the data marts tend to be
summarized.

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 14
Data Marts is divided into two parts:
Independent Data Mart: Independent data mart is sourced from data captured from one ormore
operational systems or external data providers, or data generally locally within a different department
or geographic area.
Dependent Data Mart: Dependent data marts are sourced exactly from enterprise data- warehouses.
Virtual Warehouses
Virtual Data Warehouses is a set of perception over the operational database. For effective query
processing, only some of the possible summary vision may be materialized. A virtual warehouse is
simple to build but required excess capacity on operational database servers.

Data Warehouse Delivery Process


Now we discuss the delivery process of the data warehouse. Main steps used in data warehouse
delivery process which are as follows:

IT Strategy: DWH project must contain IT strategy for procuring and retaining funding.
Business Case Analysis: After the IT strategy has been designed, the next step is the business case. It
is essential to understand the level of investment that can be justified and to recognize the projected
business benefits which should be derived from using the data warehouse.

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 15
Education & Prototyping: Company will experiment with the ideas of data analysis and educate
themselves on the value of the data warehouse. This is valuable and should be required if this is the
company first exposure to the benefits of the DS record. Prototyping method can progress the growth of
education. It is better than working models. Prototyping requires business requirement, technical
blueprint, and structures.
Business Requirement:
It contains the following.,
The logical model for data within the data warehouse. The source system that provides this data
(mapping rules) The business rules to be applied to information.
The query profiles for the immediate requirement
Technical blueprint: It arranges the architecture of the warehouse. Technical blueprint of the delivery
process makes an architecture plan which satisfies long-term requirements. It lays server and data mart
architecture and essential components of database design.
Building the vision: It is the phase where the first production deliverable is produced. This stage will
probably create significant infrastructure elements for extracting and loading information but limit
them to the extraction and load of information sources.
History Load: The next step is one where the remainder of the required history is loaded into the data
warehouse. This means that the new entities would not be added to the data warehouse, but additional
physical tables would probably be created to save the increased record volumes.
AD-Hoc Query: In this step, we configure an ad-hoc query tool to operate against the data
warehouse.
These end-customer access tools are capable of automatically generating the database query that
answers any question posed by the user.
Automation: The automation phase is where many of the operational management processes are fully
automated within the DWH. These would include:
Extracting & loading the data from a variety of sources systems , transforming the information into a
form suitable for analysis Backing up, restoring & archiving data
Generating aggregations from predefined definitions within the Data Warehouse.
Monitoring query profiles & determining the appropriate aggregates to maintain systemperformance.
Extending Scope: In this phase, the scope of DWH is extended to address a new set of business
requirements. This involves the loading of additional data sources into the DWH i.e. the introduction
of new data marts.

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 16
Requirement Evolution: This is the last step of the delivery process of a data warehouse. As weall
know that requirements are not static and evolve continuously. As the business requirements will
change it supports to be reflected in the system.

OLAP INTRODUCTION
In the earlier unit you had studied about Extract, Transform and Loading (ETL) of a
Data Warehouse. Within the data science field, there are two types of data processing
systems: online analytical processing (OLAP) and online transaction processing (OLTP).
The main difference is that one uses data to gain valuable insights, while the oth er is
purely operational. However, there are meaningful ways to use both systems to solve
data problems. OLAP is a system for performing multi - dimensional analysis at high
speeds on large volumes of data. Typically, this data is from a data warehouse, dat a mart
or some other centralized data store. OLAP is ideal for data mining, business
intelligence and complex analytical calculations, as well as business reporting functions
like financial analysis, budgeting and sales forecasting.

OLAP AND ITS NEED


Online Analytical Processing (OLAP) is the technology to analyze and process data from
multiple sources at the same time. It accesses the multiple databases at the same time. It
is a software which helps the data analysts to collect data from different perspec tive for
developing effective business strategies. The query operations like group, join or
aggregation can be easily done with OLAP using pre -calculated or pre-aggregated data
hence making it much faster than simple relational databases. You can understan d OLAP
as a multi cubic structure, which has many cubes, each cube is pertaining to some
database. The cubes are designed in such a way that generates reports effectively and
efficiently.
OLAP is the core component of the data warehouse implementation, pro viding fast and
flexible multi-dimensional data analysis for business intelligence (BI) and decision
support applications. OLAP (for online analytical processing) is a software used to
perform high-speed, multivariate analysis of large amounts of data in d ata warehouses,
data markets, or other unified and centralized data warehouses. The data is broken down
for display, monitoring or analysis. For example, sales figures can be related to location

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 17
(region, country, state/province, company), time (year, month , week, day), product
(clothing, male/female/child, brand, type), etc., but In a data warehouse, records are
stored in tables, and each table can only sort data on two of the dimensions at a time.
Recording and reorganizing them into a multi - dimensional format allows very fast
processing and very in-depth analysis
The primary objective of OLAP or data analysis is not just data processing .For
instance, If a company might compare their sales in the month of January with the month
of February then compare those results with another location which may be stored in a
separate database. In this case, it needs a multi -view of database design storing all the
data categories. Another example of Amazon, it analyzes purchases made by its
customers to recommend the customers with a personalized home page of products
which are likely to be interested by them. So, this is one of the good examples of OLAP
systems. It creates a single platform for all type of business analytical means which
includes planning budgeting fo recasting and analysis the main benefit of OLAP is the
consistency of information and calculations using OLAP systems we can easily apply
security restrictions on users and objects to comply with regulations and protect
sensitive data.
OLAP assists managers in making decisions by giving multidimensional record views
that are efficient to provide, hence enhancing their productivity. Due to the inherent
flexibility support provided by organized databases, OLAP functions are self -contained.
Through extensive control of analysis-capabilities, it permits simulation of business
models and challenges.
Let’s see the need to use OLAP to have better understanding of OLAP over relational
databases:
Efficient and Effective methods to improve the sales of an Organization : In retail,
having multiple products with different number of channels for selling the product
across the globe. OLAP makes it effective and efficient to search for a product of a
different region within a specified time period(like, excluding weekdays sa les or just
weekend sales or festival duration sales very specific from a very large data distributed.)
It improves the sales of a business. The data analysis power of OLAP brings effective
results in sales. It helps in identifying expenditures which produ ce a high return of
investments (ROI).

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 18
Usually, data operations and analysis are performed using the simple spreadsheet, where
data values are arranged in row and column format. This is ideal for two - dimensional
data. However, OLAP contains multidimension al data, with data usually obtained from a
different and unrelated source. Using a spreadsheet is not an optimal option. The cube
can store and analyze multidimensional data in a logical and orderly manner.

CHARACTERISITCS OF OLAP

The main characteristics of OLAP are as follows:


Fast: OLAP act as bridge between Data Warehouse and front -end. Hence helps in the
better accessibility of data yielding faster results.
Analysis: OLAP data analysis and computational measure and their results are stored in
separate data files. OLAP distinguishes better zero and missing values. It should ignore
missing value and performs the correct aggregate values. OLAP facilitates interactive
query handling and complex analysis for the users.
Shared: OLAP operations drill-down or roll-up, it navigates between various
dimensions in multidimensional cube making it effective and efficient reporting system.
Multidimensional: OLAP has Multidimensional conceptual view and access of data to
different users at different levels. The increa sing number of dimensions and report
generation performance of the OLAP system does not significantly degrade.
Data and Information: OLAP has calculation power for complex queries and data. It
does data visualization using graphs and charts.

OLAP Operations
OLAP stands for Online Analytical Processing Server. It is a software technology that
allows users to analyze information from multiple database systems at the same time. It is
based on multidimensional data model and allows the user to query on multi-
dimensional data (eg. Delhi -> 2018 -> Sales data). OLAP databases are divided into one
or more cubes and these cubes are known as Hyper-cubes.

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 19

OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:
Moving down in the concept hierarchy
Adding a new dimension
In the cube given in overview section, the drill down operation is performed by moving
down in the concept hierarchy of Time dimension (Quarter -> Month).

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 20
2. Roll up: It is just opposite of the drill-down operation. It performs aggregation
on the OLAP cube. It can be done by:
Climbing up in the concept hierarchy
Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by
climbing up in the concept hierarchy of Location dimension (City -> Country).

3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions.
In the cube given in the overview section, a sub-cube is selected by selecting
following dimensions with criteria:
Location = “Delhi” or
“Kolkata” Time = “Q1” or
“Q2”
Item = “Car” or “Bus”

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 21
4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-
cube creation. In the cube given in the overview section, Slice is performed on the
dimension
Time = “Q1”.

5. Pivot: It is also known as rotation operation as it rotates the current view to get a new
view of the representation. In the sub-cube obtained after the slice operation,
performing pivot operation gives a new view of it.

UNIT 2: ETL AND OLAP TECHNOLOGY


RIT CCS341 - DATA WAREHOUSING 1

UNIT-III
METADATA, DATAMART AND PARTITION STRATEGY

Meta Data – Categories of Metadata – Role of Metadata – Metadata Repository – Challenges for Meta
Management - Data Mart – Need of Data Mart- Cost Effective Data Mart- Designing Data Marts- Cost
of Data Marts- Partitioning Strategy – Vertical partition – Normalization – Row Splitting – Horizontal
Partition

Meta data Definitions


Here is a sample list of definitions:
Data about the data
Table of contents for the data
Catalog for the data
Data warehouse atlas
Data warehouse roadmap
Data warehouse directory
Glue that holds the data warehouse contents together
Tongs to handle the data
The nerve center

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 2

Role of Metadata

Challenges for Metadata Management


Although metadata is so vital in a data warehouse environment, seamlessly integrating all the parts of
metadata is a formidable task. Industry-wide standardization is far from being a reality. Metadata created by
a process at one end cannot be viewed through a tool used at another end without going through
convoluted transformations. These challenges force many data warehouse developers to abandon the
requirements for proper metadata management.

Here are the major challenges to be addressed while providing metadata:

 Each software tool has its own propriety metadata. If you are using several tools in your
data warehouse, how can you reconcile the formats?
 No industry-wide accepted standards exist for metadata formats.
 There are conflicting claims on the advantages of a centralized metadata repository as
opposedto a collection of fragmented metadata stores.
 There are no easy and accepted methods of passing metadata along the processes as data
moves from the source systems to the staging area and thereafter to the data warehouse
storage.
 Preserving version control of metadata uniformly throughout the data warehouse is
tedious and difficult.
 In a large data warehouse with numerous source systems, unifying the metadata relating
to the data sources can be an enormous task. You have to deal with conflicting
standards, formats, and data naming conventions, data definitions, attributes, values,
business rules, and units of measure. You have to resolve indiscriminate use of aliases
and compensate for inadequate datavalidation rules.

Metadata Repository
Think of a metadata repository as a general-purpose information directory or cataloguing device to
classify, store, and manage metadata. As we have seen earlier, business metadata and technical metadata
serve different purposes. The end-users need the business metadata; data warehouse developers and
administrators require the technical metadata.
The structures of these two categories of metadata also vary. Therefore, the metadata repository can be
thought of as two distinct information directories, one to store business metadata and the other to store
technical metadata. This division may also be logical within a single physical repository.
The following Figure shows the typical contents in a metadata repository. Notice the division between
business and technical metadata. Did you also notice another component called the information navigator?
This component is implemented in different ways in commercial offerings. The functions of the
information navigator include the following:

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 3

Interface from query tools. This function attaches data warehouse data to third-party query tools so that
metadata definitions inside the technical metadata may be viewed from these tools.
Drill-down for details. The user of metadata can drill down and proceed from one level of metadata to a
lower level for more information. For example, you can first get the definition of a data table, then go to
the next level for seeing all attributes, and go further to get the details of individual attributes.

Review predefined queries and reports. The user is able to review predefined queries and reports, and
launch the selected ones with proper parameters.

A centralized metadata repository accessible from all parts of the data warehouse for your end- users,
developers, and administrators appears to be an ideal solution for metadata management. But for a
centralized metadata repository to be the best solution, the repository must meet some basic requirements.
Let us quickly review these requirements. It is not easy to find a repository tool that satisfies every one of
the requirements listed below.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 4

Flexible organization. Allow the data administrator to classify and organize metadata into logical
categories and subcategories, and assign specific components of metadata to the classifications.

 Historical. Use versioning to maintain the historical perspective of the metadata.


 Integrated. Store business and technical metadata in formats meaningful to all types of users.
 Good compartmentalization. Able to separate and store logical and physical database models.
 Analysis and look-up capabilities. Capable of browsing all parts of metadata and also navigating
through the relationships.
 Customizable. Able to create customized views of metadata for individual groups of users and to
include new metadata objects as necessary.
 Maintain descriptions and definitions. View metadata in both business and technicalterms.
 Standardization of naming conventions. Flexibility to adopt any type of naming convention and
standardize throughout the metadata repository.
 Synchronization. Keep metadata synchronized within all parts of the data warehouse environment
and with the related external systems.
 Open. Support metadata exchange between processes via industry-standard interfaces and be
compatible with a large variety of tools.

Selection of a suitable metadata repository product is one of the key decisions the project team must
make. Use the above list of criteria as a guide while evaluating repository tools for your data warehouse

What Is A Data Mart?

A data mart is a small portion of the data warehouse that is mainly related to a particular business
domain as marketing (or) sales etc.

The data stored in the DW system is huge hence data marts are designed with a subset of data that belongs
to individual departments. Thus a specific group of users can easily utilize this data for their analysis.

Unlike a data warehouse that has many combinations of users, each data mart will have a particular set of
end-users. The lesser number of end-users results in better response time.

Data marts are also accessible to business intelligence (BI) tools. Data marts do not contain duplicated
(or) unused data. They do get updated at regular intervals. They are subject-oriented and flexible databases.
Each team has the right to develop and maintain its data marts without modifying data warehouse (or)
other data mart’s data.

A data mart is more suitable for small businesses as it costs very less than a data warehouse system. The
time required to build a data mart is also lesser than the time required for buildinga data warehouse.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 5

Pictorial representation of Multiple Data Marts:

When Do We Need Data Mart?


Based on the necessity, plan and design a data mart for your department by engaging the
stakeholders because the operational cost of data mart may be high some times.

Consider the below reasons to build a data mart:


 If you want to partition the data with a set of user access control strategy.
 If a particular department wants to see the query results much faster instead of scanning huge DW
data.
 If a department wants data to be built on other hardware (or) software platforms.
 If a department wants data to be designed in a manner that is suitable for its tools.

Cost-Effective Data Mart:


A cost-effective data mart can be built by the following steps:

 Identify The Functional Splits: Divide the organization data into each data mart (departmental)
specific data to meet its requirement, without any further organizationaldependency.
 Identify User Access Tool Requirements: There may be different user access tools in the
market that need different data structures. Data marts are used to support all these internal
structures without disturbing the DW data. One data mart can be associated with one tool as per
the user needs. Data marts can also provide updated data to such tools daily.
 Identify Access Control Issues: If different data segments in a DW system need privacy and
should be accessed by a set of authorized users then all such data can be moved into data marts.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 6

Cost of Data Mart:

The cost of data mart can be estimated as follows:

 Hardware and Software Cost: Any newly added data mart may need extra hardware, software,
processing power, network, and disk storage space to work on queries requested by the end-
users. This makes data marting an expensive strategy. Hence the budget should be planned
precisely.
 Network Access: If the location of the data mart is different from that of the data warehouse,
then all the data should be transferred with the data mart loading process. Thus a network should
be provided to transfer huge volumes of data which may be expensive.
 Time Window Constraints: The time taken for the data mart loading process will depend on
various factors such as complexity & volumes of data, network capacity, and data transfer
mechanisms, etc.

Comparison Of Data Warehouse Vs Data Mart

S.No Data Warehouse Data Mart


1 Complex and costs more to implement. Simple and cheaper to implement.
2 Works at the organization level for the The scope is limited to a
entirebusiness. particulardepartment.
3 Querying the DW is difficult for Querying the data mart is easy for
business users because of huge data businessusers because of limited data.
dependencies.
4 Implementation time is more may be Implementation time is less may be in
inmonths or years. days,weeks or months.
5 Gathers data from various external Gathers data from a few centralized DW
sourcesystems. (or)internal (or) external source systems.
6 Strategic decisions can be made. Business decisions can be made.

Types Of Data Marts

Data marts are classified into three types i.e. Dependent, Independent and Hybrid. This classification
is based on how they have been populated i.e. either from a data warehouse (or) from any other data
sources.

Extraction, Transformation, and Transportation (ETT) is the process that is used to populate data mart’s
data from any source systems.

#1) Dependent Data Mart


In a dependent data mart, data is sourced from the existing data warehouse itself. This is a top- down
approach because the portion of restructured data into the data mart is extracted from the centralized data
warehouse.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 7

A data mart can use DW data either logically or physically as shown below:
 Logical View: In this scenario, data mart’s data is not physically separated from the DW. It
refers to DW data through virtual views (or) tables logically.

 Physical subset: In this scenario, data mart’s data is physically separated from the DW. Once one
or more data marts are developed, you can allow the users to access only the data marts (or) to access
both Data marts and Data warehouses.

ETT is a simplified process in the case of dependent data marts because the usable data is already
existing in the centralized DW. The accurate set of summarized data should be just moved to the
respective data marts.

An Image of Dependent Data Mart is shown below:

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 8

#2) Independent Data Mart


An independent data mart is best suitable for small departments in an organization. Here data is not
sourced from the existing data warehouse. The Independent data mart is neither dependent on enterprise
DW nor other data marts.

Independent data marts are stand-alone systems where data is extracted, transformed and loaded from
external (or) internal data sources. These are easy to design and maintain until it is supporting simple
department wise business needs.

You have to work with each phase of the ETT process in case of independent data marts in a similar way
as to how the data has been processed into centralized DW. However, the number of sources and the data
populated to the data marts may be less.

Pictorial representation of an Independent Data Mart:

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 9

#3) Hybrid Data Mart


In a hybrid data mart, data is integrated from both the DW and other operational systems. Hybrid data
marts are flexible with large storage structures. It can also refer to other data martsdata.

Pictorial representation of a Hybrid Data Mart:

Implementation Steps Of A Data Mart


The implementation of Data Mart which is considered to be a bit complex is explained in the below steps:

 Designing: Since the time business users request a data mart, the designing phase involves
requirements gathering, creating appropriate data from respective data sources, creating the
logical and physical data structures and ER diagrams.
 Constructing: The team will design all tables, views, indexes, etc., in the data mart system.
 Populating: Data will be extracted, transformed and loaded into data mart along with metadata.
 Accessing: Data Mart data is available to be accessed by the end-users. They can query the data
for their analysis and reports.
 Managing: This involves various managerial tasks such as user access controls, data mart
performance fine-tuning, maintaining existing data marts and creating data mart recovery
scenarios in case the system fails.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 10

Structure Of A Data Mart


The structure of each data mart is created as per the requirement. Data Mart structures are called Star joins.
This structure will differ from one data mart to another.

Star joins are multi-dimensional structures that are formed with fact and dimension tables to support large
amounts of data. Star join will have a fact table in the center surrounded by the dimension tables.

Respective fact table data is associated with dimension tables’ data with a foreign key reference. A fact table
can be surrounded by 20-30 dimension tables.

Similar to the DW system, in star joins as well, the fact tables contain only numerical data and the
respective textual data can be described in dimension tables. This structure resembles a starschema in DW.

Pictorial representation of a Star Join Structure.

But the granular data from the centralized DW is the base for any data mart’s data. Many calculations
will be performed on the normalized DW data to transform it into multidimensional data marts data
which is stored in the form of cubes.

This works similarly as to how the data from legacy source systems is transformed into a normalized
DW data.

When Is A Pilot Data Mart Useful?


A pilot can be deployed in a small environment with a restricted number of users to ensure if the
deployment is successful before the full-fledged deployment. However, this is not essential all the time.
The pilot deployments will be of no use once the purpose is met.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 11

You need to consider the below scenarios that recommend for the pilot deployment:
 If the end-users are new to the Data warehouse system.
 If the end-users want to feel comfortable to retrieve data/reports by themselves before going to
production.
 If the end-users want hands-on with the latest tools (or) technologies.
 If the management wants to see the benefits as a proof of concept before making it as abig release.
 If the team wants to if ensure all ETL components (or) infrastructure components work well before
the release.

Drawbacks Of Data Mart


Though data marts have some benefits over DW they also have some drawbacks as explainedbelow:

 Unwanted data marts that have been created are tough to maintain.
 Data marts are meant for small business needs. Increasing the size of data marts will decrease its
performance.
 If you are creating more number of data marts then the management should properly take care of
their versioning, security, and performance.
 Data marts may contain historical (or) summarized (or) detailed data. However, updates to DW data
and data mart data may not happen at the same time due to data inconsistency issues.
Partitioning Strategy
Partitioning is done to enhance performance and facilitate easy management of data. Partitioning also
helps in balancing the various requirements of the system. It optimizes the hardware performance and
simplifies the management of data warehouse by partitioning each fact table into multiple separate
partitions. In this chapter, we will discuss different partitioningstrategies.
Why is it Necessary to Partition?
Partitioning is important for the following reasons −

 For easy management,


 To assist backup/recovery,
 To enhance performance.

For Easy Management


The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge size of fact table
is very hard to manage as a single entity. Therefore it needs partitioning.
To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with all the data.
Partitioning allows us to load only as much data as is required on a regular basis. It reduces the time to load
and also enhances the performance of the system.
Note − To cut down on the backup size, all partitions other than the current partition can be marked as
read-only. We can then put these partitions into a state where they cannot be modified. Then they can be
backed up. It means only the current partition is to be backed up.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 12

To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query performance
is enhanced because now the query scans only those partitions that are relevant. It does not have to scan
the whole data.

Horizontal Partitioning
There are various ways in which a fact table can be partitioned. In horizontal partitioning, we have to
keep in mind the requirements for manageability of the data warehouse.

Partitioning by Time into Equal Segments


In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each time period
represents a significant retention period within the business. For example, if the user queries for month
to date data then it is appropriate to partition the data into monthly segments. We can reuse the
partitioned tables by removing the data in them.

Partition by Time into Different-sized Segments


This kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of
small partitions for relatively current data, larger partition for inactive data.

Points to Note
 The detailed information remains available online.
 The number of physical tables is kept relatively small, which reduces the operating
cost.
 This technique is suitable where a mix of data dipping recent history and data mining
through entire history is required.
 This technique is not useful where the partitioning profile changes on a regular basis,
because repartitioning will increase the operation cost of data warehouse.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 13

Partition on a Different Dimension


The fact table can also be partitioned on the basis of dimensions other than time such as product group,
region, supplier, or any other dimension. Let's have an example.
Suppose a market function has been structured into distinct regional departments like on a state by state
basis. If each region wants to query on information captured within its region, it would prove to be more
effective to partition the fact table into regional partitions. This will cause the queries to speed up because
it does not require to scan information that is not relevant.
Points to Note
 The query does not have to scan irrelevant data which speeds up the query process.
 This technique is not appropriate where the dimensions are unlikely to change in future.
So, it is worth determining that the dimension does not change in future.
 If the dimension changes, then the entire fact table would have to be repartitioned.

Note − We recommend to perform the partition only on the basis of time dimension, unless you are
certain that the suggested dimension grouping will not change within the life of the data warehouse.
Partition by Size of Table
When there are no clear basis for partitioning the fact table on any dimension, then we should partition
the fact table on the basis of their size. We can set the predetermined size as a critical point. When the
table exceeds the predetermined size, a new table partition is created.
Points to Note
 This partitioning is complex to manage.
 It requires metadata to identify what data is stored in each partition.

Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition the dimensions. Here we
have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in order to apply
comparisons, that dimension may be very large. This would definitely affect the responsetime.
Round Robin Partitions
In the round robin technique, when a new partition is needed, the old one is archived. It uses metadata
to allow user access tool to refer to the correct table partition.
This technique makes it easy to automate table management facilities within the datawarehouse.
Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts how vertical
partitioning is done.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 14

Vertical partitioning can be performed in the following two ways:


 Normalization
 Row Splitting

Normalization
Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduce space. Take a look at the following tables
that show how normalization is performed.
Table before Normalization

Product_id Qty Value sales_date Store_id Store_name Location Region

30 5 3.67 3-Aug-13 16 sunny Bangalore S

35 4 5.33 3-Sep-13 16 sunny Bangalore S

40 5 2.50 3-Sep-13 64 san Mumbai W

45 7 5.66 3-Sep-13 16 sunny Bangalore S

Table after Normalization

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 15

Store_id Store_name Location Region

16 Sunny Bangalore W

64 San Mumbai S

Product_id Quantity Value sales_date Store_id

30 5 3.67 3-Aug-13 16

35 4 5.33 3-Sep-13 16

40 5 2.50 3-Sep-13 64

45 7 5.66 3-Sep-13 16

Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is to speed
up the access to large table by reducing its size.
Note − While using vertical partitioning, make sure that there is no requirement to perform a major join
operation between two partitions.
Identify Key to Partition
It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to reorganizing
the fact table. Let's have an example. Suppose we want to partition the followingtable.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be

 Region
 Transaction date
Suppose the business is organized in 30 geographical regions and each region has different number of
branches. That will give us 30 partitions, which is reasonable. This partitioning is good enough because
our requirements capture has shown that a vast majority of queries are restricted to the user's own
business region.
If we partition by transaction date instead of region, then the latest transaction from every region will be
in one partition. Now the user who wants to look at data within his own region has to query across
multiple partitions. Hence it is worth determining the right partitioning key.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 1

UNIT-IV

DIMENSIONAL MODELING AND SCHEMA

Dimensional Modeling- Multi-Dimensional Data Modeling – Data Cube- Star Schema-


Snowflake schema- Star Vs Snowflake schema- Fact constellation Schema- Schema
Definition - Process Architecture- Types of Data Base Parallelism – Data warehouse Tools

Multidimensional Model:
A multidimensional model views data in the form of a data-cube. A data cube enables data to be modelled
and viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps records. For
example, a shop may create a sales data warehouse to keep records of the store's sales for the dimension time,
item, and location. These dimensions allow the save to keep track of things, for example, monthly sales of
items and the locations at which the items were sold. Each dimension has a table related to it, called a
dimensional table, which describes the dimension further. For example, a dimensional table for an item may
contain the attributes item name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales. This theme is
represented by a fact table. Facts are numerical measures. The fact table contains the names of the facts or
measures of the related dimensional tables.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 2

Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in the table. In
this 2D representation, the sales for Delhi are shown for the time dimension (organized in quarters) and the
item dimension (classified according to the types of an item sold). The fact or measure displayed in rupee
sold (in thousands).

Now, if we want to view the sales data with a third dimension, For example, suppose the data according to
time and item, as well as the location is considered for the cities Chennai, Kolkata, Mumbai, and Delhi.
These 3D data are shown in the table. The 3D data of the table are represented as a series of 2D tables.

Conceptually, it may also be represented by the same data in the form of a 3D data cube, as shown in fig:

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 3

Working on a Multidimensional Data Model


The following stages should be followed by every project for building a Multi-Dimensional Data Model
Stage 1: Assembling data from the client - In first stage, a Multi-Dimensional Data Model collects
correct data from the client. Mostly, software professionals provide simplicity to the client about the range
of data which can be gained with the selected technology and collect the complete data in detail.

Stage 2: Grouping different segments of the system - In the second stage, the Multi-Dimensional Data
Model recognizes and classifies all the data to the respective section they belong to and also builds it
problem-free to apply step by step.

Stage 3: Noticing the different proportions - In the third stage, it is the basis on which the design of the
system is based. In this stage, the main factors are recognized according to the user’s point of view. These
factors are also known as “Dimensions”.

Stage 4: Preparing the actual-time factors and their respective qualities - In the fourth stage, the
factors which are recognized in the previous step are used further for identifying the related qualities.
These qualities are also known as “attributes” in the database.

Stage 5: Finding the actuality of factors which are listed previously and their qualities - In the fifth
stage, A Multi-Dimensional Data Model separates and differentiates the actuality from the factors which
are collected by it. These actually play a significant role in the arrangement of a Multi-Dimensional Data
Model.

Stage 6: Building the Schema to place the data, with respect to the information collected from the
steps above - In the sixth stage, on the basis of the data which was collected previously, a Schema is
built.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 4

Features of multidimensional data models:


Measures: Measures are numerical data that can be analyzed and compared, such as sales or revenue. They
are typically stored in fact tables in a multidimensional data model.
Dimensions: Dimensions are attributes that describe the measures, such as time, location, or product. They
are typically stored in dimension tables in a multidimensional data model.
Cubes: Cubes are structures that represent the multidimensional relationships between measures and
dimensions in a data model. They provide a fast and efficient way to retrieve and analyze data.
Aggregation: Aggregation is the process of summarizing data across dimensions and levels of detail. This
is a key feature of multidimensional data models, as it enables users to quickly analyze data at different levels
of granularity.
Drill-down and roll-up: Drill-down is the process of moving from a higher-level summary of data to a lower
level of detail, while roll-up is the opposite process of moving from a lower-level detail to a higher-level
summary. These features enable users to explore data in greater detail and gain insights into the underlying
patterns.
Hierarchies: Hierarchies are a way of organizing dimensions into levels of detail. For example, a time
dimension might be organized into years, quarters, months, and days. Hierarchies provide a way to navigate
the data and perform drill-down and roll-up operations.
OLAP (Online Analytical Processing): OLAP is a type of multidimensional data model that supports fast
and efficient querying of large datasets. OLAP systems are designed to handle complex queries and provide
fast response times.
Advantages of Multi-Dimensional Data Model
The following are the advantages of a multi-dimensional data model:
 A multi-dimensional data model is easy to handle.
 It is easy to maintain.
 Its performance is better than that of normal databases (e.g. relational databases).
 The representation of data is better than traditional databases. That is because the multi-dimensional
databases are multi-viewed and carry different types of factors.
 It is workable on complex systems and applications, contrary to the simple one-dimensional
database systems.
 The compatibility in this type of database is an upliftment for projects having lower bandwidth for
maintenance staff.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 5

Disadvantages of Multi-Dimensional Data Model


The following are the disadvantages of a Multi-Dimensional Data Model:
 The multi-dimensional Data Model is slightly complicated in nature and it requires professionals to
recognize and examine the data in the database.
 During the work of a Multi-Dimensional Data Model, when the system caches, there is a great
effect on the working of the system.
 It is complicated in nature due to which the databases are generally dynamic in design.
 The path to achieving the end product is complicated most of the time.
 As the Multi-Dimensional Data Model has complicated systems, databases have a large number of
databases due to which the system is very insecure when there is a security break.
What is Data Cube?

When data is grouped or combined in multidimensional matrices called Data Cubes. The data cube method
has a few alternative names or a few variants, such as "Multidimensional databases," "materialized views,"
and "OLAP (On-Line Analytical Processing)."

The general idea of this approach is to materialize certain expensive computations that are frequently
inquired.

For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be materialized
into a set of eight views as shown in fig, where psc indicates a view consisting of aggregate function value
(such as total-sales) computed by grouping three attributes part, supplier, and customer, p indicates a view
composed of the corresponding aggregate function values calculated by grouping part alone, etc.

Data cube is created from a subset of attributes in the database. Specific attributes are chosen to be measure
attributes, i.e., the attributes whose values are of interest. Another attributes are selected as dimensions or
functional attributes. The measure attributes are aggregated according to the dimensions.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 6

For example, XYZ may create a sales data warehouse to keep records of the store's sales for the dimensions
time, item, branch, and location. These dimensions enable the store to keep track of things like monthly sales
of items, and the branches and locations at which the items were sold. Each dimension may have a table
identify with it, known as a dimensional table, which describes the dimensions. For example, a dimension
table for items may contain the attributes item name, brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse in many
cases because not every cell in each dimension may have corresponding data in the database.
Techniques should be developed to handle sparse cubes efficiently.
If a query contains constants at even lower levels than those provided in a data cube, it is not clear how to
make the best use of the precomputed results stored in the data cube.
The model view data in the form of a data cube. OLAP tools are based on the multidimensional data model.
Data cubes usually model n-dimensional data.
A data cube enables data to be modelled and viewed in multiple dimensions. A multidimensional data model
is organized around a central theme, like sales and transactions. A fact table represents this theme. Facts are
numerical measures. Thus, the fact table contains measure (such as Rs. sold) and keys to each of the related
dimensional tables.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used for analyzing
the relationship between dimensions.

Example: In the 2-D representation, we will look at the All Electronics sales data for items sold per quarter in
the city of Vancouver. The measured display in dollars sold (in thousands).

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 7

3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example, suppose we would
like to view the data according to time, item as well as the location for the cities Chicago, New York, Toronto,
and Vancouver. The measured display in dollars sold (in thousands). These 3-D data are shown in the table.
The 3-D data of the table are represented as a series of 2-D tables.

Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in fig:

Let us suppose that we would like to view our sales data with an additional fourth dimension, such as a
supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest level of
summarization is called a base cuboid.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 8

For example, the 4-D cuboid in the figure is the base cuboid for the given time, item, location, and supplier
dimensions.

Figure is shown a 4-D data cube representation of sales data, according to the dimensions time, item,
location, and supplier. The measure displayed is dollars sold (in thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex cuboid. In
this example, this is the total sales, or dollars sold, summarized over all four dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating 4-D data cubes for
the dimension time, item, location, and supplier. Each cuboid represents a different degree of summarization.

Schemas Used in Data Warehouses: Star, Galaxy (Fact constellation), and Snowflake:

What Is a Data Warehouse Schema?

We can think of a data warehouse schema as a blueprint or an architecture of how data will be stored and
managed. A data warehouse schema isn’t the data itself, but the organization of how data is stored and how it
relates to other data components within the data warehouse architecture.

In the past, data warehouse schemas were often strictly enforced across an enterprise, but in modern imple-
mentations where storage is increasingly inexpensive, schemas have become less constrained. Despite this
loosening or sometimes total abandonment of data warehouse schemas, knowledge of the foundational
schema designs can be important to both maintaining legacy resources and for creating modern data ware-
house design that learns from the past.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 9

The basic components of all data warehouse schemas are fact and dimension tables. The different combination
of these two central elements compose almost the entirety of all data warehouse schema designs.
Fact Table

A fact table aggregates metrics, measurements, or facts about business processes. In this example, fact tables
are connected to dimension tables to form a schema architecture representing how data relates within the data
warehouse. Fact tables store primary keys of dimension tables as foreign keys within the fact table.

Dimension Table

Dimension tables are non-denormalized tables used to store data attributes or dimensions. As mentioned
above, the primary key of a dimension table is stored as a foreign key in the fact table. Dimension tables are
not joined together. Instead, they are joined via association through the central fact table.

3 Types of Schema Used in Data Warehouses

History presents us with three prominent types of data warehouse schema known as Star Schema, Snowflake
Schema, and Galaxy Schema. Each of these data warehouse schemas has unique design constraints and de-
scribes a different organizational structure for how data is stored and how it relates to other data within the
data warehouse.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 10

What Is a Star Schema in a Data Warehouse?

The star schema in a data warehouse is historically one of the most straightforward designs. This schema
follows some distinct design parameters, such as only permitting one central table and a handful of single-
dimension tables joined to the table. In following these design constraints, star schema can resemble a star
with one central table, and five dimension tables joined (thus where the star schema got its name).

Star Schema is known to create denormalized dimension tables – a database structuring strategy that organizes
tables to introduce redundancy for improved performance. Denormalization intends to introduce redundancy
in additional dimensions so long as it improves query performance.

Characteristics of the Star Schema:


 Star data warehouse schemas create a denormalized database that enables quick querying responses
 The primary key in the dimension table is joined to the fact table by the foreign key
 Each dimension in the star schema maps to one dimension table
 Dimension tables within a star scheme are not to be connected directly
 Star schema creates denormalized dimension tables

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 11

What Is a Snowflake Schema?

The Snowflake Schema is a data warehouse schema that encompasses a logical arrangement of dimension
tables. This data warehouse schema builds on the star schema by adding additional sub-dimension tables that
relate to first-order dimension tables joined to the fact table.

Just like the relationship between the foreign key in the fact table and the primary key in the dimension table,
with the snowflake schema approach, a primary key in a sub-dimension table will relate to a foreign key within
the higher order dimension table.

Snowflake schema creates normalized dimension tables – a database structuring strategy that organizes tables
to reduce redundancy. The purpose of normalization is to eliminate any redundant data to reduce overhead.

Characteristics of the Snowflake Schema:


 Snowflake Schema are permitted to have dimension tables joined to other dimension tables
 Snowflake Schema are to have one fact table only
 Snowflake Schema create normalized dimension tables
 The normalized schema reduces required disk space for running and managing this data warehouse
 Snowflake Scheme offer an easier way to implement a dimension

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 12

What Is a Galaxy Schema?

The Galaxy Data Warehouse Schema, also known as a Fact Constellation Schema, acts as the next iteration
of the data warehouse schema. Unlike the Star Schema and Snowflake Schema, the Galaxy Schema uses
multiple fact tables connected with shared normalized dimension tables. Galaxy Schema can be thought of as
star schema interlinked and completely normalized, avoiding any kind of redundancy or inconsistency of data.

Characteristics of the Galaxy Schema:


 Galaxy Schema is multidimensional acting as a strong design consideration for complex database sys-
tems
 Galaxy Schema reduces redundancy to near zero redundancy as a result of normalization
 Galaxy Schema is known for high data quality and accuracy and lends to effective reporting and
analytics

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 13

Key Differences between Star, Snowflake, and Galaxy Schema:

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 14

Summary of Data Warehouse Schemas’:

To understand data warehouse schema and its various types at the conceptual level, here are a few things to
remember:

 Data warehouse schema is a blueprint for how data will be stored and managed. It includes
definitions of terms, relationships, and the arrangement of those terms and relationships.
 Star, galaxy, and snowflake are common types of data warehouse schema that vary in the
arrangement and design of the data relationships.
 Star schema is the simplest data warehouse schema and contains just one central table and a handful
of single-dimension tables joined together.
 Snowflake schema builds on star schema by adding sub-dimension tables, which eliminates
Redundancy and reduces overhead costs.
 Galaxy schema uses multiple fact tables (Snowflake and Star use only one) which makes it like an
Interlinked star schema. This nearly eliminates redundancy and is ideal for complex database
Systems.

Which Data Warehouse Schema is Best?

There’s no one “best” data warehouse schema. The “best” schema depends on (among other things) your
resources, the type of data you’re working with, and what you’d like to do with it.

For instance, star schema is ideal for organizations that want maximum simplicity and can tolerate higher disk
space usage. But galaxy schema is more suitable for complex data aggregation. And snowflake schema could
be superior for an organization that wants lower data redundancy without the complexity of star schema.

How StreamSets’ Schema-agnostic Approach Makes Schemas Easy

Our agnostic approach to schema management means that StreamSets data pipeline tools can manage any kind
of schema – simple, complex or non-existent. Meaning, with StreamSets you don’t have to spend hours match-
ing the schema from a legacy origin into your destination, instead StreamSets can infer any kind of schema
without you having to lift a finger. If however, you want to enforce a schema and create hard and fast validation
rules, StreamSets can help you with that as well. Our flexibility in how we manage schemas means your data
teams have less to figure out on their own and more time to spend on what really matters: your data.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 15

Data Warehouse Process Architecture

The process architecture defines an architecture in which the data from the data warehouse is processed for
a particular computation.

Following are the two fundamental process architectures:

Centralized Process Architecture


In this architecture, the data is collected into single centralized storage and processed upon completion by a
single machine with a huge structure in terms of memory, processor, and storage.

Centralized process architecture evolved with transaction processing and is well suited for small
organizations with one location of service.
It requires minimal resources both from people and system perspectives.

It is very successful when the collection and consumption of data occur at the same location.

Distributed Process Architecture

In this architecture, information and its processing are allocated across data centres, and its processing is
distributed across data centres, and processing of data is localized with the group of the results into
centralized storage. Distributed architectures are used to overcome the limitations of the centralized process
architectures where all the information needs to be collected to one central location, and results are available
in one central location.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 16

There are several architectures of the distributed process:

Client-Server

In this architecture, the user does all the information collecting and presentation, while the server does the
processing and management of data.

Three-tier Architecture

With client-server architecture, the client machines need to be connected to a server machine, thus mandating
finite states and introducing latencies and overhead in terms of record to be carried between clients and
servers.

N-tier Architecture

The n-tier or multi-tier architecture is where clients, middleware, applications, and servers are isolated into
multiple tiers.

Cluster Architecture

In this architecture, machines that are connected in network architecture (software or hardware) to
approximately work together to process information or compute requirements in parallel. Each device in a
cluster is associated with a function that is processed locally, and the result sets are collected to a master
server that returns it to the user.

Peer-to-Peer Architecture

This is a type of architecture where there are no dedicated servers and clients. Instead, all the processing
responsibilities are allocated among all machines, called peers. Each machine can perform the function of a
Client or server or just process data.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 17

Types of Database Parallelism

Parallelism is used to support speedup, where queries are executed faster because more resources, such as
processors and disks, are provided. Parallelism is also used to provide scale-up, where increasing workloads
are managed without increase response-time, via an increase in the degree of parallelism.

Different architectures for parallel database systems are shared-memory, shared-disk, shared-nothing, and
hierarchical structures.

(a)Horizontal Parallelism: It means that the database is partitioned across multiple disks, and parallel
processing occurs within a specific task (i.e., table scan) that is performed concurrently on different
processors against different sets of data.

(b)Vertical Parallelism: It occurs among various tasks. All component query operations (i.e., scan, join, and
sort) are executed in parallel in a pipelined fashion. In other words, an output from one function (e.g., join)
as soon as records become available.

Intraquery Parallelism
Intraquery parallelism defines the execution of a single query in parallel on multiple processors and disks.
Using intraquery parallelism is essential for speeding up long-running queries.

Interquery parallelism
In this method it does not help in this function since each query is run sequentially.

This application of parallelism decomposes the serial SQL, query into lower-level operations such as scan,
join, sort, and aggregation.

These lower-level operations are executed concurrently, in parallel.

In interquery parallelism, different queries or transaction execute in parallel with one another.

This form of parallelism can increase transactions throughput. The response times of individual transactions
are not faster than they would be if the transactions were run in isolation.

Thus, the primary use of interquery parallelism is to scale up a transaction processing system to support a
more significant number of transactions per second.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 18

Database vendors started to take advantage of parallel hardware architectures by implementing multiserver
and multithreaded systems designed to handle a large number of client requests efficiently.

This approach naturally resulted in interquery parallelism, in which different server threads (or processes)
handle multiple requests at the same time.

Interquery parallelism has been successfully implemented on SMP systems, where it increased the
throughput and allowed the support of more concurrent users.

Data Warehouse Tools

The tools that allow sourcing of data contents and formats accurately and external data stores into the data
warehouse have to perform several essential tasks that contain:
 Data consolidation and integration.
 Data transformation from one form to another form.
 Data transformation and calculation based on the function of business rules that force transformation.
 Metadata synchronization and management, which includes storing or updating metadata about
source files, transformation actions, loading formats, and events.

There are several selection criteria which should be considered while implementing a data warehouse:

1. The ability to identify the data in the data source environment that can be read by the tool is necessary.
2. Support for flat files, indexed files, and legacy DBMSs is critical.
3. The capability to merge records from multiple data stores is required in many installations.
4. The specification interface to indicate the information to be extracted and conversation are essential.
5. The ability to read information from repository products or data dictionaries is desired.
6. The code develops by the tool should be completely maintainable.
7. Selective data extraction of both data items and records enables users to extract only the required
data.
8. A field-level data examination for the transformation of data into information is needed.
9. The ability to perform data type and the character-set translation is a requirement when moving data
between incompatible systems.
10. The ability to create aggregation, summarization and derivation fields and records are necessary.
11. Vendor stability and support for the products are components that must be evaluated carefully.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 19

Data Warehouse Software Components:

A warehousing team will require different types of tools during a warehouse project. These software
products usually fall into one or more of the categories illustrated, as shown in the figure.

Extraction and Transformation


The warehouse team needs tools that can extract, transform, integrate, clean, and load information from a
source system into one or more data warehouse databases. Middleware and gateway products may be needed
for warehouses that extract a record from a host-based source system.

Warehouse Storage
Software products are also needed to store warehouse data and their accompanying metadata. Relational
database management systems are well suited to large and growing warehouses.

Data access and retrieval


Different types of software are needed to access, retrieve, distribute, and present warehouse data to its end-
clients.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 1

UNIT-V
SYSTEM & PROCESS MANAGERS
Data Warehousing System Managers: System Configuration Manager- System Scheduling
Manager - System Event Manager - System Database Manager - System Backup Recovery
Manager - Data Warehousing Process Managers: Load Manager – Warehouse Manager-
Query Manager – Tuning – Testing

System Managers: A System management is mandatory for the successful implementation of a


data warehouse. The most important system managers are −
 System configuration manager
 System scheduling manager
 System event manager
 System database manager
 System backup recovery manager

System Configuration Manager


 The system configuration manager is responsible for the management of the setup and
configuration of data warehouse.
 The structure of configuration manager varies from one operating system to another.
 In Unix structure of configuration, the manager varies from vendor to vendor.
 Configuration managers have single user interface.
 The interface of configuration manager allows us to control all aspects of the system.
 The most important configuration tool is the I/O manager.

UNIT 5: SYSTEM AND PROCESS MANAGERS


RIT CCS341 - DATA WAREHOUSING 2

System Scheduling Manager


System Scheduling Manager is responsible for the successful implementation of the data warehouse.
Its purpose is to schedule ad hoc queries. Every operating system has its own scheduler with some
form of batch control mechanism. The list of features a system scheduling manager must have is as
follows –
 Work across cluster or MPP boundaries
 Deal with international time differences
 Handle job failure
 Handle multiple queries
 Support job priorities
 Restart or re-queue the failed jobs
 Notify the user or a process when job is completed
 Maintain the job schedules across system outages
 Re-queue jobs to other queues
 Support the stopping and starting of queues
 Log Queued jobs
 Deal with inter-queue processing
Note − The above list can be used as evaluation parameters for the evaluation of a good scheduler.
Some important jobs that a scheduler must be able to handle are as follows −
 Daily and ad hoc query scheduling
 Execution of regular report requirements
 Data load
 Data processing
 Index creation
 Backup
 Aggregation creation
 Data transformation
Note − If the data warehouse is running on a cluster or MPP architecture, then the system
scheduling manager must be capable of running across the architecture.

UNIT 5: SYSTEM AND PROCESS MANAGERS


RIT CCS341 - DATA WAREHOUSING 3
System Event Manager
The event manager is a kind of a software. The event manager manages the events that are defined
on the data warehouse system. We cannot manage the data warehouse manually because the
structure of data warehouse is very complex. Therefore we need a tool that automatically handles all
the events without any intervention of the user.
Note − The Event manager monitors the events occurrences and deals with them. The event
manager also tracks the myriad of things that can go wrong on this complex data warehouse system.
Events
Events are the actions that are generated by the user or the system itself. It may be noted that the
event is a measurable, observable, occurrence of a defined action.
Given below is a list of common events that are required to be tracked.
 Hardware failure
 Running out of space on certain key disks
 A process dying
 A process returning an error
 CPU usage exceeding an 805 threshold
 Internal contention on database serialization points
 Buffer cache hit ratios exceeding or failure below threshold
 A table reaching to maximum of its size
 Excessive memory swapping
 A table failing to extend due to lack of space
 Disk exhibiting I/O bottlenecks
 Usage of temporary or sort area reaching a certain thresholds
 Any other database shared memory usage
The most important thing about events is that they should be capable of executing on their own.
Event packages define the procedures for the predefined events. The code associated with each
event is known as event handler. This code is executed whenever an event occurs.

UNIT 5: SYSTEM AND PROCESS MANAGERS


RIT CCS341 - DATA WAREHOUSING 4
System and Database Manager
System and database manager may be two separate pieces of software, but they do the same job.
The objective of these tools is to automate certain processes and to simplify the execution of others.
The criteria for choosing a system and the database manager are as follows −
 increase user's quota.
 assign and de-assign roles to the users
 assign and de-assign the profiles to the users
 perform database space management
 monitor and report on space usage
 tidy up fragmented and unused space
 add and expand the space
 add and remove users
 manage user password
 manage summary or temporary tables
 assign or deassign temporary space to and from the user
 reclaim the space form old or out-of-date temporary tables
 manage error and trace logs
 to browse log and trace files
 redirect error or trace information
 switch on and off error and trace logging
 perform system space management
 monitor and report on space usage
 clean up old and unused file directories
 add or expand space.

System Backup Recovery Manager


The backup and recovery tool makes it easy for operations and management staff to back-up the
data. Note that the system backup manager must be integrated with the schedule manager software
being used. The important features that are required for the management of backups are as follows −
 Scheduling
 Backup data tracking
 Database awareness

UNIT 5: SYSTEM AND PROCESS MANAGERS


RIT CCS341 - DATA WAREHOUSING 5
Backups are taken only to protect against data loss. Following are the important points to remember
 The backup software will keep some form of database of where and when the piece of data
was backed up.
 The backup recovery manager must have a good front-end to that database.
 The backup recovery software should be database aware.
 Being aware of the database, the software then can be addressed in database terms, and will
not perform backups that would not be viable.

Process managers:
Process managers are responsible for maintaining the flow of data both into and out of the
data warehouse. There are three different types of process managers −
 Load manager
 Warehouse manager
 Query manager

Data Warehouse Load Manager


Load manager performs the operations required to extract and load the data into the database.
The size and complexity of a load manager varies between specific solutions from one data
warehouse to another.

Load Manager Architecture


The load manager does performs the following functions −
 Extract data from the source system.
 Fast load the extracted data into temporary data store.
 Perform simple transformations into structure similar to the one in the data warehouse.

UNIT 5: SYSTEM AND PROCESS MANAGERS


RIT CCS341 - DATA WAREHOUSING 6

Extract Data from Source


The data is extracted from the operational databases or the external information providers.
Gateways are the application programs that are used to extract data. It is supported by
underlying DBMS and allows the client program to generate SQL to be executed at a server.
Open Database Connection (ODBC) and Java Database Connection (JDBC) are examples of
gateway.
Fast Load
 In order to minimize the total load window, the data needs to be loaded into the warehouse in
the fastest possible time.
 Transformations affect the speed of data processing.
 It is more effective to load the data into a relational database prior to applying transformations
and checks.
 Gateway technology is not suitable, since they are inefficient when large data volumes are
involved.
Simple Transformations
While loading, it may be required to perform simple transformations. After completing simple
transformations, we can do complex checks. Suppose we are loading the EPOS sales
transaction, we need to perform the following checks −
 Strip out all the columns that are not required within the warehouse.
 Convert all the values to required data types.

UNIT 5: SYSTEM AND PROCESS MANAGERS


RIT CCS341 - DATA WAREHOUSING 7
Warehouse Manager
The warehouse manager is responsible for the warehouse management process. It consists of
a third-party system software, C programs, and shell scripts. The size and complexity of a
warehouse manager varies between specific solutions.
Warehouse Manager Architecture
A warehouse manager includes the following −
 The controlling process
 Stored procedures or C with SQL
 Backup/Recovery tool
 SQL scripts

Functions of Warehouse Manager


A warehouse manager performs the following functions −
 Analyzes the data to perform consistency and referential integrity checks.
 Creates indexes, business views, partition views against the base data.
 Generates new aggregations and updates the existing aggregations.
 Generates normalizations.
 Transforms and merges the source data of the temporary store into the published data
warehouse.
 Backs up the data in the data warehouse.
 Archives the data that has reached the end of its captured life.
Note − A warehouse Manager analyzes query profiles to determine whether the index and
aggregations are appropriate.

UNIT 5: SYSTEM AND PROCESS MANAGERS


RIT CCS341 - DATA WAREHOUSING 8
Query Manager
The query manager is responsible for directing the queries to suitable tables. By directing the
queries to appropriate tables, it speeds up the query request and response process. In addition,
the query manager is responsible for scheduling the execution of the queries posted by the
user.
Query Manager Architecture
A query manager includes the following components −
 Query redirection via C tool or RDBMS
 Stored procedures
 Query management tool
 Query scheduling via C tool or RDBMS
 Query scheduling via third-party software

Functions of Query Manager


 It presents the data to the user in a form they understand.
 It schedules the execution of the queries posted by the end-user.
 It stores query profiles to allow the warehouse manager to determine which indexes and
aggregations are appropriate.

UNIT 5: SYSTEM AND PROCESS MANAGERS


RIT CCS341 - DATA WAREHOUSING 9

DATAWAREHOUSE TUNING:
A data warehouse keeps evolving and it is unpredictable what query the user is going to post in the
future. Therefore it becomes more difficult to tune a data warehouse system. In this chapter, we will
discuss how to tune the different aspects of a data warehouse such as performance, data load, queries,
etc.
Difficulties in Data Warehouse Tuning
Tuning a data warehouse is a difficult procedure due to following reasons −
 Data warehouse is dynamic; it never remains constant.
 It is very difficult to predict what query the user is going to post in the future.
 Business requirements change with time.
 Users and their profiles keep changing.
 The user can switch from one group to another.
 The data load on the warehouse also changes with time.
Performance Assessment
Here is a list of objective measures of performance −
 Average query response time
 Scan rates
 Time used per day query
 Memory usage per process
 I/O throughput rates
Following are the points to remember.
 It is necessary to specify the measures in service level agreement (SLA).
 It is of no use trying to tune response time, if they are already better than those required.
 It is essential to have realistic expectations while making performance assessment.
 It is also essential that the users have feasible expectations.
 To hide the complexity of the system from the user, aggregations and views should be used.
 It is also possible that the user can write a query you had not tuned for.
Data Load Tuning
Data load is a critical part of overnight processing. Nothing else can run until data load is complete.
This is the entry point into the system.
Note − If there is a delay in transferring the data, or in arrival of data then the entire system is affected
badly. Therefore it is very important to tune the data load first.

UNIT 5: SYSTEM AND PROCESS MANAGERS


RIT CCS341 - DATA WAREHOUSING 10
There are various approaches of tuning data load that are discussed below −
 The very common approach is to insert data using the SQL Layer. In this approach, normal
checks and constraints need to be performed. When the data is inserted into the table, the code
will run to check for enough space to insert the data. If sufficient space is not available, then
more space may have to be allocated to these tables. These checks take time to perform and are
costly to CPU.
 The second approach is to bypass all these checks and constraints and place the data directly
into the preformatted blocks. These blocks are later written to the database. It is faster than the
first approach, but it can work only with whole blocks of data. This can lead to some space
wastage.
 The third approach is that while loading the data into the table that already contains the table,
we can maintain indexes.
 The fourth approach says that to load the data in tables that already contain data, drop the
indexes & recreate them when the data load is complete. The choice between the third and
the fourth approach depends on how much data is already loaded and how many indexes need
to be rebuilt.
Integrity Checks
Integrity checking highly affects the performance of the load. Following are the points to remember −
 Integrity checks need to be limited because they require heavy processing power.
 Integrity checks should be applied on the source system to avoid performance degrade of data
load.
Tuning Queries
We have two kinds of queries in data warehouse −
 Fixed queries
 Ad hoc queries
Fixed Queries
Fixed queries are well defined. Following are the examples of fixed queries −
 regular reports
 Canned queries
 Common aggregations
Tuning the fixed queries in a data warehouse is same as in a relational database system. The only
difference is that the amount of data to be queried may be different. It is good to store the most
successful execution plan while testing fixed queries. Storing these executing plan will allow us to

UNIT 5: SYSTEM AND PROCESS MANAGERS


RIT CCS341 - DATA WAREHOUSING 11
spot changing data size and data skew, as it will cause the execution plan to change.
Note − We cannot do more on fact table but while dealing with dimension tables or the aggregations,
the usual collection of SQL tweaking, storage mechanism, and access methods can be used to tune
these queries.
Ad hoc Queries
To understand ad hoc queries, it is important to know the ad hoc users of the data warehouse. For each
user or group of users, you need to know the following −
 The number of users in the group
 Whether they use ad hoc queries at regular intervals of time
 Whether they use ad hoc queries frequently
 Whether they use ad hoc queries occasionally at unknown intervals.
 The maximum size of query they tend to run
 The average size of query they tend to run
 Whether they require drill-down access to the base data
 The elapsed login time per day
 The peak time of daily usage
 The number of queries they run per peak hour
Summary
 It is important to track the user's profiles and identify the queries that are run on a regular basis.
 It is also important that the tuning performed does not affect the performance.
 Identify similar and ad hoc queries that are frequently run.
 If these queries are identified, then the database will change and new indexes can be added for
those queries.
 If these queries are identified, then new aggregations can be created specifically for those
queries that would result in their efficient execution.

UNIT 5: SYSTEM AND PROCESS MANAGERS


RIT CCS341 - DATA WAREHOUSING 12

FIG: SNOWFLAKE PERFORMANCE TUNING

DATAWAREHOUSE TESTING:
Testing is very important for data warehouse systems to make them work correctly and efficiently.
There are three basic levels of testing performed on a data warehouse −
 Unit testing
 Integration testing
 System testing
Unit Testing
 In unit testing, each component is separately tested.
 Each module, i.e., procedure, program, SQL Script, Unix shell is tested.
 This test is performed by the developer.

UNIT 5: SYSTEM AND PROCESS MANAGERS


RIT CCS341 - DATA WAREHOUSING 13
Integration Testing
 In integration testing, the various modules of the application are brought together and then
tested against the number of inputs.
 It is performed to test whether the various components do well after integration.
System Testing
 In system testing, the whole data warehouse application is tested together.
 The purpose of system testing is to check whether the entire system works correctly together or
not.
 System testing is performed by the testing team.
 Since the size of the whole data warehouse is very large, it is usually possible to perform
minimal system testing before the test plan can be enacted.
Test Schedule
First of all, the test schedule is created in the process of developing the test plan. In this schedule, we
predict the estimated time required for the testing of the entire data warehouse system.
There are different methodologies available to create a test schedule, but none of them are perfect
because the data warehouse is very complex and large. Also the data warehouse system is evolving in
nature. One may face the following issues while creating a test schedule −
 A simple problem may have a large size of query that can take a day or more to complete, i.e.,
the query does not complete in a desired time scale.
 There may be hardware failures such as losing a disk or human errors such as accidentally
deleting a table or overwriting a large table.
Note − Due to the above-mentioned difficulties, it is recommended to always double the amount of
time you would normally allow for testing.
Testing Backup Recovery
Testing the backup recovery strategy is extremely important. Here is the list of scenarios for which this
testing is needed −
 Media failure
 Loss or damage of table space or data file
 Loss or damage of redo log file
 Loss or damage of control file
 Instance failure
 Loss or damage of archive file
 Loss or damage of table

UNIT 5: SYSTEM AND PROCESS MANAGERS


RIT CCS341 - DATA WAREHOUSING 14
Testing Operational Environment
There are a number of aspects that need to be tested. These aspects are listed below.
 Security − A separate security document is required for security testing. This document
contains a list of disallowed operations and devising tests for each.
 Scheduler − Scheduling software is required to control the daily operations of a data
warehouse. It needs to be tested during system testing. The scheduling software requires an
interface with the data warehouse, which will need the scheduler to control overnight
processing and the management of aggregations.
 Disk Configuration. − Disk configuration also needs to be tested to identify I/O bottlenecks.
The test should be performed with multiple times with different settings.
 Management Tools. − It is required to test all the management tools during system testing.
Here is the list of tools that need to be tested.
 Event manager
 System manager
 Database manager
 Configuration manager
 Backup recovery manager
Testing the Database
The database is tested in the following three ways −
 Testing the database manager and monitoring tools − To test the database manager and the
monitoring tools, they should be used in the creation, running, and management of test
database.
 Testing database features − Here is the list of features that we have to test −
 Querying in parallel
 Create index in parallel
 Data load in parallel
 Testing database performance − Query execution plays a very important role in data
warehouse performance measures. There are sets of fixed queries that need to be run regularly
and they should be tested. To test ad hoc queries, one should go through the user requirement
document and understand the business completely. Take time to test the most awkward queries
that the business is likely to ask against different index and aggregation strategies.

UNIT 5: SYSTEM AND PROCESS MANAGERS


RIT CCS341 - DATA WAREHOUSING 15
Testing the Application
 All the managers should be integrated correctly and work in order to ensure that the end-to-end
load, index, aggregate and queries work as per the expectations.
 Each function of each manager should work correctly
 It is also necessary to test the application over a period of time.
 Week end and month-end tasks should also be tested.
Logistic of the Test
The aim of system test is to test all of the following areas −
 Scheduling software
 Day-to-day operational procedures
 Backup recovery strategy
 Management and scheduling tools
 Overnight processing
 Query performance
Note − The most important point is to test the scalability. Failure to do so will leave us a system
design that does not work when the system grows.

FIG: DATAWAREHOUSE TESTING

UNIT 5: SYSTEM AND PROCESS MANAGERS

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy