Ccs341 Data Warehousing All Units
Ccs341 Data Warehousing All Units
CCS341 DATAWAREHOUSING
NOTES
2023-2024(EVEN)
SYLLABUS
UNIT-I
INTRODUCTION TO DATA WAREHOUSE
Data warehouse Introduction - Data warehouse components- operational database Vs data
warehouse – Data warehouse Architecture – Three-tier Data Warehouse Architecture -
Autonomous Data Warehouse- Autonomous Data Warehouse Vs Snowflake - Modern Data
Warehouse
Introduction
A Data Warehouse is Built by combining data from multiple diverse sources that support
analytical reporting, structured and unstructured queries, and decision making for the organization,
and Data Warehousing is a step-by-step approach for constructing and using a Data Warehouse.
Many data scientists get their data in raw formats from various sources of data and information. But,
for many data scientists also as business decision-makers, particularly in big enterprises, the main
sources of data and information are corporate data warehouses. A data warehouse holds data from
multiple sources, including internal databases and Software (SaaS) platforms. After the data is
loaded, it often cleansed, transformed, and checked for quality before it is used for analytics
reporting, data science, machine learning, or anything.
1. Business User: Business users or customers need a data warehouse to look at summarized
data from the past. Since these people are coming from a non-technical background also, the
data may berepresented to them in an uncomplicated way.
2. Maintains consistency: Data warehouses are programmed in such a way that they can be
applied in a regular format to all collected data from different sources, which makes it
effortless for company decision-makers to analyze and share data insights with their
colleagues around the globe. By standardizing the data, the risk of error in interpretation is
also reduced and improves overall accuracy.
3. Store historical data: Data Warehouses are also used to store historical data that means, the
time variable data from the past and this input can be used for various purposes.
4. Make strategic decisions: Data warehouses contribute to making better strategic decisions.
Some business strategies may be depending upon the data stored within the data warehouses.
5. High response time: Data warehouse has got to be prepared for somewhat sudden masses
and typeof queries that demands a major degree of flexibility and fast latency.
Characteristics of Data warehouse:
1. Subject Oriented: A data warehouse is often subject-oriented because it delivers may be
achieved on a particular theme which means the data warehousing process is proposed to
handle a particular theme that is more defined. These themes are often sales, distribution,
selling. etc.
2. Time-Variant: When the data is maintained via totally different intervals of time like
weekly, monthly, or annually, etc. It founds numerous time limits that are unit structured
Data Extraction: This stage handles various data sources. Data analysts should employ
suitable techniques for every data source.
Data Transformation: As we all know, information for a knowledge warehouse comes
from many alternative sources. If information extraction for a data warehouse posture huge
challenges, information transformation gifts even important challenges. We tend to perform
many individual tasks as a part of information transformation. First, we tend to clean the info
extracted from every source of data. Standardization of information elements forms an
outsized part of data transformation. Data transformation contains several kinds of
combining items of information from totally different sources. Information transformation
additionally contains purging supply information that’s not helpful and separating
outsourced records into new mixtures. Once the data transformation performs ends, we’ve got
a set of integrated information that’s clean, standardized, and summarized.
Data Loading: When we complete the structure and construction of the data warehouse and
go live for the first time, we do the initial loading of the data into the data warehouse
storage. The initial load moves high volumes of data consuming a considerable amount of
time.
3. Data Storage in Warehouse:
Data storage for data warehousing is split into multiple repositories. These data repositories contain
structureddata in a very highlynormalized form for fast and efficient processing.
Metadata: Metadata means data about data i.e. it summarizes basic details regarding data,
creating findings & operating with explicit instances of data. Metadata is generated by an
additional correction or automatically and can contain basic information about data.
Raw Data: Raw data is a set of data and information that has not yet been processed and
was delivered from a particular data entity to the data supplier and hasn’t been processed
nonetheless by machine or human. This data is gathered out from online sources to deliver
UNIT-II
ETL AND OLAP TECHNOLOGY
What is ETL – ETL Vs ELT – Types of Data warehouses - Data warehouse Design and Modeling -
Delivery Process - Online Analytical Processing (OLAP) - Characteristics of OLAP – Online
Transaction Processing (OLTP) Vs OLAP - OLAP operations- Types of OLAP- ROLAP Vs MOLAP Vs
HOLAP.
Extraction, Load and Transform (ELT): Extraction, Load and Transform (ELT) is the
technique of extracting raw data from the source and storing it in data warehouse of the target
server and preparing it for end stream users.
ELT comprises of 3 different operations performed on the data:
1. Extract:
Extracting data is the technique of identifying data from one or more sources. The sources
may be databases, files, ERP, CRM or any other useful source of data.
2. Load:
Loading is the process of storing the extracted raw data in data warehouse or data lakes.
3. Transform:
Data transformation is the process in which the raw data source is transformed to the
target format required for analysis.
Data is retrieved from the warehouse anytime when required. The data transformed as
required is then sent forward for analysis. When you use ELT, you move the entire data
set as it exists in the source systems to the target. This means that you have the raw data
at your disposal inthe data warehouse, in contrast to the ETL approach.
Extraction, Transform and Load (ETL):
ETL is the traditional technique of extracting raw data, transforming it for the users as
required and storing it in data warehouses. ELT was later developed, having ETL as its
base. The three operations happening in ETL and ELT are the same except that their order
of processing is slightly varied. This change in sequence was made to overcome some
drawbacks.
1. Extract:
It is the process of extracting raw data from all available data sources such as databases,
files, ERP, CRM or any other.
2. Transform:
The extracted data is immediately transformed as required by the user.
3. Load:
The transformed data is then loaded into the data warehouse from where the users can
access it.
The data collected from the sources are directly stored in the staging area. The
transformations required are performed on the data in the staging area. Once the data is
transformed, the resultant data is stored in the data warehouse. The main drawback of
ETLarchitecture is that once the transformed data is stored in the warehouse, it cannot be
modified again whereas in ELT, a copy of the raw data is always available in the
warehouseand only the required data is transformed when needed.
ELT ETL
ELT tools do not require ETL tools require specific hardware with
additional hardware their own engines to perform
transformations
As all components are in one As ETL uses staging area, extra time is
system, loading is done only once required to load the data
Time to transform data is independent of The system has to wait for large sizes of data.
the size of data As the size of data increases, transformation
time also increases
It is cost effective and available to Not cost effective for small and
all business using SaaS solution mediumbusiness
The data transformed is used by The data transformed is used by users reading
data scientists and advanced analysts report and SQL coders
Creates ad hoc views.Low cost for Views are created based on multiple
building and maintaining scripts.Deleting view means deleting data
Best for unstructured and non- Best for relational and structured data. Better
relational data. Ideal for data lakes. for small to medium amounts of data
Suited for verylarge amounts of data
In the "Bottom-Up" approach, a data warehouse is described as "a copy of transaction data specifical
architecture for query and analysis," term the star schema. In this approach, a data mart is created first
to necessary reporting and analytical capabilities for particular business processes (or subjects). Thus it
is needed to be a business-driven approach in contrast to Inmon's data- driven approach.
Data marts include the lowest grain data and, if needed, aggregated data too. Instead of a normalized
database for the data warehouse, a denormalized dimensional database is adapted to meet the data
delivery requirements of data warehouses.
Using this method, to use the set of data marts as the enterprise data warehouse, data marts should be
built with conformed dimensions in mind, defining that ordinary objects are represented the same in
different data marts. The conformed dimensions connected the data marts to form a data warehouse,
which is generally called a virtual data warehouse.
The advantage of the "bottom-up" design approach is that it has quick ROI, as developing a data mart,
a data warehouse for a single subject, takes far less time and effort than developing an enterprise-wide
data warehouse. Also, the risk of failure is even less. This method is inherently incremental. This
method allows the project team to learn and grow.
It may see quick results if implementedwith Less risk of failure, favorable return on
repetitions. investment,and proof of techniques.
In contrast, data modeling in operational database systems targets efficiently supporting simple
transactions in the database such as retrieving, inserting, deleting, and changing data. Moreover, data
warehouses are designed for the customer with general information knowledge about the enterprise,
whereas operational database systems are more oriented toward use by software specialists for creating
distinct applications.
Data Warehouse model is illustrated in the given diagram.
The data within the specific warehouse itself has a particular architecture with the emphasis on
various levels of summarization, as shown in figure:
Older detail data is stored in some form of mass storage, and it is infrequently accessed and kept at a
level detail consistent with current detailed data.
Lightly summarized data is data extract from the low level of detail found at the current, detailedlevel
and usually is stored on disk storage. When building the data warehouse we have to remember what
unit of time is summarization done over and also the components or what attributes the summarized
data will contain in.
Highly summarized data is compact and directly available and can even be found outside the
warehouse.
Metadata is the final element of the data warehouses and is really of various dimensions in which it is
not the same as file drawn from the operational data, but it is used as:-
A directory to help the DSS investigator locate the items of the data warehouse.
A guide to the mapping of record as the data is changed from the operational data to the data
warehouse environment.
A guide to the method used for summarization between the current, accurate data and the lightly
summarized information and the highly summarized data, etc.
The phase for designing the logical data model which are as follows:
Enterprise Warehouse
An Enterprise warehouse collects all of the records about subjects spanning the entire organization. It
supports corporate-wide data integration, usually from one or more operational systems or external
data providers, and it's cross-functional in scope. It generally contains detailed information as well as
summarized information and can range in estimate from a few gigabyte to hundreds of gigabytes,
terabytes, or beyond.
An enterprise data warehouse may be accomplished on traditional mainframes, UNIX super servers, or
parallel architecture platforms. It required extensive business modeling and may take years to develop
and build.
Data Mart
A data mart includes a subset of corporate-wide data that is of value to a specific collection of users.
The scope is confined to particular selected subjects. For example, a marketing data mart may restrict
its subjects to the customer, items, and sales. The data contained in the data marts tend to be
summarized.
IT Strategy: DWH project must contain IT strategy for procuring and retaining funding.
Business Case Analysis: After the IT strategy has been designed, the next step is the business case. It
is essential to understand the level of investment that can be justified and to recognize the projected
business benefits which should be derived from using the data warehouse.
OLAP INTRODUCTION
In the earlier unit you had studied about Extract, Transform and Loading (ETL) of a
Data Warehouse. Within the data science field, there are two types of data processing
systems: online analytical processing (OLAP) and online transaction processing (OLTP).
The main difference is that one uses data to gain valuable insights, while the oth er is
purely operational. However, there are meaningful ways to use both systems to solve
data problems. OLAP is a system for performing multi - dimensional analysis at high
speeds on large volumes of data. Typically, this data is from a data warehouse, dat a mart
or some other centralized data store. OLAP is ideal for data mining, business
intelligence and complex analytical calculations, as well as business reporting functions
like financial analysis, budgeting and sales forecasting.
CHARACTERISITCS OF OLAP
OLAP Operations
OLAP stands for Online Analytical Processing Server. It is a software technology that
allows users to analyze information from multiple database systems at the same time. It is
based on multidimensional data model and allows the user to query on multi-
dimensional data (eg. Delhi -> 2018 -> Sales data). OLAP databases are divided into one
or more cubes and these cubes are known as Hyper-cubes.
OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:
Moving down in the concept hierarchy
Adding a new dimension
In the cube given in overview section, the drill down operation is performed by moving
down in the concept hierarchy of Time dimension (Quarter -> Month).
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions.
In the cube given in the overview section, a sub-cube is selected by selecting
following dimensions with criteria:
Location = “Delhi” or
“Kolkata” Time = “Q1” or
“Q2”
Item = “Car” or “Bus”
5. Pivot: It is also known as rotation operation as it rotates the current view to get a new
view of the representation. In the sub-cube obtained after the slice operation,
performing pivot operation gives a new view of it.
UNIT-III
METADATA, DATAMART AND PARTITION STRATEGY
Meta Data – Categories of Metadata – Role of Metadata – Metadata Repository – Challenges for Meta
Management - Data Mart – Need of Data Mart- Cost Effective Data Mart- Designing Data Marts- Cost
of Data Marts- Partitioning Strategy – Vertical partition – Normalization – Row Splitting – Horizontal
Partition
Role of Metadata
Each software tool has its own propriety metadata. If you are using several tools in your
data warehouse, how can you reconcile the formats?
No industry-wide accepted standards exist for metadata formats.
There are conflicting claims on the advantages of a centralized metadata repository as
opposedto a collection of fragmented metadata stores.
There are no easy and accepted methods of passing metadata along the processes as data
moves from the source systems to the staging area and thereafter to the data warehouse
storage.
Preserving version control of metadata uniformly throughout the data warehouse is
tedious and difficult.
In a large data warehouse with numerous source systems, unifying the metadata relating
to the data sources can be an enormous task. You have to deal with conflicting
standards, formats, and data naming conventions, data definitions, attributes, values,
business rules, and units of measure. You have to resolve indiscriminate use of aliases
and compensate for inadequate datavalidation rules.
Metadata Repository
Think of a metadata repository as a general-purpose information directory or cataloguing device to
classify, store, and manage metadata. As we have seen earlier, business metadata and technical metadata
serve different purposes. The end-users need the business metadata; data warehouse developers and
administrators require the technical metadata.
The structures of these two categories of metadata also vary. Therefore, the metadata repository can be
thought of as two distinct information directories, one to store business metadata and the other to store
technical metadata. This division may also be logical within a single physical repository.
The following Figure shows the typical contents in a metadata repository. Notice the division between
business and technical metadata. Did you also notice another component called the information navigator?
This component is implemented in different ways in commercial offerings. The functions of the
information navigator include the following:
Interface from query tools. This function attaches data warehouse data to third-party query tools so that
metadata definitions inside the technical metadata may be viewed from these tools.
Drill-down for details. The user of metadata can drill down and proceed from one level of metadata to a
lower level for more information. For example, you can first get the definition of a data table, then go to
the next level for seeing all attributes, and go further to get the details of individual attributes.
Review predefined queries and reports. The user is able to review predefined queries and reports, and
launch the selected ones with proper parameters.
A centralized metadata repository accessible from all parts of the data warehouse for your end- users,
developers, and administrators appears to be an ideal solution for metadata management. But for a
centralized metadata repository to be the best solution, the repository must meet some basic requirements.
Let us quickly review these requirements. It is not easy to find a repository tool that satisfies every one of
the requirements listed below.
Flexible organization. Allow the data administrator to classify and organize metadata into logical
categories and subcategories, and assign specific components of metadata to the classifications.
Selection of a suitable metadata repository product is one of the key decisions the project team must
make. Use the above list of criteria as a guide while evaluating repository tools for your data warehouse
A data mart is a small portion of the data warehouse that is mainly related to a particular business
domain as marketing (or) sales etc.
The data stored in the DW system is huge hence data marts are designed with a subset of data that belongs
to individual departments. Thus a specific group of users can easily utilize this data for their analysis.
Unlike a data warehouse that has many combinations of users, each data mart will have a particular set of
end-users. The lesser number of end-users results in better response time.
Data marts are also accessible to business intelligence (BI) tools. Data marts do not contain duplicated
(or) unused data. They do get updated at regular intervals. They are subject-oriented and flexible databases.
Each team has the right to develop and maintain its data marts without modifying data warehouse (or)
other data mart’s data.
A data mart is more suitable for small businesses as it costs very less than a data warehouse system. The
time required to build a data mart is also lesser than the time required for buildinga data warehouse.
Identify The Functional Splits: Divide the organization data into each data mart (departmental)
specific data to meet its requirement, without any further organizationaldependency.
Identify User Access Tool Requirements: There may be different user access tools in the
market that need different data structures. Data marts are used to support all these internal
structures without disturbing the DW data. One data mart can be associated with one tool as per
the user needs. Data marts can also provide updated data to such tools daily.
Identify Access Control Issues: If different data segments in a DW system need privacy and
should be accessed by a set of authorized users then all such data can be moved into data marts.
Hardware and Software Cost: Any newly added data mart may need extra hardware, software,
processing power, network, and disk storage space to work on queries requested by the end-
users. This makes data marting an expensive strategy. Hence the budget should be planned
precisely.
Network Access: If the location of the data mart is different from that of the data warehouse,
then all the data should be transferred with the data mart loading process. Thus a network should
be provided to transfer huge volumes of data which may be expensive.
Time Window Constraints: The time taken for the data mart loading process will depend on
various factors such as complexity & volumes of data, network capacity, and data transfer
mechanisms, etc.
Data marts are classified into three types i.e. Dependent, Independent and Hybrid. This classification
is based on how they have been populated i.e. either from a data warehouse (or) from any other data
sources.
Extraction, Transformation, and Transportation (ETT) is the process that is used to populate data mart’s
data from any source systems.
A data mart can use DW data either logically or physically as shown below:
Logical View: In this scenario, data mart’s data is not physically separated from the DW. It
refers to DW data through virtual views (or) tables logically.
Physical subset: In this scenario, data mart’s data is physically separated from the DW. Once one
or more data marts are developed, you can allow the users to access only the data marts (or) to access
both Data marts and Data warehouses.
ETT is a simplified process in the case of dependent data marts because the usable data is already
existing in the centralized DW. The accurate set of summarized data should be just moved to the
respective data marts.
Independent data marts are stand-alone systems where data is extracted, transformed and loaded from
external (or) internal data sources. These are easy to design and maintain until it is supporting simple
department wise business needs.
You have to work with each phase of the ETT process in case of independent data marts in a similar way
as to how the data has been processed into centralized DW. However, the number of sources and the data
populated to the data marts may be less.
Designing: Since the time business users request a data mart, the designing phase involves
requirements gathering, creating appropriate data from respective data sources, creating the
logical and physical data structures and ER diagrams.
Constructing: The team will design all tables, views, indexes, etc., in the data mart system.
Populating: Data will be extracted, transformed and loaded into data mart along with metadata.
Accessing: Data Mart data is available to be accessed by the end-users. They can query the data
for their analysis and reports.
Managing: This involves various managerial tasks such as user access controls, data mart
performance fine-tuning, maintaining existing data marts and creating data mart recovery
scenarios in case the system fails.
Star joins are multi-dimensional structures that are formed with fact and dimension tables to support large
amounts of data. Star join will have a fact table in the center surrounded by the dimension tables.
Respective fact table data is associated with dimension tables’ data with a foreign key reference. A fact table
can be surrounded by 20-30 dimension tables.
Similar to the DW system, in star joins as well, the fact tables contain only numerical data and the
respective textual data can be described in dimension tables. This structure resembles a starschema in DW.
But the granular data from the centralized DW is the base for any data mart’s data. Many calculations
will be performed on the normalized DW data to transform it into multidimensional data marts data
which is stored in the form of cubes.
This works similarly as to how the data from legacy source systems is transformed into a normalized
DW data.
You need to consider the below scenarios that recommend for the pilot deployment:
If the end-users are new to the Data warehouse system.
If the end-users want to feel comfortable to retrieve data/reports by themselves before going to
production.
If the end-users want hands-on with the latest tools (or) technologies.
If the management wants to see the benefits as a proof of concept before making it as abig release.
If the team wants to if ensure all ETL components (or) infrastructure components work well before
the release.
Unwanted data marts that have been created are tough to maintain.
Data marts are meant for small business needs. Increasing the size of data marts will decrease its
performance.
If you are creating more number of data marts then the management should properly take care of
their versioning, security, and performance.
Data marts may contain historical (or) summarized (or) detailed data. However, updates to DW data
and data mart data may not happen at the same time due to data inconsistency issues.
Partitioning Strategy
Partitioning is done to enhance performance and facilitate easy management of data. Partitioning also
helps in balancing the various requirements of the system. It optimizes the hardware performance and
simplifies the management of data warehouse by partitioning each fact table into multiple separate
partitions. In this chapter, we will discuss different partitioningstrategies.
Why is it Necessary to Partition?
Partitioning is important for the following reasons −
To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query performance
is enhanced because now the query scans only those partitions that are relevant. It does not have to scan
the whole data.
Horizontal Partitioning
There are various ways in which a fact table can be partitioned. In horizontal partitioning, we have to
keep in mind the requirements for manageability of the data warehouse.
Points to Note
The detailed information remains available online.
The number of physical tables is kept relatively small, which reduces the operating
cost.
This technique is suitable where a mix of data dipping recent history and data mining
through entire history is required.
This technique is not useful where the partitioning profile changes on a regular basis,
because repartitioning will increase the operation cost of data warehouse.
Note − We recommend to perform the partition only on the basis of time dimension, unless you are
certain that the suggested dimension grouping will not change within the life of the data warehouse.
Partition by Size of Table
When there are no clear basis for partitioning the fact table on any dimension, then we should partition
the fact table on the basis of their size. We can set the predetermined size as a critical point. When the
table exceeds the predetermined size, a new table partition is created.
Points to Note
This partitioning is complex to manage.
It requires metadata to identify what data is stored in each partition.
Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition the dimensions. Here we
have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in order to apply
comparisons, that dimension may be very large. This would definitely affect the responsetime.
Round Robin Partitions
In the round robin technique, when a new partition is needed, the old one is archived. It uses metadata
to allow user access tool to refer to the correct table partition.
This technique makes it easy to automate table management facilities within the datawarehouse.
Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts how vertical
partitioning is done.
Normalization
Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduce space. Take a look at the following tables
that show how normalization is performed.
Table before Normalization
16 Sunny Bangalore W
64 San Mumbai S
30 5 3.67 3-Aug-13 16
35 4 5.33 3-Sep-13 16
40 5 2.50 3-Sep-13 64
45 7 5.66 3-Sep-13 16
Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is to speed
up the access to large table by reducing its size.
Note − While using vertical partitioning, make sure that there is no requirement to perform a major join
operation between two partitions.
Identify Key to Partition
It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to reorganizing
the fact table. Let's have an example. Suppose we want to partition the followingtable.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be
Region
Transaction date
Suppose the business is organized in 30 geographical regions and each region has different number of
branches. That will give us 30 partitions, which is reasonable. This partitioning is good enough because
our requirements capture has shown that a vast majority of queries are restricted to the user's own
business region.
If we partition by transaction date instead of region, then the latest transaction from every region will be
in one partition. Now the user who wants to look at data within his own region has to query across
multiple partitions. Hence it is worth determining the right partitioning key.
UNIT-IV
Multidimensional Model:
A multidimensional model views data in the form of a data-cube. A data cube enables data to be modelled
and viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps records. For
example, a shop may create a sales data warehouse to keep records of the store's sales for the dimension time,
item, and location. These dimensions allow the save to keep track of things, for example, monthly sales of
items and the locations at which the items were sold. Each dimension has a table related to it, called a
dimensional table, which describes the dimension further. For example, a dimensional table for an item may
contain the attributes item name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales. This theme is
represented by a fact table. Facts are numerical measures. The fact table contains the names of the facts or
measures of the related dimensional tables.
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in the table. In
this 2D representation, the sales for Delhi are shown for the time dimension (organized in quarters) and the
item dimension (classified according to the types of an item sold). The fact or measure displayed in rupee
sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example, suppose the data according to
time and item, as well as the location is considered for the cities Chennai, Kolkata, Mumbai, and Delhi.
These 3D data are shown in the table. The 3D data of the table are represented as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as shown in fig:
Stage 2: Grouping different segments of the system - In the second stage, the Multi-Dimensional Data
Model recognizes and classifies all the data to the respective section they belong to and also builds it
problem-free to apply step by step.
Stage 3: Noticing the different proportions - In the third stage, it is the basis on which the design of the
system is based. In this stage, the main factors are recognized according to the user’s point of view. These
factors are also known as “Dimensions”.
Stage 4: Preparing the actual-time factors and their respective qualities - In the fourth stage, the
factors which are recognized in the previous step are used further for identifying the related qualities.
These qualities are also known as “attributes” in the database.
Stage 5: Finding the actuality of factors which are listed previously and their qualities - In the fifth
stage, A Multi-Dimensional Data Model separates and differentiates the actuality from the factors which
are collected by it. These actually play a significant role in the arrangement of a Multi-Dimensional Data
Model.
Stage 6: Building the Schema to place the data, with respect to the information collected from the
steps above - In the sixth stage, on the basis of the data which was collected previously, a Schema is
built.
When data is grouped or combined in multidimensional matrices called Data Cubes. The data cube method
has a few alternative names or a few variants, such as "Multidimensional databases," "materialized views,"
and "OLAP (On-Line Analytical Processing)."
The general idea of this approach is to materialize certain expensive computations that are frequently
inquired.
For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be materialized
into a set of eight views as shown in fig, where psc indicates a view consisting of aggregate function value
(such as total-sales) computed by grouping three attributes part, supplier, and customer, p indicates a view
composed of the corresponding aggregate function values calculated by grouping part alone, etc.
Data cube is created from a subset of attributes in the database. Specific attributes are chosen to be measure
attributes, i.e., the attributes whose values are of interest. Another attributes are selected as dimensions or
functional attributes. The measure attributes are aggregated according to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the store's sales for the dimensions
time, item, branch, and location. These dimensions enable the store to keep track of things like monthly sales
of items, and the branches and locations at which the items were sold. Each dimension may have a table
identify with it, known as a dimensional table, which describes the dimensions. For example, a dimension
table for items may contain the attributes item name, brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse in many
cases because not every cell in each dimension may have corresponding data in the database.
Techniques should be developed to handle sparse cubes efficiently.
If a query contains constants at even lower levels than those provided in a data cube, it is not clear how to
make the best use of the precomputed results stored in the data cube.
The model view data in the form of a data cube. OLAP tools are based on the multidimensional data model.
Data cubes usually model n-dimensional data.
A data cube enables data to be modelled and viewed in multiple dimensions. A multidimensional data model
is organized around a central theme, like sales and transactions. A fact table represents this theme. Facts are
numerical measures. Thus, the fact table contains measure (such as Rs. sold) and keys to each of the related
dimensional tables.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used for analyzing
the relationship between dimensions.
Example: In the 2-D representation, we will look at the All Electronics sales data for items sold per quarter in
the city of Vancouver. The measured display in dollars sold (in thousands).
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example, suppose we would
like to view the data according to time, item as well as the location for the cities Chicago, New York, Toronto,
and Vancouver. The measured display in dollars sold (in thousands). These 3-D data are shown in the table.
The 3-D data of the table are represented as a series of 2-D tables.
Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in fig:
Let us suppose that we would like to view our sales data with an additional fourth dimension, such as a
supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest level of
summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time, item, location, and supplier
dimensions.
Figure is shown a 4-D data cube representation of sales data, according to the dimensions time, item,
location, and supplier. The measure displayed is dollars sold (in thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex cuboid. In
this example, this is the total sales, or dollars sold, summarized over all four dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating 4-D data cubes for
the dimension time, item, location, and supplier. Each cuboid represents a different degree of summarization.
Schemas Used in Data Warehouses: Star, Galaxy (Fact constellation), and Snowflake:
We can think of a data warehouse schema as a blueprint or an architecture of how data will be stored and
managed. A data warehouse schema isn’t the data itself, but the organization of how data is stored and how it
relates to other data components within the data warehouse architecture.
In the past, data warehouse schemas were often strictly enforced across an enterprise, but in modern imple-
mentations where storage is increasingly inexpensive, schemas have become less constrained. Despite this
loosening or sometimes total abandonment of data warehouse schemas, knowledge of the foundational
schema designs can be important to both maintaining legacy resources and for creating modern data ware-
house design that learns from the past.
The basic components of all data warehouse schemas are fact and dimension tables. The different combination
of these two central elements compose almost the entirety of all data warehouse schema designs.
Fact Table
A fact table aggregates metrics, measurements, or facts about business processes. In this example, fact tables
are connected to dimension tables to form a schema architecture representing how data relates within the data
warehouse. Fact tables store primary keys of dimension tables as foreign keys within the fact table.
Dimension Table
Dimension tables are non-denormalized tables used to store data attributes or dimensions. As mentioned
above, the primary key of a dimension table is stored as a foreign key in the fact table. Dimension tables are
not joined together. Instead, they are joined via association through the central fact table.
History presents us with three prominent types of data warehouse schema known as Star Schema, Snowflake
Schema, and Galaxy Schema. Each of these data warehouse schemas has unique design constraints and de-
scribes a different organizational structure for how data is stored and how it relates to other data within the
data warehouse.
The star schema in a data warehouse is historically one of the most straightforward designs. This schema
follows some distinct design parameters, such as only permitting one central table and a handful of single-
dimension tables joined to the table. In following these design constraints, star schema can resemble a star
with one central table, and five dimension tables joined (thus where the star schema got its name).
Star Schema is known to create denormalized dimension tables – a database structuring strategy that organizes
tables to introduce redundancy for improved performance. Denormalization intends to introduce redundancy
in additional dimensions so long as it improves query performance.
The Snowflake Schema is a data warehouse schema that encompasses a logical arrangement of dimension
tables. This data warehouse schema builds on the star schema by adding additional sub-dimension tables that
relate to first-order dimension tables joined to the fact table.
Just like the relationship between the foreign key in the fact table and the primary key in the dimension table,
with the snowflake schema approach, a primary key in a sub-dimension table will relate to a foreign key within
the higher order dimension table.
Snowflake schema creates normalized dimension tables – a database structuring strategy that organizes tables
to reduce redundancy. The purpose of normalization is to eliminate any redundant data to reduce overhead.
The Galaxy Data Warehouse Schema, also known as a Fact Constellation Schema, acts as the next iteration
of the data warehouse schema. Unlike the Star Schema and Snowflake Schema, the Galaxy Schema uses
multiple fact tables connected with shared normalized dimension tables. Galaxy Schema can be thought of as
star schema interlinked and completely normalized, avoiding any kind of redundancy or inconsistency of data.
To understand data warehouse schema and its various types at the conceptual level, here are a few things to
remember:
Data warehouse schema is a blueprint for how data will be stored and managed. It includes
definitions of terms, relationships, and the arrangement of those terms and relationships.
Star, galaxy, and snowflake are common types of data warehouse schema that vary in the
arrangement and design of the data relationships.
Star schema is the simplest data warehouse schema and contains just one central table and a handful
of single-dimension tables joined together.
Snowflake schema builds on star schema by adding sub-dimension tables, which eliminates
Redundancy and reduces overhead costs.
Galaxy schema uses multiple fact tables (Snowflake and Star use only one) which makes it like an
Interlinked star schema. This nearly eliminates redundancy and is ideal for complex database
Systems.
There’s no one “best” data warehouse schema. The “best” schema depends on (among other things) your
resources, the type of data you’re working with, and what you’d like to do with it.
For instance, star schema is ideal for organizations that want maximum simplicity and can tolerate higher disk
space usage. But galaxy schema is more suitable for complex data aggregation. And snowflake schema could
be superior for an organization that wants lower data redundancy without the complexity of star schema.
Our agnostic approach to schema management means that StreamSets data pipeline tools can manage any kind
of schema – simple, complex or non-existent. Meaning, with StreamSets you don’t have to spend hours match-
ing the schema from a legacy origin into your destination, instead StreamSets can infer any kind of schema
without you having to lift a finger. If however, you want to enforce a schema and create hard and fast validation
rules, StreamSets can help you with that as well. Our flexibility in how we manage schemas means your data
teams have less to figure out on their own and more time to spend on what really matters: your data.
The process architecture defines an architecture in which the data from the data warehouse is processed for
a particular computation.
Centralized process architecture evolved with transaction processing and is well suited for small
organizations with one location of service.
It requires minimal resources both from people and system perspectives.
It is very successful when the collection and consumption of data occur at the same location.
In this architecture, information and its processing are allocated across data centres, and its processing is
distributed across data centres, and processing of data is localized with the group of the results into
centralized storage. Distributed architectures are used to overcome the limitations of the centralized process
architectures where all the information needs to be collected to one central location, and results are available
in one central location.
Client-Server
In this architecture, the user does all the information collecting and presentation, while the server does the
processing and management of data.
Three-tier Architecture
With client-server architecture, the client machines need to be connected to a server machine, thus mandating
finite states and introducing latencies and overhead in terms of record to be carried between clients and
servers.
N-tier Architecture
The n-tier or multi-tier architecture is where clients, middleware, applications, and servers are isolated into
multiple tiers.
Cluster Architecture
In this architecture, machines that are connected in network architecture (software or hardware) to
approximately work together to process information or compute requirements in parallel. Each device in a
cluster is associated with a function that is processed locally, and the result sets are collected to a master
server that returns it to the user.
Peer-to-Peer Architecture
This is a type of architecture where there are no dedicated servers and clients. Instead, all the processing
responsibilities are allocated among all machines, called peers. Each machine can perform the function of a
Client or server or just process data.
Parallelism is used to support speedup, where queries are executed faster because more resources, such as
processors and disks, are provided. Parallelism is also used to provide scale-up, where increasing workloads
are managed without increase response-time, via an increase in the degree of parallelism.
Different architectures for parallel database systems are shared-memory, shared-disk, shared-nothing, and
hierarchical structures.
(a)Horizontal Parallelism: It means that the database is partitioned across multiple disks, and parallel
processing occurs within a specific task (i.e., table scan) that is performed concurrently on different
processors against different sets of data.
(b)Vertical Parallelism: It occurs among various tasks. All component query operations (i.e., scan, join, and
sort) are executed in parallel in a pipelined fashion. In other words, an output from one function (e.g., join)
as soon as records become available.
Intraquery Parallelism
Intraquery parallelism defines the execution of a single query in parallel on multiple processors and disks.
Using intraquery parallelism is essential for speeding up long-running queries.
Interquery parallelism
In this method it does not help in this function since each query is run sequentially.
This application of parallelism decomposes the serial SQL, query into lower-level operations such as scan,
join, sort, and aggregation.
In interquery parallelism, different queries or transaction execute in parallel with one another.
This form of parallelism can increase transactions throughput. The response times of individual transactions
are not faster than they would be if the transactions were run in isolation.
Thus, the primary use of interquery parallelism is to scale up a transaction processing system to support a
more significant number of transactions per second.
Database vendors started to take advantage of parallel hardware architectures by implementing multiserver
and multithreaded systems designed to handle a large number of client requests efficiently.
This approach naturally resulted in interquery parallelism, in which different server threads (or processes)
handle multiple requests at the same time.
Interquery parallelism has been successfully implemented on SMP systems, where it increased the
throughput and allowed the support of more concurrent users.
The tools that allow sourcing of data contents and formats accurately and external data stores into the data
warehouse have to perform several essential tasks that contain:
Data consolidation and integration.
Data transformation from one form to another form.
Data transformation and calculation based on the function of business rules that force transformation.
Metadata synchronization and management, which includes storing or updating metadata about
source files, transformation actions, loading formats, and events.
There are several selection criteria which should be considered while implementing a data warehouse:
1. The ability to identify the data in the data source environment that can be read by the tool is necessary.
2. Support for flat files, indexed files, and legacy DBMSs is critical.
3. The capability to merge records from multiple data stores is required in many installations.
4. The specification interface to indicate the information to be extracted and conversation are essential.
5. The ability to read information from repository products or data dictionaries is desired.
6. The code develops by the tool should be completely maintainable.
7. Selective data extraction of both data items and records enables users to extract only the required
data.
8. A field-level data examination for the transformation of data into information is needed.
9. The ability to perform data type and the character-set translation is a requirement when moving data
between incompatible systems.
10. The ability to create aggregation, summarization and derivation fields and records are necessary.
11. Vendor stability and support for the products are components that must be evaluated carefully.
A warehousing team will require different types of tools during a warehouse project. These software
products usually fall into one or more of the categories illustrated, as shown in the figure.
Warehouse Storage
Software products are also needed to store warehouse data and their accompanying metadata. Relational
database management systems are well suited to large and growing warehouses.
UNIT-V
SYSTEM & PROCESS MANAGERS
Data Warehousing System Managers: System Configuration Manager- System Scheduling
Manager - System Event Manager - System Database Manager - System Backup Recovery
Manager - Data Warehousing Process Managers: Load Manager – Warehouse Manager-
Query Manager – Tuning – Testing
Process managers:
Process managers are responsible for maintaining the flow of data both into and out of the
data warehouse. There are three different types of process managers −
Load manager
Warehouse manager
Query manager
DATAWAREHOUSE TUNING:
A data warehouse keeps evolving and it is unpredictable what query the user is going to post in the
future. Therefore it becomes more difficult to tune a data warehouse system. In this chapter, we will
discuss how to tune the different aspects of a data warehouse such as performance, data load, queries,
etc.
Difficulties in Data Warehouse Tuning
Tuning a data warehouse is a difficult procedure due to following reasons −
Data warehouse is dynamic; it never remains constant.
It is very difficult to predict what query the user is going to post in the future.
Business requirements change with time.
Users and their profiles keep changing.
The user can switch from one group to another.
The data load on the warehouse also changes with time.
Performance Assessment
Here is a list of objective measures of performance −
Average query response time
Scan rates
Time used per day query
Memory usage per process
I/O throughput rates
Following are the points to remember.
It is necessary to specify the measures in service level agreement (SLA).
It is of no use trying to tune response time, if they are already better than those required.
It is essential to have realistic expectations while making performance assessment.
It is also essential that the users have feasible expectations.
To hide the complexity of the system from the user, aggregations and views should be used.
It is also possible that the user can write a query you had not tuned for.
Data Load Tuning
Data load is a critical part of overnight processing. Nothing else can run until data load is complete.
This is the entry point into the system.
Note − If there is a delay in transferring the data, or in arrival of data then the entire system is affected
badly. Therefore it is very important to tune the data load first.
DATAWAREHOUSE TESTING:
Testing is very important for data warehouse systems to make them work correctly and efficiently.
There are three basic levels of testing performed on a data warehouse −
Unit testing
Integration testing
System testing
Unit Testing
In unit testing, each component is separately tested.
Each module, i.e., procedure, program, SQL Script, Unix shell is tested.
This test is performed by the developer.