Data Warehousing
Data Warehousing
WAREHOUSING.
Definitions
Data warehousing: Data warehousing is a process of collecting, storing, and managing data
from various sources in a centralized repository for analysis and reporting. It's designed to
provide a unified view of an organization's data, making it easier to understand, analyze, and
report on.
Data warehouse: Is a central repository that stores structured and semi-structured data from
multiple sources optimized for analysis and reporting to support business intelligence.
Dimensional data warehouse: A dimensional data warehouse is a specialized database
designed to support business intelligence and data analytics applications. It's optimized for
efficient querying and analysis of large volumes of historical data. The key concept is to store
data in a normalized format, making it easier to understand, analyze, and report on.
Key Concepts:
Fact Table: Stores quantitative data (measures) associated with a specific event or
transaction.
Dimension Tables: Store descriptive data (attributes) that provide context to the fact
table.
Star Schema: The most common design pattern, where the fact table is at the center, and
dimension tables radiate outwards.
Snowflake Schema: A variation of the star schema where dimension tables can have
their own hierarchies.
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around a
particular subject, such as customer, product, or sales, instead of the global organization's
ongoing operations. This is done by excluding data that are not useful concerning the subject
and including all data needed by the users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among
different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from
3 months, 6 months, 12 months, or even previous data from a data warehouse. These
variations with a transactions system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and access to data. Therefore,
the DW does not require transaction processing, recovery, and concurrency capabilities,
which allows for substantial speedup of data retrieval. Non-Volatile defines that once
entered into the warehouse, and data should not change.
Components of a data warehouse
1. Source Data Components
Source data coming into the data warehouses may be grouped into four broad categories:
Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose segments of
the data from the various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets,
reports, customer profiles, and sometimes even department databases. This is the internal
data, part of which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In
every operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their industry
produced by the external department.
2. Data Staging Component
After we have been extracted data from various operational systems and external sources,
we have to prepare the files for storing in the data warehouse. The extracted data coming
from several different sources need to be changed, converted, and made ready in a format
that is relevant to be saved for querying and analysis.
We will now discuss the three primary functions that take place in the staging area.
Data Extraction: This method has to deal with numerous data sources. We have to
employ the appropriate techniques for each data source.
Data Transformation:
First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or
elimination of duplicates when we bring in the same data from various source systems.
Standardization of data components forms a large part of data transformation. Data
transformation contains many forms of combining pieces of data from different sources.
We combine data from single source record or related data parts from many source
records.
On the other hand, data transformation also contains purging source data that is not useful
and separating outsource records into new combinations. Sorting and merging of data
take place on a large scale in the data staging area. When the data transformation function
ends, we have a collection of integrated data that is cleaned, standardized, and
summarized.
Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first
time, we do the initial loading of the information into the data warehouse storage. The
initial load moves high volumes of data using up a substantial amount of time.
3. Data Storage Components
Data storage for the data warehousing is a split repository. The data repositories for the
operational systems generally include only the current data. Also, these data repositories
include the data structured in highly normalized for fast and efficient processing.
6. Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users.
The scope is confined to particular selected subjects. Data in a data warehouse should be
a fairly current, but not mainly up to the minute, although development in the data
warehouse industry has made standard and incremental data dumps more achievable.
Data marts are lower than data warehouses and usually contain organization. The current
trends in data warehousing are to developed a data warehouse with several smaller
related data marts for particular kinds of queries and reports.
Pros:
Simplicity in design and implementation
Lower initial costs
Suitable for small organizations with limited data sources
Cons:
Limited scalability
Potential performance issues as data volumes grow
Lack of separation between storage and compute resources
Use Case: Small businesses or departments with straightforward reporting needs and limited
data sources.
Pros:
Better scalability compared to single-tier
Improved performance by offloading analytical queries from operational systems
Allows for more complex transformations
Cons:
Increased complexity in design and maintenance
Potential for data latency between source systems and the warehouse
Use Case: Medium-sized organizations with multiple data sources and more complex analytical
needs.
Bottom Tier: Includes source systems and the staging area for initial data extraction and
storage.
Middle Tier: Comprises the main data warehouse and potentially separate data marts.
Top Tier: Consists of query and analysis tools, reporting applications, and data mining tools.
Pros:
High scalability and flexibility
Clear separation of concerns between layers
Supports complex querying and analytics
Better performance for large-scale data processing
Cons:
More complex to design and implement
Higher initial costs
Requires more specialized skills to manage
Use Case: Large enterprises with diverse data sources, complex analytical requirements, and the
need for high scalability.
Hub-and-Spoke Architecture
This architecture combines a centralized data warehouse (the hub) with multiple subject-
specific data marts (the spokes). Data is first integrated and stored in the central warehouse,
then distributed to various data marts for specific departmental or functional needs.
Pros:
Balances centralized control with departmental flexibility
Improves query performance for specific business domains
Facilitates easier data governance and consistency
Cons:
Can lead to data redundancy
Requires careful coordination between the central warehouse and data marts
More complex ETL processes
Use Case: Organizations with distinct departmental data needs but requiring a single source of
truth.
Federated Architecture
In this model, data remains distributed across multiple sources, with a virtual layer providing a
unified view of the data. Instead of physically moving all data to a central repository, queries
are distributed across the various sources.
Pros:
Reduces data movement and storage costs
Provides real-time access to source data
Useful for organizations with regulatory constraints on data centralization
Cons:
Can have performance issues with complex queries
Requires sophisticated query optimization
Challenging to maintain data consistency across sources
Use Case: Organizations with strict data residency requirements or those needing real-time
access to operational data.
The choice of architecture depends on various factors including the organization’s size, data
volume, analytical needs, existing infrastructure, and budget. Many modern implementations
use a hybrid approach, combining elements from different architectures to create a solution
tailored to specific business requirements.
Challenges
Data quality is a critical concern in data warehousing. Ensuring data accuracy, consistency,
and completeness is essential for reliable analysis. Errors, inconsistencies, or missing values can
lead to incorrect insights and decisions.
Data governance refers to the policies, procedures, and standards that govern the use,
management, and protection of data. It helps ensure data quality, security, and compliance with
regulations. Lack of effective data governance can lead to data silos, inconsistencies, and
security risks.
Performance is another key challenge. Data warehouses often deal with large volumes of data,
and slow query performance can hinder analysis and reporting. Optimizing the data warehouse
for efficient querying, indexing, and partitioning is crucial.
Scalability is the ability of a data warehouse to handle growth in data volume and complexity.
As organizations expand and generate more data, the data warehouse must be able to
accommodate the increased load without sacrificing performance.
Cost is a significant factor to consider. Data warehousing involves investments in hardware,
software, personnel, and ongoing maintenance. Balancing the need for a powerful data
warehouse with cost constraints is essential.
For instance, suppose a business wants to analyze sales data. In that case, the dimensions could
include customers, products, regions, and time, while the facts could be the number of products
sold, the total revenue generated, and the profit earned.
The data is then structured into a star or snowflake schema, with the fact table at the center and
the dimension tables connected via foreign keys. Each dimension table contains descriptive
attributes that describe a specific aspect of the fact table.
Star Schema
The star schema is the simplest and most common dimensional modeling technique. In a star
schema, the fact table is at the center and connected via foreign key(s) to the dimension tables.
The fact table contains the numerical values or metrics being analyzed, while the dimension
tables have the attributes that describe the data.
For instance, in the sales data example mentioned earlier, the fact table could contain the total
revenue generated and the profit earned. In contrast, the dimension tables could have the
attributes such as customer name, product name, region, and time.
The star schema is a straightforward and efficient method of dimensional modeling that is easy
to understand and use. It is suitable for data warehouses that require fast and efficient queries.
Snowflake Schema
The snowflake schema is a more complex dimensional modeling technique used when there are
multiple levels of granularity within a dimension. In a snowflake schema, the dimension tables
are normalized, meaning they are split into multiple tables to reduce data redundancy. This
normalization results in a more complex schema that resembles a snowflake, hence the name.
For instance, the customer dimension table could be normalized in the sales data example to
include separate tables for customer and address information.
The snowflake schema suits large, complex data warehouses requiring extensive data analysis
and reporting. However, it can be more challenging to use and maintain than the star schema.
Dimension
Dimensions are the descriptive data elements that are used to categorize or classify the data. For
example, in a sales data warehouse, the dimensions might include product, customer, time, and
location. Each dimension is made up of a set of attributes that describe the dimension. For
example, the product dimension might include attributes such as product name, product
category, and product price.
Attributes
Characteristics of dimension in data modeling are known as characteristics. These are used to
filter, search facts, etc. For a dimension of location, attributes can be State, Country, Zipcode,
etc.
Fact Table
In a dimensional data model, the fact table is the central table that contains the measures or
metrics of interest, surrounded by the dimension tables that describe the attributes of the
measures. The dimension tables are related to the fact table through foreign key relationships
Dimension Table
Dimensions of a fact are mentioned by the dimension table and they are basically joined by a
foreign key. Dimension tables are simply de-normalized tables. The dimensions can be having
one or more relationships.
Steps to Create Dimensional Data Modeling
Step-1: Identifying the business objective: The first step is to identify the business objective.
Sales, HR, Marketing, etc. are some examples of the need of the organization. Since it is the
most important step of Data Modelling the selection of business objectives also depends on the
quality of data available for that process.
Step-2: Identifying Granularity: Granularity is the lowest level of information stored in the
table. The level of detail for business problems and its solution is described by Grain.
Step-3: Identifying Dimensions and their Attributes: Dimensions are objects or things.
Dimensions categorize and describe data warehouse facts and measures in a way that supports
meaningful answers to business questions. A data warehouse organizes descriptive attributes as
columns in dimension tables. For Example, the data dimension may contain data like a year,
month, and weekday.
Step-4: Identifying the Fact: The measurable data is held by the fact table. Most of the fact table
rows are numerical values like price or cost per unit, etc.
Step-5: Building of Schema: We implement the Dimension Model in this step. A schema is a
database structure.