0% found this document useful (0 votes)
11 views11 pages

Data Warehousing

The document provides an overview of dimensional data warehousing, defining key concepts such as data warehouses, fact and dimension tables, and various schemas like star and snowflake. It discusses the need for data warehousing for effective analytics, its components, architectures, and the benefits and challenges associated with it. Additionally, it covers dimensional modeling techniques and the importance of data quality, governance, and performance in data warehousing.

Uploaded by

stonecode254
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views11 pages

Data Warehousing

The document provides an overview of dimensional data warehousing, defining key concepts such as data warehouses, fact and dimension tables, and various schemas like star and snowflake. It discusses the need for data warehousing for effective analytics, its components, architectures, and the benefits and challenges associated with it. Additionally, it covers dimensional modeling techniques and the importance of data quality, governance, and performance in data warehousing.

Uploaded by

stonecode254
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

CHAPTER ASSIGNED: DIMENSIONAL DATA WAREHOUSE AND DATA

WAREHOUSING.
Definitions
Data warehousing: Data warehousing is a process of collecting, storing, and managing data
from various sources in a centralized repository for analysis and reporting. It's designed to
provide a unified view of an organization's data, making it easier to understand, analyze, and
report on.
Data warehouse: Is a central repository that stores structured and semi-structured data from
multiple sources optimized for analysis and reporting to support business intelligence.
Dimensional data warehouse: A dimensional data warehouse is a specialized database
designed to support business intelligence and data analytics applications. It's optimized for
efficient querying and analysis of large volumes of historical data. The key concept is to store
data in a normalized format, making it easier to understand, analyze, and report on.

Key Concepts:
 Fact Table: Stores quantitative data (measures) associated with a specific event or
transaction.
 Dimension Tables: Store descriptive data (attributes) that provide context to the fact
table.
 Star Schema: The most common design pattern, where the fact table is at the center, and
dimension tables radiate outwards.
 Snowflake Schema: A variation of the star schema where dimension tables can have
their own hierarchies.

Need for data warehouse


An ordinary database can store MBs to GBs of data and that too for a specific purpose. For
storing data of TB size, the storage shifted to Data warehouse. Besides this, a transactional
database doesn’t offer itself for analytics. To effectively perform analytics, an organization
keeps a central data warehouse to closely study its business by organizing, understanding, and
using its historical data for making strategic decisions and analyzing trends.
What is a data warehouse used for?
Cloud data warehousing offers a range of solutions that can benefit an organization. Here are
some of the most common data warehouse use cases:
 Making real-time decisions: Analyze data in real time to proactively address challenges,
identify opportunities, gain efficiency, reduce costs, and proactively respond to business
events.
 Consolidating siloed data: Quickly pull data from multiple structured sources across
your organization, such as point-of-sale systems, websites, and email lists, and bring it
together into one location so that you can perform analysis and get insights.
 Enabling business reporting and ad hoc analysis: Keep historical data on a separate
server from operational data so that end users can access it and run their own queries and
reports without impacting the performance of operational systems or waiting to get help
from IT.
 Implementing machine learning and AI: Collect historical and real-time data to
develop algorithms that can provide predictive insights, such as anticipating traffic spikes
or suggesting relevant products to a customer browsing a website

Key features of a data warehouse

Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around a
particular subject, such as customer, product, or sales, instead of the global organization's
ongoing operations. This is done by excluding data that are not useful concerning the subject
and including all data needed by the users to understand the subject.

Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among
different data sources.

Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from
3 months, 6 months, 12 months, or even previous data from a data warehouse. These
variations with a transactions system, where often only the most current file is kept.

Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and access to data. Therefore,
the DW does not require transaction processing, recovery, and concurrency capabilities,
which allows for substantial speedup of data retrieval. Non-Volatile defines that once
entered into the warehouse, and data should not change.
Components of a data warehouse
1. Source Data Components
Source data coming into the data warehouses may be grouped into four broad categories:

Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose segments of
the data from the various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets,
reports, customer profiles, and sometimes even department databases. This is the internal
data, part of which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In
every operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their industry
produced by the external department.
2. Data Staging Component
After we have been extracted data from various operational systems and external sources,
we have to prepare the files for storing in the data warehouse. The extracted data coming
from several different sources need to be changed, converted, and made ready in a format
that is relevant to be saved for querying and analysis.
We will now discuss the three primary functions that take place in the staging area.
Data Extraction: This method has to deal with numerous data sources. We have to
employ the appropriate techniques for each data source.
Data Transformation:
First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or
elimination of duplicates when we bring in the same data from various source systems.
Standardization of data components forms a large part of data transformation. Data
transformation contains many forms of combining pieces of data from different sources.
We combine data from single source record or related data parts from many source
records.

On the other hand, data transformation also contains purging source data that is not useful
and separating outsource records into new combinations. Sorting and merging of data
take place on a large scale in the data staging area. When the data transformation function
ends, we have a collection of integrated data that is cleaned, standardized, and
summarized.

Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first
time, we do the initial loading of the information into the data warehouse storage. The
initial load moves high volumes of data using up a substantial amount of time.
3. Data Storage Components
Data storage for the data warehousing is a split repository. The data repositories for the
operational systems generally include only the current data. Also, these data repositories
include the data structured in highly normalized for fast and efficient processing.

4. Information Delivery Component


The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to some
customer-specified scheduling algorithm.
5. Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a
database management system. In the data dictionary, we keep the data about the logical
data structures, the data about the records and addresses, the information about the
indexes, and so on.

6. Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users.
The scope is confined to particular selected subjects. Data in a data warehouse should be
a fairly current, but not mainly up to the minute, although development in the data
warehouse industry has made standard and incremental data dumps more achievable.
Data marts are lower than data warehouses and usually contain organization. The current
trends in data warehousing are to developed a data warehouse with several smaller
related data marts for particular kinds of queries and reports.

7. Management and Control Component


The management and control elements coordinate the services and functions within the
data warehouse. These components control the data transformation and the data transfer
into the data warehouse storage. On the other hand, it moderates the data delivery to the
clients. Its work with the database management systems and authorizes data to be
correctly saved in the repositories. It monitors the movement of information into the
staging method and from there into the data warehouses storage itself.
Data warehousing architectures
Single-Tier Data Warehouse Architecture
This is the simplest form, where the data warehouse acts as a centralized repository for all data.
In this architecture, data is extracted from source systems, transformed, and loaded directly into
the data warehouse, which also serves as the platform for querying and analysis.

Pros:
Simplicity in design and implementation
Lower initial costs
Suitable for small organizations with limited data sources
Cons:
Limited scalability
Potential performance issues as data volumes grow
Lack of separation between storage and compute resources
Use Case: Small businesses or departments with straightforward reporting needs and limited
data sources.

Two-Tier Data Warehouse Architecture


In this model, the data warehouse is separated from the source systems, creating two distinct
layers. Data is extracted from source systems, transformed, and then loaded into the data
warehouse. The warehouse itself handles both storage and querying.

Pros:
Better scalability compared to single-tier
Improved performance by offloading analytical queries from operational systems
Allows for more complex transformations
Cons:
Increased complexity in design and maintenance
Potential for data latency between source systems and the warehouse
Use Case: Medium-sized organizations with multiple data sources and more complex analytical
needs.

Three-Tier Data Warehouse Architecture


it consists of a bottom tier (data source layer), a middle tier (data warehouse layer), and a top
tier (client or BI tools layer). This architecture provides high scalability, performance, and
integration with advanced analytics tools.

Bottom Tier: Includes source systems and the staging area for initial data extraction and
storage.
Middle Tier: Comprises the main data warehouse and potentially separate data marts.
Top Tier: Consists of query and analysis tools, reporting applications, and data mining tools.
Pros:
High scalability and flexibility
Clear separation of concerns between layers
Supports complex querying and analytics
Better performance for large-scale data processing
Cons:
More complex to design and implement
Higher initial costs
Requires more specialized skills to manage
Use Case: Large enterprises with diverse data sources, complex analytical requirements, and the
need for high scalability.

Hub-and-Spoke Architecture
This architecture combines a centralized data warehouse (the hub) with multiple subject-
specific data marts (the spokes). Data is first integrated and stored in the central warehouse,
then distributed to various data marts for specific departmental or functional needs.

Pros:
Balances centralized control with departmental flexibility
Improves query performance for specific business domains
Facilitates easier data governance and consistency
Cons:
Can lead to data redundancy
Requires careful coordination between the central warehouse and data marts
More complex ETL processes
Use Case: Organizations with distinct departmental data needs but requiring a single source of
truth.

Federated Architecture
In this model, data remains distributed across multiple sources, with a virtual layer providing a
unified view of the data. Instead of physically moving all data to a central repository, queries
are distributed across the various sources.

Pros:
Reduces data movement and storage costs
Provides real-time access to source data
Useful for organizations with regulatory constraints on data centralization
Cons:
Can have performance issues with complex queries
Requires sophisticated query optimization
Challenging to maintain data consistency across sources
Use Case: Organizations with strict data residency requirements or those needing real-time
access to operational data.

The choice of architecture depends on various factors including the organization’s size, data
volume, analytical needs, existing infrastructure, and budget. Many modern implementations
use a hybrid approach, combining elements from different architectures to create a solution
tailored to specific business requirements.

Types of Data Warehouses


Enterprise Data Warehouse (EDW): A centralized repository that integrates data from
various sources across an organization.
Data Mart: A smaller, focused data warehouse designed to serve a specific department or
business unit.
Operational Data Store (ODS): A temporary storage area for operational data before it's
loaded into the data warehouse.

Benefits of Data Warehousing


Improved Decision Making: Provides access to timely and accurate information for informed
decision-making.
Enhanced Reporting: Enables the creation of comprehensive and customizable reports.
Increased Efficiency: Streamlines data analysis and reporting processes.
Better Customer Insights: Helps understand customer behavior, preferences, and trends.
Enhanced Operational Efficiency: Optimizes business processes and resource allocation.
Challenges and Considerations
 Data Quality: Ensuring data accuracy, consistency, and completeness.
 Data Governance: Establishing policies and procedures for data management.
 Performance: Optimizing the data warehouse for efficient query performance.
 Scalability: Accommodating growth in data volume and complexity.
 Cost: Managing the costs associated with data storage, processing, and maintenance.

Challenges
Data quality is a critical concern in data warehousing. Ensuring data accuracy, consistency,
and completeness is essential for reliable analysis. Errors, inconsistencies, or missing values can
lead to incorrect insights and decisions.
Data governance refers to the policies, procedures, and standards that govern the use,
management, and protection of data. It helps ensure data quality, security, and compliance with
regulations. Lack of effective data governance can lead to data silos, inconsistencies, and
security risks.
Performance is another key challenge. Data warehouses often deal with large volumes of data,
and slow query performance can hinder analysis and reporting. Optimizing the data warehouse
for efficient querying, indexing, and partitioning is crucial.
Scalability is the ability of a data warehouse to handle growth in data volume and complexity.
As organizations expand and generate more data, the data warehouse must be able to
accommodate the increased load without sacrificing performance.
Cost is a significant factor to consider. Data warehousing involves investments in hardware,
software, personnel, and ongoing maintenance. Balancing the need for a powerful data
warehouse with cost constraints is essential.

Addressing these challenges requires careful planning, implementation, and ongoing


management. Organizations must invest in data quality initiatives, establish effective data
governance frameworks, optimize performance, ensure scalability, and manage costs
effectively.
Dimensional modelling in data warehousing
Dimensional modeling is a data modeling technique used in data warehousing that allows
businesses to structure data to optimize analysis and reporting. This method involves organizing
data into dimensions and facts, where dimensions are used to describe the data, and facts are
used to quantify the data.

For instance, suppose a business wants to analyze sales data. In that case, the dimensions could
include customers, products, regions, and time, while the facts could be the number of products
sold, the total revenue generated, and the profit earned.
The data is then structured into a star or snowflake schema, with the fact table at the center and
the dimension tables connected via foreign keys. Each dimension table contains descriptive
attributes that describe a specific aspect of the fact table.

Dimensional Modeling Techniques


There are two primary techniques used in dimensional modeling:

Star Schema
The star schema is the simplest and most common dimensional modeling technique. In a star
schema, the fact table is at the center and connected via foreign key(s) to the dimension tables.
The fact table contains the numerical values or metrics being analyzed, while the dimension
tables have the attributes that describe the data.

For instance, in the sales data example mentioned earlier, the fact table could contain the total
revenue generated and the profit earned. In contrast, the dimension tables could have the
attributes such as customer name, product name, region, and time.

The star schema is a straightforward and efficient method of dimensional modeling that is easy
to understand and use. It is suitable for data warehouses that require fast and efficient queries.
Snowflake Schema
The snowflake schema is a more complex dimensional modeling technique used when there are
multiple levels of granularity within a dimension. In a snowflake schema, the dimension tables
are normalized, meaning they are split into multiple tables to reduce data redundancy. This
normalization results in a more complex schema that resembles a snowflake, hence the name.

For instance, the customer dimension table could be normalized in the sales data example to
include separate tables for customer and address information.

The snowflake schema suits large, complex data warehouses requiring extensive data analysis
and reporting. However, it can be more challenging to use and maintain than the star schema.

Elements of Dimensional Data Model


Facts
Facts are the measurable data elements that represent the business metrics of interest. For
example, in a sales data warehouse, the facts might include sales revenue, units sold, and profit
margins. Each fact is associated with one or more dimensions, creating a relationship between
the fact and the descriptive data.

Dimension
Dimensions are the descriptive data elements that are used to categorize or classify the data. For
example, in a sales data warehouse, the dimensions might include product, customer, time, and
location. Each dimension is made up of a set of attributes that describe the dimension. For
example, the product dimension might include attributes such as product name, product
category, and product price.

Attributes
Characteristics of dimension in data modeling are known as characteristics. These are used to
filter, search facts, etc. For a dimension of location, attributes can be State, Country, Zipcode,
etc.

Fact Table
In a dimensional data model, the fact table is the central table that contains the measures or
metrics of interest, surrounded by the dimension tables that describe the attributes of the
measures. The dimension tables are related to the fact table through foreign key relationships

Dimension Table
Dimensions of a fact are mentioned by the dimension table and they are basically joined by a
foreign key. Dimension tables are simply de-normalized tables. The dimensions can be having
one or more relationships.
Steps to Create Dimensional Data Modeling
Step-1: Identifying the business objective: The first step is to identify the business objective.
Sales, HR, Marketing, etc. are some examples of the need of the organization. Since it is the
most important step of Data Modelling the selection of business objectives also depends on the
quality of data available for that process.

Step-2: Identifying Granularity: Granularity is the lowest level of information stored in the
table. The level of detail for business problems and its solution is described by Grain.

Step-3: Identifying Dimensions and their Attributes: Dimensions are objects or things.
Dimensions categorize and describe data warehouse facts and measures in a way that supports
meaningful answers to business questions. A data warehouse organizes descriptive attributes as
columns in dimension tables. For Example, the data dimension may contain data like a year,
month, and weekday.

Step-4: Identifying the Fact: The measurable data is held by the fact table. Most of the fact table
rows are numerical values like price or cost per unit, etc.

Step-5: Building of Schema: We implement the Dimension Model in this step. A schema is a
database structure.

Advantages of Dimensional Data Modeling


 Simplified Data Access: Dimensional data modeling enables users to easily access data
through simple queries, reducing the time and effort required to retrieve and analyze data.
 Enhanced Query Performance: The simple structure of dimensional data modeling allows
for faster query performance, particularly when compared to relational data models.
 Increased Flexibility: Dimensional data modeling allows for more flexible data analysis,
as users can quickly and easily explore relationships between data.
 Improved Data Quality: Dimensional data modeling can improve data quality by
reducing redundancy and inconsistencies in the data.
 Easy to Understand: Dimensional data modeling uses simple, intuitive structures that are
easy to understand, even for non-technical users.

Disadvantages of Dimensional Data Modeling


 Limited Complexity: Dimensional data modeling may not be suitable for very complex
data relationships, as it relies on simple structures to organize data.
 Limited Integration: Dimensional data modeling may not integrate well with other data
models, particularly those that rely on normalization techniques.
 Limited Scalability: Dimensional data modeling may not be as scalable as other data
modeling techniques, particularly for very large datasets.
 Limited History Tracking: Dimensional data modeling may not be able to track changes
to historical data, as it typically focuses on current data.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy