0% found this document useful (0 votes)
20 views39 pages

Ch4 DW Detailed Version

BI

Uploaded by

rymachayeb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views39 pages

Ch4 DW Detailed Version

BI

Uploaded by

rymachayeb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Data warehousing

11
Outline
Part 1:
I. Introduction to Data Warehousing
II. Architecture of Data Warehousing
III. Design and Modeling in Data Warehousing

Part 2:

IV. ETL Processes in Data Warehousing


A. Extracting Data
1. Data Extraction Techniques
2. Data Profiling
B. Transforming Data
1. Data Cleaning and Quality
2. Data Integration
C. Loading Data 2
Data warehouse: Definition
A data warehouse is a centralized repository that integrates and stores large volumes
of structured, historical data from various sources within an organization.

It is designed for the purpose of supporting business intelligence (BI) activities,


including reporting, analysis, and decision-making processes.

Data warehouses provide a consolidated view of an organization's data, allowing


users to analyze trends, identify patterns, and gain valuable insights that can inform
strategic and operational decisions.

Data warehouses play a crucial role in business intelligence by providing decision-


makers with a unified and consistent view of historical data. 3
Keys characteristics
Key characteristics of a data warehouse include:

• Subject-Oriented: Data warehouses are organized around specific business


subjects or areas, such as sales, finance, or customer relations, to support
analytical queries and reporting within those domains.
• Integrated Data: Data from disparate sources, such as transactional databases,
spreadsheets, and external systems, is integrated and transformed to ensure
consistency and coherence in the warehouse. This integration process is often
facilitated through ETL (Extract, Transform, Load) procedures.
• Time-Variant: Data in a data warehouse is time-stamped, allowing users to
analyze trends and changes over time. This time-variant aspect enables historical
analysis and reporting.
4
Keys characteristics
• Non-Volatile: Unlike operational databases that are frequently updated with
transactional data, a data warehouse is non-volatile. Once data is loaded into
the warehouse, it is typically not updated or deleted, ensuring a stable
environment for analytical processing.
• Optimized for Query and Reporting: Data warehouses are structured and
indexed for efficient querying and reporting. They often use denormalized
schemas, such as star or snowflake schemas, to simplify and accelerate
analytical queries.

5
Data warehouse VS Database (1/3)

Data warehouse Database


Purpose Primarily designed for Designed for transactional
analytical processing and processing and day-to-day
business intelligence. It is operations. Focus is on efficient
optimized for complex data retrieval, insertion, and
queries and reporting. updating.
Data Types Stores large volumes of Stores operational data, often in
historical, structured data. real-time. Primarily contains
Often includes data from current and frequently updated
multiple sources within the information.
organization.
Schema Design Uses specialized schemas like Typically uses normalized
star schema or snowflake schemas to reduce redundancy
schema for efficient querying and maintain data integrity.
and reporting. Normalization helps in 6
transactional processing.
Data warehouse VS Database (2/3)

Data warehouse Database


Data Integration Involves the integration of data May store data from a specific
from various sources using ETL application or domain.
(Extract, Transform, Load) Integration is focused on
processes to ensure maintaining consistency within
consistency and coherence. the operational context.
Data Volatility Non-volatile; historical data is Volatile; data is frequently
stored and rarely updated. updated and modified as part
Changes typically involve of ongoing transactions.
adding new data rather than
modifying existing records.

7
Data warehouse VS Database (3/3)

Data warehouse Database


Query Optimization Optimized for complex Optimized for fast retrieval
queries. and updating of individual
records.
User Base Primarily used by analysts, Used by application
data scientists, and decision- developers, system
makers for in-depth analysis, administrators, and
reporting, and business operational staff for day-to-
intelligence activities. day application support and
transactional processing.
Data Processing Online Analytical Processing Online Transactional
(OLAP) Processing (OLTP)

8
OLTP VS OLAP

Data warehouses are tailored for analytical


processing, historical analysis, and business
intelligence, whereas databases are focused 9
on supporting transactional processing and
day-to-day operations.
Main Components of a Data
Warehouse
A data warehouse comprises several components that work together to facilitate the
storage, integration, and retrieval of large volumes of data for analytical processing.
The main components of a data warehouse include:
1. Data Sources:
These are systems or applications that generate and store data. Data sources can
include operational databases, external data feeds, spreadsheets, and other
repositories.
2. ETL (Extract, Transform, Load) Processes:
ETL processes are responsible for extracting data from various sources,
transforming it to conform to the data warehouse's structure and quality 10
standards, and loading it into the data warehouse.
Main Components of a Data
Warehouse
3. Data Warehouse Database:
The central repository that stores the integrated and transformed data. It is
optimized for analytical querying and reporting. Data warehouses often use
specialized database management systems (DBMS) designed for analytical
workloads.
4. Data Marts:
Data marts are subsets of the data warehouse that focus on specific business
functions or departments. They are often designed for the needs of a particular
group of users.
5. OLAP (Online Analytical Processing) Servers:
OLAP servers enable users to interactively analyze and explore data in a 11
multidimensional way. OLAP provides capabilities for slicing and dicing data,
drilling down into details, and performing complex analyses.
Design and Modeling in Data
Warehousing
Data warehouse modeling involves designing the structure and organization of

data within a data warehouse to facilitate efficient querying, reporting, and

analysis.

The goal is to provide a clear and optimized representation of data that supports

business intelligence and decision-making.

Dimensional modeling is much better suited for business intelligence (BI)

applications and data warehousing (DW)

The key concepts in dimensional modeling are facts, dimensions, and attributes. 12

All these concepts can be organized in several ways, called schemas.


Dimensional modeling overview

 The fact Tbl_Fact_Store_Sales is at the core of the dimensional model


 Four surrounding dimensions that define and put into context the store
sales:
• Tbl_Dim_Item, which is what products were sold.
• Tbl_Dim_Date, which is when those products were sold
• Tbl_Dim_Customer, who bought the products 13
• Tbl_Dim_Buyer, who bought the product for the store
Key concepts: Facts Tables

A fact is a measurement of a business activity, such as a business event or


transaction, and is generally numeric.
Examples of facts are sales, expenses, and inventory levels
Fact tables are composed of two types of columns: keys and measures
• The first, the key column, consists of a group of foreign keys (FK) that point
to the primary keys of dimensional tables that are associated with this fact
table to enable business analysis. The relationships between fact tables and
the dimensions are one-to-many.
• The second type of column is the actual measures of the business activity
such as the sales revenue and order quantity. Every measurement has a
grain, which is the level of detail in the measurement of an event such as a 14

unit of measure or currency used.


Facts Tables: Example

15
Fact table—primary key is a surrogate key. Fact table— several measures.
Key concepts : Dimension

A dimension is an entity that establishes the business context for the measures
(facts) used by an enterprise.
Dimensions define the who, what, where, and why of the dimensional model,
and group similar attributes into a category or subject area. Examples of
dimensions are product, geography, customers, employees, and time. Whereas
facts are numeric, dimensions are descriptive in nature (although some of those
descriptions, such as a product list price, may be numeric).
Creating a dimension enables facts to store attributes in a single place

16
Dimension

Dimensions keep the database from being overrun with redundant data. With all
the attributes in a dimension table, they don’t have to be repeated in the fact
tables.
Example:
Take Amazon, for example. The data for an individual sale will contain the product
identification number, but will not repeat all the attributes of the product (color,
description, reviews, etc.). Those attributes are in a dimension, and each individual
sale of that product just points to them.
From a business perspective, the key purpose of dimensions it to use their
17
attributes to filter and analyze data based on performance measures
Dimension
• Dimensions are used for
• Selection of data
• Grouping of data at the right level of detail
• Dimensions consist of dimension values
• Product dimension has values ”milk”, ”cream”, …
• Time dimension has values ”1/1/2001”, ”2/1/2001”,…
• Dimension values may have an ordering
• Used for comparing cube data across values
• Especially used for Time dimension

18
Dimension
• Dimensions have hierarchies with levels
• Typically 3-5 levels (of detail)
• Dimension values are organized in a tree structure
• Product: Product->Type->Category
• Store: Store->Area->City->County
• Time: Day->Month->Quarter->Year
• Dimensions have a bottom level and a top level
• Levels may have attributes
• Simple, non-hierarchical information
• Day has Workday as attribute
• Dimensions should contain much information
19
• Time dimension may contain holiday, season, events,…
• Good dimensions have 50-100 or more attributes/levels
Dimensional model: Example

Example: sales of supermarkets


• Facts and measures
• Each sales record is a fact, and its sales value is a measure
• Dimensions
• Group correlated attributes into the same dimension
• Each sales record is associated with its values of Product, store,
Time

20
Granularity: Dimensionality
Hierarchy
• Granularity of facts is important
• Level of detail
• Given by combination of bottom levels
• A dimensional hierarchy defines mappings from a set of lower-level
concepts to higher level concepts.

21
Data Warehouse Design

• A schema is a logical description of the entire database.

• Database uses relational model, while a data warehouse uses Star,

Snowflake, and Fact Constellation schema.

22
Star Schema
In a star schema, there is a central fact table surrounded by dimension

tables.

Each dimension in a star schema is represented with only one-dimension

table

The fact table contains numerical measures (such as sales or revenue), and

dimension tables provide descriptive information about the measures .

This dimension table contains the set of attributes.

23
Star Schema: Example

24
Snowflake schema
Snowflake schema is an expanded version of a star schema in which

dimension tables are normalized into several related tables.

• Advantages

• Small saving in storage space

• Normalized structures are easier to update and maintain

• Disadvantages

• A schema that is less intuitive

• The ability to browse through the content is difficult


25
• A degraded query performance because of additional joins.
Snowflake schema: Example

26
Fact constellation
schema
• A fact constellation has multiple fact tables. It is also known as galaxy

schema.

• The following diagram shows two fact tables, namely sales and Inventory

27
From the Data Warehouse to
Data Marts

• A data mart contains only those data that are specific to a particular
group. For example, the marketing data mart may contain only data
related to items, customers, and sales.
• Data marts are confined to subjects.
• Data marts are small in size.
• Data marts are customized by department

28
The complete Decision Support
System

29
DWH Architecture

30
Types of Data Warehousing
Architectures
1. Centralized Data Warehouse : is a single, unified repository that stores

and manages data from various sources within an organization. It serves

as a centralized and integrated platform for business intelligence and

decision-making.

2. Data Marts : are smaller, specialized subsets of a data warehouse that

focus on specific business areas, departments, or user groups. They are

designed to meet the needs of a particular set of users with common


31
interests.
Types of Data Warehousing
Architectures
3. Federated Data Warehouse : is an architecture that integrates data

from multiple independent data sources without physically

consolidating the data into a central repository. It enables distributed

data access and processing.

4. Hybrid Data Warehouse: combines elements of both centralized and

distributed architectures. It may involve a mix of on-premises and

cloud-based solutions, as well as a combination of centralized and


32
federated approaches.
Extraction Transformation
Loading–ETL tools

33
Data architecture VS Data
modeling
• Data architecture applies to the higher-level view of how the enterprise
handles its data, such as how it is categorized, integrated, and stored.

• Data modeling applies to very specific and detailed rules about how pieces
of data are arranged in the database. Where data architecture is the
blueprint for your house, data modeling is the instructions for installing a
faucet.

34
Kimball Approach:
• Kimball emphasizes the use of dimensional modeling, creating star or
snowflake schemas. This approach focuses on designing the data
warehouse based on business processes and user requirements.

• Follows a bottom-up development approach, starting with the creation


of data marts that address immediate business requirements. These
data marts are then integrated to form the complete data warehouse.

• Kimball's approach involves the use of Extract, Transform, Load (ETL)


processes that are specifically designed for dimensional models. This
35
ensures the transformation of source data into a format optimized for
reporting and analysis.
Kimball Approach:

36
Inmon's Approach:
Inmon supporters the creation of a centralized Enterprise Data Warehouse
(EDW) as the foundation. This EDW serves as a single, integrated repository for
the entire organization.

Inmon's approach follows a top-down development methodology. It begins


with the creation of an enterprise-wide data warehouse and then focuses on
building data marts to meet specific business needs.

37
Kimball VS Inmon’s Approach
Philosophy:
Kimball: Business-driven, iterative, and agile.
Inmon: Enterprise-centric, normalized, and long-term.
Data Model:
Kimball: Dimensional modeling, star or snowflake schemas.
Inmon: Normalized data model, 3NF.
Development Approach:
Kimball: Bottom-up development, starting with data marts.
Inmon: Top-down development, starting with the enterprise data warehouse.
Data Marts:
Kimball: Considers data marts as primary deliverables.
Inmon: Views data marts as subsets of the enterprise data warehouse.
Flexibility: 38
Kimball: Agile and adaptable to changing business needs.
Inmon: Emphasizes a stable and scalable architecture for long-term use.
Kimball approach: Main
steps
1. Choose the subject : Clearly define the business objectives and scope of
the data warehouse project.
2. Requirements Gathering: Collaborate closely with business users to
gather their reporting and analysis requirements.
3. Dimensional Modeling: Star or Snowflake Schema: Develop dimensional
models using star or snowflake schemas. Identify Dimensions and Facts
4. ETL Design and Development: Create Extract, Transform, Load (ETL)
processes based on dimensional models.
5. Data Mart Development: Develop data marts as subsets of the data
warehouse, addressing specific business needs.
39
6. Business Intelligence Tools Integration: Choose and integrate business
intelligence tools compatible with dimensional models.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy