Data Warehousing: Modern Database Management
Data Warehousing: Modern Database Management
Data Warehousing
1
Objectives
• Definition of terms
• Reasons for information gap between
information needs and availability
• Reasons for need of data warehousing
• Describe three levels of data warehouse
architectures
• List four steps of data reconciliation
• Describe two components of star schema
• Estimate fact table size
• Design a data mart
2
Definition
• Data Warehouse:
Warehouse
– A subject-oriented, integrated, time-variant, non-updatable
collection of data used in support of management decision-
making processes
– Subject-oriented: e.g. customers, patients, students,
products
– Integrated: Consistent naming conventions, formats,
encoding structures; from multiple data sources
– Time-variant: Can study trends and changes
– Nonupdatable: Read-only, periodically refreshed
• Data Mart:
Mart
– A data warehouse that is limited in scope
3
Need for Data Warehousing
• Integrated, company-wide view of high-quality
information (from disparate databases)
• Separation of operational and informational systems
and data (for improved performance)
4
5
Data Warehouse Architectures
• Generic Two-Level Architecture
• Independent Data Mart
• Dependent Data Mart and Operational
Data Store
• Logical Data Mart and @ctive Warehouse
• Three-Layer architecture
6
Figure 11-2: Generic two-level data warehousing architecture
L
One,
company-
wide
T warehouse
T
E
T
E Simpler data access
Single ETL for
enterprise data warehouse Dependent data marts
(EDW) loaded from EDW
9
Figure 11-5 Logical data mart and real ODS and data warehouse
are one and the same
time warehouse architecture
T
E
Near real-time ETL for Data marts are NOT separate databases,
Data Warehouse but logical views of the data warehouse
Easier to create new data marts 10
Figure 11-7 Data Characteristics
Example of DBMS
log entry Status vs. Event Data
Status
Status
12
Figure 11-8
Transient
Data Characteristics
operational data Transient vs. Periodic Data
With
transient
data,
changes to
existing
records are
written over
previous
records, thus
destroying
the previous
data content
13
Figure 11-9:
Periodic
Data Characteristics
warehouse data Transient vs. Periodic Data
Periodic
data are
never
physically
altered or
deleted
once they
have
been
added to
the store
14
Other Data Warehouse
Changes
• New descriptive attributes
• New business activity attributes
• New classes of descriptive attributes
• Descriptive attributes become more
refined
• Descriptive data are related to one another
• New source of data
15
The Reconciled Data Layer
• Typical operational data is:
– Transient–not historical
– Not normalized (perhaps due to denormalization for performance)
– Restricted in scope–not comprehensive
– Sometimes poor quality–inconsistencies and errors
• After ETL, data should be:
– Detailed–not summarized yet
– Historical–periodic
– Normalized–3rd normal form or higher
– Comprehensive–enterprise-wide perspective
– Timely–data should be current enough to assist decision-making
– Quality controlled–accurate with full integrity
16
The ETL Process
• Capture/Extract
• Scrub or data cleansing
• Transform
• Load and Index
17
Capture/Extract…obtaining a snapshot of a chosen subset
of the source data for loading into the data warehouse
Figure 11-10:
Steps in data
reconciliation
Record-level: Field-level:
Selection–data partitioning single-field–from one field to one field
Joining–data combining multi-field–from many fields to one, or
Aggregation–data summarization one field to many
20
Load/Index= place transformed data
into the warehouse and create indexes
Figure 11-10:
Steps in data
reconciliation
(cont.)
In general–some transformation
function translates data from old
form to new form
Table lookup–another
approach, uses a separate
table keyed by source
record code
22
Figure 11-12: Multifield transformation
1:M–from one
source field to
many target fields
23
Derived Data
• Objectives
– Ease of use for decision support applications
– Fast response to predefined user queries
– Customized data for particular target audiences
– Ad-hoc query support
– Data mining capabilities
Characteristics
– Detailed (mostly periodic) data
– Aggregate (for summary)
– Distributed (to departmental servers)
Excellent for ad-hoc queries, but bad for online transaction processing
25
Figure 11-14: Star schema example
26
Figure 11-15 Star schema with sample data
27
Issues Regarding Star Schema
• Dimension table keys must be surrogate (non-intelligent and
non-business related), because:
– Keys may change over time
– Length/format consistency
• Granularity of Fact Table–what level of detail do you want?
– Transactional grain–finest level
– Aggregated grain–more summarized
– Finer grains better market basket analysis capability
– Finer grain more dimension tables, more rows in fact table
• Duration of the database–how much history should be kept?
– Natural duration–13 months or 5 quarters
– Financial institutions may need longer duration
– Older data is more difficult to source and cleanse
28
Figure 11-16: Modeling dates
31
Figure 11-22: Slicing a data cube
32
Summary report
Figure 11-24
Example of drill-down
33
Data Mining and Visualization
• Knowledge discovery using a blend of statistical, AI, and
computer graphics techniques
• Goals:
– Explain observed events or conditions
– Confirm hypotheses
– Explore data for new or unexpected relationships
• Techniques
– Case-based reasoning
– Rule discovery
– Signal processing
– Neural nets
– Fractals
• Data visualization – representing data in graphical/multimedia
formats for analysis
34