Data Warehouse
Data Warehouse
analysis and query reporting in the star schema. 7. Flags and Indicators
1. A fact is a performance measure. For example, Now how do we generate dimensional models?
"Sales of Product X". The Dimensional Normal Form
2. Fact values are not known in advance. They are •is a creative and practical approach originated by
only known when event measurement Mike Schmitz to design Dimension Table Families.
occurs. •Here, fact tables are highly normalized for
3. Facts are numeric. maintainability and flexibility.
4. The most useful facts are numeric and additive. •Dimensions have their hierarchies de-normalized
into them for usability and performance.
DIMENSION TABLES
•Its schema is limited to two levels.
• In a star schema, dimension tables contain
classification and aggregation information about 1. These are a single first level or central highly
normalized table called a fact table and
the values in the fact table.
2. multiple second level tables called dimension
• Dimension tables contain the parameters by which tables linked to the first level table in primarily one to
the fact table measures are analyzed. For many relationships
example, the amount sold is analyzed by day, month,
quarter, or year. Or the amount sold on sunny days vs.
rainy days, and so on.
CURRENT TECHNICAL
PROJECT AND PROGRAM PLANNING ENVIRONMENT
TRAIN TEAM
4.IDENTIFY HE FACTS
DEVELOP PRELIMINARY
AGGREGATION PLAN
:PHYSICAL DESIGN
TYPICALLY UNDERESTIMATED
INMON MODEL
-operational
-departmental
-individual
KIMBALL MODEL
DIFFERENCES
inmon kimball
low high
makes it easy
objective deliver sound for end users
tech to directly
solution based query data.
on proven
methods.
modeling business processes results in
numerous data entities/tables and a
DIMENSIONAL MODELLING
spaghetti-like interweaving of relationships
a logical design technique for technique for among them.
structuring data such that it is intuitive for not usable by end users to complicated
bus. users and delivers fast query not usable for dw queries
performance. dimensional models may contain more
widely accepted as the preferred approach content than normalized model
for DW presentation
Two Key Benefits of Dimensional Modeling à la
simplicity is fundamental to usefulness.
Kimball
allows software to easily navigate database.
Divides world into measurements and Understandability
context.
– Model must be easily understood by business users
Measurements are numeric values called
facts. – Yet represent complexities of the business
Context intuitively divided into clumps
called dimensions. • Performance
Dimensions describe the “who, what, where, – Fast response to queries that summarize millions of
when, why, and how” of the facts. rows is essential
A dimensional model consists of a fact table – Limiting models to single level joins rather than
containing: multi-level joins
measurements surrounded by a halo of – Denormalization has a significant impact on
dimension tables containing textual context. performance
Known as a star join.
Known as a star schema when stored in a Benefits of Dimensional Models
relational database (RDBMS).
Predictable, Standard Framework
RELATIONAL MODELING
– Users recognize that this is “their
widely used method of database business”
data is divided into discrete entities. – Report writers, query tools, and user
- each of which becomes a relational interfaces can be built into BI tools
database table called an entity.
models are shown in two forms - logical and – Makes user interfaces more
physical understandable
- logical models are designed to be
– Makes processing more efficient
independent of any particular rdbms
- tables in a logical model are called Gracefully Extensible to Accommodate Change
entities the columns are called attributes
physical models are derived from logical – Existing tables can be changed by adding
models but are specific to a given RDBMS new data rows
each entity has a unique identifier known as •Data should not have to be
its primary key reloaded
the primary key consists of one or more
attributes/column – No query tool or reporting tool has to be
reprogrammed
NORMALIZED MODELS
– Old BI applications continue to run
designed to eliminate redundancies without yielding different results
in 3NF
Star Join Schema is Symmetrical
– Every dimension is equivalent • Develop Detailed Bus Matrix
– All dimensions symmetrically equal entry • Identify, Track, and Resolve Issues
points to the fact table
Establishing Naming Conventions
• No concern about order in
selecting tables • Use descriptive and consistent data names. Reasons:
– Logical design can be done nearly – Names become column headers in reports.
independent of expected query Column names must be non-redundant.
Example: not just City, but Customer City or
patterns Supplier City
• Sales Date versus Received Date • Know the naming rules of your RDBMS
– Heterogeneous products
• Document the detailed dimension worksheet • Are primarily joined to dimension tables through
foreign keys
– Known as a Source-to-Target Map
Fact Table Granularity
Note that spreadsheets are used extensively in
metadata documentation • The fact table’s grain is the business definition of
the measurement event that produces the fact table
Develop Detailed Bus Matrix
– Example: Each time a customer submits
• Bus matrix makes several things articulate and an order online a customer order event
obvious ultimately becomes a row in the customer
– Business processes have several fact tables order fact table.
– Explicit granularity for fact tables • Declaring the grain means a fact table row
represents the blank in this statement: “A fact row is
– Named facts for fact tables created when _______ occurs.”
– Reusable conformed dimensions Determining the Grain of a Fact Table
Identify, Track, and Resolve Issues • In business terms
• Issues continually arise as the team works among its – What is the meaning of an individual row
members and with business participants in the fact table
• Important to identify, track, and resolve these issues • In data modeling terms
• Assign someone to capture and track issues that – What is the unique logical identifier
arise at meetings or in discussions
– What are the identifying dimension keys
Fact Table Facts
• In ETL terms
• A fact is a performance measure
– What is the rule for populating the table
– Sales of Product X
– In PhP • Summaries
• The most useful facts are numeric and additive Detail Fact Table – Granularity Statement
•Are usually the largest tables – “One row for each item in a transaction”
• Are usually appended to • Notice that the standard dimensions are not part of
the granularity statement
Granularity Enforcement – Day
– Keys that are foreign keys in the Fact • Sometimes the grain of a fact table is not
Table connected to primary keys of their made up of all the dimensions in the fact
respective Dimensions table
– Store Exceptions:
– Distribution Mgr
• Resolves many to many relationships between • All attributes are stored as Integers
several pairs of dimension tables
• Usually stored in a 4 byte sized Integer =
32 Bits
Snapshot Fact Table (Point in time values) • A 32-Bit attribute can handle:
– “One row for each product sold by store by day” • Measures can be base facts from a source system
• A perfect cube since the lpk is made up of all of the • Measures can be derived or calculated from base
dimension keys facts
• Dimensions identify the grain or granularity of this • Measures sometimes called metrics
table.
– Example: Key Performance Indicators or Contain the parameters by which the fact table
KPIs are metrics or measures; often used in measures are analyzed:
dashboards
– amount sold is analyzed by day, month, quarter, or
Three Types of Facts year
• Additive - can sum by any/all dimensions. – amount sold on sunny days vs. rainy days
Can be summed across all dimensions and all • Contain textual and discrete data
combinations of dimensions • Are usually smaller than fact tables
Semi-Additive Measures Have a single column surrogate primary key
Can be summed across some, but not all dimensions (called the warehouse dimension key)
Non-Additive Measures • Are joined to a fact table through a foreign key
Can not be summed across all dimensions, but can be reference to their primary key
aggregated other ways (avg, min, max) • Can contain one or more hierarchies
5.0 Designing Dimension Tables • The hierarchies are de-normalized into the
Dimension Tables dimension tables
• They can also identified as the “by” words in a • What hierarchies do you see?
business question that asks for a report.
What is a surrogate key?
– Example: “I’d like a report that lists sales
A surrogate key is a system assigned primary key.
by store by product by quarter.
• When the first row is added to a dimension, the • 0 – the fact table row had an invalid legacy id for
this dimension (Invalid)
system automatically assigns a key of 1 to the row.
• -1 – The fact table row should reference a value for
• As each additional row is added, the system this dimension, but the value is unknown (Missing
automatically increments the key by 1. Mandatory)
• It’s meaningless, but essential as a foreign key in • -2 – The fact table row is not applicable for this
fact tables dimension (Missing Optional)
• Important: Retain source system primary key as examples
unique identifier to use as lookup argument during
ETL process and for report headers Invalid reference from fact table
Warehouse Dimension Keys: Single column – The sale of a product whose product ID is not in the
surrogate keys dimension table
– Provide key control within the data warehouse • Unknown reference from fact table
– Enable one method of tracking attribute history • The fact table row is not applicable to this
dimension
– Facilitate exception references from a fact table
– The sale of a product that is not on promotion
• Implemented in every dimension, even date and
time dimensions Dimension Table Classifications and Examples
The Date Dimension Family – Retain smaller key in fact table which is seldom
used
Implement the family and name each dimension for
its granularity – Will serve as a dimension to a transaction level fact
table if built
– Date Dim
– It will only be used if there is a need to bring back
– Week Dim individual detail
– Month Dim – Good container for text comment field
– Quarter Dim Condition Dimensions
– Year Dim • Conditions that may effect fact table activity
• Use at least one character column for date dependent on date and another dimension
• Put in all attributes that simplify analysis • Cannot be handled in the date dimension
– Specific Business Dimension (Sales Date Dim) • Often need system to capture
• One row for each change date for each product • Need standardized method of handling
• All transaction attributes have already been attached • Dimensions where the dimension table takes on
to the fact table more than one value for an individual fact table row
Degenerate Dimension: Alternative Model • One solution is to convert the multiple rows into
one row
• Make a dimension for transaction
• Usually Called: Mix Dimension table
– Protects against reuse of transaction ids
• One row for each different mix of values • Only one dimension table in a dimension family is
encountered in an individual fact table row attached to any one fact table
• Entity taking on different roles or uses for the same • To use normalized tables in the dimensional model.
entity
• Break dimension hierarchies into normalized tables
Dimensional Physical Implementation connected by foreign key – primary key relationships
• Physical base table Every join costs something and one extra join may
cause the database optimizer to choose a bad
– Date Dim algorithm
• Views*
• View