0% found this document useful (0 votes)

182 views117 pages

DWH Start l2

The document provides an overview of data warehousing concepts from architecture to implementation in 3 sentences or less: Data warehousing transforms data from various sources into a centralized, integrated, and analyzed form to help organizations make more informed business decisions. It addresses issues like scattered and inconsistent data, as well as a lack of understanding and usability of available data, through processes that load, transform, consolidate, and deliver data to support analysis and reporting needs. The document then discusses key data warehousing concepts and architectures to help readers understand how to implement a successful data warehousing solution.

Uploaded by

Gopi Krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

182 views117 pages

DWH Start l2

Uploaded by

Gopi Krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 117

Data Warehousing (DWH)

Data Warehousing Concepts

(Architecture to Implementation)

Version 1.0 (Rev 0)

Presented By:
Deepak.
A producer wants to know….
Which
Whichare
areour
our
lowest/highest
lowest/highestmargin
margin
customers
customers??
Who
Whoare
aremy
mycustomers
customers
What and
andwhat
whatproducts
Whatisisthe
themost
most products
effective are
arethey
theybuying?
effectivedistribution
distribution buying?
channel?
channel?

Which
Whichcustomers
customers
are
aremost
mostlikely
likelyto
togo
go
omm-- to
c
r
ttpproggeesst t tothe
thecompetition
competition??
u c
t p rorodduhe bbi igge?
W hhaat pavveetthevennuue? What
Whatimpact
impactwill
will
W ns hha n rereve new
newproducts/services
products/services
-otitoionsacttoon
-o imppac have
haveon
onrevenue
revenue
im and
andmargins?
margins?
Data, Data everywhere yet ...
 I can’t find the data I need
 data is scattered over the network
 many versions, subtle differences
 I can’t get the data I need
 need an expert to get the data
 I can’t understand the data I found
 available data poorly documented
 I can’t use the data I found
 results are unexpected
 data needs to be transformed from one
form to other
Information Crisis
 Over 50% of business users found it “difficult”
or “very difficult” to get the information they
need.

 77 percent of them said bad decisions had

been made due to lacking information.

 One in six of 1,000 workers surveyed currently

experiences problems in finding the
information they need to do their jobs.

 GB’s – TB’s of data is available. Still IT

department is not able to respond for critical
business queries on time.

t he a nswer
p r ovide acy ?? o uld I Do?
ou P lea s e
wi t h acc ur
W h at S h
t elli ge nce …
Will Y n time iness I n
e r ie s o i s B u s o w ??
q u lu tio n r i t n
of my So
ho uld I go f o
S
What is Data Warehousing?

A process of transforming data

Information into information and making it
available to users in a timely
enough manner to make a
difference

[Forrester Research, April 1996]

Data
Evolution
 60’s: Batch reports
 hard to find and analyze information
 inflexible and expensive, reprogram every new request

 70’s: Terminal-based DSS and EIS (executive

information systems)
 still inflexible, not integrated with desktop tools
 80’s: Desktop data access and analysis tools
 query tools, spreadsheets, GUIs
 easier to use, but only access operational databases

 90’s: Data warehousing with integrated OLAP

engines and tools
Very Large Data Bases
 Terabytes -- 10^12 bytes: Walmart -- 24 Terabytes

 Petabytes -- 10^15 bytes: Geographic Information Systems

National Medical Records

 Exabytes -- 10^18 bytes: Weather images

 Zettabytes -- 10^21 bytes: Intelligence Agency Videos

 Zottabytes -- 10^24 bytes:

Basics
 OLTP—On-Line Transaction Processing
 Supports CRUD—create, read, update, delete
 Requires exclusive locks in order to make changes
 Performance, concurrency are important goals
 Traditional use of database systems

 BI—Business Intelligence
 Provides analytics
 May lock a whole table for a table scan
 Results from large amounts of data an important goal
 Increasingly the source of great business benefit
Different Systems
OLTP ODS OLAP DM/DW
Business Focus Operational Operational Tactical Tactical Strategic
Tactical
End User Tools Client Server Client Server Web Client Server Client Server Web
Web
DB Technology Relational Relational Cubic Relational

Trans Count Large Medium Small Small

Trans Size Small Medium Medium Large

Trans Time Short Medium Long Long

Size in Gigs 10 – 200 50 – 400 50 – 400 400 - 4000

Normalization 3NF 3NF N/A 0-3NF

Data Modeling Traditional ER Traditional ER N/A Dimensional

Data Warehousing Definition
 A data warehouse is a
 subject-oriented,
 integrated,
 nonvolatile, and
 time-variant
collection of data in support of management’s decisions. The
data warehouse contains granular corporate data.

 A data warehouse is a copy of transactional data specifically

structured for querying and analysis. According to this definition:
 The form of the stored data (RDBMS, flat file) has nothing to
do with whether something is a data warehouse.
 Data warehousing is not necessarily for the needs of
"decision makers" or used in the process of decision making
Data Warehousing Definition
 Subject-Oriented: Stored data targets specific subjects.
Example: It may store data regarding total Sales, Number of
Customers, etc. and not general data on everyday operations.

 Integrated: Data may be distributed across heterogeneous sources

which have to be integrated.
Example: Sales data may be on RDB, Customer information on Flat
files, etc.

 Time Variant: Data stored may not be current but varies with time and
data have an element of time.
Example: Data of sales in last 5 years, etc.

 Non-Volatile: It is separate from the Enterprise Operational Database

and hence is not subject to frequent modification. It generally has only
2operations performed on it: Loading of data and Access of data.
Subject-Oriented Data Collections

Classical operations systems are

organized around the applications of
the company. For an insurance
company, the applications may be
auto, health, life, and casualty. The
major subject areas of the insurance
corporation might be customer,
policy, premium, and claim. For a
manufacturer, the major subject
areas might be product, order,
vendor, bill of material, and raw
goods. For a retailer, the major
subject areas may be product, SKU,
sale, vendor, and so forth. Each
type of company has its own unique
set of subjects
Integrated Data Collections

Of all the aspects of a data

warehouse, integration is the most
important. Data is fed from multiple
disparate sources into the data
warehouse. As the data is fed it is
converted, reformatted,
resequenced, summarized, and so
forth. The result is that data—once it
resides in the data warehouse—has
a single physical corporate image.
Non-volatile Data Collections
Data is updated in the
operational environment as a
regular matter of course, but
warehouse data exhibits a very
different set of characteristics.
Data warehouse data is loaded
(usually en masse) and
accessed, but it is not updated
(in the general sense). Instead,
when data in the data warehouse
is loaded, it is loaded in a
snapshot, static format. When
subsequent changes occur, a
new snapshot record is written.
In doing so a history of data is
kept in the data warehouse.
Time-variant Data Collections
Time variance implies that every
unit of data in the data warehouse
is accurate as of some one
moment in time. In some cases, a
record is time stamped. In other
cases, a record has a date of
transaction. But in every case,
there is some form of time
marking to show the moment in
time during which the record is
accurate. A 60-to-90-day time
horizon is normal for operational
systems; a 5-to-10-year time
horizon is normal for the data
warehouse. As a result of this
difference in time horizons, the
data warehouse contains much
more history than any other
environment.
Data Warehouse vs. OLTP
OLTP DW
Purpose Automate day-to-day Analysis
operations

Structure RDBMS RDBMS

Data Model Normalized (ER) Dimensional

Access SQL SQL and business analysis

programs

Data Data that runs the business Current and historical information

Condition of data Changing, incomplete Historical, complete, descriptive

Data Mart
 A logical subset of the complete data warehouse. A data mart is a
complete “pie-wedge” of the overall data warehouse pie. A data mart
represents a project that can be brought to completion rather than
being an impossible galactic undertaking. A data warehouse is made up
of the union of all its data marts.
 Contains a subset of corporate-wide data that is of value to a specific
group of users. Scope is confined to specific selected subjects
 Implemented on low-cost departmental servers that are UNIX or
Windows NT based. Implementation cycle is measured in weeks rather
than months or years
 They are further classified as:
 Independent
 Sourced from data captured from one or more operational systems or
external information providers
 Sourced from data generated locally within a particular department or
geographic area
 Dependent
 Sourced directly from enterprise data warehouses
Data Warehouse & Data Mart
Granularity
 The term Granularity is used to indicate the level of
detail stored in the fact table. The granularity of the
Fact table follows naturally from the level of detail of
its related dimensions.
 For example, if each Time record represents a day,
each Product record represents a product, and each
Organization record represents one branch, then the
grain of a sales Fact table with these dimensions would
likely be: sales per product per day per branch.
 Proper identification of the granularity of each
schema is crucial to the usefulness and cost of the
warehouse.
Granularity
 Too High
 Severely limits the ability of users to obtain additional detail.
 For example, if each time record represented an entire year,
there will be one sales fact record for each year, and it would
not be possible to obtain sales figures on a monthly or daily
basis.

 Too Low
 Results in an exponential increase in the size requirements
of the warehouse.
 For example, if each time record represented an hour, there will
be one sales fact record for each hour of the day
 8,760 sales fact records for a year with 365 days for each
combination of Product, Client, and Organization
 If daily sales facts are all that are required, the number of
records in the database can be reduced dramatically.
Operational Data Store (ODS)
The Operational Data Store is used for tactical decision making
while the DW supports strategic decisions. It contains
transaction data, at the lowest level of detail for the subject area
 subject-oriented, just like a DW
 integrated, just like a DW
 volatile (or updateable) , unlike a DW
 an ODS is like a transaction processing system
 information gets overwritten with updated data
 no history is maintained (other than audit trail) or operational
history
 current, i.e., not time-variant, unlike a DW
 current data, up to a few years
 no history is maintained (other than audit trail) or operational
history
Operational Data Store (ODS)
 An ODS is a collection of integrated databases designed to
support the monitoring of operations. Unlike the databases of
OLTP applications (that are function oriented), the ODS
contains subject oriented, volatile, and current enterprise-wide
detailed information. It serves as a system of record that
provides comprehensive views of data in operational sources.

 Like data warehouses, ODS are integrated and subject-

oriented. However, an ODS is always current and is constantly
updated. The ODS is an ideal data source for a data
warehouse, since it already contains integrated operational data
as of a given point in time.

 In short, ODS is an integrated collection of clean data destined

for the data warehouse.
Operational Data Store (ODS)
 Characteristics:
 Subject Oriented
 Customer, Vendor, Account, Product, etc.

 Integrated
 Data is cleansed, standardized and placed into a consistent data model

 Volatile
 UPDATEs occur regularly, whereas data warehouses are refreshed via

INSERTs to firmly preserve history

 Current Valued
 No INSERTs means no significant retention of historical data (don’t take

this literally…history can extend to an accounting cycle, for instance)

 Detailed

 Examples
 Goldman Sachs
 Disparate, global real estate investment details collected daily, then

standardized and transmitted to HQ

 WalMart
 Real-time transaction ‘flash’ sales report. Highly granular and can be

aggregated as seen fit

Data Warehouse Architecture
Data Warehouse Architectures (A)
 The different types of DW architecture…
 Independent Data Marts
 Dependent Data Marts / Hub and Spoke
 Bus Architecture
 Central Data Warehouse (no dependent data marts)
 Federated
 And more…

 With many variances on the themes. Therefore

some of the architectures represented here might
look slightly different to how others represent them.
Data Warehouse Architectures (A)
Independent Data Marts

Reports
Report

ETL
Finance
Finance
Data Mart

Reports
SMS Report

ETL
SMS
Data Mart

Reports
ETL

HR Report
HR
Data Mart

Reports
ETL

Other Report
External
Data Mart

How Data Warehousing was often performed in the early days

Individual projects developing solutions into functional silos No program /
enterprise perspective No conformed dimensions
Data Warehouse Architectures (A)
Dependent Data Marts / Hub & Spoke
Usually employing a Top-Down approach (Inmon)

Report
Finance
Finance
Data Mart

Report

Reporting Infrastructure
SMS
SMS
Data Mart

ETL
Data Report
ETL

Warehouse

Finance
External
Data Mart Report

Finance Report
HR
Data Mart

An approach also used in the early days, but refined over time
Originally suggested extensive effort in building the DW Now
recommends building DW incrementally
Data Warehouse Architectures (A)
Data Mart Bus (conformed)
Usually employing a Bottom-Up approach (Kimball)

Finance
Note Finance
Report

Data Mart

Report

Reporting Infrastructure
SMS
SMS
Data Mart

ETL
Staging Report
ETL

Area

Finance
External
Data Mart Report

Finance Report
HR
Data Mart

An approach also used in the early days, but refined over time
Originally suggested building silos Now recommends enterprise
perspective
Data Warehouse Architectures (A)
Central Data Warehouse
Usually employing a Hybrid approach

Report
Finance

Report

Reporting Infrastructure
SMS
Report
ETL

ETL
Staging Data
Area Warehouse

External
Report

Report
HR

Seeks to overcome the limitations of previous architectures Highly

variable with many individual approaches
Data Warehouse Architectures (A)
Federated Data Warehouse

Report
Finance
DW 1
Report

Reporting Infrastructure
SMS
Enterprise Report
ETL

ETL
DW 2 Data
Warehouse
External
Report

DW 3
Report
HR

An attempt to consolidate legacy Data Marts

Data Warehouse Architectures (B)
 1.Generic Two-Level Architecture
 2.Independent Data Mart
 3.Dependent Data Mart and Operational Data
Store
 4.Logical Data Mart and @ctive Warehouse
 5.Three-Layer architecture
All involve some form of extraction,
transformation and loading (ETL)
ETL
Data Warehouse Architectures (B)
Generic two-level architecture

Periodic extraction  data is not completely current in warehouse

Data Warehouse Architectures (B)
Independent Data Mart

T
E

Separate ETL for each Data access complexity

independent data mart due to multiple data marts
Data Warehouse Architectures (B)
Dependent data mart with operational data store
ODS provides option for obtaining
current data

T
E Simpler data access
Single ETL for
enterprise data warehouse Dependent data marts
(EDW) loaded from EDW
Data Warehouse Architectures (B)
Logical data mart and Active data warehouse
ODS and data warehouse are
ODS data warehouse
one and the same

T
E
Near real-time ETL for Data marts are NOT separate databases,
@active Data Warehouse but logical views of the data warehouse
 Easier to create new data marts
Data Warehouse Architectures (B)
Three-layer architecture Reconciled and derived data
 Reconciled data: detailed, current data intended to be the single,
authoritative source for all decision support.
 Derived data: Data that have been selected, formatted, and aggregated
for end-user decision support application.
 Metadata: technical and business data that describe the properties or
characteristics of other data.
Data Sources
 Unstructured data – Support of any text file type in 32 languages

 Relational data – Generic ODBC, HP NeoView, IBM DB2/UDB, Informix IDS,

Microsoft SQL Server, mySQL, Netezza, Teradata, Oracle, Sybase Adaptive
Server Enterprise (ASE), and Sybase IQ. Native bulk loading supported for all
major databases.

 Mainframe (via partner) – Adabas, ISAM (C-ISAM, DISAM), Enscribe , IMS/DB,

RMS, and VSAM

 Enterprise applications – JD Edwards OneWorld and World, Oracle e-

Business Suite (EBS), PeopleTools, SAP NetWeaver Business Intelligence,
SAP ERP, and SAP R/3 via ABAP, BAPI, and IDOC, Siebel, and
Salesforce.com

 Technology and standards – An open services-based architecture permits

third-party integration using standards like CWM, JMS, SNMP, and Web
services. Data access from flat files, XML (DTD and schema definitions), Cobol,
Microsoft Excel, HTTP/HTTPS, IBM MQ Series, JMS, and Web services (SOAP,
WSDL).
Extraction E (ETL)
 Logical Extraction Methods
 Full Extraction
 An example of full extraction may be an export file.

 Incremental Extraction
 Only the data that has changed since a well-defined event back in
history will be extracted

 Physical Extraction Methods

 Online Extraction
 You need to consider whether the distributed transactions are using

original source objects or prepared source objects.

 Offline Extraction
 Flat files, Dump files, Redo and archive logs, Transportable
tablespaces
 Data in a defined, generic format.
 Oracle-specific format. Information about the containing objects is
included.
To stage or not to stage
 A conflict between
 getting the data from the operational systems as fast
as possible
 having the ability to restart without repeating the
process from the beginning
 Reasons for staging
 Recoverability: stage the data as soon as it has been
extracted from the source systems and immediately
after major processing (cleaning, transformation, etc).
 Backup: can reload the data warehouse from the
staging tables without going to the sources
 Auditing: lineage between the source data and the
underlying transformations before the load to the data
warehouse
Designing the staging area
 The staging area is owned by the ETL team
 no indexes, no aggregations, no presentation access, no
querying, no service level agreements
 Users are not allowed in the staging area for any reason
 staging is a “construction” site
 Reports cannot access data in the staging area
 tables can be added, or dropped without modifying the user
community
 Only ETL processes can read/write the staging area (ETL
developers must capture table names, update strategies, load
frequency, ETL jobs, expected growth and other details about
the staging area)
 The staging area consists of both RDBMS tables and data files
Staging Area data Structures in the
ETL System
 Flat files
 fast to write, append to, sort and filter (grep) but slow to update, access
or join
 enables restart without going to the sources
 XML Data Sets (not really used in Staging)
 Relational Tables
 Metadata, SQL interface, DBA support
 Dimensional Model Constructs: Facts, Dimensions, Atomic Facts
tables, Aggregate Fact Tables (OLAP Cubes)
 Surrogate Key Mapping Tables
 map natural keys from the OLTP systems to the surrogate key from the
DW
 can be stored in files or the RDBMS (but you can use the IDENTITY
function if you go with the RDBMS approach)
 Best Practices about these data structures:
 perform impact analysis, capture metadata, use naming conventions,
Extracting
 Effectively integrate data from
 different DBMS, OS, H/W, communication protocols
 need a logical map, data movement view documents,
data lineage report
 have a plan

 identity source candidates

 analyze source systems with a data profiling tool

 receive walk-through of data lineage and business

rules (from the DW architect and business analyst to

the ETL developer)
 data alterations during data cleansing, calculations and
formulas
 measure twice, cut once
 standard conformance to dimensions and numerical facts
 receive walk-through of the dimensional model
Components of the Data Movement
Views
 Target table and column, table type (Dimension,
Fact)
 Slow-changing dimension type per target column
 Type 1, overwrite (Customer first name)
 Type 2, retain history (Customer last name)
 Type 3, retain multiple valid alternative values
 Source database, table, column
 Transformations (the guts of the document)
 Where do you capture this information? Which tool?
How do you maintain this metadata?
Keeping track of source systems
 Data modelers and ETL developers should maintain
a documentation of the source systems including
 subject area, application name, business name,
department name
 priority
 business owner and technical owner
 DBMS, production server
 DB size, #users, complexity, #transactions per day
 comments
 Determine the system of record
Some good rules
 Dealing with derived data (derive from base facts or accepted
calculated columns from the source systems?)
 if calculations are done by the DW infrastructure, the ETL developer is
responsible for them
 and what if the numbers don’t match?
 recommendations: stay true to the definition of system-of-record

 The further downstream you go from the originating data source, the
more you increase the risk of extracting corrupt data. Barring rare
exceptions, maintain the practice of sourcing data only from the
system-of-record.
 Analyze your source system
 get a ER-model for the system or reverse engineering one (develop
one by looking at the metadata of the system)
 reverse engineering is not the same as “forward engineering”, i.e.,
given the ER-models of the source systems derive the dimensional
schema of the data warehouse
Data Analysis
 Reverse engineering of the understanding of a source system
 unique identifiers and natural keys
 data types
 relationships between tables (1-to-1, many-to-1, many to
many), problematic when source database does not have
foreign keys defined
 discrete relationships (static data, reference tables)
 Data content analysis
 NULL values, especially in foreign keys, NULL result in lossy
joins
 In spite of the most detailed analysis, we recommend
using outer join logic when extracting from relational
source systems, simply because referential integrity often
cannot be trusted on remote systems
 Dates in non-date fields
Extract data from disparate systems
 What is the standard for the
enterprise?
 ODBC, OLE DB, JDBC, .NET
 access databases from
windows applications, so that
applications are portable
 performance is major drawback
 every DBMS has an ODBC
driver, even flat files
 Adds two layers of interaction
between the ETL and the
database
Extracting from different sources
 Mainframe
 COBOL copybooks give you the datatypes
 EBCDIC and not ASCII character set (FTP does the
translation between the mainframe and Unix/Windows)
 Working with redefined fields (To save space the same field
is used for different types of data)
 Extracting from IMS, IDMS, Adabase
 you need special adapters or you get someone in those
systems to give you a file
 XML sources, Web Log Files: doable, if you undestand the structure of
those sources
 Enterprise-Resource-Planning ERP Systems (SAP, PeopleSoft, Oracle)
 Don’t treat it as a relational system -- it’s a mess
 Use adapters
Extracting Changed Data
 Using Audit Columns
 Use the last update timestamp, populated by triggers or the front-end application
 Must ensure that the timestamp is dependable, that is if the front-end modifies it, a batch job
does not override it
 Index the timestamp it if it’s dependable
 Database Log Scrapping or sniffing
 Take the log of the source file and try to determine the transactions that affect you
 Sniffing does it real time
 Timed extracts
 Retrieve all records from the source that were modified “today”
 POTENTIALLY dangerous -- what if the process fails today? When it runs tomorrow, you’d
have lost today’s changes
 Process of elimination
 Preserve yesterday’s data in the stage area
 Bring today’s entire data in the stage area
 Perform a comparison
 Inefficient, but the most reliable
 Initial and Incremental Loads
 Create two tables, previous-load and current-load
 Load into the current-load, compare with the previous-load, when you are done drop the
previous-load, rename the current-load into previous-load, create a new curent-log
Tips for Extracting
 Constrain on indexed columns
 Retrieve only the data you need
 Use DISTINCT sparingly
 Use the SET operations sparingly
 Use HINT (HINT tells the DBMS to make sure
it uses a certain index)
 Avoid NOT
 Avoid functions in the where clause
 Avoid subqueries
Transformation (ETL)
 Data Profiling
 Gather metadata
 Identify and Prepare Data for Profiling
 Value Analysis
 Structure Analysis
 Single Object Data Rule Analysis
 Multiple Object Data Rule Analysis

 Data Cleaning
 Parsing
 Correcting
 Standardizing
 Matching
 Consolidating
Cleaning Deliverables
 Keep accurate records of the types of data
quality problems you look for, when you look,
what you look at, etc
 Is data quality getting better or worse?
 Which source systems generate the most data
quality errors?
 Is there any correlation between data quality
levels and the performance of the organization
as a whole?
Cleaning and Conforming
 While the Extracting and Loading part of an
ETL process simply moves data, the cleaning
and conforming part (the transformation part
truly adds value)
 How do we deal with dirty data?
 Data Profiling report
 The Error Event fact table
 Audit Dimension
Defining Data Quality
 Basic definition of data quality is data accuracy and
that means
 Correct: the values of the data are valid, e.g., my
resident state is PA
 Unambiguous: The values of the data can mean only
one thing, e.g., there is only one PA
 Consistent: the values of the data use the same
format, e.g., PA and not Penn, or Pennsylvania
 Complete: data are not null, and aggregates do not
lose data somewhere in the information flow
Who cares about information
quality?
 Most organization accept low quality data as normal,
after all we are profitable, aren’t we?
 In fact as long as information quality is relatively the
same across the competition, it’s probably acceptable
 But look what happened at the U.S. Auto
Manufactures (GM, Ford, Chrysler) who have been
losing round consistently over the Japanese
automobile quality
The high cost of low quality data #1
 Some Metro Nashville city pensioners overpaid $2.3 million form 1987
to 1995 while another set of pensioners underpaid $2.6 million as a
result of incorrect pension calculations (The Tennessean, March
21,1998)
 Two 20-year old “calculation errors” socked Los Angeles County’s
pension systems with $1.2 billion in unforeseen liabilities and will force
county officials to make $25 millions/year of unplanned contributions to
make up the difference (Los Angeles Times April 8, 1998)
 Wrong price data in retail databases may cost American consumers as
much as $2.5 billion in overcharges annually. Data audits show 4 out of
5 errors in prices are overcharges. (Information Week Sept 1992)
 Four years later, 1 out of 20 items scanned incorrectly, according to a
Federal Trade Commission study of 17,000 items.
The high cost of low quality data #2
 The US Attorney general’s office has stated that “approximately $23
billion or 14% of the health care dollar is wasted in fraud or incorrect
billing (Nashile Business Journal, Sept 1997)
 In 1992, 96,000 IRS tax refund checks were returned as undeliverable
due to bad addresses
 No fewer than 1 out of 6 US registered voters on voter registration lists
have either moved or are deceased, according to audits that compare
voter registration lists with the US postal office change-of-address lists
 Electronic data audits reveal that invalid data values in a typical
customer database averages 15-20%
 Barbra Streisand pulled her investment account from her investment
bank because it misspelled her name as “Barbara”
The high costs of low quality data
#3
 The Gartner group estimates for the worldwide costs to modify software
and change databases to fix the Y2K problem was $400-$600 billion.
T.Capers Jones says this estimate is low, it should be $1.5 trillion. The
cost to fix this single pervasive error is one eighth of the US federal
deficit ($8 trillion Oct 2005).
 Another way to look at it. The 50 most profitable companies in the world
earned a combined $178 billion in profits in 1996. If the entire profit of
these companies was used to fix the problem, it would only fix about
12% of the problem
 And MS Excell, in year 2000, still regards 1900 as a leap year (which is
not).
Data Profiling Deliverable
 Start before building the ETL system
 Data profiling analysis including
 Schema definitions
 Business objects
 Domains
 Data Sources
 Table definitions
 Synonyms
 Data rules
 Value rules
 Issues that need to be addressed
Loading (ETL)
 Data are physically moved to the data
warehouse
 The loading takes place within a “load
window”
 The trend is to near real time updates of the
data warehouse as the warehouse is
increasingly used for operational applications
Dimensional Modeling
 The process and outcome of designing logical
database schemas created to support OLAP and
Data Warehousing solutions
 Used by most contemporary BI solutions
 – “Right” mix of normalization and denormalization
often called Dimensional Normalization
 – Some use for full data warehouse design
 – Others use for data mart designs
 Consists of two primary types of tables
 – Dimension tables
 – Fact tables
Dimensional Modeling …
Dimension Tables!
 Organized hierarchies of categories, levels, and members
 Used to “slice” and query within a cube
 Business perspective from which data is looked upon
 Collection of text attributes that are highly correlated
(e.g. Product, Store, Time)
 Shared with multiple fact relationships to provides data
correlation
Dimensional Modeling …
Dimension Details
 Attributes
Descriptive characteristics of an entity
 Building blocks of dimensions, describe each instance
 Usually text fields, with discrete values
 e.g., the flavor of a product, the size of a product
 Dimension Keys
 Surrogate Keys
 Candidate Business Keys
 Dimension Granularity
 Granularity in general is the level of detail of data contained
in an entity
 A dimensions granularity is the lowest level object which
uniquely identifies a member
 Typically the identifying name of a dimension
Dimensional Modeling …
Hierarchies
Dimensional Modeling …
DW - Surrogate Keys
 OLTP – Natural Keys
 Production Keys
 Intelligent Keys
 Smart Keys
NKs (Natural Keys) tell us something about the record they represent
For example: Student IDNO - 2003B4A7290
 DW - Surrogate Keys
 Integer keys
 Artificial Keys
 Non-intelligent Keys
 Meaningless Keys
SKs do not tell us anything about the record they represent
 Surrogate Keys - Advantages
 Buffers the DW from operational changes
 Saves Space
 Faster Joins
 Allows proper handling of changing dimensions
Dimensional Modeling …
DW - Surrogate Keys
Buffering DW from operational changes
 Production keys are often reused
 For Eg. Inactive account numbers or obsolete product codes are reassigned after a
period of dormancy
 Not a problem in operational system, but can cause problems in a DW
 SKs allow the DW to differentiate between the two instances of the same production
key
Space Saving
 Surrogate Keys are integers
 4 bytes of space
 Are 4 bytes enough?
 Nearly 4 billion values!!!
 For example
 Date data type occupies 8 bytes
 10 million records in fact table
 Space saving=4x10million bytes =38.15 MB

Faster Joins
 Every join between dimension table and fact table is based on SKs and not on NKs
 Which is faster?
 Comparing 2 strings
 Comparing 2 integers
 But the issue is – Do we need joins in the first place?

Changing Dimensions
 Surrogate keys helps in saving historical data in the same Dimension.
Dimensional Modeling …
Dimension Table
 The final component of the Dimension besides the SK & NK is a set of
descriptive attributes. (May be large approximately 100 in a dimension).
DW Architect should not call for the numerical field in the dimension
tables, else need to call only textual attributes. All descriptive attributes
should be truly static or should only change slowly and episodically.
Example:- The difference between the measured fact and
numeric descriptive attribute is obvious in 98% cases. Sometimes it
takes time to distinguish. Take case of Catalog Price :- this standard
catalog is numeric so can be taken as a fact, but what when the
standard price of a product change. So we can’t take this
numeric attribute as a measure or fact. It should be in the Dimension
table as a numeric descriptive attribute.
 Contains attributes for dimensions
 50 to 100 attributes common
 Best attributes are textual and descriptive
 DW is only as good as the dimension attributes
 Contains hierarchal information albeit redundantly
 Entry points into the fact table
Dimensional Modeling …
Dimension Types
Static Dimension:
When a dimension is static and is not being updated
for historical changes to individual rows, there is 1-to-1
relationship between the PK (SK) and the Natural key
(NK)
i.e. PK (SK) : NK :: 1 : 1
Dynamic or Slowly changing Dimension:-
When a dimension is slowly changing we generate
many PKs (SK) for each Natural Key as we track the
history of changes to the dimension then the
relationship between the PK (SK) to Natural Key (NK)
is N-to-1.
i.e. PK (SK) : NK :: N : 1
Dimensional Modeling …
Dimension Types
Big dimensions
Big represents wide as well as deep. Eg- customer, product, location etc. with
millions of records and hundred or more fields. Almost always derived from
multiple source.
Examples: Customer, Product, Location
 Millions or records with hundreds of fields (insurance
customers) Or hundreds of millions of records with few
fields (supermarket customers)
 Always derived by multiple sources
 These dimensions should be conformed

Small Dimensions
Many of the dimensions in the DWH are tiny lookup tables with only a few
records and one or two columns. Eg – is the transaction dimension. These
dimensions cannot be, should not be confirmed along the various fact tables.
Examples: Transaction Type, Claim Status
 Tiny lookup tables with only a few records and one ore more columns
 Build by typing into a spreadsheet and loading the data into the DW
 These dimensions should NOT be conformed
Dimensional Modeling …
One dimension or two dimension
 In dimensional modeling we usually think that dimensions are independent. But a good
statistician would be able to demonstrate a degree of correlation between the product
dimension and the store dim. (if correlation degree is high we can combine these two
dimensions as one).
 But understand the fact that if there are 10000 rows in product and 100 store dimension, if we
combine these two dimension’s in the resulting dim. There will be <(100,000,0) rows which is
a disadvantage of combining two dim’s.
 There may be an more than one independent type of correlation (independent) between the
two dimensions.

 Weak statistical correlation and big dimensions

 Even if there is a weak correlation, build the DW as if the dimensions are independent,
especially for large dimensions
 100 stores, 1 million products, 100K millions for combining the two, too large
 If the statistical correlation is significant, build a fact table for the correlation, and there may be
many such correlations, e.g., merchandizing correlation, pricing-strategy correlation,
changing-seasonality correlation
 Leave the dimensions simple and independent if the dimensions are large and the statistical
correlation is weak

 Strong correlations and small resulting dimensions

 If the correlation is strong, e.g., product always has a brand, leave the brand OUT of the fact
table, because the product always rolls up to a single brand and COMBINE the product and
the brand into a single dimension
 Arbitrary bound: 100K records is not small dimension
Dimensional Modeling …
Dimensional Roles
 When the same dimension is attached to a fact table multiple
times
 Sale has an order date, payment date, a shipping date, a
return date
 A claim transaction has multiple internal people, a claim
intake, a medical nurse, a claim adjuster, a payer, etc
 End-user might be confused when they do drill-down into the
dimension as to what the dimension represents, so it helps to
have the dimension have different names
 Solution: Create views over the dimensions for each of the
dimensional role to help with the name recognition
 For very large dimensions (locations) might be appropriate to
create physical copies of the dimension table to help with the
name confusion
Dimensional Modeling …
Role-Playing and Time Dimensions
Dimensional Modeling …
Junk/Profile Dimension
 Miscellaneous flags and textual fields are left in the source data structures. These
include Y/N flags, textual codes and free form textual attributes. These are not
significant fields in major dimensions. We left with these options.

 Exclude and discard all flags and texts Not a good option.
 Place the flags and texts unchanged in the fact table This option is also not
good, as it swell up the fact table to no specific advantage.
 Make only those flags and textual fields as a separate table on its own
Not good because it will increase the number of dimension tables.
 Best approach Keep only those flags and texts that are meaningful. Group all
the useful flags into a single “JUNK” dimension. These will be useful for
constraining queries based on flag text values.

 Again junk :- some data source have a dozen or more operational codes attached to
the fact table records, many of which have very low cardinalities. Even if there is no
obvious correlation between the values of the operational codes. A single junk
dimension can be created to bring all these little codes into one dimension and tidy
up design. The records in the junk dimension should probably be created as they
are encountered in the data, rather than beforehand as the Cartesian product of all
the separate codes. It is likely that the incrementally produced junk dimension in
much smaller than full Cartesian product of all the values of the code.
Dimensional Modeling …
Degenerated Dimension
 When a parent-child relationship exists and
the grain of the fact table is the child, the
parent is kind of left out in the design process
 Example:
 grain of the fact able is the line item in an order
 the order number is significant part of the key
 but we don’t create a dimension for the order
number, because it would be useless
 we insert the order number as part of the key, as
if it was a dimension, but we don’t create a
dimension table for it
Dimensional Modeling …
Date and Time Dimension
 Virtually everywhere:
measurements are
defined at specific times,
repeated over time, etc.
 Most common: calendar-
day dimension with the
grain of a single day,
many attributes
 Doesn’t have a
conventional source:
 Built by hand,
spreadsheet
 Holidays, workdays,
fiscal periods, week
numbers, last day of
month flags, must be
entered manually
 10 years are about 4K
rows
Dimensional Modeling …
Slow-changing Dimensions
 When the DW receives notification that some
record in a dimension has changed, there are
three basic responses:
 Type 1 slow changing dimension (Overwrite)
 Type 2 slow changing dimension (Partitioning
History)
 Type 3 slow changing dimension (Alternate
Realities)
Dimensional Modeling …
Type 1 Slowly Changing Dimension (Overwrite)
 Overwrite one or more values of the dimension with the new value
 Use when
 the data are corrected
 there is no interest in keeping history
 there is no need to run previous reports or the changed value is immaterial to the report
 Type 1 Overwrite results in an UPDATE SQL statement when the value changes
 If a column is Type-1, the ETL subsystem must
 Add the dimension record, if it’s a new value or
 Update the dimension attribute in place
 Must also update any Staging tables, so that any subsequent DW load from the staging tables
will preserve the overwrite
 This update never affects the surrogate key
 But it affects materialized aggregates that were built on the value that changed (will be
discussed more next week when we talk about delivering fact tables)
 Beware of ETL tools “Update else Insert” statements, which are convenient but inefficient
 Some developers use “UPDATE else INSERT” for fast changing dimensions and “INSERT else UPDATE”
for very slow changing dimensions
 Better Approach: Segregate INSERTS from UPDATES, and feed the DW independently for the updates
and for the inserts
 No need to invoke a bulk loader for small tables, simply execute the SQL updates, the performance impact
is immaterial, even with the DW logging the SQL statement
 For larger tables, a loader is preferable, because SQL updates will result into unacceptable database
logging activity
 Turn the logger off before you update with SQL Updates and separate SQL Inserts
 Or use a bulk loader
 Prepare the new dimension in a staging file
 Drop the old dimension table
 Load the new dimension table using the bulk loader
Dimensional Modeling …
Type-2 Slowly Changing Dimension (Partitioning History)
 Standard

 When a record changes, instead of overwriting

 create a new dimension record
 with a new surrogate key
 add the new record into the dimension table
 use this record going forward in all fact tables
 no fact tables need to change
 no aggregates need to be re-computed

 Perfectly partitions history because at each detailed version of the dimension is

correctly connected to the span of fact tables for which that version is correct

 With a Type-2 change, you might want to include the following additional
attributes in the dimension
 Date of change
 Exact timestamp of change
 Reason for change
 Current Flag (current/expired)
Dimensional Modeling …
Type-2 Slowly Changing Dimension (Partitioning History)

 The natural key does not

change
 The job attribute changes
 We can constraint our query
 the Manager job
 Joe’s employee id
 Type-2 do not change the
natural key (the natural key
should never change)
Dimensional Modeling …
Type-3 Slowly Changing Dimension (Alternate Realities)
 Applicable when a change happens to a dimension record but
the old record remains valid as a second choice
 Product category designations
 Sales-territory assignments
 Instead of creating a new row, a new column is inserted (if it
does not already exist)
 The old value is added to the secondary column
 Before the new value overrides the primary column
 Example: old category, new category
 Usually defined by the business after the main ETL process is
implemented
 “Please move Brand X from Men’s Sportswear to Leather
goods but allow me to track Brand X optionally in the old
category”
 The old category is described as an “Alternate reality”
Dimensional Modeling …
Fact/Measure/Metric
 Fully Additive Facts
 Units_sold, Sales_amt
 Semi Additive Facts
 Account_balance, Customer_count
28/3,tissue paper,store1, 25, 250,20
28/3,paper towel,store1, 35, 350,30
Is no. of customers who bought either tissue paper or
paper towel is 50?
 Non Additive Facts
 Gross margin=Gross profit/amount
 Note that GP and Amount are fully additive
 Ratio of the sums and not sum of the ratios
Dimensional Modeling …
Schema Designs
 Star schema: A fact table in the middle connected to a set of
dimension tables
 Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
 Advantages
 Small saving in storage space
 Normalized structures are easier to update and maintain
 Disadvantages
 Schema less intuitive
 Ability to browse through the content difficult
 Degraded query performance because of additional joins.

 Fact constellations: Multiple fact tables share dimension tables,

viewed as a collection of stars, therefore called galaxy schema or fact
constellation
Dimensional Modeling …
The basic structure of a fact table
 Every table defined by its
grain
 in business terms
 in terms of the dimension foreign
keys and other fields
 A set of foreign keys (FK)
 context for the fact
 Join to Dimension Tables
 Degenerate Dimensions
 Part of the key
 Not a foreign key to a Dimension
table
 Primary Key
 a subset of the FKs
 must be defined in the table
 Fact Attributes
 measurements
Dimensional Modeling …
Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table
item_name
month time_key brand
quarter type
year item_key supplier_type
branch_key
location
branch location_key
location_key
branch_key street
units_sold
branch_name city
branch_type dollars_sold province_or_street
country
avg_sales
Measures/Facts/Metrics
Dimensional Modeling …
Star Schema Example
Dimensional Modeling …
Star Schema with sample data
Dimensional Modeling …
Multiple Fact Tables
 More than one fact table in a given star
schema.
 Ex: There are 2 fact tables, one at the center
of each star:
 Sales – facts about the sale of a product to a customer
in a store on a date.
 Receipts - facts about the receipt of a product from a
vendor to a warehouse on a date.
 Two separate product dimension tables have been
created.
 One date dimension table is used.
Dimensional Modeling …
Multiple Fact tables
Dimensional Modeling …
Snowflake Schema
time
item
time_key
day Sales Fact Table item_key supplier
item_name supplier_key
day_of_the_week
time_key brand supplier_type
month
quarter type
item_key supplier_key
year
branch_key
branch location_key location
branch_key location_key
units_sold
branch_name street
branch_type dollars_sold city_key
city
avg_sales city_key
Measures/Fact/Metrics city
province_or_street
country
Dimensional Modeling …
Snowflake Schema
MOVIE
movie key
ADDRESS CUSTOMER
customer key movie copy number
customer key (FK) movie number
first name movie title
street address
last name REVENUE rental status
city
transaction id rental date
state movie key (FK)
payment date due date
zip market key (FK)
payment status
customer key (FK)
time key (FK)
movie rental rate
overdue charge
payment amount
MARKET DISTRICT
TIME market key
district key
time key store number
number
store city
day name
store state
month office address
region key (FK)
year manager name
district key (FK)
period

REGION
region key
number
name
office address
manager name
Dimensional Modeling …
time
Fact Constellation or Galaxy Schema
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location

branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_street
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type
Dimensional Modeling …
Normalizing dimension tables
 Dimension tables may not be normalized. Most data
warehouse experts find this acceptable.
 In some situations in which it makes sense to further
normalize dimension tables.
 Multivalued dimensions:
 Ex: Hospital charge/payment for a patient on a date is
associated with one or more diagnosis.
 N:M relationship between the Diagnosis and Finances
fact table.
 Solution: create an associative entity (helper table)
between Diagnosis and Finances.
Dimensional Modeling …
Multivalve Dimension
Dimensional Modeling …
Guaranteeing Referential Integrity
1. Check Before Loading
• Check before you add fact
records
• Check before you delete
dimension records
• Best approach
2. Check While Loading
• DBMS enforces RI
• Elegant but typically SLOW
• Exception: Red Brick
database system is capable
of loading 100 million records
an hour into a fact table
where it is checking
referential integrity on all the
dimensions simultaneously!
3. Check After Loading
• No RI in the DBMS
• Periodic checks for invalid
foreign keys looking for
invalid data
• Ridiculously slow
Dimensional Modeling …
Options for loading the Surrogate Keys of Dimensions
 Option 1: Look up the current surrogate key in each
dimension, fetch the record with the most current
surrogate key for the natural key and use that
surrogate key. Good option but very slow.
 Option 2: Maintain a surrogate key lookup table for
each dimension. This table is updated whenever a
new record is added or when a Type-2 update occurs
in an existing dimensional entity.
 The dimensions must be updated with Type-2 updates
before any facts are loaded into the Data Warehouse,
to guarantee referential integrity
 Also known as surrogate key pipeline method
Dimensional Modeling …
Surrogate Key Pipeline
 Assume that all records to be added to the fact table
are current, i.e.,
 in the incoming fact records, the value for the natural
key of each dimension is the most current value known
to the DW
 When loading a fact table, the final ETL step
converts the natural keys of the new input records
into the correct surrogate key of the dimensions using
the key mapping tables
Dimensional Modeling …
Surrogate Key Pipeline
Dimensional Modeling …
Kinds of Fact Tables
 Each fact table should have one and only one
fundamental grain
 There are three types of fact tables
 Transaction grain
 Periodic snapshot grain
 Accumulating snapshot grain
Dimensional Modeling …
Transaction Grain Fact Tables
 The grain represents an instantaneous measurement
at a specific point in space and time
 retail sales transaction
 The largest and the most detailed type
 Unpredictable sparseness, i.e., given a set of
dimensional values, no fact may be found; could this
be significant to the end users?
 Usually partitioned by time
Dimensional Modeling …
Periodic Snapshot Fact Tables
 The grain represents a span of time periodically repeated
 A periodic snapshot for a checking account in a bank,
reported every month
 Beginning balance
 Ending balance
 Number of deposits
 Total for deposits
 Number of withdrawals
 Total for withdrawals
 Graceful modifications
 Predictable sparseness
 one checking account record per account per month
 what if there is no account activity? How do you create a fact record
then?
 What should be the surrogate key for the non-date attributes, the values
at the beginning or the end of the period?
Dimensional Modeling …
Accumulating Snapshot Fact Tables
 The grain represents finite processes that have a definite
beginning and an end
 Order fulfillment, claim processing, “small workflows”
 But not large complicated looping workflows
 Example: shipment invoice line item
 Order date
 Requested ship date
 Actual ship date
 Delivery date
 Last payment date
 Return date
 Characteristics: large number of date dimensions, data are
created and overwritten multiple times as events unfold
Dimensional Modeling …
Loading a Table
 Separate inserts from updates (if updates are relatively few compared
to insertions and compared to table size)
 First process the updates (with SQL updates?)
 Then process the inserts
 Use a bulk loader
 To improve performance of the inserts & decrease database overhead
 Load in parallel
 Break data in logical segments, say one per year & load the data in parallel
 Minimize physical updates
 To decrease database overhead with writing the logs
 It might be better to delete the records to be updated and then use a bulk-
loader to load the new records
 Some trial and error is necessary
 Perform aggregates outside of the DBMS
 SQL has count, max, etc functions and group by, order by contracts
 But they are slow compared to dedicated tools outside the DBMS
 Replace entire table (if updates are many compared to the table size)
Dimensional Modeling …
Updating and Correcting Facts

 Should we change fact data once they are in the data

warehouse? No if they represents business events,
but yes if they represent data errors in the OLTP
systems
 Example
 Company sells 1000 containers of sodas, each
container has 12 oz cans
 A mistake is made, the container has 20oz cans
 This mistake must be corrected in the DW
 The business never sold 12oz cans

 Preserving the data might misrepresent the business

Dimensional Modeling …
Updating and Correcting Facts
 Negate the fact
 Create an exact duplicate of the fact where all the
measurements are negated (-minus), so the measures “cancel”
each other in summaries
 Do it for audit reasons and/or if capturing/measuring erroneous
entries is significant to the business
 Update the fact
 Delete and reload the fact
 Drawback: current versions of previously released reports no
longer valid
 Physical deletes -- the record is deleted
 Logical deletes -- the record is tagged “deleted”
 If the source system does not contain an audit trail to give you
the deletions, deletions must be “deduced” by comparing the
source data with the data warehouse
Dimensional Modeling …
Graceful Modifications
 One of the significant advantages of dimensional models is to support graceful
modifications
 Adding a fact/measurement to an existing fact table at the same grain
 Values for the new measurement in the old records are left NULL
 NULLS do not change counts and other aggregate functions
 Adding a dimension to an existing fact table at the same grain
 The FK of the new dimension must point to the Not-Applicable record
in the new dimension for the old records in the fact table
 Adding an attribute to an existing dimension
 Type-1 Dimension: Do nothing
 Type-2 Dimension: All records referring to time spans preceding the introduction
of the attribute has NULL for the attribute
 Increasing the granularity of existing fact and dimension tables
 Very tricky
 Example: change the date dimension from a week to a day
 Can be done if there is a many-to-1 mapping
 Use an ALTER table command, not a drop and recreate
Dimensional Modeling …
Late Arriving Facts
 Suppose we receive today a purchase order that is one month old and our
dimensions are type-2 dimensions
 We are willing to insert this late arriving fact into the correct historical position,
even though our sales summary for last month will change
 We must be careful how we will choose the old historical record for which this
purchase applies
 For each dimension, find the corresponding dimension record in effect at
the time of the purchase
 Using the surrogate keys found above, replace the incoming natural keys
with the surrogate keys
 Insert the late arriving record in the correct partition of the table

1. The dimension records must have a begin and end timestamp so that they can
be located quickly for the late arriving fact
2. We must be willing to accept late arriving facts and invalidate previously
published reports
3. If using partitioning you must guarantee that the late arriving fact is put in its
correct partition
Dimensional Modeling …
Error Event Table Deliverable
 Built as a
star schema
 Each data
quality error
or issue is
added to the
table
Dimensional Modeling …
Audit Dimension Deliverable
audit key (PK)
 Captures the
overall quality category (text)
specific data
quality of a given overall quality score (integer)
table completeness category (text)
completeness score (integer)
validation category (text)
validation score (integer)
out-of-bounds category (text)
out-of-bounds score (integer)
number screens failed
max seventy score
extract time stamp
clean time stamp
conform time stamp
FTL system version
allocation version
currency conversion version
other audit attributes
Factless Fact Tables
 There are applications in which fact tables do
not have non key data but that do have
foreign keys for the associated dimensions.

 The two situations:

 To track events
 To inventory the set of possible occurrences (called
coverage)
Factless fact table showing occurrence
of an event.
Factless fact table showing coverage
Dimensional Modeling VS Entity-
Relationship Modeling
 An OLTP system requires a normalized structure to minimize redundancy,
provide validation of input data, and support a high volume of fast transactions.
A transaction usually involves a single business event, such as placing an order
or posting an invoice payment. An OLTP model often looks like a spider web of
hundreds or even thousands of related tables.

 In contrast, a typical dimensional model uses a star design that is easy to

understand and relate to business needs, supports simplified business queries,
and provides superior query performance by minimizing table joins.

 Dimensional modeling, is the name of the logical design technique often used
for data warehouses. It is different from entity-relationship modeling.

 Entity relationship modeling is a logical design technique that seeks to eliminate

data redundancy while Dimensional modeling seeks to present data in a
standard framework that is intuitive and allows for high-performance access.

 For example, a query that requests the total sales income and quantity sold for a
range of products in a specific geographical region for a specific time period can
typically be answered in a few seconds or less regardless of how many
hundreds of millions of rows of data are stored in the data warehouse database.
Entity–Relationship Modeling

Customer Demographics

CustomerSubscriptions Salesperson

Payment SalesConditions Channel

Campaign Offer Carrier Master Carrier History

Campaign History
District

Zones
City
Dimensional Modeling
Dimensions Fact Table Dimensions

Subscription Sales
Customer Date
EffectiveDateKey
CustomerKey
SubscriptionsKey
Payment PaymentKey Subscriptions
CampaignKey
SalesPersonKey
RouteKey
Demographics Key
Campaign UnitsSold Salesperson
DollarsSold
DiscountCost
PremiumCost
Route Demographics
Entity–Relationship Modeling
Dimensional Model
Few Facts:
 Q: Ralph Kimball invented the fact and dimension terminology.
 A: While Ralph played a critical role in establishing these terms
as industry standards, he didn’t “invent” the concepts. As best as
we can determine, the terms facts and dimensions originated
from a joint research projected conducted by General Mills and
Dartmouth University in the 1960s. By the 1970s, both AC
Nielsen and IRI used these terms consistently when describing
their syndicated data offerings. Ralph first heard about
“dimensions,” “facts,” and “conformed dimensions” from AC
Nielsen in 1983 as they were explaining their dimensional
structures for simplifying the presentation of analytic information.

 Q: Dimensional models are fully denormalized.

 A: Dimensional models combine normalized and denormalized
table structures. The dimension tables of descriptive information
are highly denormalized with detailed and hierarchical roll-up
attributes in the same table. Meanwhile, the fact tables with
performance metrics are typically normalized.

U - 1 I D W: NIT Ntroduction To ATA Arehousing
No ratings yet
U - 1 I D W: NIT Ntroduction To ATA Arehousing
134 pages
Cs403-Finalterm Solved Mcqs With References by Moaaz
90% (10)
Cs403-Finalterm Solved Mcqs With References by Moaaz
39 pages
Module 1 DWDM 8.1.24
No ratings yet
Module 1 DWDM 8.1.24
58 pages
Data Warehouse Lecture#8
No ratings yet
Data Warehouse Lecture#8
47 pages
Project Report
100% (5)
Project Report
35 pages
Unit I DWDM
No ratings yet
Unit I DWDM
67 pages
Data Warehousing (Chapter 2)
No ratings yet
Data Warehousing (Chapter 2)
21 pages
An Introduction To Data Warehousing
100% (1)
An Introduction To Data Warehousing
35 pages
Data Warehousing Concepts
No ratings yet
Data Warehousing Concepts
87 pages
Data Warehousing - Lecture - 4
No ratings yet
Data Warehousing - Lecture - 4
15 pages
Lecture 1
No ratings yet
Lecture 1
43 pages
Data Warehouse
No ratings yet
Data Warehouse
68 pages
BIDW Concepts
100% (1)
BIDW Concepts
56 pages
Data Warehousing Fundamentals
No ratings yet
Data Warehousing Fundamentals
47 pages
Data Mining and Warehosuing Lecture 01
No ratings yet
Data Mining and Warehosuing Lecture 01
36 pages
chp15 16 17 Warehouse NoSQL
No ratings yet
chp15 16 17 Warehouse NoSQL
38 pages
1 & 2 Data Warehousing - 021052
No ratings yet
1 & 2 Data Warehousing - 021052
80 pages
CH3 Data Warehousing
No ratings yet
CH3 Data Warehousing
51 pages
Data Warehousing and Data Mining
100% (1)
Data Warehousing and Data Mining
48 pages
02 DataWarehousing and OLAP
No ratings yet
02 DataWarehousing and OLAP
66 pages
Data Warehousing New Concepts
No ratings yet
Data Warehousing New Concepts
34 pages
An Introduction To Data Warehousing
No ratings yet
An Introduction To Data Warehousing
30 pages
Unit - 1 Introduction To Data Warehousing
No ratings yet
Unit - 1 Introduction To Data Warehousing
57 pages
Relational Algebra
No ratings yet
Relational Algebra
50 pages
Data Warehouseclass
No ratings yet
Data Warehouseclass
25 pages
Data War Eh Puse
No ratings yet
Data War Eh Puse
51 pages
Data Warehouse
No ratings yet
Data Warehouse
97 pages
DBMS II Seven 7
No ratings yet
DBMS II Seven 7
13 pages
Online Blood Bank Management System Project Report
100% (1)
Online Blood Bank Management System Project Report
116 pages
Major Project Report On
No ratings yet
Major Project Report On
46 pages
Data Mining and Warehousing
No ratings yet
Data Mining and Warehousing
18 pages
Final Interview Questions (Etl - Informatica) : Subject Oriented, Integrated, Time Variant, Non Volatile
100% (1)
Final Interview Questions (Etl - Informatica) : Subject Oriented, Integrated, Time Variant, Non Volatile
77 pages
Unit-1.1 Data Warehouse
No ratings yet
Unit-1.1 Data Warehouse
29 pages
Lecture 1 Introduction To Data Warehousing
No ratings yet
Lecture 1 Introduction To Data Warehousing
41 pages
SQL Sever Recent
No ratings yet
SQL Sever Recent
176 pages
CH 1
No ratings yet
CH 1
53 pages
DWDM Unit 1
No ratings yet
DWDM Unit 1
103 pages
Data Warehouses: FPT University
No ratings yet
Data Warehouses: FPT University
49 pages
DWM Unit-I Notes
No ratings yet
DWM Unit-I Notes
9 pages
2024 Meeting 1 - Data Warehouse Fundamentals
No ratings yet
2024 Meeting 1 - Data Warehouse Fundamentals
47 pages
UNITyssu 1 LT
No ratings yet
UNITyssu 1 LT
12 pages
Overview of Data Warehouse
No ratings yet
Overview of Data Warehouse
30 pages
Modul 9 - Data Warehousing and Business Intelligence - DMBOK2
No ratings yet
Modul 9 - Data Warehousing and Business Intelligence - DMBOK2
59 pages
Data Warehouse
No ratings yet
Data Warehouse
74 pages
Data Mining-Data Warehouse
No ratings yet
Data Mining-Data Warehouse
7 pages
DBMSregi
100% (1)
DBMSregi
33 pages
01 - What Is A Data Warehouse
No ratings yet
01 - What Is A Data Warehouse
16 pages
Data WareHouse
No ratings yet
Data WareHouse
48 pages
Data Warehouse Notes
No ratings yet
Data Warehouse Notes
41 pages
Course Overview: What Is Data Warehouse
No ratings yet
Course Overview: What Is Data Warehouse
75 pages
Unit 1
No ratings yet
Unit 1
22 pages
An Introduction To Data Warehousing
No ratings yet
An Introduction To Data Warehousing
35 pages
Data Integration Across Sources: Trust Credit Card Savings Loans
No ratings yet
Data Integration Across Sources: Trust Credit Card Savings Loans
27 pages
Project Report of Online Book Store
No ratings yet
Project Report of Online Book Store
38 pages
Laudon Mis16 PPT Ch06 KL CE
No ratings yet
Laudon Mis16 PPT Ch06 KL CE
65 pages
Unit 38 DatabaseManagementSyst
No ratings yet
Unit 38 DatabaseManagementSyst
27 pages
Presentation Prepared By:: Aqsa Ashfaq
No ratings yet
Presentation Prepared By:: Aqsa Ashfaq
22 pages
Unit No: 01 Introduction To Data Warehouse: by Pratiksha Meshram
No ratings yet
Unit No: 01 Introduction To Data Warehouse: by Pratiksha Meshram
38 pages
Alert Log Recommendation - "Increase Per Process Memlock (Soft) Limit To at Least MB To Lock % of SHARED GLOBAL AREA (SGA) Pages Into Physical Memory" (Doc ID 2049901.1)
100% (1)
Alert Log Recommendation - "Increase Per Process Memlock (Soft) Limit To at Least MB To Lock % of SHARED GLOBAL AREA (SGA) Pages Into Physical Memory" (Doc ID 2049901.1)
3 pages
Data Warehouse Components
No ratings yet
Data Warehouse Components
26 pages
Ilovepdf - Merged (3) - Merged
No ratings yet
Ilovepdf - Merged (3) - Merged
20 pages
RDBMS 4330702
No ratings yet
RDBMS 4330702
9 pages
Diploma in Computer Engineering: Thakur Polytechnic
No ratings yet
Diploma in Computer Engineering: Thakur Polytechnic
13 pages
Crack My Cbse App: Download
No ratings yet
Crack My Cbse App: Download
13 pages
Data Warehousing and On-Line Analytical Processing
No ratings yet
Data Warehousing and On-Line Analytical Processing
40 pages
AI ML Subject Names
No ratings yet
AI ML Subject Names
21 pages
CSE132A Solutions HW 1: Sname Sid Pid Color Red
No ratings yet
CSE132A Solutions HW 1: Sname Sid Pid Color Red
5 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
52 pages
Data Ware House
No ratings yet
Data Ware House
25 pages
RDBMS LAB MANUAL w3
No ratings yet
RDBMS LAB MANUAL w3
14 pages
DM Part 2
No ratings yet
DM Part 2
24 pages
DATA WAREHOUSE Basic Concepts
No ratings yet
DATA WAREHOUSE Basic Concepts
26 pages
Data Warehouse Final Report
No ratings yet
Data Warehouse Final Report
19 pages
Data Warehousing and Business Intelligence
No ratings yet
Data Warehousing and Business Intelligence
8 pages
Chapter 5 SUMMARY
No ratings yet
Chapter 5 SUMMARY
16 pages
Database Management Systems
No ratings yet
Database Management Systems
75 pages
Database Concepts Notes
No ratings yet
Database Concepts Notes
6 pages
SQL - May
No ratings yet
SQL - May
35 pages
R DB Design
No ratings yet
R DB Design
12 pages
Unit 4 Rdbms
No ratings yet
Unit 4 Rdbms
46 pages
DBMS PPT - 1
No ratings yet
DBMS PPT - 1
32 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
21 pages
Lab Act2 Part 2 Completed
No ratings yet
Lab Act2 Part 2 Completed
8 pages
TSD X IT-402 RDBMS (Basic)
No ratings yet
TSD X IT-402 RDBMS (Basic)
4 pages
Head First Javascript.22
No ratings yet
Head First Javascript.22
11 pages
Homework 1: 1. What Is ? and Be Able To Give Example Diagram. A. Relational Model?
No ratings yet
Homework 1: 1. What Is ? and Be Able To Give Example Diagram. A. Relational Model?
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DWH Start l2

Uploaded by

DWH Start l2

Uploaded by

Data Warehousing (DWH)

Data Warehousing Concepts

Version 1.0 (Rev 0)

 77 percent of them said bad decisions had

 One in six of 1,000 workers surveyed currently

 GB’s – TB’s of data is available. Still IT

A process of transforming data

[Forrester Research, April 1996]

 70’s: Terminal-based DSS and EIS (executive

 90’s: Data warehousing with integrated OLAP

 Petabytes -- 10^15 bytes: Geographic Information Systems

 Exabytes -- 10^18 bytes: Weather images

 Zettabytes -- 10^21 bytes: Intelligence Agency Videos

 Zottabytes -- 10^24 bytes:

Trans Count Large Medium Small Small

Trans Size Small Medium Medium Large

Trans Time Short Medium Long Long

Size in Gigs 10 – 200 50 – 400 50 – 400 400 - 4000

Normalization 3NF 3NF N/A 0-3NF

Data Modeling Traditional ER Traditional ER N/A Dimensional

 A data warehouse is a copy of transactional data specifically

 Integrated: Data may be distributed across heterogeneous sources

 Non-Volatile: It is separate from the Enterprise Operational Database

Classical operations systems are

Of all the aspects of a data

Structure RDBMS RDBMS

Data Model Normalized (ER) Dimensional

Access SQL SQL and business analysis

Condition of data Changing, incomplete Historical, complete, descriptive

 Like data warehouses, ODS are integrated and subject-

 In short, ODS is an integrated collection of clean data destined

INSERTs to firmly preserve history

this literally…history can extend to an accounting cycle, for instance)

standardized and transmitted to HQ

aggregated as seen fit

 With many variances on the themes. Therefore

How Data Warehousing was often performed in the early days

Seeks to overcome the limitations of previous architectures Highly

An attempt to consolidate legacy Data Marts

Periodic extraction  data is not completely current in warehouse

Separate ETL for each Data access complexity

 Relational data – Generic ODBC, HP NeoView, IBM DB2/UDB, Informix IDS,

 Mainframe (via partner) – Adabas, ISAM (C-ISAM, DISAM), Enscribe , IMS/DB,

 Enterprise applications – JD Edwards OneWorld and World, Oracle e-

 Technology and standards – An open services-based architecture permits

 Physical Extraction Methods

original source objects or prepared source objects.

 identity source candidates

 analyze source systems with a data profiling tool

 receive walk-through of data lineage and business

rules (from the DW architect and business analyst to

 Weak statistical correlation and big dimensions

 Strong correlations and small resulting dimensions

 When a record changes, instead of overwriting

 Perfectly partitions history because at each detailed version of the dimension is

 The natural key does not

 Fact constellations: Multiple fact tables share dimension tables,

branch location_key location to_location

 Should we change fact data once they are in the data

 Preserving the data might misrepresent the business

 The two situations:

 In contrast, a typical dimensional model uses a star design that is easy to

 Entity relationship modeling is a logical design technique that seeks to eliminate

Payment SalesConditions Channel

Campaign Offer Carrier Master Carrier History

 Q: Dimensional models are fully denormalized.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.