Lecture 03
Lecture 03
TIES443
Lecture 3
Data Warehousing
Mykola Pechenizkiy
Course webpage: http://www.cs.jyu.fi/~mpechen/TIES443
November 3, 2006
Department of Mathematical Information Technology University of Jyvskyl
TIES443: Introduction to DM Lecture 3: Data Warehousing
UNIVERSITY OF JYVSKYL
OLAP
12 Codds rules for OLAP Main OLAP operations
New buzzwords
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
Data Warehouse
A decision support DB that is maintained separately from the organizations operational databases. Why Separate Data Warehouse?
High performance for both systems
DBMS tuned for OLTP
access methods, indexing, concurrency control, recovery
UNIVERSITY OF JYVSKYL
Three-Tier Architecture
other Metadata
sources
Operational Extract Transform Load Refresh
OLAP Server
Analysis
Query/Reporting
DBs
Data Warehouse
Serve
Data Mining
Data Marts
ROLAP Server
Data Sources
TIES443: Introduction to DM
Data Storage
Lecture 3: Data Warehousing
UNIVERSITY OF JYVSKYL
Three-Tier Architecture
Warehouse database server
Almost always a relational DBMS, rarely flat files Schema design Specialized scan, indexing and join techniques Handling of aggregate views (querying and materialization) Supporting query language extensions beyond SQL Complex query processing and optimization Data partitioning and parallelism
OLAP servers
Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operators Multidimensional OLAP (MOLAP): special-purpose server that directly implements multidimensional data and operations Hybrid OLAP (HOLAP): user flexibility, e.g., low level: relational, high-level: array Specialized SQL servers: specialized support for SQL queries over star/snowflake schemas
Clients
Query and reporting tools Analysis tools Data mining tools
TIES443: Introduction to DM Lecture 3: Data Warehousing
UNIVERSITY OF JYVSKYL
Source
Centralized architecture
Source
Federated architecture
TIES443: Introduction to DM Lecture 3: Data Warehousing
Tiered architecture
6
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
Data Warehouse
A data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
Data WarehouseSubject-Oriented
Organized around major subjects, such as customer, product, sales Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
TIES443: Introduction to DM
10
UNIVERSITY OF JYVSKYL
Data WarehouseIntegrated
Constructed by integrating multiple, heterogeneous data sources
relational databases, flat files, on-line transaction records
TIES443: Introduction to DM
11
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
12
UNIVERSITY OF JYVSKYL
Data WarehouseNon-Volatile
A physically separate store of data transformed from the operational environment Operational update of data does not occur in the data warehouse environment
Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of data
TIES443: Introduction to DM
13
UNIVERSITY OF JYVSKYL
Data warehouse
update-driven, high performance
Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis
TIES443: Introduction to DM
14
UNIVERSITY OF JYVSKYL
15
UNIVERSITY OF JYVSKYL
Snowflake Schema
TIES443: Introduction to DM
16
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
17
UNIVERSITY OF JYVSKYL
Star Schema
A single fact table and a single table for each dimension Every fact points to one tuple in each of the dimensions and has additional attributes Does not capture hierarchies directly Generated keys are used for performance and maintenance reasons Fact constellation: Multiple Fact tables that share many dimension tables
Example: Projected expense and the actual expense may share dimensional tables
TIES443: Introduction to DM
18
UNIVERSITY OF JYVSKYL
Some Terms
Relation, which relates the dimensions to the measure of interest, is called the fact table (e.g. sale) Information about dimensions can be represented as a collection of relations called the dimension tables (product, customer, store) Each dimension can have a set of associated attributes For each dimension, the set of associated attributes can be structured as a hierarchy
TIES443: Introduction to DM
19
UNIVERSITY OF JYVSKYL
country
Canada
...
Mexico
Ireland
...
France
city office
TIES443: Introduction to DM
Toronto
...
...
Belfast
Blackrock
20
10
UNIVERSITY OF JYVSKYL
Product ProductNO ProdName ProdDescr Category Category UnitPrice Date DateKey Date Month City CityName State Country
TIES443: Introduction to DM
21
UNIVERSITY OF JYVSKYL
Item item_key item_name brand type supplier_key Location location_key street city Province/street country
shipper
shipper_key shipper_name location_key shipper_type
22
Measures
TIES443: Introduction to DM Lecture 3: Data Warehousing
11
UNIVERSITY OF JYVSKYL
Two-dimensional cube
p1 p2
c1 12 11
c2 8
c3 50
Fact relation
sale Product Client p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 Date 1 1 1 1 2 2 Amt 12 11 50 8 44 4
3-dimensional cube
day 2 day 1
c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8
23
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
Multidimensional Data
Sales volume as a function of product, month, and region Dimensions: Product, Location, Time
Re gi on
Hierarchical summarization paths Industry Region Year
Product
Month
TIES443: Introduction to DM Lecture 3: Data Warehousing
24
12
UNIVERSITY OF JYVSKYL
A data cube allows data to be modeled and viewed in multiple dimensions (such as sales)
Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables
Definitions
an n-Dimensional base cube is called a base cuboid The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid The lattice of cuboids forms a data cube
TIES443: Introduction to DM Lecture 3: Data Warehousing
25
UNIVERSITY OF JYVSKYL
0-D(apex) cuboid
time,supplier time,item,location
item,supplier
2-D cuboids
time,location,supplier
3-D cuboids
item,location,supplier
time,item,supplier
4-D(base) cuboid
time, item, location, supplier
TIES443: Introduction to DM Lecture 3: Data Warehousing
26
13
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
Country
TV PC VCR sum
Pr od u
1Qtr
2Qtr
ct
27
UNIVERSITY OF JYVSKYL
Pivot (rotate)
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL) rankings time functions: e.g. time avg.
TIES443: Introduction to DM Lecture 3: Data Warehousing
28
14
UNIVERSITY OF JYVSKYL
Total Sales Total Sales per city Total Sales per city per store Total Sales per city per store per month
Drill Up
Drill Down
Total Sales Total Sales per city Total Sales per city by category
Drill Across
Drill Up
TIES443: Introduction to DM
29
UNIVERSITY OF JYVSKYL
day 2
day 1
c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8
p1 p2
c1 56 11
c2 4 8
c3 50
sum
c1 67
c2 12
c3 50
129
p1 p2 sum 110 19
rollup drill-down
TIES443: Introduction to DM Lecture 3: Data Warehousing
30
15
UNIVERSITY OF JYVSKYL
day 2 day 1
c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8
1 2 Sum
c1 23 44 67
c2 8 4 12
c3 50 50
Sum 81 48 129
p1 p2 Sum
c1 56 11 67
c2 4 8 12
c3 50 50
31
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
32
16
UNIVERSITY OF JYVSKYL
2. Transparency
The OLAP functionality should be provided behind the user's existing software without adversely affecting the functionality of the 'host, i.e. OLAP server should shield the user for the complexity of the data and application
3. Accessibility
OLAP should allow the user to access diverse data stores (relational, nonrelational and legacy systems) but see the data within a common 'schema provided by the OLAP tool, i.e. Users shouldnt have to know the location, type or layout of the data to access it. OLAP server should automate the mapping of the logical schema to the physical data
33
UNIVERSITY OF JYVSKYL
6. Generic Dimensionality
Data dimensions must all be treated equally. Functions available for one dimension must be available for others.
TIES443: Introduction to DM
34
17
UNIVERSITY OF JYVSKYL
35
UNIVERSITY OF JYVSKYL
Data cleaning:
detect errors in the data and rectify them when possible
Data transformation:
convert data from legacy or host format to warehouse format: different data formats, languages, etc.
Load:
sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions
Refresh
propagate the updates from the data sources to the warehouse
TIES443: Introduction to DM Lecture 3: Data Warehousing
36
18
UNIVERSITY OF JYVSKYL
DW Information Flows
INFLOW - Processes associated with the extraction, cleansing, and loading of the data from the source systems into the data warehouse. UPFLOW - Processes associated with adding value to the data in the warehouse through summarizing, packaging, and distribution of the data. DOWNFLOW - Processes associated with archiving and backing-up/recovery of data in the warehouse. OUTFLOW - Processes associated with making the data available to the end-users. METAFLOW - Processes associated with the management of the metadata.
TIES443: Introduction to DM Lecture 3: Data Warehousing
37
UNIVERSITY OF JYVSKYL
Data Cleaning
why?
Data warehouse contains data that is analyzed for business decisions More data and multiple sources could mean more errors in the data and harder to trace such errors Results in incorrect analysis
finding and resolving inconsistency in the source data detecting data anomalies and rectifying them early has huge payoffs Important to identify tools that work together well Long Term Solution
Change business practices and data entry tools Repository for meta-data
TIES443: Introduction to DM Lecture 3: Data Warehousing
38
19
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
39
UNIVERSITY OF JYVSKYL
Load
Issues:
huge volumes of data to be loaded small time window (usually at night) when the warehouse can be taken off-line When to build indexes and summary tables allow system administrator to monitor status, cancel suspend, resume load, or change load rate restart after failure with no loss of data integrity
Techniques:
batch load utility: sort input records on clustering key and use sequential I/O; build indexes and derived tables sequential loads still too long (~100 days for TB) use parallelism and incremental techniques
TIES443: Introduction to DM
40
20
UNIVERSITY OF JYVSKYL
Refresh
when to refresh
on every update: too expensive, only necessary if OLAP queries need current data (e.g., up-the-minute stock quotes) periodically (e.g., every 24 hours, every week) or after significant events refresh policy set by administrator based on user needs and traffic possibly different policies for different sources
how to refresh
Full extract from base tables
read entire source table or database: expensive
Incremental techniques
detect & propagate changes on base tables: replication servers logical correctness transactional correctness: incremental load
TIES443: Introduction to DM
41
UNIVERSITY OF JYVSKYL
Metadata Repository
Administrative metadata
source databases and their contents warehouse schema, view & derived data definitions dimensions, hierarchies pre-defined queries and reports data mart locations and contents data partitions data extraction, cleansing, transformation rules, defaults data refresh and purging rules user profiles, user groups security: user authorization, access control
Business data
business terms and definitions ownership of data charging policies
Operational metadata
data lineage: history of migrated data and sequence of transf-s applied currency of data: active, archived, purged monitoring information: warehouse usage statistics, error reports, audit trails.
TIES443: Introduction to DM Lecture 3: Data Warehousing
42
21
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
43
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
44
22
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
45
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
46
23
UNIVERSITY OF JYVSKYL
Common DW Problems
Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands Data homogenization High demand for resources Data ownership High maintenance Long duration projects Complexity of integration
TIES443: Introduction to DM
47
UNIVERSITY OF JYVSKYL
Research Issues
Data cleaning
focus on data inconsistencies, not schema differences data mining techniques
Physical Design
design of summary tables, partitions, indexes tradeoffs in use of different indexes
Query processing
selecting appropriate summary tables dynamic optimization with feedback query optimization: cost estimation, use of transformations, search strategies partitioning query processing between OLAP server and backend server.
Warehouse Management
incremental refresh techniques computing summary tables during load failure recovery during load and refresh process management: scheduling queries, load and refresh use of workflow technology for process management
Lecture 3: Data Warehousing
TIES443: Introduction to DM
48
24
UNIVERSITY OF JYVSKYL
49
UNIVERSITY OF JYVSKYL
Summary
What is a data warehouse Data warehouse architectures
Conceptual DW Modelling Physical DW Modelling
50
25
UNIVERSITY OF JYVSKYL
Additional Slides
TIES443: Introduction to DM
51
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
52
26
UNIVERSITY OF JYVSKYL
Data Mart
Data Mart
Model refinement
Model refinement
53
UNIVERSITY OF JYVSKYL
54
27
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
55
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
56
28
UNIVERSITY OF JYVSKYL
define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales
TIES443: Introduction to DM Lecture 3: Data Warehousing
57
UNIVERSITY OF JYVSKYL
Promotion
Organization
58
TIES443: Introduction to DM
29
UNIVERSITY OF JYVSKYL
Index Structures
This and the following slides on Indexing for DW are adopted with minor modifications from: http://infolab.stanford.edu/~hector/cs245/Notes12.ppt
Indexing principle:
mapping key values to records for associative direct access
Most popular indexing techniques in relational database: B+-trees For multi-dimensional data, a large number of indexing techniques have been developed: R-trees Index structures applied in warehouses
inverted lists bit map indexes join indexes text indexes
Lecture 3: Data Warehousing
TIES443: Introduction to DM
59
UNIVERSITY OF JYVSKYL
Inverted Lists
18 19
r4 r18 r34 r35 r5 r19 r37 r40
20 23
20 21 22
inverted lists
data records
List for age = 20: r4, r18, r34, r35 List for name = fred: r18, r52 Answer is intersection: r18
60
...
23 25 26
age 20 20 21 20 20 25 21 26
30
UNIVERSITY OF JYVSKYL
Bitmap Indexes
Bitmap index: An indexing technique that has attracted attention in multi-dimensional database implementation table
Customer c1 c2 c3 c4 c5 c6 City Detroit Chicago Detroit Poznan Paris Paris Car Ford Honda Honda Ford BMW Nissan
TIES443: Introduction to DM
61
UNIVERSITY OF JYVSKYL
Bitmap Indexes
The index consists of bitmaps:
Index on City:
ec1 1 2 3 4 5 6 Chicago Detroit 0 1 1 0 0 1 0 0 0 0 0 0 Paris 0 0 0 0 1 1 Poznan 0 0 0 1 0 0
Index on Car:
ec1 1 2 3 4 5 6 BMW 0 1 0 0 1 0 Ford 1 0 0 1 0 0 Honda 0 1 1 0 0 0 Nissan 0 0 0 0 0 1
bitmaps
bitmaps
Index on a particular column Index consists of a number of bit vectors - bitmaps Each value in the indexed column has a bit vector (bitmaps) The length of the bit vector is the number of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column
TIES443: Introduction to DM Lecture 3: Data Warehousing
62
31
UNIVERSITY OF JYVSKYL
Bitmap Indexes
Index on a particular column Index consists of a number of bit vectors - bitmaps Each value in the indexed column has a bit vector (bitmaps) The length of the bit vector is the number of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column
TIES443: Introduction to DM
63
UNIVERSITY OF JYVSKYL
Bitmap Index
18 19
20 23
20 21 22
1 1 0 1 1 0 0 0 0
23 25 26
0 0 1 0 0 0 1 0 1 1
id 1 2 3 4 5 6 7 8
name age joe 20 fred 20 sally 21 nancy 20 tom 20 pat 25 dave 21 jeff 26
Query: Get people with age = 20 and name = fred List for age = 20: 1101100000 List for name = fred: 0100000001 Answer is intersection: 0100000000 Suited well for domains with small cardinality
data records
age index
TIES443: Introduction to DM
bit maps
Lecture 3: Data Warehousing
...
64
32
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
65
UNIVERSITY OF JYVSKYL
Join
Combine SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT
sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4
product id p1 p2 name price bolt 10 nut 5
joinTb
prodId p1 p2 p1 p2 p1 p1
price 10 5 10 5 10 10
storeId c1 c1 c3 c2 c1 c2
date 1 1 1 1 2 2
amt 12 11 50 8 44 4
TIES443: Introduction to DM
66
33
UNIVERSITY OF JYVSKYL
Join Indexes
join index
product id p1 p2 name price bolt 10 nut 5 jIndex r1,r3,r5,r6 r2,r4
sale
rId r1 r2 r3 r4 r5 r6
prodId p1 p2 p1 p2 p1 p1
storeId c1 c1 c3 c2 c1 c2
date 1 1 1 1 2 2
amt 12 11 50 8 44 4
TIES443: Introduction to DM
67
UNIVERSITY OF JYVSKYL
Join Indexes
Traditional indexes map the value to a list of record ids. Join indexes map the tuples in the join result of two relations to the source tables. In data warehouse cases, join indexes relate the values of the dimensions of a star schema to rows in the fact table.
For a warehouse with a Sales fact table and dimension city, a join index on city maintains for each distinct city a list of RIDs of the tuples recording the sales in the city
TIES443: Introduction to DM
68
34