Data Warehousing and OLAP Technology
Data Warehousing and OLAP Technology
1. Objectives................................................................................3
2. What is Data Warehouse?........................................................4
2.1. Definitions.......................................................................4
2.2. Data Warehouse—Subject-Oriented................................5
2.3. Data Warehouse—Integrated...........................................5
2.4. Data Warehouse—Time Variant.....................................6
2.5. Data Warehouse—Non-Volatile......................................6
2.6. Data Warehouse vs. Heterogeneous DBMS....................7
2.7. Data Warehouse vs. Operational DBMS.........................7
2.8. OLTP vs. OLAP..............................................................8
2.9. Why Separate Data Warehouse?......................................9
3. Multidimensional Data Model...............................................10
3.1. Definitions.....................................................................10
4. Conceptual Modeling of Data Warehousing.........................12
4.1. Star Schema...................................................................13
4.2. Snowflake Schema.........................................................14
4.3. Fact Constellation..........................................................15
5. A Data Mining Query Language: DMQL..............................16
5.1. Definitions and syntax...................................................16
5.2. Defining a Star Schema in DMQL................................17
5.3. Defining a Snowflake Schema in DMQL......................18
5.4. Defining a Fact Constellation in DMQL.......................19
5.5. Measures: Three Categories...........................................21
5.6. How to compute data cube measures?...........................22
6. A Concept Hierarchy.............................................................24
7. OLAP Operations in a Multidimensional Data......................26
8. OLAP Operations..................................................................29
9. Starnet Query Model for Multidimensional Databases.........33
10. Data warehouse architecture..............................................34
10.1. DW Design Process.......................................................35
10.2. Three Data Warehouse models......................................37
10.3. OLAP Server Architectures...........................................39
11. Data Warehouse Implementation.......................................40
11.1. Materialization of data cube..........................................40
11.2. Cube Operation..............................................................41
11.3. Cube Computation Methods..........................................43
11.4. Multi-way Array Aggregation for Cube Computation
Error! Bookmark not defined.
11.5. Indexing OLAP Data: Bitmap Index.............................44
11.6. Indexing OLAP Data: Join Indices................................45
11.7. Efficient Processing OLAP Queries..............................46
11.8. Data Warehouse Usage..................................................46
11.9. Why online analytical mining?......................................47
12. An OLAM Architecture.....................................................48
1. Objectives
2.1. Definitions
OLTP OLAP
Users Clerk, IT professional Knowledge worker
Function Day to day operations Decision support
DB design Application-oriented Subject-oriented
Data Current, up-to-date Historical, Summarized,
Detailed, flat relational multidimensional
Isolated Integrated, consolidated
Usage Repetitive Ad-hoc
Access Read/write, Index/hash on Lots of scans
prim. Key
Unit of work Short, simple transaction Complex query
# records Tens Millions
accessed
#users Thousands Hundreds
DB size 100MB-GB 100GB-TB
Metric Transaction throughput Query throughput, response
2.9. Why Separate Data Warehouse?
3.1. Definitions
Examples:
Product
Dates
Locations
1-D cuboids
2-D cuboids
time,supplier item,supplier
time,location,supplie
time,item,location
3.D cuboids
time,item,supplie item,location,supplier
location
item
time
location
item
time
time_key day item
day_of_the_week month Sales Fact Table
quarter year item_key
time_key item_name
brand
item_key type
supplier_type
branch_key
location_key location
branch
units_sold location_key street
anch_key branch_name branch_type
city state_or_province country
dollars_sold
avg_sales
Measures
4.2. Snowflake Schema
item
time
item_key
time_key item_name
day Sales Fact Table brand
day_of_the_week type
month time_key supplier_type
quarter
year item_key supplier
location_key location
branch
location_key street city_key
branch_key branch_name branch_type units_sold
dollars_sold
avg_sales city
city_key city
Measures state_or_province country
4.3. Fact Constellation
time item
Shipping Fact Table
time_key day item_key item_name brand
day_of_the_week month Sales Fact Table
type time key
quarter year supplier type
time_key item key
Shipper key
item_key
from location
branch_key to location
dollars cost
location_key
branch location units shipped
anch_key branch_name branch_type units_sold location_key street
city state_or_ province country shipper
dollars_sold
shipper_key shipper_name location_key sh
avg_sales
Measures
5. A Data Mining Query Language: DMQL
Syntax:
define cube <cube_name> [<dimension_list>]:
<measure_list>
Example
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars),
avg_sales = avg(sales_in_dollars),
units_sold = count(*)
Syntax:
define dimension <dimension_name>
as (<attribute_or_subdimension_list>)
Example:
define dimension item
as (item_key, item_name, brand, type,
supplier_type)
Special Case (Shared Dimension Tables)
Syntax:
define dimension <dimension_name>
as <dimension_name_first_time>
in cube <cube_name_first_time>
Example:
Distributive:
Algebraic:
Holistic:
o If there is no constant bound on the storage size
needed to describe a subaggregate.
all
all
...
country
Germany ... Spain Canad ... Mexico
a
office L. Chan
... M. Wind
The order can be either partial or total:
country year
state quarter
week
city month
day
street
Set-grouping hierarchy:
€ Day
Week
region
Product
Month
A Sample data cube:
Total annual
Product Date sales
1Qtr 2Qtr 3Qtr 4Qtr su
TV m
PC U.S.A
VCR
Country
su
m Canada
Mexico
sum
0-D(apex) cuboid
product date country
1-D cuboids
product, date,
product,date product,country country
date, country
2.D cuboids
3-D(base) cuboid
Querying a data cube
8. OLAP Operations
Objectives:
o OLAP is a powerful analysis tool:
Forecasting
Statistical computations,
aggregations,
etc.
c1 10 3 21
New Orleans
c2 12 9
5
c3 11 7 7
Virginia Date of
c4 12 11 15
sale
CD
video
Camera
roll up
Video Camera CD
NO 22 8 30
VA 23 18 22
Drill down (roll down):
o It is the reverse of roll-up
o It is performed by stepping down a concept
hierarchy for a dimension or introducing new
dimensions.
Pivot (rotate):
o Re-orient the cube for an alternative
presentation of the data
o Transform 3D view to series of 2D planes.
Other operations
o Drill across: involving (across) more than one
fact table.
o Drill through: through the bottom level of the
cube to its back-end relational tables (using
SQL)
9. Starnet Query Model for Multidimensional Databases
Customer Orders
Shipping Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
Product
Time PRODUCT LINE
ANNUALY DAIL
QTRL PRODUCT ITEM PRODUCT GROUP
Location
DIVISION
Organization
Each circle is called a footprint Promotion
10. Data warehouse architecture
OLAP
Monitor & Integrator Server
Metadata
other sources
Extract Transform Load Refresh
Data Warehouse
Operational DBs
Data Marts
Serve
Enterprise warehouse
o Collect all of the information about subjects
spanning the entire organization.
Data Mart
o a subset of corporate-wide data that is of value to
a specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data
mart
Independent vs. dependent (directly from
warehouse) data mart.
Virtual warehouse
o A set of views over operational databases
o Only some of the possible summary views may
be materialized
A Recommended Approach
Enterprise Data
Data Mart Data Mart Warehouse
o Two methods:
Base cuboid data is stored in a `base fact
table
Aggregate data:
► Data can be stored in the base fact
table (Summary Fact table), or
► Data can be stored in a separate
summary fact tables to store each
level of abstraction.
11. Data Warehouse Implementation
Objectives:
Cube Computation
⎧ 2
n If no
hierarchy if hierarchy
number of cuboids ⎪
⎨
n
and
⎪
(Li Li is number of levels
1) associated with d dim ensioni
⎪⎩ i1
(city) (year)
(item)
ROLAP-based cubing
o Sorting, hashing, and grouping operations
are applied to the dimension attributes in
order to reorder and cluster related tuples
o Grouping is performed on some
subaggregates as a “partial grouping step”
o Aggregates may be computed from
previously computed aggregates, rather than
from the base fact table
MOLAP Approach
o Uses Array-based algorithm
o The base cuboid is stored as
multidimensional array.
o Read in a number of cells to compute partial
cuboids
11.4. Indexing OLAP Data: Bitmap Index
Approach:
o Index on a particular column
o Each value in the column has a bit vector: bit-op is
fast
o The length of the bit vector: # of records in the base
table
o The i-th bit is set if the i-th row of the base table has
the value for the indexed column
o Not suitable for high cardinality domains
Example:
Base Table:
Cust Region Type
C1 Asia Retail
C2 Europe Dealer
C3 Asia Dealer
C4 America Retail
C5 Europe Dealer
Index on Region:
RecID Asia Europe America
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1
5 0 1 0
Index on Type:
RecID Retail Dealer
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
11.5. Indexing OLAP Data: Join Indices
Join index:
JI(R-id, S-id)
Architecture of OLAM
12. An OLAM Architecture
Layer4
User Interface
MDDB
Layer2 MDDB
Meta Data
Filtering&Integration Database API Filtering